162 lines
5.8 KiB
Markdown
162 lines
5.8 KiB
Markdown
# PDF Aggregator & Section Extractor
|
||
|
||
## Purpose
|
||
|
||
This Python tool allows you to **extract sections or page ranges from one or more PDFs** and aggregate them into a **single PDF** with:
|
||
|
||
- Table of Contents (TOC) reflecting section headings
|
||
- Optional bookmarks corresponding to TOC entries
|
||
- Cropped pages to remove headers and footers for a clean layout
|
||
- Flat TOC structure with unnumbered section titles
|
||
|
||
It is designed for **research papers, reports, or multi-chapter PDFs** where you need to assemble a subset of content into a new document, while preserving readability and navigation.
|
||
|
||
---
|
||
|
||
## Key Features
|
||
|
||
- **Section-based extraction:** Extract sections (e.g., `1.3`, `5`) and all subsections automatically.
|
||
- **Page-range extraction:** Extract explicit page ranges using a special syntax (e.g., `@10-20`).
|
||
- **Header/Footer cropping:** Automatically removes unwanted headers and footers while maintaining consistent page sizes across multiple PDFs.
|
||
- **Flat Table of Contents:** TOC shows only the section titles from the original PDFs; numbering and indentation are stripped.
|
||
- **Bookmarks:** Optional PDF bookmarks that match the TOC entries for easy navigation.
|
||
- **Multi-source aggregation:** Accepts multiple PDFs and merges the specified sections or page ranges into a single output file.
|
||
- **Command-line driven:** Fully configurable via CLI arguments and compatible with debugging in VS Code.
|
||
|
||
---
|
||
|
||
## Design Choices
|
||
|
||
1. **Separation of concerns**
|
||
- `pdfaggregator.py`: All core PDF handling functions (section extraction, cropping, TOC creation).
|
||
- `driver.py`: CLI entry point that parses arguments, validates inputs, and orchestrates aggregation.
|
||
- `test_pdfaggregator.py`: Unit tests for parsing, extraction, TOC, and page handling.
|
||
|
||
2. **Level-aware section matching**
|
||
- Section numbers (e.g., `5`) match only top-level chapters, not subsections like `5.1`.
|
||
- Page ranges (prefixed with `@`) are treated explicitly and bypass section matching.
|
||
|
||
3. **Argument parsing**
|
||
- Uses Python’s standard `argparse` for CLI argument handling.
|
||
- Input format supports multiple PDFs and multiple sections per file, e.g.:
|
||
|
||
```
|
||
-i input.pdf:5,@50-60 input2.pdf:3.2
|
||
```
|
||
|
||
4. **Debug-friendly and testable**
|
||
- CLI parsing separated from `main()` for easier unit testing.
|
||
- Compatible with VS Code debugging (`launch.json` with `${file}` for portable paths).
|
||
|
||
5. **TOC and bookmark handling**
|
||
- TOC entries reflect **section headings only**; numbering stripped.
|
||
- Bookmarks are linked to corresponding pages in the final PDF.
|
||
- Cropping and page normalization ensures consistent layout.
|
||
|
||
6. **Unit tests**
|
||
- Core functions, including `parse_inputs`, page extraction, cropping, and TOC creation, are covered.
|
||
- Uses `unittest` and avoids direct CLI dependency for testability.
|
||
|
||
---
|
||
|
||
## How it Works (High-Level Flow)
|
||
|
||
1. **Parse CLI inputs**
|
||
- Extract PDF file paths and associated sections or page ranges.
|
||
- Normalize section references (e.g., `@10-20`).
|
||
|
||
2. **Iterate through PDFs**
|
||
- Load PDF using `PyPDF2`.
|
||
- Build outline tree (hierarchical sections) if available.
|
||
|
||
3. **Process sections/page ranges**
|
||
- For each requested section:
|
||
- Find matching section heading in outline (level-aware).
|
||
- Determine start and end pages, including all subsections.
|
||
- For page ranges prefixed with `@`, extract exact pages.
|
||
|
||
4. **Crop pages**
|
||
- Remove headers and footers based on configurable ratios.
|
||
- Normalize page size across all PDFs for visual consistency.
|
||
|
||
5. **Assemble final PDF**
|
||
- Add a Table of Contents as the first pages.
|
||
- Merge all extracted pages.
|
||
- Add bookmarks corresponding to TOC entries.
|
||
|
||
6. **Write output**
|
||
- Save aggregated PDF to specified output path.
|
||
- TOC shows **titles only**, flat structure, and correct page references.
|
||
|
||
---
|
||
|
||
+---------------------+
|
||
| CLI / Inputs |
|
||
| - PDF files |
|
||
| - Sections/@pages |
|
||
+----------+----------+
|
||
|
|
||
v
|
||
+---------------------+
|
||
| parse_inputs() |
|
||
| Normalize sections |
|
||
| Detect page ranges |
|
||
+----------+----------+
|
||
|
|
||
v
|
||
+---------------------+
|
||
| Iterate PDFs |
|
||
| Load PDF with |
|
||
| PyPDF2 |
|
||
+----------+----------+
|
||
|
|
||
+-------------+--------------+
|
||
| |
|
||
v v
|
||
+-------------------+ +------------------+
|
||
| Section Extraction| | Page Range (@) |
|
||
| - Match outline | | - Extract pages |
|
||
| - Level-aware | | directly |
|
||
| - Subsections | +------------------+
|
||
+--------+----------+
|
||
|
|
||
v
|
||
+----------------------+
|
||
| Crop Pages |
|
||
| - Remove headers |
|
||
| - Remove footers |
|
||
| - Normalize size |
|
||
+----------+-----------+
|
||
|
|
||
v
|
||
+----------------------+
|
||
| TOC & Bookmarks |
|
||
| - Extracted section |
|
||
| titles only |
|
||
| - Flat TOC structure |
|
||
| - Bookmarks linked |
|
||
+----------+-----------+
|
||
|
|
||
v
|
||
+----------------------+
|
||
| Merge PDF Pages |
|
||
| - TOC pages first |
|
||
| - All extracted pages|
|
||
+----------+-----------+
|
||
|
|
||
v
|
||
+----------------------+
|
||
| Output PDF |
|
||
| - TOC & bookmarks |
|
||
| - Cropped & normalized|
|
||
+----------------------+
|
||
|
||
|
||
---
|
||
## Example Usage
|
||
|
||
```bash
|
||
python driver.py \
|
||
-o aggregated.pdf \
|
||
-i input.pdf:5,@50-60 input2.pdf:3.2
|