# PDF Aggregator & Section Extractor ## Purpose This Python tool allows you to **extract sections or page ranges from one or more PDFs** and aggregate them into a **single PDF** with: - Table of Contents (TOC) reflecting section headings - Optional bookmarks corresponding to TOC entries - Cropped pages to remove headers and footers for a clean layout - Flat TOC structure with unnumbered section titles It is designed for **research papers, reports, or multi-chapter PDFs** where you need to assemble a subset of content into a new document, while preserving readability and navigation. --- ## Key Features - **Section-based extraction:** Extract sections (e.g., `1.3`, `5`) and all subsections automatically. - **Page-range extraction:** Extract explicit page ranges using a special syntax (e.g., `@10-20`). - **Header/Footer cropping:** Automatically removes unwanted headers and footers while maintaining consistent page sizes across multiple PDFs. - **Flat Table of Contents:** TOC shows only the section titles from the original PDFs; numbering and indentation are stripped. - **Bookmarks:** Optional PDF bookmarks that match the TOC entries for easy navigation. - **Multi-source aggregation:** Accepts multiple PDFs and merges the specified sections or page ranges into a single output file. - **Command-line driven:** Fully configurable via CLI arguments and compatible with debugging in VS Code. --- ## Design Choices 1. **Separation of concerns** - `pdfaggregator.py`: All core PDF handling functions (section extraction, cropping, TOC creation). - `driver.py`: CLI entry point that parses arguments, validates inputs, and orchestrates aggregation. - `test_pdfaggregator.py`: Unit tests for parsing, extraction, TOC, and page handling. 2. **Level-aware section matching** - Section numbers (e.g., `5`) match only top-level chapters, not subsections like `5.1`. - Page ranges (prefixed with `@`) are treated explicitly and bypass section matching. 3. **Argument parsing** - Uses Python’s standard `argparse` for CLI argument handling. - Input format supports multiple PDFs and multiple sections per file, e.g.: ``` -i input.pdf:5,@50-60 input2.pdf:3.2 ``` 4. **Debug-friendly and testable** - CLI parsing separated from `main()` for easier unit testing. - Compatible with VS Code debugging (`launch.json` with `${file}` for portable paths). 5. **TOC and bookmark handling** - TOC entries reflect **section headings only**; numbering stripped. - Bookmarks are linked to corresponding pages in the final PDF. - Cropping and page normalization ensures consistent layout. 6. **Unit tests** - Core functions, including `parse_inputs`, page extraction, cropping, and TOC creation, are covered. - Uses `unittest` and avoids direct CLI dependency for testability. --- ## How it Works (High-Level Flow) 1. **Parse CLI inputs** - Extract PDF file paths and associated sections or page ranges. - Normalize section references (e.g., `@10-20`). 2. **Iterate through PDFs** - Load PDF using `PyPDF2`. - Build outline tree (hierarchical sections) if available. 3. **Process sections/page ranges** - For each requested section: - Find matching section heading in outline (level-aware). - Determine start and end pages, including all subsections. - For page ranges prefixed with `@`, extract exact pages. 4. **Crop pages** - Remove headers and footers based on configurable ratios. - Normalize page size across all PDFs for visual consistency. 5. **Assemble final PDF** - Add a Table of Contents as the first pages. - Merge all extracted pages. - Add bookmarks corresponding to TOC entries. 6. **Write output** - Save aggregated PDF to specified output path. - TOC shows **titles only**, flat structure, and correct page references. --- +---------------------+ | CLI / Inputs | | - PDF files | | - Sections/@pages | +----------+----------+ | v +---------------------+ | parse_inputs() | | Normalize sections | | Detect page ranges | +----------+----------+ | v +---------------------+ | Iterate PDFs | | Load PDF with | | PyPDF2 | +----------+----------+ | +-------------+--------------+ | | v v +-------------------+ +------------------+ | Section Extraction| | Page Range (@) | | - Match outline | | - Extract pages | | - Level-aware | | directly | | - Subsections | +------------------+ +--------+----------+ | v +----------------------+ | Crop Pages | | - Remove headers | | - Remove footers | | - Normalize size | +----------+-----------+ | v +----------------------+ | TOC & Bookmarks | | - Extracted section | | titles only | | - Flat TOC structure | | - Bookmarks linked | +----------+-----------+ | v +----------------------+ | Merge PDF Pages | | - TOC pages first | | - All extracted pages| +----------+-----------+ | v +----------------------+ | Output PDF | | - TOC & bookmarks | | - Cropped & normalized| +----------------------+ --- ## Example Usage ```bash python driver.py \ -o aggregated.pdf \ -i input.pdf:5,@50-60 input2.pdf:3.2