PDF Aggregator & Section Extractor

Purpose

This Python tool allows you to extract sections or page ranges from one or more PDFs and aggregate them into a single PDF with:

Table of Contents (TOC) reflecting section headings
Optional bookmarks corresponding to TOC entries
Cropped pages to remove headers and footers for a clean layout
Flat TOC structure with unnumbered section titles

It is designed for research papers, reports, or multi-chapter PDFs where you need to assemble a subset of content into a new document, while preserving readability and navigation.

Key Features

Section-based extraction: Extract sections (e.g., 1.3, 5) and all subsections automatically.
Page-range extraction: Extract explicit page ranges using a special syntax (e.g., @10-20).
Header/Footer cropping: Automatically removes unwanted headers and footers while maintaining consistent page sizes across multiple PDFs.
Flat Table of Contents: TOC shows only the section titles from the original PDFs; numbering and indentation are stripped.
Bookmarks: Optional PDF bookmarks that match the TOC entries for easy navigation.
Multi-source aggregation: Accepts multiple PDFs and merges the specified sections or page ranges into a single output file.
Command-line driven: Fully configurable via CLI arguments and compatible with debugging in VS Code.

Design Choices

Separation of concerns
- pdfaggregator.py: All core PDF handling functions (section extraction, cropping, TOC creation).
- driver.py: CLI entry point that parses arguments, validates inputs, and orchestrates aggregation.
- test_pdfaggregator.py: Unit tests for parsing, extraction, TOC, and page handling.
Level-aware section matching
- Section numbers (e.g., 5) match only top-level chapters, not subsections like 5.1.
- Page ranges (prefixed with @) are treated explicitly and bypass section matching.
Argument parsing
- Uses Python’s standard argparse for CLI argument handling.
- Input format supports multiple PDFs and multiple sections per file, e.g.:
```
-i input.pdf:5,@50-60 input2.pdf:3.2
```
Debug-friendly and testable
- CLI parsing separated from main() for easier unit testing.
- Compatible with VS Code debugging (launch.json with ${file} for portable paths).
TOC and bookmark handling
- TOC entries reflect section headings only; numbering stripped.
- Bookmarks are linked to corresponding pages in the final PDF.
- Cropping and page normalization ensures consistent layout.
Unit tests
- Core functions, including parse_inputs, page extraction, cropping, and TOC creation, are covered.
- Uses unittest and avoids direct CLI dependency for testability.

How it Works (High-Level Flow)

Parse CLI inputs
- Extract PDF file paths and associated sections or page ranges.
- Normalize section references (e.g., @10-20).
Iterate through PDFs
- Load PDF using PyPDF2.
- Build outline tree (hierarchical sections) if available.
Process sections/page ranges
- For each requested section:
  - Find matching section heading in outline (level-aware).
  - Determine start and end pages, including all subsections.
- For page ranges prefixed with @, extract exact pages.
Crop pages
- Remove headers and footers based on configurable ratios.
- Normalize page size across all PDFs for visual consistency.
Assemble final PDF
- Add a Table of Contents as the first pages.
- Merge all extracted pages.
- Add bookmarks corresponding to TOC entries.
Write output
- Save aggregated PDF to specified output path.
- TOC shows titles only, flat structure, and correct page references.

           +---------------------+
           |   CLI / Inputs      |
           |   - PDF files       |
           |   - Sections/@pages |
           +----------+----------+
                      |
                      v
           +---------------------+
           |  parse_inputs()     |
           |  Normalize sections |
           |  Detect page ranges |
           +----------+----------+
                      |
                      v
           +---------------------+
           |  Iterate PDFs       |
           |  Load PDF with      |
           |  PyPDF2             |
           +----------+----------+
                      |
        +-------------+--------------+
        |                            |
        v                            v

Example Usage

python driver.py \
    -o aggregated.pdf \
    -i input.pdf:5,@50-60 input2.pdf:3.2

README.md Unescape Escape