Code/python/pdfcreator/README.md

# PDF Aggregator & Section Extractor

## Purpose

This Python tool allows you to **extract sections or page ranges from one or more PDFs** and aggregate them into a **single PDF** with:

- Table of Contents (TOC) reflecting section headings
- Optional bookmarks corresponding to TOC entries
- Cropped pages to remove headers and footers for a clean layout
- Flat TOC structure with unnumbered section titles

It is designed for **research papers, reports, or multi-chapter PDFs** where you need to assemble a subset of content into a new document, while preserving readability and navigation.

---

## Key Features

- **Section-based extraction:** Extract sections (e.g., `1.3`, `5`) and all subsections automatically.
- **Page-range extraction:** Extract explicit page ranges using a special syntax (e.g., `@10-20`).
- **Header/Footer cropping:** Automatically removes unwanted headers and footers while maintaining consistent page sizes across multiple PDFs.
- **Flat Table of Contents:** TOC shows only the section titles from the original PDFs; numbering and indentation are stripped.
- **Bookmarks:** Optional PDF bookmarks that match the TOC entries for easy navigation.
- **Multi-source aggregation:** Accepts multiple PDFs and merges the specified sections or page ranges into a single output file.
- **Command-line driven:** Fully configurable via CLI arguments and compatible with debugging in VS Code.

---

## Design Choices

1. **Separation of concerns**
   - `pdfaggregator.py`: All core PDF handling functions (section extraction, cropping, TOC creation).
   - `driver.py`: CLI entry point that parses arguments, validates inputs, and orchestrates aggregation.
   - `test_pdfaggregator.py`: Unit tests for parsing, extraction, TOC, and page handling.

2. **Level-aware section matching**
   - Section numbers (e.g., `5`) match only top-level chapters, not subsections like `5.1`.
   - Page ranges (prefixed with `@`) are treated explicitly and bypass section matching.

3. **Argument parsing**
   - Uses Python’s standard `argparse` for CLI argument handling.
   - Input format supports multiple PDFs and multiple sections per file, e.g.:

     ```
     -i input.pdf:5,@50-60 input2.pdf:3.2
     ```

4. **Debug-friendly and testable**
   - CLI parsing separated from `main()` for easier unit testing.
   - Compatible with VS Code debugging (`launch.json` with `${file}` for portable paths).

5. **TOC and bookmark handling**
   - TOC entries reflect **section headings only**; numbering stripped.
   - Bookmarks are linked to corresponding pages in the final PDF.
   - Cropping and page normalization ensures consistent layout.

6. **Unit tests**
   - Core functions, including `parse_inputs`, page extraction, cropping, and TOC creation, are covered.
   - Uses `unittest` and avoids direct CLI dependency for testability.

---

## How it Works (High-Level Flow)

1. **Parse CLI inputs**
   - Extract PDF file paths and associated sections or page ranges.
   - Normalize section references (e.g., `@10-20`).

2. **Iterate through PDFs**
   - Load PDF using `PyPDF2`.
   - Build outline tree (hierarchical sections) if available.

3. **Process sections/page ranges**
   - For each requested section:
     - Find matching section heading in outline (level-aware).
     - Determine start and end pages, including all subsections.
   - For page ranges prefixed with `@`, extract exact pages.

4. **Crop pages**
   - Remove headers and footers based on configurable ratios.
   - Normalize page size across all PDFs for visual consistency.

5. **Assemble final PDF**
   - Add a Table of Contents as the first pages.
   - Merge all extracted pages.
   - Add bookmarks corresponding to TOC entries.

6. **Write output**
   - Save aggregated PDF to specified output path.
   - TOC shows **titles only**, flat structure, and correct page references.

---

               +---------------------+
               |   CLI / Inputs      |
               |   - PDF files       |
               |   - Sections/@pages |
               +----------+----------+
                          |
                          v
               +---------------------+
               |  parse_inputs()     |
               |  Normalize sections |
               |  Detect page ranges |
               +----------+----------+
                          |
                          v
               +---------------------+
               |  Iterate PDFs       |
               |  Load PDF with      |
               |  PyPDF2             |
               +----------+----------+
                          |
            +-------------+--------------+
            |                            |
            v                            v
   +-------------------+        +------------------+
   | Section Extraction|        | Page Range (@)   |
   | - Match outline   |        | - Extract pages  |
   | - Level-aware     |        |   directly       |
   | - Subsections     |        +------------------+
   +--------+----------+
            |
            v
   +----------------------+
   |  Crop Pages           |
   |  - Remove headers     |
   |  - Remove footers     |
   |  - Normalize size     |
   +----------+-----------+
            |
            v
   +----------------------+
   |  TOC & Bookmarks      |
   |  - Extracted section  |
   |    titles only        |
   |  - Flat TOC structure |
   |  - Bookmarks linked   |
   +----------+-----------+
            |
            v
   +----------------------+
   |  Merge PDF Pages      |
   |  - TOC pages first    |
   |  - All extracted pages|
   +----------+-----------+
            |
            v
   +----------------------+
   |  Output PDF           |
   |  - TOC & bookmarks    |
   |  - Cropped & normalized|
   +----------------------+


---
## Example Usage

```bash
python driver.py \
    -o aggregated.pdf \
    -i input.pdf:5,@50-60 input2.pdf:3.2