improved pdfcreator

1- use CLI 2- refactor code
2026-01-19 23:06:56 +00:00
parent 6c4b78f274
commit a5934e45b2
8 changed files with 810 additions and 113 deletions
--- a/python/pdfcreator/README.md
+++ b/python/pdfcreator/README.md
@@ -0,0 +1,161 @@
+# PDF Aggregator & Section Extractor
+
+## Purpose
+
+This Python tool allows you to **extract sections or page ranges from one or more PDFs** and aggregate them into a **single PDF** with:
+
+- Table of Contents (TOC) reflecting section headings
+- Optional bookmarks corresponding to TOC entries
+- Cropped pages to remove headers and footers for a clean layout
+- Flat TOC structure with unnumbered section titles
+
+It is designed for **research papers, reports, or multi-chapter PDFs** where you need to assemble a subset of content into a new document, while preserving readability and navigation.
+
+---
+
+## Key Features
+
+- **Section-based extraction:** Extract sections (e.g., `1.3`, `5`) and all subsections automatically.
+- **Page-range extraction:** Extract explicit page ranges using a special syntax (e.g., `@10-20`).
+- **Header/Footer cropping:** Automatically removes unwanted headers and footers while maintaining consistent page sizes across multiple PDFs.
+- **Flat Table of Contents:** TOC shows only the section titles from the original PDFs; numbering and indentation are stripped.
+- **Bookmarks:** Optional PDF bookmarks that match the TOC entries for easy navigation.
+- **Multi-source aggregation:** Accepts multiple PDFs and merges the specified sections or page ranges into a single output file.
+- **Command-line driven:** Fully configurable via CLI arguments and compatible with debugging in VS Code.
+
+---
+
+## Design Choices
+
+1. **Separation of concerns**
+   - `pdfaggregator.py`: All core PDF handling functions (section extraction, cropping, TOC creation).
+   - `driver.py`: CLI entry point that parses arguments, validates inputs, and orchestrates aggregation.
+   - `test_pdfaggregator.py`: Unit tests for parsing, extraction, TOC, and page handling.
+
+2. **Level-aware section matching**
+   - Section numbers (e.g., `5`) match only top-level chapters, not subsections like `5.1`.
+   - Page ranges (prefixed with `@`) are treated explicitly and bypass section matching.
+
+3. **Argument parsing**
+   - Uses Python’s standard `argparse` for CLI argument handling.
+   - Input format supports multiple PDFs and multiple sections per file, e.g.:
+
+     ```
+     -i input.pdf:5,@50-60 input2.pdf:3.2
+     ```
+
+4. **Debug-friendly and testable**
+   - CLI parsing separated from `main()` for easier unit testing.
+   - Compatible with VS Code debugging (`launch.json` with `${file}` for portable paths).
+
+5. **TOC and bookmark handling**
+   - TOC entries reflect **section headings only**; numbering stripped.
+   - Bookmarks are linked to corresponding pages in the final PDF.
+   - Cropping and page normalization ensures consistent layout.
+
+6. **Unit tests**
+   - Core functions, including `parse_inputs`, page extraction, cropping, and TOC creation, are covered.
+   - Uses `unittest` and avoids direct CLI dependency for testability.
+
+---
+
+## How it Works (High-Level Flow)
+
+1. **Parse CLI inputs**
+   - Extract PDF file paths and associated sections or page ranges.
+   - Normalize section references (e.g., `@10-20`).
+
+2. **Iterate through PDFs**
+   - Load PDF using `PyPDF2`.
+   - Build outline tree (hierarchical sections) if available.
+
+3. **Process sections/page ranges**
+   - For each requested section:
+     - Find matching section heading in outline (level-aware).
+     - Determine start and end pages, including all subsections.
+   - For page ranges prefixed with `@`, extract exact pages.
+
+4. **Crop pages**
+   - Remove headers and footers based on configurable ratios.
+   - Normalize page size across all PDFs for visual consistency.
+
+5. **Assemble final PDF**
+   - Add a Table of Contents as the first pages.
+   - Merge all extracted pages.
+   - Add bookmarks corresponding to TOC entries.
+
+6. **Write output**
+   - Save aggregated PDF to specified output path.
+   - TOC shows **titles only**, flat structure, and correct page references.
+
+---
+
+               +---------------------+
+               |   CLI / Inputs      |
+               |   - PDF files       |
+               |   - Sections/@pages |
+               +----------+----------+
+                          |
+                          v
+               +---------------------+
+               |  parse_inputs()     |
+               |  Normalize sections |
+               |  Detect page ranges |
+               +----------+----------+
+                          |
+                          v
+               +---------------------+
+               |  Iterate PDFs       |
+               |  Load PDF with      |
+               |  PyPDF2             |
+               +----------+----------+
+                          |
+            +-------------+--------------+
+            |                            |
+            v                            v
+   +-------------------+        +------------------+
+   | Section Extraction|        | Page Range (@)   |
+   | - Match outline   |        | - Extract pages  |
+   | - Level-aware     |        |   directly       |
+   | - Subsections     |        +------------------+
+   +--------+----------+
+            |
+            v
+   +----------------------+
+   |  Crop Pages           |
+   |  - Remove headers     |
+   |  - Remove footers     |
+   |  - Normalize size     |
+   +----------+-----------+
+            |
+            v
+   +----------------------+
+   |  TOC & Bookmarks      |
+   |  - Extracted section  |
+   |    titles only        |
+   |  - Flat TOC structure |
+   |  - Bookmarks linked   |
+   +----------+-----------+
+            |
+            v
+   +----------------------+
+   |  Merge PDF Pages      |
+   |  - TOC pages first    |
+   |  - All extracted pages|
+   +----------+-----------+
+            |
+            v
+   +----------------------+
+   |  Output PDF           |
+   |  - TOC & bookmarks    |
+   |  - Cropped & normalized|
+   +----------------------+
+
+
+---
+## Example Usage
+
+```bash
+python driver.py \
+    -o aggregated.pdf \
+    -i input.pdf:5,@50-60 input2.pdf:3.2