improved pdfcreator

1- use CLI
2- refactor code
This commit is contained in:
local
2026-01-19 23:06:56 +00:00
parent 6c4b78f274
commit a5934e45b2
8 changed files with 810 additions and 113 deletions

161
python/pdfcreator/README.md Normal file
View File

@@ -0,0 +1,161 @@
# PDF Aggregator & Section Extractor
## Purpose
This Python tool allows you to **extract sections or page ranges from one or more PDFs** and aggregate them into a **single PDF** with:
- Table of Contents (TOC) reflecting section headings
- Optional bookmarks corresponding to TOC entries
- Cropped pages to remove headers and footers for a clean layout
- Flat TOC structure with unnumbered section titles
It is designed for **research papers, reports, or multi-chapter PDFs** where you need to assemble a subset of content into a new document, while preserving readability and navigation.
---
## Key Features
- **Section-based extraction:** Extract sections (e.g., `1.3`, `5`) and all subsections automatically.
- **Page-range extraction:** Extract explicit page ranges using a special syntax (e.g., `@10-20`).
- **Header/Footer cropping:** Automatically removes unwanted headers and footers while maintaining consistent page sizes across multiple PDFs.
- **Flat Table of Contents:** TOC shows only the section titles from the original PDFs; numbering and indentation are stripped.
- **Bookmarks:** Optional PDF bookmarks that match the TOC entries for easy navigation.
- **Multi-source aggregation:** Accepts multiple PDFs and merges the specified sections or page ranges into a single output file.
- **Command-line driven:** Fully configurable via CLI arguments and compatible with debugging in VS Code.
---
## Design Choices
1. **Separation of concerns**
- `pdfaggregator.py`: All core PDF handling functions (section extraction, cropping, TOC creation).
- `driver.py`: CLI entry point that parses arguments, validates inputs, and orchestrates aggregation.
- `test_pdfaggregator.py`: Unit tests for parsing, extraction, TOC, and page handling.
2. **Level-aware section matching**
- Section numbers (e.g., `5`) match only top-level chapters, not subsections like `5.1`.
- Page ranges (prefixed with `@`) are treated explicitly and bypass section matching.
3. **Argument parsing**
- Uses Pythons standard `argparse` for CLI argument handling.
- Input format supports multiple PDFs and multiple sections per file, e.g.:
```
-i input.pdf:5,@50-60 input2.pdf:3.2
```
4. **Debug-friendly and testable**
- CLI parsing separated from `main()` for easier unit testing.
- Compatible with VS Code debugging (`launch.json` with `${file}` for portable paths).
5. **TOC and bookmark handling**
- TOC entries reflect **section headings only**; numbering stripped.
- Bookmarks are linked to corresponding pages in the final PDF.
- Cropping and page normalization ensures consistent layout.
6. **Unit tests**
- Core functions, including `parse_inputs`, page extraction, cropping, and TOC creation, are covered.
- Uses `unittest` and avoids direct CLI dependency for testability.
---
## How it Works (High-Level Flow)
1. **Parse CLI inputs**
- Extract PDF file paths and associated sections or page ranges.
- Normalize section references (e.g., `@10-20`).
2. **Iterate through PDFs**
- Load PDF using `PyPDF2`.
- Build outline tree (hierarchical sections) if available.
3. **Process sections/page ranges**
- For each requested section:
- Find matching section heading in outline (level-aware).
- Determine start and end pages, including all subsections.
- For page ranges prefixed with `@`, extract exact pages.
4. **Crop pages**
- Remove headers and footers based on configurable ratios.
- Normalize page size across all PDFs for visual consistency.
5. **Assemble final PDF**
- Add a Table of Contents as the first pages.
- Merge all extracted pages.
- Add bookmarks corresponding to TOC entries.
6. **Write output**
- Save aggregated PDF to specified output path.
- TOC shows **titles only**, flat structure, and correct page references.
---
+---------------------+
| CLI / Inputs |
| - PDF files |
| - Sections/@pages |
+----------+----------+
|
v
+---------------------+
| parse_inputs() |
| Normalize sections |
| Detect page ranges |
+----------+----------+
|
v
+---------------------+
| Iterate PDFs |
| Load PDF with |
| PyPDF2 |
+----------+----------+
|
+-------------+--------------+
| |
v v
+-------------------+ +------------------+
| Section Extraction| | Page Range (@) |
| - Match outline | | - Extract pages |
| - Level-aware | | directly |
| - Subsections | +------------------+
+--------+----------+
|
v
+----------------------+
| Crop Pages |
| - Remove headers |
| - Remove footers |
| - Normalize size |
+----------+-----------+
|
v
+----------------------+
| TOC & Bookmarks |
| - Extracted section |
| titles only |
| - Flat TOC structure |
| - Bookmarks linked |
+----------+-----------+
|
v
+----------------------+
| Merge PDF Pages |
| - TOC pages first |
| - All extracted pages|
+----------+-----------+
|
v
+----------------------+
| Output PDF |
| - TOC & bookmarks |
| - Cropped & normalized|
+----------------------+
---
## Example Usage
```bash
python driver.py \
-o aggregated.pdf \
-i input.pdf:5,@50-60 input2.pdf:3.2