Files
Code/python/pdfcreator/README.md
local a5934e45b2 improved pdfcreator
1- use CLI
2- refactor code
2026-01-19 23:06:56 +00:00

162 lines
5.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# PDF Aggregator & Section Extractor
## Purpose
This Python tool allows you to **extract sections or page ranges from one or more PDFs** and aggregate them into a **single PDF** with:
- Table of Contents (TOC) reflecting section headings
- Optional bookmarks corresponding to TOC entries
- Cropped pages to remove headers and footers for a clean layout
- Flat TOC structure with unnumbered section titles
It is designed for **research papers, reports, or multi-chapter PDFs** where you need to assemble a subset of content into a new document, while preserving readability and navigation.
---
## Key Features
- **Section-based extraction:** Extract sections (e.g., `1.3`, `5`) and all subsections automatically.
- **Page-range extraction:** Extract explicit page ranges using a special syntax (e.g., `@10-20`).
- **Header/Footer cropping:** Automatically removes unwanted headers and footers while maintaining consistent page sizes across multiple PDFs.
- **Flat Table of Contents:** TOC shows only the section titles from the original PDFs; numbering and indentation are stripped.
- **Bookmarks:** Optional PDF bookmarks that match the TOC entries for easy navigation.
- **Multi-source aggregation:** Accepts multiple PDFs and merges the specified sections or page ranges into a single output file.
- **Command-line driven:** Fully configurable via CLI arguments and compatible with debugging in VS Code.
---
## Design Choices
1. **Separation of concerns**
- `pdfaggregator.py`: All core PDF handling functions (section extraction, cropping, TOC creation).
- `driver.py`: CLI entry point that parses arguments, validates inputs, and orchestrates aggregation.
- `test_pdfaggregator.py`: Unit tests for parsing, extraction, TOC, and page handling.
2. **Level-aware section matching**
- Section numbers (e.g., `5`) match only top-level chapters, not subsections like `5.1`.
- Page ranges (prefixed with `@`) are treated explicitly and bypass section matching.
3. **Argument parsing**
- Uses Pythons standard `argparse` for CLI argument handling.
- Input format supports multiple PDFs and multiple sections per file, e.g.:
```
-i input.pdf:5,@50-60 input2.pdf:3.2
```
4. **Debug-friendly and testable**
- CLI parsing separated from `main()` for easier unit testing.
- Compatible with VS Code debugging (`launch.json` with `${file}` for portable paths).
5. **TOC and bookmark handling**
- TOC entries reflect **section headings only**; numbering stripped.
- Bookmarks are linked to corresponding pages in the final PDF.
- Cropping and page normalization ensures consistent layout.
6. **Unit tests**
- Core functions, including `parse_inputs`, page extraction, cropping, and TOC creation, are covered.
- Uses `unittest` and avoids direct CLI dependency for testability.
---
## How it Works (High-Level Flow)
1. **Parse CLI inputs**
- Extract PDF file paths and associated sections or page ranges.
- Normalize section references (e.g., `@10-20`).
2. **Iterate through PDFs**
- Load PDF using `PyPDF2`.
- Build outline tree (hierarchical sections) if available.
3. **Process sections/page ranges**
- For each requested section:
- Find matching section heading in outline (level-aware).
- Determine start and end pages, including all subsections.
- For page ranges prefixed with `@`, extract exact pages.
4. **Crop pages**
- Remove headers and footers based on configurable ratios.
- Normalize page size across all PDFs for visual consistency.
5. **Assemble final PDF**
- Add a Table of Contents as the first pages.
- Merge all extracted pages.
- Add bookmarks corresponding to TOC entries.
6. **Write output**
- Save aggregated PDF to specified output path.
- TOC shows **titles only**, flat structure, and correct page references.
---
+---------------------+
| CLI / Inputs |
| - PDF files |
| - Sections/@pages |
+----------+----------+
|
v
+---------------------+
| parse_inputs() |
| Normalize sections |
| Detect page ranges |
+----------+----------+
|
v
+---------------------+
| Iterate PDFs |
| Load PDF with |
| PyPDF2 |
+----------+----------+
|
+-------------+--------------+
| |
v v
+-------------------+ +------------------+
| Section Extraction| | Page Range (@) |
| - Match outline | | - Extract pages |
| - Level-aware | | directly |
| - Subsections | +------------------+
+--------+----------+
|
v
+----------------------+
| Crop Pages |
| - Remove headers |
| - Remove footers |
| - Normalize size |
+----------+-----------+
|
v
+----------------------+
| TOC & Bookmarks |
| - Extracted section |
| titles only |
| - Flat TOC structure |
| - Bookmarks linked |
+----------+-----------+
|
v
+----------------------+
| Merge PDF Pages |
| - TOC pages first |
| - All extracted pages|
+----------+-----------+
|
v
+----------------------+
| Output PDF |
| - TOC & bookmarks |
| - Cropped & normalized|
+----------------------+
---
## Example Usage
```bash
python driver.py \
-o aggregated.pdf \
-i input.pdf:5,@50-60 input2.pdf:3.2