improved pdfcreator
1- use CLI 2- refactor code
This commit is contained in:
161
python/pdfcreator/README.md
Normal file
161
python/pdfcreator/README.md
Normal file
@@ -0,0 +1,161 @@
|
||||
# PDF Aggregator & Section Extractor
|
||||
|
||||
## Purpose
|
||||
|
||||
This Python tool allows you to **extract sections or page ranges from one or more PDFs** and aggregate them into a **single PDF** with:
|
||||
|
||||
- Table of Contents (TOC) reflecting section headings
|
||||
- Optional bookmarks corresponding to TOC entries
|
||||
- Cropped pages to remove headers and footers for a clean layout
|
||||
- Flat TOC structure with unnumbered section titles
|
||||
|
||||
It is designed for **research papers, reports, or multi-chapter PDFs** where you need to assemble a subset of content into a new document, while preserving readability and navigation.
|
||||
|
||||
---
|
||||
|
||||
## Key Features
|
||||
|
||||
- **Section-based extraction:** Extract sections (e.g., `1.3`, `5`) and all subsections automatically.
|
||||
- **Page-range extraction:** Extract explicit page ranges using a special syntax (e.g., `@10-20`).
|
||||
- **Header/Footer cropping:** Automatically removes unwanted headers and footers while maintaining consistent page sizes across multiple PDFs.
|
||||
- **Flat Table of Contents:** TOC shows only the section titles from the original PDFs; numbering and indentation are stripped.
|
||||
- **Bookmarks:** Optional PDF bookmarks that match the TOC entries for easy navigation.
|
||||
- **Multi-source aggregation:** Accepts multiple PDFs and merges the specified sections or page ranges into a single output file.
|
||||
- **Command-line driven:** Fully configurable via CLI arguments and compatible with debugging in VS Code.
|
||||
|
||||
---
|
||||
|
||||
## Design Choices
|
||||
|
||||
1. **Separation of concerns**
|
||||
- `pdfaggregator.py`: All core PDF handling functions (section extraction, cropping, TOC creation).
|
||||
- `driver.py`: CLI entry point that parses arguments, validates inputs, and orchestrates aggregation.
|
||||
- `test_pdfaggregator.py`: Unit tests for parsing, extraction, TOC, and page handling.
|
||||
|
||||
2. **Level-aware section matching**
|
||||
- Section numbers (e.g., `5`) match only top-level chapters, not subsections like `5.1`.
|
||||
- Page ranges (prefixed with `@`) are treated explicitly and bypass section matching.
|
||||
|
||||
3. **Argument parsing**
|
||||
- Uses Python’s standard `argparse` for CLI argument handling.
|
||||
- Input format supports multiple PDFs and multiple sections per file, e.g.:
|
||||
|
||||
```
|
||||
-i input.pdf:5,@50-60 input2.pdf:3.2
|
||||
```
|
||||
|
||||
4. **Debug-friendly and testable**
|
||||
- CLI parsing separated from `main()` for easier unit testing.
|
||||
- Compatible with VS Code debugging (`launch.json` with `${file}` for portable paths).
|
||||
|
||||
5. **TOC and bookmark handling**
|
||||
- TOC entries reflect **section headings only**; numbering stripped.
|
||||
- Bookmarks are linked to corresponding pages in the final PDF.
|
||||
- Cropping and page normalization ensures consistent layout.
|
||||
|
||||
6. **Unit tests**
|
||||
- Core functions, including `parse_inputs`, page extraction, cropping, and TOC creation, are covered.
|
||||
- Uses `unittest` and avoids direct CLI dependency for testability.
|
||||
|
||||
---
|
||||
|
||||
## How it Works (High-Level Flow)
|
||||
|
||||
1. **Parse CLI inputs**
|
||||
- Extract PDF file paths and associated sections or page ranges.
|
||||
- Normalize section references (e.g., `@10-20`).
|
||||
|
||||
2. **Iterate through PDFs**
|
||||
- Load PDF using `PyPDF2`.
|
||||
- Build outline tree (hierarchical sections) if available.
|
||||
|
||||
3. **Process sections/page ranges**
|
||||
- For each requested section:
|
||||
- Find matching section heading in outline (level-aware).
|
||||
- Determine start and end pages, including all subsections.
|
||||
- For page ranges prefixed with `@`, extract exact pages.
|
||||
|
||||
4. **Crop pages**
|
||||
- Remove headers and footers based on configurable ratios.
|
||||
- Normalize page size across all PDFs for visual consistency.
|
||||
|
||||
5. **Assemble final PDF**
|
||||
- Add a Table of Contents as the first pages.
|
||||
- Merge all extracted pages.
|
||||
- Add bookmarks corresponding to TOC entries.
|
||||
|
||||
6. **Write output**
|
||||
- Save aggregated PDF to specified output path.
|
||||
- TOC shows **titles only**, flat structure, and correct page references.
|
||||
|
||||
---
|
||||
|
||||
+---------------------+
|
||||
| CLI / Inputs |
|
||||
| - PDF files |
|
||||
| - Sections/@pages |
|
||||
+----------+----------+
|
||||
|
|
||||
v
|
||||
+---------------------+
|
||||
| parse_inputs() |
|
||||
| Normalize sections |
|
||||
| Detect page ranges |
|
||||
+----------+----------+
|
||||
|
|
||||
v
|
||||
+---------------------+
|
||||
| Iterate PDFs |
|
||||
| Load PDF with |
|
||||
| PyPDF2 |
|
||||
+----------+----------+
|
||||
|
|
||||
+-------------+--------------+
|
||||
| |
|
||||
v v
|
||||
+-------------------+ +------------------+
|
||||
| Section Extraction| | Page Range (@) |
|
||||
| - Match outline | | - Extract pages |
|
||||
| - Level-aware | | directly |
|
||||
| - Subsections | +------------------+
|
||||
+--------+----------+
|
||||
|
|
||||
v
|
||||
+----------------------+
|
||||
| Crop Pages |
|
||||
| - Remove headers |
|
||||
| - Remove footers |
|
||||
| - Normalize size |
|
||||
+----------+-----------+
|
||||
|
|
||||
v
|
||||
+----------------------+
|
||||
| TOC & Bookmarks |
|
||||
| - Extracted section |
|
||||
| titles only |
|
||||
| - Flat TOC structure |
|
||||
| - Bookmarks linked |
|
||||
+----------+-----------+
|
||||
|
|
||||
v
|
||||
+----------------------+
|
||||
| Merge PDF Pages |
|
||||
| - TOC pages first |
|
||||
| - All extracted pages|
|
||||
+----------+-----------+
|
||||
|
|
||||
v
|
||||
+----------------------+
|
||||
| Output PDF |
|
||||
| - TOC & bookmarks |
|
||||
| - Cropped & normalized|
|
||||
+----------------------+
|
||||
|
||||
|
||||
---
|
||||
## Example Usage
|
||||
|
||||
```bash
|
||||
python driver.py \
|
||||
-o aggregated.pdf \
|
||||
-i input.pdf:5,@50-60 input2.pdf:3.2
|
||||
Reference in New Issue
Block a user