PDF Aggregator & Section Extractor
Purpose
This Python tool allows you to extract sections or page ranges from one or more PDFs and aggregate them into a single PDF with:
- Table of Contents (TOC) reflecting section headings
- Optional bookmarks corresponding to TOC entries
- Cropped pages to remove headers and footers for a clean layout
- Flat TOC structure with unnumbered section titles
It is designed for research papers, reports, or multi-chapter PDFs where you need to assemble a subset of content into a new document, while preserving readability and navigation.
Key Features
- Section-based extraction: Extract sections (e.g.,
1.3,5) and all subsections automatically. - Page-range extraction: Extract explicit page ranges using a special syntax (e.g.,
@10-20). - Header/Footer cropping: Automatically removes unwanted headers and footers while maintaining consistent page sizes across multiple PDFs.
- Flat Table of Contents: TOC shows only the section titles from the original PDFs; numbering and indentation are stripped.
- Bookmarks: Optional PDF bookmarks that match the TOC entries for easy navigation.
- Multi-source aggregation: Accepts multiple PDFs and merges the specified sections or page ranges into a single output file.
- Command-line driven: Fully configurable via CLI arguments and compatible with debugging in VS Code.
Design Choices
-
Separation of concerns
pdfaggregator.py: All core PDF handling functions (section extraction, cropping, TOC creation).driver.py: CLI entry point that parses arguments, validates inputs, and orchestrates aggregation.test_pdfaggregator.py: Unit tests for parsing, extraction, TOC, and page handling.
-
Level-aware section matching
- Section numbers (e.g.,
5) match only top-level chapters, not subsections like5.1. - Page ranges (prefixed with
@) are treated explicitly and bypass section matching.
- Section numbers (e.g.,
-
Argument parsing
-
Uses Python’s standard
argparsefor CLI argument handling. -
Input format supports multiple PDFs and multiple sections per file, e.g.:
-i input.pdf:5,@50-60 input2.pdf:3.2
-
-
Debug-friendly and testable
- CLI parsing separated from
main()for easier unit testing. - Compatible with VS Code debugging (
launch.jsonwith${file}for portable paths).
- CLI parsing separated from
-
TOC and bookmark handling
- TOC entries reflect section headings only; numbering stripped.
- Bookmarks are linked to corresponding pages in the final PDF.
- Cropping and page normalization ensures consistent layout.
-
Unit tests
- Core functions, including
parse_inputs, page extraction, cropping, and TOC creation, are covered. - Uses
unittestand avoids direct CLI dependency for testability.
- Core functions, including
How it Works (High-Level Flow)
-
Parse CLI inputs
- Extract PDF file paths and associated sections or page ranges.
- Normalize section references (e.g.,
@10-20).
-
Iterate through PDFs
- Load PDF using
PyPDF2. - Build outline tree (hierarchical sections) if available.
- Load PDF using
-
Process sections/page ranges
- For each requested section:
- Find matching section heading in outline (level-aware).
- Determine start and end pages, including all subsections.
- For page ranges prefixed with
@, extract exact pages.
- For each requested section:
-
Crop pages
- Remove headers and footers based on configurable ratios.
- Normalize page size across all PDFs for visual consistency.
-
Assemble final PDF
- Add a Table of Contents as the first pages.
- Merge all extracted pages.
- Add bookmarks corresponding to TOC entries.
-
Write output
- Save aggregated PDF to specified output path.
- TOC shows titles only, flat structure, and correct page references.
+---------------------+
| CLI / Inputs |
| - PDF files |
| - Sections/@pages |
+----------+----------+
|
v
+---------------------+
| parse_inputs() |
| Normalize sections |
| Detect page ranges |
+----------+----------+
|
v
+---------------------+
| Iterate PDFs |
| Load PDF with |
| PyPDF2 |
+----------+----------+
|
+-------------+--------------+
| |
v v
+-------------------+ +------------------+ | Section Extraction| | Page Range (@) | | - Match outline | | - Extract pages | | - Level-aware | | directly | | - Subsections | +------------------+ +--------+----------+ | v +----------------------+ | Crop Pages | | - Remove headers | | - Remove footers | | - Normalize size | +----------+-----------+ | v +----------------------+ | TOC & Bookmarks | | - Extracted section | | titles only | | - Flat TOC structure | | - Bookmarks linked | +----------+-----------+ | v +----------------------+ | Merge PDF Pages | | - TOC pages first | | - All extracted pages| +----------+-----------+ | v +----------------------+ | Output PDF | | - TOC & bookmarks | | - Cropped & normalized| +----------------------+
Example Usage
python driver.py \
-o aggregated.pdf \
-i input.pdf:5,@50-60 input2.pdf:3.2