Files
Code/python/pdfcreator/README.md
local a5934e45b2 improved pdfcreator
1- use CLI
2- refactor code
2026-01-19 23:06:56 +00:00

5.8 KiB
Raw Permalink Blame History

PDF Aggregator & Section Extractor

Purpose

This Python tool allows you to extract sections or page ranges from one or more PDFs and aggregate them into a single PDF with:

  • Table of Contents (TOC) reflecting section headings
  • Optional bookmarks corresponding to TOC entries
  • Cropped pages to remove headers and footers for a clean layout
  • Flat TOC structure with unnumbered section titles

It is designed for research papers, reports, or multi-chapter PDFs where you need to assemble a subset of content into a new document, while preserving readability and navigation.


Key Features

  • Section-based extraction: Extract sections (e.g., 1.3, 5) and all subsections automatically.
  • Page-range extraction: Extract explicit page ranges using a special syntax (e.g., @10-20).
  • Header/Footer cropping: Automatically removes unwanted headers and footers while maintaining consistent page sizes across multiple PDFs.
  • Flat Table of Contents: TOC shows only the section titles from the original PDFs; numbering and indentation are stripped.
  • Bookmarks: Optional PDF bookmarks that match the TOC entries for easy navigation.
  • Multi-source aggregation: Accepts multiple PDFs and merges the specified sections or page ranges into a single output file.
  • Command-line driven: Fully configurable via CLI arguments and compatible with debugging in VS Code.

Design Choices

  1. Separation of concerns

    • pdfaggregator.py: All core PDF handling functions (section extraction, cropping, TOC creation).
    • driver.py: CLI entry point that parses arguments, validates inputs, and orchestrates aggregation.
    • test_pdfaggregator.py: Unit tests for parsing, extraction, TOC, and page handling.
  2. Level-aware section matching

    • Section numbers (e.g., 5) match only top-level chapters, not subsections like 5.1.
    • Page ranges (prefixed with @) are treated explicitly and bypass section matching.
  3. Argument parsing

    • Uses Pythons standard argparse for CLI argument handling.

    • Input format supports multiple PDFs and multiple sections per file, e.g.:

      -i input.pdf:5,@50-60 input2.pdf:3.2
      
  4. Debug-friendly and testable

    • CLI parsing separated from main() for easier unit testing.
    • Compatible with VS Code debugging (launch.json with ${file} for portable paths).
  5. TOC and bookmark handling

    • TOC entries reflect section headings only; numbering stripped.
    • Bookmarks are linked to corresponding pages in the final PDF.
    • Cropping and page normalization ensures consistent layout.
  6. Unit tests

    • Core functions, including parse_inputs, page extraction, cropping, and TOC creation, are covered.
    • Uses unittest and avoids direct CLI dependency for testability.

How it Works (High-Level Flow)

  1. Parse CLI inputs

    • Extract PDF file paths and associated sections or page ranges.
    • Normalize section references (e.g., @10-20).
  2. Iterate through PDFs

    • Load PDF using PyPDF2.
    • Build outline tree (hierarchical sections) if available.
  3. Process sections/page ranges

    • For each requested section:
      • Find matching section heading in outline (level-aware).
      • Determine start and end pages, including all subsections.
    • For page ranges prefixed with @, extract exact pages.
  4. Crop pages

    • Remove headers and footers based on configurable ratios.
    • Normalize page size across all PDFs for visual consistency.
  5. Assemble final PDF

    • Add a Table of Contents as the first pages.
    • Merge all extracted pages.
    • Add bookmarks corresponding to TOC entries.
  6. Write output

    • Save aggregated PDF to specified output path.
    • TOC shows titles only, flat structure, and correct page references.

           +---------------------+
           |   CLI / Inputs      |
           |   - PDF files       |
           |   - Sections/@pages |
           +----------+----------+
                      |
                      v
           +---------------------+
           |  parse_inputs()     |
           |  Normalize sections |
           |  Detect page ranges |
           +----------+----------+
                      |
                      v
           +---------------------+
           |  Iterate PDFs       |
           |  Load PDF with      |
           |  PyPDF2             |
           +----------+----------+
                      |
        +-------------+--------------+
        |                            |
        v                            v

+-------------------+ +------------------+ | Section Extraction| | Page Range (@) | | - Match outline | | - Extract pages | | - Level-aware | | directly | | - Subsections | +------------------+ +--------+----------+ | v +----------------------+ | Crop Pages | | - Remove headers | | - Remove footers | | - Normalize size | +----------+-----------+ | v +----------------------+ | TOC & Bookmarks | | - Extracted section | | titles only | | - Flat TOC structure | | - Bookmarks linked | +----------+-----------+ | v +----------------------+ | Merge PDF Pages | | - TOC pages first | | - All extracted pages| +----------+-----------+ | v +----------------------+ | Output PDF | | - TOC & bookmarks | | - Cropped & normalized| +----------------------+


Example Usage

python driver.py \
    -o aggregated.pdf \
    -i input.pdf:5,@50-60 input2.pdf:3.2