improved pdfcreator
1- use CLI 2- refactor code
This commit is contained in:
9
python/.vscode/launch.json
vendored
9
python/.vscode/launch.json
vendored
@@ -4,12 +4,17 @@
|
|||||||
// For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387
|
// For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387
|
||||||
"version": "0.2.0",
|
"version": "0.2.0",
|
||||||
"configurations": [
|
"configurations": [
|
||||||
|
|
||||||
{
|
{
|
||||||
"name": "Python Debugger: Current File",
|
"name": "Debug PDF Aggregator",
|
||||||
"type": "debugpy",
|
"type": "debugpy",
|
||||||
"request": "launch",
|
"request": "launch",
|
||||||
"program": "${file}",
|
"program": "${file}",
|
||||||
|
"args": [
|
||||||
|
"-o",
|
||||||
|
"${workspaceFolder}/pdfcreator/out.pdf",
|
||||||
|
"-i",
|
||||||
|
"${workspaceFolder}/pdfcreator/input.pdf:5,@50-60",
|
||||||
|
],
|
||||||
"console": "integratedTerminal"
|
"console": "integratedTerminal"
|
||||||
}
|
}
|
||||||
]
|
]
|
||||||
|
|||||||
161
python/pdfcreator/README.md
Normal file
161
python/pdfcreator/README.md
Normal file
@@ -0,0 +1,161 @@
|
|||||||
|
# PDF Aggregator & Section Extractor
|
||||||
|
|
||||||
|
## Purpose
|
||||||
|
|
||||||
|
This Python tool allows you to **extract sections or page ranges from one or more PDFs** and aggregate them into a **single PDF** with:
|
||||||
|
|
||||||
|
- Table of Contents (TOC) reflecting section headings
|
||||||
|
- Optional bookmarks corresponding to TOC entries
|
||||||
|
- Cropped pages to remove headers and footers for a clean layout
|
||||||
|
- Flat TOC structure with unnumbered section titles
|
||||||
|
|
||||||
|
It is designed for **research papers, reports, or multi-chapter PDFs** where you need to assemble a subset of content into a new document, while preserving readability and navigation.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Key Features
|
||||||
|
|
||||||
|
- **Section-based extraction:** Extract sections (e.g., `1.3`, `5`) and all subsections automatically.
|
||||||
|
- **Page-range extraction:** Extract explicit page ranges using a special syntax (e.g., `@10-20`).
|
||||||
|
- **Header/Footer cropping:** Automatically removes unwanted headers and footers while maintaining consistent page sizes across multiple PDFs.
|
||||||
|
- **Flat Table of Contents:** TOC shows only the section titles from the original PDFs; numbering and indentation are stripped.
|
||||||
|
- **Bookmarks:** Optional PDF bookmarks that match the TOC entries for easy navigation.
|
||||||
|
- **Multi-source aggregation:** Accepts multiple PDFs and merges the specified sections or page ranges into a single output file.
|
||||||
|
- **Command-line driven:** Fully configurable via CLI arguments and compatible with debugging in VS Code.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Design Choices
|
||||||
|
|
||||||
|
1. **Separation of concerns**
|
||||||
|
- `pdfaggregator.py`: All core PDF handling functions (section extraction, cropping, TOC creation).
|
||||||
|
- `driver.py`: CLI entry point that parses arguments, validates inputs, and orchestrates aggregation.
|
||||||
|
- `test_pdfaggregator.py`: Unit tests for parsing, extraction, TOC, and page handling.
|
||||||
|
|
||||||
|
2. **Level-aware section matching**
|
||||||
|
- Section numbers (e.g., `5`) match only top-level chapters, not subsections like `5.1`.
|
||||||
|
- Page ranges (prefixed with `@`) are treated explicitly and bypass section matching.
|
||||||
|
|
||||||
|
3. **Argument parsing**
|
||||||
|
- Uses Python’s standard `argparse` for CLI argument handling.
|
||||||
|
- Input format supports multiple PDFs and multiple sections per file, e.g.:
|
||||||
|
|
||||||
|
```
|
||||||
|
-i input.pdf:5,@50-60 input2.pdf:3.2
|
||||||
|
```
|
||||||
|
|
||||||
|
4. **Debug-friendly and testable**
|
||||||
|
- CLI parsing separated from `main()` for easier unit testing.
|
||||||
|
- Compatible with VS Code debugging (`launch.json` with `${file}` for portable paths).
|
||||||
|
|
||||||
|
5. **TOC and bookmark handling**
|
||||||
|
- TOC entries reflect **section headings only**; numbering stripped.
|
||||||
|
- Bookmarks are linked to corresponding pages in the final PDF.
|
||||||
|
- Cropping and page normalization ensures consistent layout.
|
||||||
|
|
||||||
|
6. **Unit tests**
|
||||||
|
- Core functions, including `parse_inputs`, page extraction, cropping, and TOC creation, are covered.
|
||||||
|
- Uses `unittest` and avoids direct CLI dependency for testability.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## How it Works (High-Level Flow)
|
||||||
|
|
||||||
|
1. **Parse CLI inputs**
|
||||||
|
- Extract PDF file paths and associated sections or page ranges.
|
||||||
|
- Normalize section references (e.g., `@10-20`).
|
||||||
|
|
||||||
|
2. **Iterate through PDFs**
|
||||||
|
- Load PDF using `PyPDF2`.
|
||||||
|
- Build outline tree (hierarchical sections) if available.
|
||||||
|
|
||||||
|
3. **Process sections/page ranges**
|
||||||
|
- For each requested section:
|
||||||
|
- Find matching section heading in outline (level-aware).
|
||||||
|
- Determine start and end pages, including all subsections.
|
||||||
|
- For page ranges prefixed with `@`, extract exact pages.
|
||||||
|
|
||||||
|
4. **Crop pages**
|
||||||
|
- Remove headers and footers based on configurable ratios.
|
||||||
|
- Normalize page size across all PDFs for visual consistency.
|
||||||
|
|
||||||
|
5. **Assemble final PDF**
|
||||||
|
- Add a Table of Contents as the first pages.
|
||||||
|
- Merge all extracted pages.
|
||||||
|
- Add bookmarks corresponding to TOC entries.
|
||||||
|
|
||||||
|
6. **Write output**
|
||||||
|
- Save aggregated PDF to specified output path.
|
||||||
|
- TOC shows **titles only**, flat structure, and correct page references.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
+---------------------+
|
||||||
|
| CLI / Inputs |
|
||||||
|
| - PDF files |
|
||||||
|
| - Sections/@pages |
|
||||||
|
+----------+----------+
|
||||||
|
|
|
||||||
|
v
|
||||||
|
+---------------------+
|
||||||
|
| parse_inputs() |
|
||||||
|
| Normalize sections |
|
||||||
|
| Detect page ranges |
|
||||||
|
+----------+----------+
|
||||||
|
|
|
||||||
|
v
|
||||||
|
+---------------------+
|
||||||
|
| Iterate PDFs |
|
||||||
|
| Load PDF with |
|
||||||
|
| PyPDF2 |
|
||||||
|
+----------+----------+
|
||||||
|
|
|
||||||
|
+-------------+--------------+
|
||||||
|
| |
|
||||||
|
v v
|
||||||
|
+-------------------+ +------------------+
|
||||||
|
| Section Extraction| | Page Range (@) |
|
||||||
|
| - Match outline | | - Extract pages |
|
||||||
|
| - Level-aware | | directly |
|
||||||
|
| - Subsections | +------------------+
|
||||||
|
+--------+----------+
|
||||||
|
|
|
||||||
|
v
|
||||||
|
+----------------------+
|
||||||
|
| Crop Pages |
|
||||||
|
| - Remove headers |
|
||||||
|
| - Remove footers |
|
||||||
|
| - Normalize size |
|
||||||
|
+----------+-----------+
|
||||||
|
|
|
||||||
|
v
|
||||||
|
+----------------------+
|
||||||
|
| TOC & Bookmarks |
|
||||||
|
| - Extracted section |
|
||||||
|
| titles only |
|
||||||
|
| - Flat TOC structure |
|
||||||
|
| - Bookmarks linked |
|
||||||
|
+----------+-----------+
|
||||||
|
|
|
||||||
|
v
|
||||||
|
+----------------------+
|
||||||
|
| Merge PDF Pages |
|
||||||
|
| - TOC pages first |
|
||||||
|
| - All extracted pages|
|
||||||
|
+----------+-----------+
|
||||||
|
|
|
||||||
|
v
|
||||||
|
+----------------------+
|
||||||
|
| Output PDF |
|
||||||
|
| - TOC & bookmarks |
|
||||||
|
| - Cropped & normalized|
|
||||||
|
+----------------------+
|
||||||
|
|
||||||
|
|
||||||
|
---
|
||||||
|
## Example Usage
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python driver.py \
|
||||||
|
-o aggregated.pdf \
|
||||||
|
-i input.pdf:5,@50-60 input2.pdf:3.2
|
||||||
99
python/pdfcreator/driver.py
Normal file
99
python/pdfcreator/driver.py
Normal file
@@ -0,0 +1,99 @@
|
|||||||
|
# driver.py
|
||||||
|
from pypdf import PageObject
|
||||||
|
from pdfaggregator import *
|
||||||
|
from pypdf import PdfReader, PdfWriter
|
||||||
|
import argparse
|
||||||
|
|
||||||
|
|
||||||
|
PDF_INPUTS = [
|
||||||
|
{"file": "pdfcreator/input.pdf", "sections": ["5", "@20-30"]},
|
||||||
|
# {"file": "pdfcreator/input2.pdf",
|
||||||
|
# "sections": ["3.1"]}
|
||||||
|
]
|
||||||
|
|
||||||
|
OUTPUT_PDF = "pdfcreator/extracted_sections.pdf"
|
||||||
|
|
||||||
|
# crop ratios
|
||||||
|
HEADER_CROP = 0.1
|
||||||
|
FOOTER_CROP = 0.03
|
||||||
|
|
||||||
|
|
||||||
|
def main(pdf_inputs, output_pdf):
|
||||||
|
|
||||||
|
content_writer = PdfWriter()
|
||||||
|
toc_entries = []
|
||||||
|
current_page = 0
|
||||||
|
|
||||||
|
for pdf_info in pdf_inputs: # or change
|
||||||
|
REFERENCE_BOX = None
|
||||||
|
reader = PdfReader(pdf_info["file"])
|
||||||
|
outline_tree = build_outline_tree(reader)
|
||||||
|
total_pages = len(reader.pages)
|
||||||
|
|
||||||
|
for entry in pdf_info["sections"]:
|
||||||
|
page_indices = parse_page_range(entry)
|
||||||
|
|
||||||
|
if page_indices:
|
||||||
|
# Explicit page range
|
||||||
|
current_page, REFERENCE_BOX, toc_entry = extract_page_range(
|
||||||
|
entry, reader, content_writer, current_page, REFERENCE_BOX,
|
||||||
|
HEADER_CROP, FOOTER_CROP
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
# Section prefix
|
||||||
|
current_page, REFERENCE_BOX, toc_entry = extract_section_prefix(
|
||||||
|
entry, reader, content_writer, current_page, REFERENCE_BOX,
|
||||||
|
outline_tree, HEADER_CROP, FOOTER_CROP
|
||||||
|
)
|
||||||
|
|
||||||
|
if toc_entry:
|
||||||
|
toc_entries.append(toc_entry)
|
||||||
|
|
||||||
|
# Create TOC PDF
|
||||||
|
toc_pdf = create_toc_pdf(toc_entries)
|
||||||
|
toc_page_count = len(toc_pdf.pages)
|
||||||
|
|
||||||
|
final_writer = PdfWriter()
|
||||||
|
# add TOC pages
|
||||||
|
for page in toc_pdf.pages:
|
||||||
|
final_writer.add_page(page)
|
||||||
|
# add content pages
|
||||||
|
for page in content_writer.pages:
|
||||||
|
final_writer.add_page(page)
|
||||||
|
|
||||||
|
# Add bookmarks
|
||||||
|
bookmark_stack = {}
|
||||||
|
for entry in toc_entries:
|
||||||
|
parent = bookmark_stack.get(entry["level"] - 1)
|
||||||
|
bm = final_writer.add_outline_item(
|
||||||
|
title=entry["title"],
|
||||||
|
page_number=(entry["page"] - 1) + toc_page_count,
|
||||||
|
parent=parent
|
||||||
|
)
|
||||||
|
bookmark_stack[entry["level"]] = bm
|
||||||
|
|
||||||
|
with open(output_pdf, "wb") as f:
|
||||||
|
final_writer.write(f)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
parser = argparse.ArgumentParser(
|
||||||
|
description="Extract sections and page ranges from PDFs and aggregate into a new PDF with TOC and bookmarks."
|
||||||
|
)
|
||||||
|
|
||||||
|
parser.add_argument(
|
||||||
|
"-o", "--output",
|
||||||
|
required=True,
|
||||||
|
help="Output PDF file"
|
||||||
|
)
|
||||||
|
|
||||||
|
parser.add_argument(
|
||||||
|
"-i", "--inputs",
|
||||||
|
nargs="+",
|
||||||
|
help="Inputs in the form file.pdf:section1,section2 or file.pdf:@1-10"
|
||||||
|
)
|
||||||
|
|
||||||
|
args = parser.parse_args()
|
||||||
|
pdf_inputs = parse_inputs(args)
|
||||||
|
|
||||||
|
main(pdf_inputs, args.output)
|
||||||
Binary file not shown.
@@ -1,23 +1,50 @@
|
|||||||
from pypdf import PdfReader
|
import re
|
||||||
from io import BytesIO
|
from pypdf import PdfReader, PdfWriter
|
||||||
from reportlab.lib.pagesizes import LETTER
|
|
||||||
from reportlab.pdfgen import canvas
|
from reportlab.pdfgen import canvas
|
||||||
from pypdf import PdfReader, PdfWriter, PageObject
|
from reportlab.lib.pagesizes import LETTER
|
||||||
|
from io import BytesIO
|
||||||
|
|
||||||
|
# ================= CONFIG =================
|
||||||
|
|
||||||
# ----------- CONFIG -------------
|
|
||||||
PDF_INPUTS = [
|
PDF_INPUTS = [
|
||||||
{"file": "pdfcreator/input.pdf", "sections": ["1", "2.2", "3"]},
|
{"file": "pdfcreator/input.pdf", "sections": ["1.3", "2.1", "@111-114"]},
|
||||||
{"file": "pdfcreator/input2.pdf", "sections": ["3", "4"]},
|
{"file": "pdfcreator/input2.pdf", "sections": ["3.2"]},
|
||||||
]
|
]
|
||||||
OUTPUT_PDF = "pdfcreator/combined_sections.pdf"
|
|
||||||
|
OUTPUT_PDF = "pdfcreator/extracted_sections.pdf"
|
||||||
|
|
||||||
|
HEADER_CROP = 0.12 # top of first page of section
|
||||||
|
FOOTER_CROP = 0.06 # bottom of all pages
|
||||||
|
|
||||||
|
# =========================================
|
||||||
|
|
||||||
|
|
||||||
# Cropping ratios
|
def strip_numbering(title):
|
||||||
HEADER_CROP = 0.1 # top of first page of section
|
"""
|
||||||
FOOTER_CROP = 0.0 # bottom of pages
|
Remove leading numbering from a string like '1.3 Background'
|
||||||
# --------------------------------
|
Returns 'Background'.
|
||||||
|
"""
|
||||||
|
return re.sub(r'^\d+(\.\d+)*\s+', '', title)
|
||||||
|
|
||||||
# ----- HELPER FUNCTIONS ---------
|
|
||||||
|
# ---------- Outline utilities ------------
|
||||||
|
def parse_page_range(entry):
|
||||||
|
"""
|
||||||
|
Returns a list of zero-based page indices if entry is a page range.
|
||||||
|
Page ranges must be prefixed with '@', e.g., "@1-10".
|
||||||
|
Otherwise returns None (treated as section prefix).
|
||||||
|
"""
|
||||||
|
if entry.startswith("@"):
|
||||||
|
s = entry[1:] # remove the @
|
||||||
|
try:
|
||||||
|
start, end = s.split("-")
|
||||||
|
start = int(start) - 1 # zero-based
|
||||||
|
end = int(end) # inclusive in range
|
||||||
|
return list(range(start, end))
|
||||||
|
except ValueError:
|
||||||
|
print(f"[WARN] Invalid page range: {entry}")
|
||||||
|
return None
|
||||||
|
return None # not a page range
|
||||||
|
|
||||||
|
|
||||||
def build_outline_tree(reader):
|
def build_outline_tree(reader):
|
||||||
@@ -36,14 +63,14 @@ def build_outline_tree(reader):
|
|||||||
return _build(reader.outline)
|
return _build(reader.outline)
|
||||||
|
|
||||||
|
|
||||||
def find_section(nodes, title):
|
def find_section_with_level(nodes, prefix, level=0):
|
||||||
for node in nodes:
|
for node in nodes:
|
||||||
if node["title"] == title or node["title"].startswith(title + " "):
|
if node["title"].startswith(prefix):
|
||||||
return node
|
return node, level
|
||||||
found = find_section(node["children"], title)
|
found = find_section_with_level(node["children"], prefix, level + 1)
|
||||||
if found:
|
if found[0]:
|
||||||
return found
|
return found
|
||||||
return None
|
return None, None
|
||||||
|
|
||||||
|
|
||||||
def collect_subtree_pages(node, pages=None):
|
def collect_subtree_pages(node, pages=None):
|
||||||
@@ -66,138 +93,178 @@ def flatten_outline_pages(nodes, pages=None):
|
|||||||
|
|
||||||
def find_end_page(target_node, outline_tree, total_pages):
|
def find_end_page(target_node, outline_tree, total_pages):
|
||||||
subtree_pages = collect_subtree_pages(target_node)
|
subtree_pages = collect_subtree_pages(target_node)
|
||||||
last_section_page = max(subtree_pages)
|
last_page = max(subtree_pages)
|
||||||
all_outline_pages = flatten_outline_pages(outline_tree)
|
|
||||||
all_outline_pages = sorted(set(all_outline_pages))
|
all_pages = sorted(set(flatten_outline_pages(outline_tree)))
|
||||||
for page in all_outline_pages:
|
for p in all_pages:
|
||||||
if page > last_section_page:
|
if p > last_page:
|
||||||
return page
|
return p
|
||||||
return total_pages
|
return total_pages
|
||||||
|
|
||||||
|
|
||||||
|
# ---------- Page manipulation ------------
|
||||||
|
|
||||||
def crop_page(page, top_ratio=0.0, bottom_ratio=0.0):
|
def crop_page(page, top_ratio=0.0, bottom_ratio=0.0):
|
||||||
llx, lly, urx, ury = page.mediabox
|
llx, lly, urx, ury = page.mediabox
|
||||||
height = ury - lly
|
height = ury - lly
|
||||||
|
|
||||||
new_lly = lly + height * bottom_ratio
|
new_lly = lly + height * bottom_ratio
|
||||||
new_ury = ury - height * top_ratio
|
new_ury = ury - height * top_ratio
|
||||||
if new_ury <= new_lly:
|
|
||||||
raise ValueError("Invalid crop ratios: page height would be negative")
|
|
||||||
page.cropbox.lower_left = (llx, new_lly)
|
page.cropbox.lower_left = (llx, new_lly)
|
||||||
page.cropbox.upper_right = (urx, new_ury)
|
page.cropbox.upper_right = (urx, new_ury)
|
||||||
|
|
||||||
|
|
||||||
def normalize_page_size(page, reference_box):
|
# ---------- TOC generation ---------------
|
||||||
"""
|
def create_toc_pdf(toc_entries, heading):
|
||||||
Force page MediaBox and CropBox to match reference.
|
buffer = BytesIO()
|
||||||
"""
|
c = canvas.Canvas(buffer, pagesize=LETTER)
|
||||||
page.mediabox.lower_left = reference_box.lower_left
|
|
||||||
page.mediabox.upper_right = reference_box.upper_right
|
|
||||||
|
|
||||||
page.cropbox.lower_left = reference_box.lower_left
|
c.setFont("Helvetica-Bold", 16)
|
||||||
page.cropbox.upper_right = reference_box.upper_right
|
c.drawString(50, 750, heading)
|
||||||
|
|
||||||
# --------------------------------
|
c.setFont("Helvetica", 12)
|
||||||
|
y = 720
|
||||||
|
|
||||||
|
for entry in toc_entries:
|
||||||
|
line = f"{strip_numbering(entry['title'])} ........................ {entry['page']}"
|
||||||
|
c.drawString(50, y, line) # flat: no indentation
|
||||||
|
y -= 18
|
||||||
|
|
||||||
|
if y < 50:
|
||||||
|
c.showPage()
|
||||||
|
c.setFont("Helvetica", 12)
|
||||||
|
y = 750
|
||||||
|
|
||||||
|
c.save()
|
||||||
|
buffer.seek(0)
|
||||||
|
return PdfReader(buffer)
|
||||||
|
|
||||||
|
|
||||||
# --------- MAIN PROCESS ----------
|
# ================= MAIN ===================
|
||||||
writer = PdfWriter()
|
|
||||||
toc_entries = [] # To build TOC later
|
content_writer = PdfWriter()
|
||||||
current_page_index = 0
|
toc_entries = []
|
||||||
|
current_page = 0
|
||||||
|
REFERENCE_BOX = None
|
||||||
|
|
||||||
for pdf_info in PDF_INPUTS:
|
for pdf_info in PDF_INPUTS:
|
||||||
file_path = pdf_info["file"]
|
reader = PdfReader(pdf_info["file"])
|
||||||
sections_to_extract = pdf_info["sections"]
|
|
||||||
|
|
||||||
reader = PdfReader(file_path)
|
|
||||||
outline_tree = build_outline_tree(reader)
|
outline_tree = build_outline_tree(reader)
|
||||||
total_pages = len(reader.pages)
|
total_pages = len(reader.pages)
|
||||||
|
|
||||||
for section_title in sections_to_extract:
|
for entry in pdf_info["sections"]:
|
||||||
target = find_section(outline_tree, section_title)
|
|
||||||
if not target:
|
|
||||||
print(f"[WARN] Section '{section_title}' not found in {file_path}")
|
|
||||||
continue
|
|
||||||
|
|
||||||
start_page = target["page"]
|
page_indices = parse_page_range(entry)
|
||||||
end_page = find_end_page(target, outline_tree, total_pages)
|
|
||||||
|
|
||||||
REFERENCE_BOX = None
|
if page_indices:
|
||||||
# Add pages to combined PDF
|
# --- Explicit page range ---
|
||||||
for i, p in enumerate(range(start_page, end_page)):
|
toc_entries.append({
|
||||||
page = reader.pages[p]
|
"title": f"Pages {entry[1:]}", # remove '@' for display
|
||||||
|
"page": current_page + 1,
|
||||||
|
"level": 0
|
||||||
|
})
|
||||||
|
|
||||||
|
for i, p in enumerate(page_indices):
|
||||||
|
if p < 0 or p >= total_pages:
|
||||||
|
print(
|
||||||
|
f"[WARN] Page {p+1} out of range in {pdf_info['file']}")
|
||||||
|
continue
|
||||||
|
page = reader.pages[p]
|
||||||
|
|
||||||
# Crop first page header+footer
|
|
||||||
if i == 0:
|
|
||||||
crop_page(page, top_ratio=HEADER_CROP,
|
crop_page(page, top_ratio=HEADER_CROP,
|
||||||
bottom_ratio=FOOTER_CROP)
|
bottom_ratio=FOOTER_CROP)
|
||||||
else:
|
if REFERENCE_BOX is None:
|
||||||
crop_page(page, top_ratio=HEADER_CROP,
|
REFERENCE_BOX = (
|
||||||
bottom_ratio=FOOTER_CROP)
|
page.cropbox.lower_left,
|
||||||
# crop_page(page, bottom_ratio=FOOTER_CROP)
|
page.cropbox.upper_right
|
||||||
|
)
|
||||||
|
page.mediabox.lower_left = REFERENCE_BOX[0]
|
||||||
|
page.mediabox.upper_right = REFERENCE_BOX[1]
|
||||||
|
page.cropbox.lower_left = REFERENCE_BOX[0]
|
||||||
|
page.cropbox.upper_right = REFERENCE_BOX[1]
|
||||||
|
|
||||||
if REFERENCE_BOX is None:
|
content_writer.add_page(page)
|
||||||
# Make a copy, not a reference
|
current_page += 1
|
||||||
REFERENCE_BOX = (
|
else:
|
||||||
page.cropbox.lower_left,
|
|
||||||
page.cropbox.upper_right
|
|
||||||
)
|
|
||||||
# Step 3: Normalize page size
|
|
||||||
page.mediabox.lower_left = REFERENCE_BOX[0]
|
|
||||||
page.mediabox.upper_right = REFERENCE_BOX[1]
|
|
||||||
page.cropbox.lower_left = REFERENCE_BOX[0]
|
|
||||||
page.cropbox.upper_right = REFERENCE_BOX[1]
|
|
||||||
|
|
||||||
writer.add_page(page)
|
target, level = find_section_with_level(
|
||||||
|
outline_tree, entry)
|
||||||
|
if not target:
|
||||||
|
print(
|
||||||
|
f"[WARN] Section {entry} not found in {pdf_info['file']}")
|
||||||
|
continue
|
||||||
|
|
||||||
# Track TOC
|
start_page = target["page"]
|
||||||
toc_entries.append({
|
end_page = find_end_page(target, outline_tree, total_pages)
|
||||||
"title": f"{section_title} ({file_path})",
|
|
||||||
"page": current_page_index + 1 # 1-based page number
|
|
||||||
})
|
|
||||||
current_page_index += (end_page - start_page)
|
|
||||||
|
|
||||||
# --------- ADD TOC PAGE(S) ----------
|
toc_entries.append({
|
||||||
|
"title": target["title"], # EXACT heading text
|
||||||
|
"page": current_page + 1, # 1-based
|
||||||
|
"level": level
|
||||||
|
})
|
||||||
|
|
||||||
|
for i, p in enumerate(range(start_page, end_page)):
|
||||||
|
page = reader.pages[p]
|
||||||
|
|
||||||
|
if i == 0:
|
||||||
|
crop_page(page, HEADER_CROP, FOOTER_CROP)
|
||||||
|
else:
|
||||||
|
crop_page(page, bottom_ratio=FOOTER_CROP)
|
||||||
|
|
||||||
|
# Capture reference AFTER cropping
|
||||||
|
if REFERENCE_BOX is None:
|
||||||
|
REFERENCE_BOX = (
|
||||||
|
page.cropbox.lower_left,
|
||||||
|
page.cropbox.upper_right
|
||||||
|
)
|
||||||
|
|
||||||
|
# Normalize page size
|
||||||
|
page.mediabox.lower_left = REFERENCE_BOX[0]
|
||||||
|
page.mediabox.upper_right = REFERENCE_BOX[1]
|
||||||
|
page.cropbox.lower_left = REFERENCE_BOX[0]
|
||||||
|
page.cropbox.upper_right = REFERENCE_BOX[1]
|
||||||
|
|
||||||
|
content_writer.add_page(page)
|
||||||
|
current_page += 1
|
||||||
|
|
||||||
|
|
||||||
def create_toc_pdf(toc_entries):
|
# ---------- Build final PDF ---------------
|
||||||
packet = BytesIO()
|
|
||||||
c = canvas.Canvas(packet, pagesize=LETTER)
|
|
||||||
c.setFont("Helvetica-Bold", 16)
|
|
||||||
c.drawString(50, 750, "Table of Contents")
|
|
||||||
c.setFont("Helvetica", 12)
|
|
||||||
y = 720
|
|
||||||
for entry in toc_entries:
|
|
||||||
text = f"{entry['title']} .... {entry['page']}"
|
|
||||||
c.drawString(50, y, text)
|
|
||||||
y -= 20
|
|
||||||
if y < 50:
|
|
||||||
c.showPage()
|
|
||||||
y = 750
|
|
||||||
c.save()
|
|
||||||
packet.seek(0)
|
|
||||||
return PdfReader(packet)
|
|
||||||
|
|
||||||
|
|
||||||
toc_pdf = create_toc_pdf(toc_entries)
|
|
||||||
|
|
||||||
# Combine TOC + extracted sections
|
|
||||||
final_writer = PdfWriter()
|
final_writer = PdfWriter()
|
||||||
|
|
||||||
# TOC first
|
# Derive TOC heading from first source document
|
||||||
|
first_reader = PdfReader(PDF_INPUTS[0]["file"])
|
||||||
|
toc_heading = "Contents" if first_reader.outline else "Table of Contents"
|
||||||
|
|
||||||
|
# Visible TOC pages
|
||||||
|
toc_pdf = create_toc_pdf(toc_entries, toc_heading)
|
||||||
|
toc_page_count = len(toc_pdf.pages)
|
||||||
|
|
||||||
for page in toc_pdf.pages:
|
for page in toc_pdf.pages:
|
||||||
final_writer.add_page(page)
|
final_writer.add_page(page)
|
||||||
|
|
||||||
# Then extracted content
|
# Content pages
|
||||||
for page in writer.pages:
|
for page in content_writer.pages:
|
||||||
final_writer.add_page(page)
|
final_writer.add_page(page)
|
||||||
|
|
||||||
# Save
|
bookmark_stack = {}
|
||||||
|
|
||||||
|
for entry in toc_entries:
|
||||||
|
parent = bookmark_stack.get(entry["level"] - 1)
|
||||||
|
|
||||||
|
bm = final_writer.add_outline_item(
|
||||||
|
title=entry["title"], # exact heading text
|
||||||
|
page_number=(entry["page"] - 1) + toc_page_count,
|
||||||
|
parent=parent
|
||||||
|
)
|
||||||
|
|
||||||
|
bookmark_stack[entry["level"]] = bm
|
||||||
|
|
||||||
|
|
||||||
|
# ---------- Write output ------------------
|
||||||
|
|
||||||
with open(OUTPUT_PDF, "wb") as f:
|
with open(OUTPUT_PDF, "wb") as f:
|
||||||
final_writer.write(f)
|
final_writer.write(f)
|
||||||
|
|
||||||
|
print(f"[OK] Created {OUTPUT_PDF}")
|
||||||
# --------- WRITE OUTPUT -----------
|
|
||||||
with open(OUTPUT_PDF, "wb") as f:
|
|
||||||
final_writer.write(f)
|
|
||||||
|
|
||||||
print(f"[INFO] Combined PDF written to {OUTPUT_PDF} with TOC.")
|
|
||||||
|
|||||||
231
python/pdfcreator/pdfaggregator.py
Normal file
231
python/pdfcreator/pdfaggregator.py
Normal file
@@ -0,0 +1,231 @@
|
|||||||
|
# pdfaggregator.py
|
||||||
|
import re
|
||||||
|
from io import BytesIO
|
||||||
|
from pypdf import PdfReader, PdfWriter, PageObject
|
||||||
|
from reportlab.pdfgen import canvas
|
||||||
|
from reportlab.lib.pagesizes import LETTER
|
||||||
|
|
||||||
|
# -----------------------------
|
||||||
|
# Parsing / Section Utilities
|
||||||
|
# -----------------------------
|
||||||
|
|
||||||
|
|
||||||
|
def parse_page_range(entry):
|
||||||
|
"""Return list of zero-based page indices if entry is a page range (@1-10)."""
|
||||||
|
if entry.startswith("@"):
|
||||||
|
s = entry[1:]
|
||||||
|
try:
|
||||||
|
start, end = s.split("-")
|
||||||
|
start = int(start) - 1
|
||||||
|
end = int(end)
|
||||||
|
return list(range(start, end))
|
||||||
|
except ValueError:
|
||||||
|
print(f"[WARN] Invalid page range: {entry}")
|
||||||
|
return None
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def strip_numbering(title):
|
||||||
|
"""Remove leading numbering like '1.3 Background' -> 'Background'"""
|
||||||
|
return re.sub(r'^\d+(\.\d+)*\s+', '', title)
|
||||||
|
|
||||||
|
|
||||||
|
def crop_page(page, top_ratio=0.0, bottom_ratio=0.0):
|
||||||
|
"""Crop the top/bottom of a page using ratios."""
|
||||||
|
llx, lly, urx, ury = page.mediabox
|
||||||
|
height = ury - lly
|
||||||
|
new_lly = lly + height * bottom_ratio
|
||||||
|
new_ury = ury - height * top_ratio
|
||||||
|
page.cropbox.lower_left = (llx, new_lly)
|
||||||
|
page.cropbox.upper_right = (urx, new_ury)
|
||||||
|
|
||||||
|
# -----------------------------
|
||||||
|
# Outline / Section Tree
|
||||||
|
# -----------------------------
|
||||||
|
|
||||||
|
|
||||||
|
def find_section_with_level(nodes, prefix, level=0):
|
||||||
|
"""Find a section node by prefix in outline tree."""
|
||||||
|
for node in nodes:
|
||||||
|
if node["title"].startswith(prefix):
|
||||||
|
return node, level
|
||||||
|
found = find_section_with_level(
|
||||||
|
node.get("children", []), prefix, level + 1)
|
||||||
|
if found[0]:
|
||||||
|
return found
|
||||||
|
return None, None
|
||||||
|
|
||||||
|
|
||||||
|
def collect_subtree_pages(node, pages=None):
|
||||||
|
"""Recursively collect pages of node and all its children."""
|
||||||
|
if pages is None:
|
||||||
|
pages = []
|
||||||
|
pages.append(node["page"])
|
||||||
|
for child in node.get("children", []):
|
||||||
|
collect_subtree_pages(child, pages)
|
||||||
|
return pages
|
||||||
|
|
||||||
|
|
||||||
|
def flatten_outline_pages(nodes, pages=None):
|
||||||
|
"""Flatten all pages from the outline tree."""
|
||||||
|
if pages is None:
|
||||||
|
pages = []
|
||||||
|
for node in nodes:
|
||||||
|
pages.append(node["page"])
|
||||||
|
flatten_outline_pages(node.get("children", []), pages)
|
||||||
|
return pages
|
||||||
|
|
||||||
|
|
||||||
|
def find_end_page(target_node, outline_tree, total_pages):
|
||||||
|
"""Find the last page of a section including its subsections."""
|
||||||
|
subtree_pages = collect_subtree_pages(target_node)
|
||||||
|
last_page = max(subtree_pages)
|
||||||
|
all_pages = sorted(set(flatten_outline_pages(outline_tree)))
|
||||||
|
for p in all_pages:
|
||||||
|
if p > last_page:
|
||||||
|
return p
|
||||||
|
return total_pages
|
||||||
|
|
||||||
|
|
||||||
|
def build_outline_tree(reader):
|
||||||
|
"""
|
||||||
|
Build a normalized outline tree from pypdf's reader.outline.
|
||||||
|
|
||||||
|
Each node:
|
||||||
|
{
|
||||||
|
"title": str,
|
||||||
|
"page": int,
|
||||||
|
"children": [ ... ]
|
||||||
|
}
|
||||||
|
"""
|
||||||
|
def walk(items):
|
||||||
|
tree = []
|
||||||
|
for item in items:
|
||||||
|
if isinstance(item, list):
|
||||||
|
# children of previous item
|
||||||
|
if tree:
|
||||||
|
tree[-1]["children"] = walk(item)
|
||||||
|
else:
|
||||||
|
tree.append({
|
||||||
|
"title": item.title.strip(),
|
||||||
|
"page": reader.get_destination_page_number(item),
|
||||||
|
"children": []
|
||||||
|
})
|
||||||
|
return tree
|
||||||
|
|
||||||
|
try:
|
||||||
|
outline = reader.outline
|
||||||
|
except Exception:
|
||||||
|
return []
|
||||||
|
|
||||||
|
if not outline:
|
||||||
|
return []
|
||||||
|
|
||||||
|
return walk(outline)
|
||||||
|
|
||||||
|
|
||||||
|
# -----------------------------
|
||||||
|
# TOC Generation
|
||||||
|
# -----------------------------
|
||||||
|
|
||||||
|
|
||||||
|
def create_toc_pdf(toc_entries, heading="Table of Contents"):
|
||||||
|
"""Generate a flat, unnumbered TOC PDF page in memory."""
|
||||||
|
buffer = BytesIO()
|
||||||
|
c = canvas.Canvas(buffer, pagesize=LETTER)
|
||||||
|
|
||||||
|
c.setFont("Helvetica-Bold", 16)
|
||||||
|
c.drawString(50, 750, heading)
|
||||||
|
|
||||||
|
c.setFont("Helvetica", 12)
|
||||||
|
y = 720
|
||||||
|
|
||||||
|
for entry in toc_entries:
|
||||||
|
line = f"{strip_numbering(entry['title'])} ........................ {entry['page']}"
|
||||||
|
c.drawString(50, y, line)
|
||||||
|
y -= 18
|
||||||
|
if y < 50:
|
||||||
|
c.showPage()
|
||||||
|
c.setFont("Helvetica", 12)
|
||||||
|
y = 750
|
||||||
|
|
||||||
|
c.save()
|
||||||
|
buffer.seek(0)
|
||||||
|
return PdfReader(buffer)
|
||||||
|
|
||||||
|
|
||||||
|
def extract_page_range(entry, reader, content_writer, current_page, REFERENCE_BOX, header_crop=0.05, footer_crop=0.03):
|
||||||
|
"""Extract pages from an explicit @page-range entry."""
|
||||||
|
page_indices = parse_page_range(entry)
|
||||||
|
if not page_indices:
|
||||||
|
return current_page, REFERENCE_BOX, None # nothing extracted
|
||||||
|
|
||||||
|
toc_entry = {"title": f"Pages {entry[1:]}",
|
||||||
|
"page": current_page + 1, "level": 0}
|
||||||
|
|
||||||
|
for i, p in enumerate(page_indices):
|
||||||
|
if p < 0 or p >= len(reader.pages):
|
||||||
|
continue
|
||||||
|
page = reader.pages[p]
|
||||||
|
crop_page(page, header_crop, footer_crop)
|
||||||
|
if REFERENCE_BOX is None:
|
||||||
|
REFERENCE_BOX = (page.cropbox.lower_left, page.cropbox.upper_right)
|
||||||
|
page.mediabox.lower_left = REFERENCE_BOX[0]
|
||||||
|
page.mediabox.upper_right = REFERENCE_BOX[1]
|
||||||
|
page.cropbox.lower_left = REFERENCE_BOX[0]
|
||||||
|
page.cropbox.upper_right = REFERENCE_BOX[1]
|
||||||
|
content_writer.add_page(page)
|
||||||
|
current_page += 1
|
||||||
|
|
||||||
|
return current_page, REFERENCE_BOX, toc_entry
|
||||||
|
|
||||||
|
|
||||||
|
def extract_section_prefix(entry, reader, content_writer, current_page, REFERENCE_BOX, outline_tree, header_crop=0.05, footer_crop=0.03):
|
||||||
|
"""Extract pages from a section prefix entry in the PDF outline."""
|
||||||
|
target, level = find_section_with_level(outline_tree, entry)
|
||||||
|
if not target:
|
||||||
|
print(
|
||||||
|
f"[WARN] Section {entry} not found in PDF {reader.stream.name if hasattr(reader.stream, 'name') else ''}")
|
||||||
|
return current_page, REFERENCE_BOX, None
|
||||||
|
|
||||||
|
start_page = target["page"]
|
||||||
|
end_page = find_end_page(target, outline_tree, len(reader.pages))
|
||||||
|
toc_entry = {"title": target["title"],
|
||||||
|
"page": current_page + 1, "level": level}
|
||||||
|
|
||||||
|
for i, p in enumerate(range(start_page, end_page)):
|
||||||
|
page = reader.pages[p]
|
||||||
|
crop_page(page, header_crop if i == 0 else 0, footer_crop)
|
||||||
|
if REFERENCE_BOX is None:
|
||||||
|
REFERENCE_BOX = (page.cropbox.lower_left, page.cropbox.upper_right)
|
||||||
|
page.mediabox.lower_left = REFERENCE_BOX[0]
|
||||||
|
page.mediabox.upper_right = REFERENCE_BOX[1]
|
||||||
|
page.cropbox.lower_left = REFERENCE_BOX[0]
|
||||||
|
page.cropbox.upper_right = REFERENCE_BOX[1]
|
||||||
|
content_writer.add_page(page)
|
||||||
|
current_page += 1
|
||||||
|
|
||||||
|
return current_page, REFERENCE_BOX, toc_entry
|
||||||
|
|
||||||
|
|
||||||
|
def parse_inputs(args):
|
||||||
|
"""
|
||||||
|
Parse CLI positional arguments into PDF_INPUTS structure.
|
||||||
|
"""
|
||||||
|
pdf_inputs = []
|
||||||
|
|
||||||
|
for item in args.inputs:
|
||||||
|
if ":" not in item:
|
||||||
|
raise ValueError(
|
||||||
|
f"Invalid input '{item}'. Expected format: file.pdf:section1,section2"
|
||||||
|
)
|
||||||
|
|
||||||
|
file_path, sections = item.split(":", 1)
|
||||||
|
section_list = [s.strip() for s in sections.split(",") if s.strip()]
|
||||||
|
|
||||||
|
pdf_inputs.append({
|
||||||
|
"file": file_path,
|
||||||
|
"sections": section_list
|
||||||
|
})
|
||||||
|
|
||||||
|
return pdf_inputs
|
||||||
6
python/pdfcreator/terminal.sh
Executable file
6
python/pdfcreator/terminal.sh
Executable file
@@ -0,0 +1,6 @@
|
|||||||
|
#!/bin/bash
|
||||||
|
export CT2_CUDA_ALLOW_FP16=1
|
||||||
|
|
||||||
|
# 'mamba run' executes the command within the context of the environment
|
||||||
|
# without needing to source .bashrc or shell hooks manually.
|
||||||
|
mamba run -n base python ~/family-repo/Code/python/pdfcreator/driver.py "$@"
|
||||||
128
python/pdfcreator/tests.py
Normal file
128
python/pdfcreator/tests.py
Normal file
@@ -0,0 +1,128 @@
|
|||||||
|
import unittest
|
||||||
|
from pypdf import PdfWriter, PageObject
|
||||||
|
from types import SimpleNamespace
|
||||||
|
from pdfaggregator import parse_inputs, strip_numbering, crop_page, extract_page_range, extract_section_prefix, parse_page_range, find_section_with_level, find_end_page
|
||||||
|
|
||||||
|
|
||||||
|
class TestPdfExtractionFunctions(unittest.TestCase):
|
||||||
|
def setUp(self):
|
||||||
|
# Dummy PDF with 5 blank pages
|
||||||
|
self.writer = PdfWriter()
|
||||||
|
for _ in range(5):
|
||||||
|
self.writer.add_page(
|
||||||
|
PageObject.create_blank_page(width=600, height=800))
|
||||||
|
self.reader = self.writer # pypdf writer can be used as reader for pages list
|
||||||
|
self.content_writer = PdfWriter()
|
||||||
|
self.outline_tree = [{"title": "Section1", "page": 0, "children": [
|
||||||
|
{"title": "Section1.1", "page": 1, "children": []}]}]
|
||||||
|
|
||||||
|
def test_extract_page_range(self):
|
||||||
|
current_page, REFERENCE_BOX, toc = extract_page_range(
|
||||||
|
"@1-3", self.reader, self.content_writer, 0, None)
|
||||||
|
self.assertEqual(len(self.content_writer.pages), 3) # pages 0 and 1
|
||||||
|
self.assertEqual(toc["title"], "Pages 1-3")
|
||||||
|
self.assertEqual(current_page, 3)
|
||||||
|
|
||||||
|
def test_extract_section_prefix(self):
|
||||||
|
current_page, REFERENCE_BOX, toc = extract_section_prefix(
|
||||||
|
"Section1", self.reader, self.content_writer, 0, None, self.outline_tree)
|
||||||
|
self.assertEqual(len(self.content_writer.pages),
|
||||||
|
5) # page 0 + subsection 1
|
||||||
|
self.assertEqual(toc["title"], "Section1")
|
||||||
|
self.assertEqual(current_page, 5)
|
||||||
|
|
||||||
|
|
||||||
|
class TestPdfAggregator(unittest.TestCase):
|
||||||
|
|
||||||
|
def test_parse_page_range(self):
|
||||||
|
self.assertEqual(parse_page_range("@1-5"), [0, 1, 2, 3, 4])
|
||||||
|
self.assertEqual(parse_page_range("@10-12"), [9, 10, 11])
|
||||||
|
self.assertIsNone(parse_page_range("1.3"))
|
||||||
|
self.assertIsNone(parse_page_range("Introduction-Overview"))
|
||||||
|
|
||||||
|
def test_strip_numbering(self):
|
||||||
|
self.assertEqual(strip_numbering("1.3 Background"), "Background")
|
||||||
|
self.assertEqual(strip_numbering(
|
||||||
|
"2.1.5 Experimental Setup"), "Experimental Setup")
|
||||||
|
self.assertEqual(strip_numbering("NoNumberingHere"), "NoNumberingHere")
|
||||||
|
|
||||||
|
def test_crop_page(self):
|
||||||
|
page = PageObject.create_blank_page(width=600, height=800)
|
||||||
|
crop_page(page, top_ratio=0.1, bottom_ratio=0.05)
|
||||||
|
llx, lly = page.cropbox.lower_left
|
||||||
|
urx, ury = page.cropbox.upper_right
|
||||||
|
self.assertAlmostEqual(ury - lly, 800 * 0.85)
|
||||||
|
|
||||||
|
|
||||||
|
class TestParseInputs(unittest.TestCase):
|
||||||
|
|
||||||
|
def test_single_pdf_single_section(self):
|
||||||
|
args = SimpleNamespace(
|
||||||
|
inputs=["doc1.pdf:1.3"]
|
||||||
|
)
|
||||||
|
|
||||||
|
result = parse_inputs(args)
|
||||||
|
|
||||||
|
self.assertEqual(result, [
|
||||||
|
{
|
||||||
|
"file": "doc1.pdf",
|
||||||
|
"sections": ["1.3"]
|
||||||
|
}
|
||||||
|
])
|
||||||
|
|
||||||
|
def test_single_pdf_multiple_sections(self):
|
||||||
|
args = SimpleNamespace(
|
||||||
|
inputs=["doc1.pdf:1.3,2.1,@10-20"]
|
||||||
|
)
|
||||||
|
|
||||||
|
result = parse_inputs(args)
|
||||||
|
|
||||||
|
self.assertEqual(result, [
|
||||||
|
{
|
||||||
|
"file": "doc1.pdf",
|
||||||
|
"sections": ["1.3", "2.1", "@10-20"]
|
||||||
|
}
|
||||||
|
])
|
||||||
|
|
||||||
|
def test_multiple_pdfs(self):
|
||||||
|
args = SimpleNamespace(
|
||||||
|
inputs=[
|
||||||
|
"doc1.pdf:1.3,@5-10",
|
||||||
|
"doc2.pdf:Introduction,3.2"
|
||||||
|
]
|
||||||
|
)
|
||||||
|
|
||||||
|
result = parse_inputs(args)
|
||||||
|
|
||||||
|
self.assertEqual(result, [
|
||||||
|
{
|
||||||
|
"file": "doc1.pdf",
|
||||||
|
"sections": ["1.3", "@5-10"]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"file": "doc2.pdf",
|
||||||
|
"sections": ["Introduction", "3.2"]
|
||||||
|
}
|
||||||
|
])
|
||||||
|
|
||||||
|
def test_whitespace_is_trimmed(self):
|
||||||
|
args = SimpleNamespace(
|
||||||
|
inputs=["doc1.pdf: 1.3 , @5-10 , Introduction "]
|
||||||
|
)
|
||||||
|
|
||||||
|
result = parse_inputs(args)
|
||||||
|
|
||||||
|
self.assertEqual(result[0]["sections"], [
|
||||||
|
"1.3", "@5-10", "Introduction"])
|
||||||
|
|
||||||
|
def test_missing_colon_raises_error(self):
|
||||||
|
args = SimpleNamespace(
|
||||||
|
inputs=["doc1.pdf"]
|
||||||
|
)
|
||||||
|
|
||||||
|
with self.assertRaises(ValueError):
|
||||||
|
parse_inputs(args)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
unittest.main()
|
||||||
Reference in New Issue
Block a user