Text extraction
Using pdfspine's get_text family, page search, the reusable TextPage handle, and table detection to pull text and structure out of a PDF.
pdfspine implements PyMuPDF's full get_text family, page search, the reusable
TextPage handle, and table detection.
get_text variants
Page.get_text(option="text", *, clip=None, flags=None, textpage=None, sort=False)
returns a different native object depending on option:
option | Returns | Description |
|---|---|---|
"text" | str | Plain text in reading order (the default). |
"words" | list[tuple] | One tuple per word, with its bounding box. |
"blocks" | list[tuple] | One tuple per text block, with its bounding box. |
"dict" | dict | Nested blocks → lines → spans structure. |
"rawdict" | dict | Like dict, but down to per-character detail. |
"json" | str | The dict structure serialized to JSON. |
"rawjson" | str | The rawdict structure serialized to JSON. |
"html" | str | HTML reconstruction of the page. |
"xhtml" | str | XHTML reconstruction. |
"xml" | str | Low-level XML with per-glyph geometry. |
page = doc[0]
text = page.get_text() # "text"
words = page.get_text("words")
blocks = page.get_text("blocks")
data = page.get_text("dict")
html = page.get_text("html")
xml = page.get_text("xml")Options
clip— aRect(or 4-sequence) limiting extraction to a sub-region.sort— whenTrue, orders blocks top-to-bottom, left-to-right by(y, x).flags— PyMuPDF text-extraction flag bits.textpage— reuse a previously builtTextPageto avoid re-parsing the page.
clip = pdfspine.Rect(0, 0, 300, 400)
snippet = page.get_text("text", clip=clip, sort=True)There is also a document-level convenience:
text = doc.get_page_text(0, "text", sort=True)Searching
Page.search_for(needle, *, hit_max=0, quads=False, clip=None, flags=None, textpage=None)
finds every occurrence of needle and returns its geometry:
rects = page.search_for("Total") # list[Rect]
quads = page.search_for("Total", quads=True) # list[Quad] (handles rotation)
# Cap the number of hits and restrict to a region:
hits = page.search_for("Total", hit_max=5, clip=pdfspine.Rect(0, 0, 595, 200))Returning Quad geometry is the right choice when text may be rotated or
skewed; the four corner points describe the exact glyph quadrilateral.
TextPage
When you extract text and search the same page, build a TextPage once and
pass it back via textpage= to avoid re-parsing:
tp = page.get_textpage() # optional: flags=, clip=
text = page.get_text("text", textpage=tp)
hits = page.search_for("invoice", textpage=tp)TextPage also exposes PyMuPDF's direct extractors:
tp.extractText() # -> str
tp.extractWORDS() # -> list[tuple]
tp.extractBLOCKS() # -> list[tuple]
tp.extractDICT() # -> dict
tp.extractRAWDICT() # -> dict
tp.extractJSON() # -> str
tp.rect # -> Rect of the pageTables
Page.find_tables(...) detects tables and returns a TableFinder:
finder = page.find_tables() # strategy="lines" by default
print(len(finder), "tables")
for table in finder: # also: finder.tables, finder[i]
print(table.bbox, table.row_count, "x", table.col_count)
grid = table.extract() # list[list[str | None]]
md = table.to_markdown() # GitHub-Flavored Markdown
html = table.to_html() # an HTML <table> stringStrategy
find_tables accepts a strategy of "lines" (default), "lines_strict", or
"text", plus tuning knobs:
finder = page.find_tables(
strategy="lines",
line_max_thickness=3.0,
snap_tolerance=3.0,
min_line_length=3.0,
)PyMuPDF's vertical_strategy / horizontal_strategy keyword arguments are also
accepted (a single non-default value selects that strategy).
Table attributes
| Member | Type | Description |
|---|---|---|
Table.bbox | Rect | Bounding box of the table. |
Table.row_count | int | Number of rows. |
Table.col_count | int | Number of columns. |
Table.header | list | Header row cell text (or []). |
Table.rows | list[float] | Snapped horizontal grid-line y positions. |
Table.cols | list[float] | Snapped vertical grid-line x positions. |
Table.cells | `list[list[Rect | None]]` |
Table.spans | list[tuple] | (row, col, row_span, col_span, Rect) per merged cell. |
Table.extract() | list[list] | Cell-text grid (None for empty/continuation). |
Table.to_markdown() | str | Markdown rendering. |
Table.to_html() | str | HTML rendering. |
Page inventory
fonts = page.get_fonts() # list of font tuples
images = page.get_images() # list of image tuples
drawings = page.get_drawings() # vector drawings (geometry as Point/Rect)Quickstart
An end-to-end walkthrough of the common pdfspine workflow — open a PDF, extract and search text, render a page to PNG, and save.
Editing & saving
Page operations, merging and splitting, metadata, table of contents, links, annotations, forms, redaction, and full or incremental saving in pdfspine.