pdfspine
Guide

Text extraction

Using pdfspine's get_text family, page search, the reusable TextPage handle, and table detection to pull text and structure out of a PDF.

pdfspine implements PyMuPDF's full get_text family, page search, the reusable TextPage handle, and table detection.

get_text variants

Page.get_text(option="text", *, clip=None, flags=None, textpage=None, sort=False) returns a different native object depending on option:

optionReturnsDescription
"text"strPlain text in reading order (the default).
"words"list[tuple]One tuple per word, with its bounding box.
"blocks"list[tuple]One tuple per text block, with its bounding box.
"dict"dictNested blocks → lines → spans structure.
"rawdict"dictLike dict, but down to per-character detail.
"json"strThe dict structure serialized to JSON.
"rawjson"strThe rawdict structure serialized to JSON.
"html"strHTML reconstruction of the page.
"xhtml"strXHTML reconstruction.
"xml"strLow-level XML with per-glyph geometry.
page = doc[0]

text = page.get_text()                # "text"
words = page.get_text("words")
blocks = page.get_text("blocks")
data = page.get_text("dict")
html = page.get_text("html")
xml = page.get_text("xml")

Options

  • clip — a Rect (or 4-sequence) limiting extraction to a sub-region.
  • sort — when True, orders blocks top-to-bottom, left-to-right by (y, x).
  • flags — PyMuPDF text-extraction flag bits.
  • textpage — reuse a previously built TextPage to avoid re-parsing the page.
clip = pdfspine.Rect(0, 0, 300, 400)
snippet = page.get_text("text", clip=clip, sort=True)

There is also a document-level convenience:

text = doc.get_page_text(0, "text", sort=True)

Searching

Page.search_for(needle, *, hit_max=0, quads=False, clip=None, flags=None, textpage=None) finds every occurrence of needle and returns its geometry:

rects = page.search_for("Total")             # list[Rect]
quads = page.search_for("Total", quads=True)  # list[Quad] (handles rotation)

# Cap the number of hits and restrict to a region:
hits = page.search_for("Total", hit_max=5, clip=pdfspine.Rect(0, 0, 595, 200))

Returning Quad geometry is the right choice when text may be rotated or skewed; the four corner points describe the exact glyph quadrilateral.

TextPage

When you extract text and search the same page, build a TextPage once and pass it back via textpage= to avoid re-parsing:

tp = page.get_textpage()                  # optional: flags=, clip=

text = page.get_text("text", textpage=tp)
hits = page.search_for("invoice", textpage=tp)

TextPage also exposes PyMuPDF's direct extractors:

tp.extractText()       # -> str
tp.extractWORDS()      # -> list[tuple]
tp.extractBLOCKS()     # -> list[tuple]
tp.extractDICT()       # -> dict
tp.extractRAWDICT()    # -> dict
tp.extractJSON()       # -> str
tp.rect                # -> Rect of the page

Tables

Page.find_tables(...) detects tables and returns a TableFinder:

finder = page.find_tables()               # strategy="lines" by default
print(len(finder), "tables")

for table in finder:                      # also: finder.tables, finder[i]
    print(table.bbox, table.row_count, "x", table.col_count)
    grid = table.extract()                # list[list[str | None]]
    md = table.to_markdown()              # GitHub-Flavored Markdown
    html = table.to_html()                # an HTML <table> string

Strategy

find_tables accepts a strategy of "lines" (default), "lines_strict", or "text", plus tuning knobs:

finder = page.find_tables(
    strategy="lines",
    line_max_thickness=3.0,
    snap_tolerance=3.0,
    min_line_length=3.0,
)

PyMuPDF's vertical_strategy / horizontal_strategy keyword arguments are also accepted (a single non-default value selects that strategy).

Table attributes

MemberTypeDescription
Table.bboxRectBounding box of the table.
Table.row_countintNumber of rows.
Table.col_countintNumber of columns.
Table.headerlistHeader row cell text (or []).
Table.rowslist[float]Snapped horizontal grid-line y positions.
Table.colslist[float]Snapped vertical grid-line x positions.
Table.cells`list[list[RectNone]]`
Table.spanslist[tuple](row, col, row_span, col_span, Rect) per merged cell.
Table.extract()list[list]Cell-text grid (None for empty/continuation).
Table.to_markdown()strMarkdown rendering.
Table.to_html()strHTML rendering.

Page inventory

fonts = page.get_fonts()        # list of font tuples
images = page.get_images()      # list of image tuples
drawings = page.get_drawings()  # vector drawings (geometry as Point/Rect)

On this page