Format Details¶
Detailed breakdown of how OCC extracts metrics from each format.
Word (.docx)¶
Parser: mammoth
Metrics extracted:
- Words — raw text extracted via
mammoth.extractRawText(), then split on whitespace - Pages — estimated at 250 words per page (
Math.max(1, Math.ceil(words / 250))) - Paragraphs — text split on double newlines, filtered for non-empty segments
Structure extraction: mammoth converts DOCX to HTML (mapping Heading 1–Heading 6 styles to <h1>–<h6>), then turndown converts to markdown with #–###### headers. This gives accurate heading hierarchy without parsing DOCX XML directly.
Table extraction: occ table inspect parses <table>/<tr>/<td> elements from mammoth's HTML output, preserving colspan and rowspan attributes for merged cells. Headers are auto-detected from <th> tags or unique first-row values.
Page estimation
DOCX files don't store reliable page counts. OCC estimates pages at 250 words per page, which is a standard publishing convention.
PDF (.pdf)¶
Parser: pdf-parse
Metrics extracted:
- Words — text extracted by pdf-parse, then split on whitespace
- Pages — actual page count from the PDF metadata (
data.numpages)
Structure extraction: pdf-parse is invoked with a custom pagerender callback that prepends [Page N] markers before each page's text. These markers enable section-to-page mapping in the structure tree. Headers in the extracted text are identified by markdown heading syntax.
PDF is the only format that provides a true page count rather than an estimate.
Excel (.xlsx)¶
Parser: SheetJS (xlsx)
Metrics extracted:
- Sheets —
workbook.SheetNames.length - Rows — derived from each sheet's
!refrange - Cells — rows × columns derived from each sheet's
!refrange
Word and page counts are not extracted from spreadsheets.
Spreadsheet inspection: occ sheet inspect <file> uses the same SheetJS workbook model to expose workbook properties, hidden sheet state, defined names, formula/comment/hyperlink signals, inferred schema, and lightweight row samples for agent-oriented preflight.
Table extraction: occ table inspect treats each sheet as a table, using the same getCell() and renderCell() utilities as sheet inspection. Merged cells are handled via the !merges array, with the top-left cell carrying colspan/rowspan values.
PowerPoint (.pptx)¶
Parser: JSZip + officeparser
Metrics extracted:
- Words — text extracted via officeparser, then split on whitespace
- Slides — counted by inspecting the ZIP structure for
ppt/slides/slideN.xmlentries
Structure extraction: Slides are enumerated from the ZIP in order and # Slide N headers are inserted, creating a flat one-level structure.
Table extraction: occ table inspect finds <a:tbl> elements in slide XML, extracting text from <a:t> tags within <a:tc> cells. gridSpan and rowSpan attributes are read for merged cells. Each table's location is reported as Slide N.
ODT (OpenDocument Text)¶
Parser: officeparser
Metrics extracted:
- Words — text extracted via officeparser, then split on whitespace
- Pages — estimated at 250 words per page (same as Word)
- Paragraphs — text split on newlines, filtered for non-empty segments
Structure extraction: Text is extracted via officeparser. Heading detection is best-effort since ODT formatting may not always be preserved in the plain text output.
Table extraction: occ table inspect parses <table:table> elements from content.xml, extracting text from <table:table-cell> elements. table:number-columns-spanned and table:number-rows-spanned attributes are read for merged cells.
ODS (OpenDocument Spreadsheet)¶
Parser: JSZip + officeparser
Metrics extracted:
- Sheets — counted by matching
<table:tableelements incontent.xml - Rows — counted by matching
<table:table-rowelements incontent.xml - Cells — counted from officeparser text output (non-empty lines)
ODP (OpenDocument Presentation)¶
Parser: JSZip + officeparser
Metrics extracted:
- Words — text extracted via officeparser, then split on whitespace
- Slides — counted by matching
<draw:pageelements incontent.xml
Structure extraction: Similar to PPTX — slides are counted from content.xml and # Slide N headers are inserted.
Table extraction: occ table inspect parses <table:table> elements from content.xml within each <draw:page>, providing per-slide context for each table's location.