Format Details¶
Detailed breakdown of how OCC extracts metrics from each format.
Word (.docx)¶
Parser: mammoth
Metrics extracted:
- Words — raw text extracted via
mammoth.extractRawText(), then split on whitespace - Pages — estimated at 250 words per page (
Math.max(1, Math.ceil(words / 250))) - Paragraphs — text split on double newlines, filtered for non-empty segments
Structure extraction: mammoth converts DOCX to HTML (mapping Heading 1–Heading 6 styles to <h1>–<h6>), then turndown converts to markdown with #–###### headers. This gives accurate heading hierarchy without parsing DOCX XML directly.
Page estimation
DOCX files don't store reliable page counts. OCC estimates pages at 250 words per page, which is a standard publishing convention.
PDF (.pdf)¶
Parser: pdf-parse
Metrics extracted:
- Words — text extracted by pdf-parse, then split on whitespace
- Pages — actual page count from the PDF metadata (
data.numpages)
Structure extraction: pdf-parse is invoked with a custom pagerender callback that prepends [Page N] markers before each page's text. These markers enable section-to-page mapping in the structure tree. Headers in the extracted text are identified by markdown heading syntax.
PDF is the only format that provides a true page count rather than an estimate.
Excel (.xlsx)¶
Parser: SheetJS (xlsx)
Metrics extracted:
- Sheets —
workbook.SheetNames.length - Rows — derived from each sheet's
!refrange - Cells — rows × columns derived from each sheet's
!refrange
Word and page counts are not extracted from spreadsheets.
PowerPoint (.pptx)¶
Parser: JSZip + officeparser
Metrics extracted:
- Words — text extracted via officeparser, then split on whitespace
- Slides — counted by inspecting the ZIP structure for
ppt/slides/slideN.xmlentries
Structure extraction: Slides are enumerated from the ZIP in order and # Slide N headers are inserted, creating a flat one-level structure.
ODT (OpenDocument Text)¶
Parser: officeparser
Metrics extracted:
- Words — text extracted via officeparser, then split on whitespace
- Pages — estimated at 250 words per page (same as Word)
- Paragraphs — text split on newlines, filtered for non-empty segments
Structure extraction: Text is extracted via officeparser. Heading detection is best-effort since ODT formatting may not always be preserved in the plain text output.
ODS (OpenDocument Spreadsheet)¶
Parser: JSZip + officeparser
Metrics extracted:
- Sheets — counted by matching
<table:tableelements incontent.xml - Rows — counted by matching
<table:table-rowelements incontent.xml - Cells — counted from officeparser text output (non-empty lines)
ODP (OpenDocument Presentation)¶
Parser: JSZip + officeparser
Metrics extracted:
- Words — text extracted via officeparser, then split on whitespace
- Slides — counted by matching
<draw:pageelements incontent.xml
Structure extraction: Similar to PPTX — slides are counted from content.xml and # Slide N headers are inserted.