Skip to content

Supported Formats

This page covers the document formats supported by OCC across its command families.

For occ code language support, see the CLI Reference. The strongest code exploration support path is JavaScript, TypeScript, and Python.

OCC supports seven office document formats across three categories.

Format Summary

Format Extension Words Pages Paragraphs Sheets Rows Cells Slides Structure Inspect Tables Parser Library
Word .docx Yes Yes* Yes Yes Yes Yes mammoth
PDF .pdf Yes Yes Yes pdf-parse
Excel .xlsx Yes Yes Yes Yes Yes SheetJS/xlsx
PowerPoint .pptx Yes Yes Yes Yes Yes JSZip + officeparser
ODT .odt Yes Yes* Yes Yes Yes Yes officeparser
ODS .ods Yes Yes Yes JSZip + officeparser
ODP .odp Yes Yes Yes Yes Yes JSZip + officeparser

* Pages for Word (.docx) and ODT (.odt) are estimated at 250 words per page.

Structure extraction (--structure) parses heading hierarchy into a tree with dotted section codes. DOCX heading styles are accurately mapped via mammoth + turndown. PDF pages are mapped to sections. PPTX/ODP produce slide-level headers. Spreadsheets have no heading hierarchy and are skipped.

Inspect commands (occ doc/sheet/slide inspect) provide format-specific metadata, risk flags, content stats, and content previews. PDF does not currently have a dedicated inspect command.

Tables extraction (occ table inspect) returns structured table content with headers, rows, and merged cell support. PDF tables cannot be structurally extracted (returns an informative note). ODS table extraction is not yet supported.

Categories

Text Documents

Word (.docx) and ODT (.odt) — extract word counts, page estimates, and paragraph counts.

Spreadsheets

Excel (.xlsx) and ODS (.ods) — extract sheet counts, row counts, and cell counts. Word counts are not applicable.

Presentations

PowerPoint (.pptx) and ODP (.odp) — extract word counts and slide counts from presentation text content.

PDF

PDF (.pdf) — extracts word counts and actual page counts (not estimated).