Parser System¶
OCC has several parser families:
- Office document parsers for metrics and structure extraction
- Inspection parsers for format-specific metadata, risk flags, and content previews (
occ doc/sheet/slide inspect) - Table extraction parsers for structured table content (
occ table inspect) - Code parsers for the
occ codegraph builder
The document parser system extracts metrics from DOCX, PDF, XLSX, PPTX, and ODF files. The inspection parsers reuse the same underlying libraries (mammoth, SheetJS, JSZip) for deeper format-specific analysis. The table extraction parsers extract structured table data from document XML or HTML output. The code parser system normalizes supported source languages into one graph model so the CLI can run the same queries across multiple languages.
Office Parser Interface¶
Every parser function implements the ParserOutput interface defined in src/types.ts:
interface ParserOutput {
fileType: string; // Display name for the format
metrics: Record<string, number>; // Only populated fields (e.g., { words: 5200, pages: 21 })
}
The router in parsers/index.ts wraps each result with file metadata:
interface ParseResult {
filePath: string;
size: number;
success: boolean; // false if parsing failed
fileType: string;
metrics: Record<string, number> | null;
}
Office Dispatch Flow¶
graph TD
A[parseFile] --> B[getExtension]
B --> C{PARSER_MAP}
C -->|docx| D[parseDocx]
C -->|pdf| E[parsePdf]
C -->|xlsx| F[parseXlsx]
C -->|pptx| G[parsePptx]
C -->|odt/ods/odp| H[parseOdf]
C -->|unknown| I["{ success: false }"]
D --> J[Return result]
E --> J
F --> J
G --> J
H --> J
PARSER_MAP¶
The extension-to-parser mapping in parsers/index.ts:
const PARSER_MAP: Record<string, ParserFn> = {
docx: parseDocx,
pdf: parsePdf,
xlsx: parseXlsx,
pptx: parsePptx,
odt: parseOdf,
ods: parseOdf,
odp: parseOdf,
};
Note that odt, ods, and odp all route to the same parseOdf function, which internally dispatches based on the file extension.
Batch Concurrency¶
parseFiles() processes files in batches of 10 using Promise.allSettled:
for (let i = 0; i < files.length; i += concurrency) {
const batch = files.slice(i, i + concurrency);
const results = await Promise.allSettled(
batch.map(f => parseFile(f.path, f.size))
);
// collect results...
}
Promise.allSettled is used instead of Promise.all so that a single failing file doesn't abort the entire batch.
Error Handling¶
When a parser throws an exception, parseFile() catches it and returns a result with success: false and metrics: null. These "Unreadable" entries still appear in the output (highlighted in red in tabular mode) so the user knows which files failed.
If Promise.allSettled itself reports a rejected promise (which shouldn't happen since parseFile catches internally), it falls back to an "Unreadable" entry as well.
Code Parser System¶
The occ code pipeline lives under src/code/ and normalizes multiple languages into one graph shape:
discover.tsfinds supported code filesparsers.tsextracts symbols, imports, calls, and inheritancebuild.tsresolves those parsed facts into nodes and edgesquery.tsanswers CLI-level questions from the in-memory graph
The strongest support path is currently:
- JavaScript
- TypeScript
- Python
Normalized Parsed Facts¶
Regardless of language, the parser layer tries to emit the same kinds of facts:
- symbols: functions, classes, and variables
- imports: specifier, bindings, and best-known import kind
- calls: caller, callee, qualifier, and source line
- inheritance: child class, base class, and source line
Those facts are still pre-resolution. build.ts then converts them into graph nodes and edges and assigns resolution status.
JavaScript and TypeScript¶
JS and TS files are parsed with the TypeScript compiler API. That path currently handles:
- function declarations
- arrow functions and function expressions assigned to variables
- class declarations and methods
importdeclarations and bindingsextendsrelationships- call expressions, including qualified calls like
this.foo()andsuper.foo()
Python¶
Python files use a lighter-weight parser path built around line-oriented extraction and import helpers. That path is currently tuned for:
- top-level functions and classes
- methods under classes
import ...andfrom ... import ...statements- class inheritance
- common method receivers like
selfandcls
Relative and repo-local imports are resolved through repository-aware helpers in languages.ts.
Resolution Behavior¶
The graph builder resolves parsed facts into edges with explicit status:
resolvedwhen OCC can confidently connect the relationshipambiguouswhen multiple candidates matchunresolvedwhen a target cannot be connected
That explicit status is part of the contract. OCC prefers surfacing uncertainty to inventing a definitive answer.
Key behaviors in the current code parser and resolver layer:
- Receiver-aware method resolution for
this,super,self, andcls - Ambiguity tracking with candidate locations for uncertain call targets
- Dependency categorization into local, external, and unresolved imports