Parser System¶
The parser system is responsible for extracting metrics from office documents.
Parser Interface¶
Every parser function implements the ParserOutput interface defined in src/types.ts:
interface ParserOutput {
fileType: string; // Display name for the format
metrics: Record<string, number>; // Only populated fields (e.g., { words: 5200, pages: 21 })
}
The router in parsers/index.ts wraps each result with file metadata:
interface ParseResult {
filePath: string;
size: number;
success: boolean; // false if parsing failed
fileType: string;
metrics: Record<string, number> | null;
}
Dispatch Flow¶
graph TD
A[parseFile] --> B[getExtension]
B --> C{PARSER_MAP}
C -->|docx| D[parseDocx]
C -->|pdf| E[parsePdf]
C -->|xlsx| F[parseXlsx]
C -->|pptx| G[parsePptx]
C -->|odt/ods/odp| H[parseOdf]
C -->|unknown| I["{ success: false }"]
D --> J[Return result]
E --> J
F --> J
G --> J
H --> J
PARSER_MAP¶
The extension-to-parser mapping in parsers/index.ts:
const PARSER_MAP: Record<string, ParserFn> = {
docx: parseDocx,
pdf: parsePdf,
xlsx: parseXlsx,
pptx: parsePptx,
odt: parseOdf,
ods: parseOdf,
odp: parseOdf,
};
Note that odt, ods, and odp all route to the same parseOdf function, which internally dispatches based on the file extension.
Batch Concurrency¶
parseFiles() processes files in batches of 10 using Promise.allSettled:
for (let i = 0; i < files.length; i += concurrency) {
const batch = files.slice(i, i + concurrency);
const results = await Promise.allSettled(
batch.map(f => parseFile(f.path, f.size))
);
// collect results...
}
Promise.allSettled is used instead of Promise.all so that a single failing file doesn't abort the entire batch.
Error Handling¶
When a parser throws an exception, parseFile() catches it and returns a result with success: false and metrics: null. These "Unreadable" entries still appear in the output (highlighted in red in tabular mode) so the user knows which files failed.
If Promise.allSettled itself reports a rejected promise (which shouldn't happen since parseFile catches internally), it falls back to an "Unreadable" entry as well.