Document Parsing
Firecrawl provides document parsing capabilities that convert supported document formats into clean, structured Markdown.
Supported document formats
Section titled “Supported document formats”Firecrawl currently supports:
- Excel spreadsheets (
.xlsx,.xls)- Each worksheet is converted to an HTML table
- Worksheets are separated by H2 headings with the sheet name
- Preserves cell formatting and data types
- Word documents (
.docx,.doc,.odt,.rtf)- Extracts text while preserving document structure
- Maintains headings, paragraphs, lists, and tables
- Preserves basic formatting and styling
- PDF documents (
.pdf)- Extracts text content with layout information
- Preserves document structure including sections and paragraphs
- Handles both text-based and scanned PDFs (with OCR support)
- Supports a
modeoption to control parsing strategy:fast(text-only),auto(text with OCR fallback, default), orocr(force OCR) - Priced at 1 credit per page (PDF → Markdown)
PDF parsing modes
Section titled “PDF parsing modes”Use the parsers option to control how PDFs are processed:
| Mode | Description |
|---|---|
auto | Attempts fast text-based extraction first, falls back to OCR if needed. Default. |
fast | Text-based parsing only (embedded text). Fastest, but won’t extract from scanned/image-heavy pages. |
ocr | Forces OCR parsing on every page. Use for scanned documents or when auto misclassifies a page. |
parsers: [{ type: "pdf", mode: "ocr", maxPages: 20 }]
parsers: [{ type: "pdf" }]
parsers: ["pdf"]
parsers: []Passing an empty array parsers: [] skips PDF parsing and returns the PDF as base64 (flat 1 credit per PDF).
How to use document parsing
Section titled “How to use document parsing”Document parsing works automatically when you provide a URL pointing to a supported document type. Firecrawl will detect the file type based on the URL extension or the response content-type header and process it accordingly.
Example: scraping an Excel file
Section titled “Example: scraping an Excel file”import Firecrawl from '@mendable/firecrawl-js';
const firecrawl = new Firecrawl({ apiKey: "fc-YOUR-API-KEY" });
const doc = await firecrawl.scrape('https://example.com/data.xlsx');
console.log(doc.markdown);Example: scraping a Word document
Section titled “Example: scraping a Word document”import Firecrawl from '@mendable/firecrawl-js';
const firecrawl = new Firecrawl({ apiKey: "fc-YOUR-API-KEY" });
const doc = await firecrawl.scrape('https://example.com/data.docx');
console.log(doc.markdown);Output format
Section titled “Output format”All supported document types are converted to clean, structured Markdown. For example, an Excel file with multiple sheets might be converted to:
## Sheet1
| Name | Value ||-------|-------|| Item 1 | 100 || Item 2 | 200 |
## Sheet2
| Date | Description ||------------|--------------|| 2023-01-01 | First quarter|