Flagship Reference
Parse PDF
Parse PDF turns uploaded PDFs into a structured document tree plus Markdown, HTML, plain text, and optional annotated PDF debug artifacts. Use /v1/parse for one file or /v1/parse/batch for up to 10 PDFs with shared parse options.
Overview
What You Get Back
Completed parse jobs expose structured JSON plus optional Markdown, HTML, plain text, annotated PDF debug artifacts, richer Markdown, image-capable Markdown, and tagged PDF output. The JSON artifact is the structured representation for downstream automation. Markdown and text are text-friendly companions for indexing, previews, and human review.
The structured JSON preserves reading order and emits a hierarchical document rooted at numberOfPages plus kids, with semantic nodes for headings, paragraphs, lists, list items, tables, rows, cells, captions, and images when detected.
When the underlying output includes layout coordinates, nodes can also expose bounding boxes so you can map extracted content back to the source pages.
- JSON artifact: structured document tree for automation and indexing.
- Markdown artifact: readable text export with the same reading-order orientation.
- HTML artifact: styled companion for rendering and review.
- Plain text artifact: lightweight output for search, RAG, and simple ingestion.
- Annotated PDF artifact: visual debug output for comparing detected structure to source pages.
- Markdown with HTML artifact: Markdown-family output that keeps richer inline/table markup.
- Markdown with images artifact: explicit image-capable Markdown; external sidecars are bundled into one zip.
- Only one markdown-style artifact can be requested per job because the parse engine emits one Markdown-family file per run.
- Tagged PDF artifact: automated structure inference for accessibility review, not a PDF/UA compliance guarantee.
- Header and footer content stays excluded by default unless
include_header_footer=trueis requested. - Tagged PDFs can prefer their native structure tree when
use_struct_tree=trueis supplied. - Sanitization, reading-order, table, line-break, hybrid-mode, and image-output settings tune extraction behavior while
output_modeandformatsselect emitted artifacts.
RAG
RAG Ingestion Workflow
Use DocuShell Parse PDF when a search, RAG, review, or agent workflow needs readable chunks plus source metadata.
For most RAG pipelines, request formats=json,markdown. Treat Markdown as the primary text to chunk and embed, then attach JSON metadata from the same parse job so retrieved passages can point back to source pages and bounding boxes.
Do not chunk PDFs by blind character windows first. Start with document semantics: headings, paragraphs, lists, captions, and tables. Merge short neighboring elements when needed, keep tables intact when their structure matters, and carry page numbers plus bounding boxes into vector metadata.
The JSON artifact is also the audit layer. Store enough metadata to reproduce a citation, highlight a source region in a review UI, and debug bad retrieval without reparsing the PDF.
- Submit with
formats=json,markdownfor the common RAG pair. - Use
use_struct_tree=truefor tagged PDFs when you want DocuShell to prefer reliable native structure tags. - Keep DocuShell's default rendering-mismatch defenses enabled for untrusted PDFs so hidden, off-page, tiny, or transparent text is less likely to enter model context.
- Keep the default header/footer exclusion unless repeated page furniture is important to the answer.
- Use
sanitize=trueonly when your pipeline should mask visible emails, URLs, and phone numbers in the extracted output. - Store per-chunk metadata such as source file ID, page number, heading path, node type, and bounding box when available.
- Keep table nodes or Markdown tables as standalone chunks when row and column relationships are important.
Implementation
RAG Examples
These examples show the DocuShell API request and the downstream metadata shape to keep with embeddings.
Submit a RAG-ready parse job
bash
curl -X POST "https://api.docushell.com/api/v1/parse" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Idempotency-Key: rag-parse-001" \
-F "file=@./policy-handbook.pdf;type=application/pdf" \
-F "formats=json,markdown" \
-F "reading_order=xycut" \
-F "use_struct_tree=true"DocuShell RAG flow
text
PDF upload
-> POST /api/v1/parse with formats=json,markdown
-> poll GET /api/v1/jobs/:jobId
-> download Markdown and JSON artifacts
-> chunk Markdown by headings, paragraphs, lists, and tables
-> attach JSON metadata: page, bounding box, node type, heading path
-> store embeddings plus metadata
-> retrieve chunks and cite the source page or regionChunk metadata shape
json
{
"content": "## Data Retention\n\nCustomer documents are retained for the configured retention window...",
"metadata": {
"source_file_id": "file_01JX...",
"source_name": "policy-handbook.pdf",
"section": "Data Retention",
"node_types": ["heading", "paragraph"],
"page_start": 3,
"page_end": 4,
"bounding_boxes": [
{ "page": 3, "bbox": { "x": 0.88, "y": 1.21, "w": 6.21, "h": 0.52 } }
]
}
}Structure
Tagged PDFs And Structure Trees
Tagged PDFs can provide stronger semantic structure than coordinate-only extraction when the tags are present and trustworthy.
When a source PDF includes usable structure tags, use_struct_tree=true tells DocuShell to prefer that structure for reading order and semantic hierarchy. This can improve headings, lists, table relationships, and natural chunk boundaries for RAG pipelines.
Real document collections are mixed. Some PDFs are well tagged, some have no tags, and some have tags that are not useful enough for downstream retrieval. Always inspect representative outputs before relying on a single extraction strategy.
- Use
use_struct_tree=truefor tagged policy documents, manuals, reports, and accessible PDFs where author-defined structure is expected. - Use the regular layout-aware path for untagged or poorly tagged files, or compare both settings during integration testing.
- Request
formats=tagged_pdfwhen you need a generated tagged PDF artifact for review or accessibility workflows. - Do not treat
formats=tagged_pdfas a PDF/UA compliance guarantee. Review the artifact before making accessibility claims. - For RAG chunking, start new chunks at major headings, preserve heading-plus-paragraph groups, and avoid splitting tables across chunks.
Inspection
Playground Inspection Views
The Parse Playground lets developers inspect both the visual overlay and the structured data behind it.
Annotated PDF Viewer
Renders the selected PDF pages and overlays layout boxes, category tags, and reading-order numbers from parser bounding boxes.
Blocks / Tables Output
Lists extracted nodes as data: order, category type, page number, bounding-box coordinates, and extracted text or table content when available.
JSON, Markdown, And Text
Switch tabs to compare the raw structured JSON with Markdown and plain text companion artifacts from the same parse job.
Capabilities
What Parse Supports Today
The public parse lane exposes curated parser controls while OCR/enrichment remain backend-profile settings.
| Capability | Availability | Notes |
|---|---|---|
| JSON artifact | Available | Hierarchical document output for automation, indexing, and structured QA. |
| Markdown artifact | Available | Text-first companion download for previews, search, and LLM ingestion. |
| HTML artifact | Available | Optional styled companion download for rendering and review. |
| Plain text artifact | Available | Optional lightweight text output for search, RAG, and simple ingestion. |
| Annotated PDF artifact | Available | Optional visual debug artifact for validating extracted structure. |
| Annotated PDF playground viewer | Available | The playground renders PDF pages with layout boxes, category tags, and reading-order numbers when block geometry is present. |
| Blocks / tables output | Available when present | The playground lists parser nodes with order, category type, page number, bounding box, and extracted content. |
| Markdown with HTML | Available | Request formats=markdown_with_html when Markdown should retain richer inline/table markup. |
| Markdown with images | Available | Request formats=markdown_with_images; external sidecars download as a zip. |
| Tagged PDF output | Available | Request formats=tagged_pdf; review before making accessibility compliance claims. |
| Batch parse | Available | Use /v1/parse/batch for up to 10 PDFs with shared parse options, per-file statuses, and per-file artifacts. |
| Reading-order preservation | Available | Structured and text-oriented artifacts follow the detected reading order. |
| Heading and list detection | Available | Headings plus numbered, bulleted, and nested lists are represented when detected. |
| Table extraction | Available / backend-gated | Structured tables are emitted when detected; complex or borderless tables may require the hybrid backend. |
| Image extraction with coordinates | Available when present | Image nodes and coordinates can appear in JSON; use markdown_with_images for image-capable Markdown. |
| Tagged PDF structure | Available | Use use_struct_tree=true to prefer native structure tags when a tagged PDF provides them. |
| Sanitization | Available | Use sanitize=true to mask email addresses, URLs, and phone numbers in extracted output. |
| Reading-order override | Available | Use reading_order=xycut|off when you need an explicit reading-order setting. |
| Table-method override | Available | Use table_method=default|cluster for light table extraction tuning. |
| Keep line breaks | Available | Use keep_line_breaks=true when text-oriented output should preserve original line breaks more closely. |
| Header/footer inclusion | Available | Use include_header_footer=true when you need repeated page furniture in the output. |
| Request hybrid mode | Backend-gated | Use hybrid_mode=auto|full only when the DocuShell hybrid backend is enabled. |
| Image output mode | Available | Use image_output=off|embedded|external for image-capable outputs. |
| OCR / scanned PDFs | Backend-gated | Available when the DocuShell hybrid OCR profile is active; otherwise scans return ocr_required. |
| Formula/chart enrichment | Backend-gated | Available only when the active DocuShell backend profile includes those enrichments. |
Endpoint
Single-File Endpoint
POST/v1/parse
Submit a PDF for queued parsing and receive structured JSON plus Markdown, HTML, plain text, and annotated PDF debug output.
Auth
Bearer token required on submit, status, and artifact download requests.
Idempotency
Server-minted job_id values with optional Idempotency-Key replay support.
Content Type
multipart/form-data
Headers
| Name | Type | Required | Location | Description |
|---|---|---|---|---|
| Authorization | Bearer <API_KEY> | Yes | header | User-owned API key created in the DocuShell dashboard. |
| Idempotency-Key | string | No | header | Recommended for safely retrying submit requests without creating duplicate jobs. |
Request Fields
| Name | Type | Required | Location | Description |
|---|---|---|---|---|
| file | file | Yes | multipart | PDF upload. The gateway validates PDF magic bytes before forwarding the file. |
| file_name | string | No | multipart | Optional file name override used for storage metadata and downstream artifact names. |
| page_range | string | No | multipart | Comma-separated pages or ranges such as 1-3,5,9-11. |
| include_header_footer | boolean | No | multipart | Set to true to keep header and footer content in the extracted output.Default: false |
| use_struct_tree | boolean | No | multipart | Set to true to prefer native tagged-PDF structure when the source document includes a usable structure tree.Default: false |
| sanitize | boolean | No | multipart | Set to true to mask email addresses, URLs, and phone numbers in extracted output.Default: false |
| reading_order | `xycut` | `off` | No | multipart | Optional reading-order strategy. Omit it to keep the current default extraction behavior. |
| table_method | `default` | `cluster` | No | multipart | Optional table-detection strategy. Omit it to keep the current default extraction behavior. |
| keep_line_breaks | boolean | No | multipart | Set to true to preserve source line breaks more aggressively in text-oriented output.Default: false |
| output_mode | `json` | `both` | `html` | `all` | No | multipart | Backward-compatible artifact bundle selector. json keeps only structured JSON, both adds Markdown, html adds HTML, and all returns the common legacy bundle: JSON, Markdown, HTML, text, and annotated PDF.Default: both |
| formats | `json` | `markdown` | `html` | `text` | `annotated_pdf` | `markdown_with_html` | `markdown_with_images` | `tagged_pdf` | No | multipart | Optional explicit artifact list. Send as repeated fields or a comma-separated value, such as formats=json,text. |
| hybrid_mode | `auto` | `full` | No | multipart | Optional per-job hybrid triage override. Requires the hybrid backend to be enabled by operations. |
| image_output | `off` | `embedded` | `external` | No | multipart | Controls image handling for image-capable outputs. markdown_with_images defaults to embedded images unless external is requested. |
Request Notes
- Plan limits are enforced before the job is queued. Starter keeps the 50 MB per-file cap; Pro, Growth, and Scale raise upload size, per PDF/job page limits, and concurrency as monthly credits grow.
- Set
use_struct_tree=truewhen tagged PDFs should favor their native structure tree. Leave it off for the default reading-order-oriented extraction path. - Structured JSON remains the canonical parse result and is always generated for successful jobs so status responses can keep returning
result.document. sanitize,reading_order,table_method,keep_line_breaks,hybrid_mode, andimage_outputare extraction-tuning knobs.output_modeandformatscontrol which companion artifacts are emitted.- Request newer artifact types such as
markdown_with_html,markdown_with_images, andtagged_pdfwithformats; only one markdown-style format (markdown,markdown_with_html, ormarkdown_with_images) can be requested per job because the parse engine emits one Markdown-family file per run. - DocuShell keeps rendering-mismatch safety filters enabled for Parse PDF output.
sanitize=trueis a separate optional control for masking visible sensitive data. - OCR, formula extraction, and chart/image descriptions follow the active DocuShell backend profile. They are not per-request fields on the shared public API.
- Status polling stays on
/v1/jobs/:jobId. Artifact streaming happens through the shared download route withformat=json|markdown|html|text|annotated_pdf|markdown_with_html|markdown_with_images|tagged_pdf.
Multipart submit
bash
curl -X POST "https://api.docushell.com/api/v1/parse" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Idempotency-Key: parse-demo-001" \
-F "file=@./quarterly-report.pdf;type=application/pdf" \
-F "file_name=quarterly-report.pdf" \
-F "page_range=1-3" \
-F "include_header_footer=true" \
-F "use_struct_tree=true" \
-F "sanitize=true" \
-F "reading_order=xycut" \
-F "table_method=cluster" \
-F "keep_line_breaks=true" \
-F "formats=json,markdown_with_images" \
-F "image_output=embedded"Try It Now
Console placeholder for safe sandbox execution.
Queued response
json
{
"job_id": "job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT",
"status": "queued",
"cost": 2500,
"service": "parse-pdf",
"request_id": "req_01JX8Y62XCDNZ2BM7TBM2M9Q8E",
"links": {
"status": "/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT"
}
}Parse job status
json
{
"job_id": "job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT",
"status": "done",
"service": "parse-pdf",
"request_id": "req_01JX8Y62XCDNZ2BM7TBM2M9Q8E",
"result": {
"document": {
"fileName": "quarterly-report.pdf",
"numberOfPages": 2,
"kids": [
{
"type": "section",
"children": [
{
"type": "heading",
"content": "Executive summary",
"heading level": 1,
"page number": 1,
"bounding box": { "x": 0.88, "y": 0.74, "w": 6.15, "h": 0.33 }
},
{
"type": "paragraph",
"content": "Revenue rose 18% year over year across the managed-services portfolio.",
"page number": 1,
"bounding box": { "x": 0.88, "y": 1.21, "w": 6.21, "h": 0.52 }
},
{
"type": "list",
"children": [
{ "type": "listItem", "content": "Renewals remained above 92%." },
{ "type": "listItem", "content": "Average contract value increased in EMEA." }
]
},
{
"type": "table",
"children": [
{
"type": "tableRow",
"children": [
{ "type": "tableCell", "content": "Region" },
{ "type": "tableCell", "content": "Growth" }
]
},
{
"type": "tableRow",
"children": [
{ "type": "tableCell", "content": "North America" },
{ "type": "tableCell", "content": "21%" }
]
}
]
},
{
"type": "caption",
"content": "Table 1. Regional growth by quarter."
}
]
}
]
},
"artifacts": {
"markdown_download": "/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download?format=markdown",
"json_download": "/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download?format=json",
"html_download": "/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download?format=html",
"text_download": "/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download?format=text",
"annotated_pdf_download": "/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download?format=annotated_pdf"
},
"metadata": {
"engine": "docushell_parse",
"output_mode": "all",
"include_header_footer": true,
"use_struct_tree": true,
"sanitize": true,
"reading_order": "xycut",
"table_method": "cluster",
"keep_line_breaks": true
}
},
"metrics": {
"queue_wait_ms": 214,
"duration_ms": 1789
},
"links": {
"status": "/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT",
"download": "/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download"
}
}JSON artifact download
bash
curl "https://api.docushell.com/api/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download?format=json" \
-H "Authorization: Bearer YOUR_API_KEY"Markdown artifact download
bash
curl "https://api.docushell.com/api/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download?format=markdown" \
-H "Authorization: Bearer YOUR_API_KEY"HTML artifact download
bash
curl "https://api.docushell.com/api/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download?format=html" \
-H "Authorization: Bearer YOUR_API_KEY"Plain text artifact download
bash
curl "https://api.docushell.com/api/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download?format=text" \
-H "Authorization: Bearer YOUR_API_KEY"Annotated PDF debug artifact download
bash
curl "https://api.docushell.com/api/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download?format=annotated_pdf" \
-H "Authorization: Bearer YOUR_API_KEY" \
--output document.annotated.pdfMarkdown with HTML download
bash
curl "https://api.docushell.com/api/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download?format=markdown_with_html" \
-H "Authorization: Bearer YOUR_API_KEY" \
--output document.with-html.mdMarkdown with images download
bash
curl "https://api.docushell.com/api/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download?format=markdown_with_images" \
-H "Authorization: Bearer YOUR_API_KEY" \
--output document.with-images.mdTagged PDF download
bash
curl "https://api.docushell.com/api/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download?format=tagged_pdf" \
-H "Authorization: Bearer YOUR_API_KEY" \
--output document.tagged.pdfArtifacts
- The JSON artifact is the structured representation. It includes the document root,
numberOfPages, and the hierarchicalkidsarray. - Node types capture semantic structure such as headings, paragraphs, lists, list items, tables, rows, cells, and captions.
- Where available, nodes include bounding boxes so you can map structured content back to source pages.
- The Parse Playground uses those same node types and bounding boxes to draw layout boxes, category tags, and reading-order numbers in the Annotated PDF viewer.
- Markdown is the flattened companion artifact for indexing, previews, search pipelines, and quick human review.
- For RAG pipelines, index Markdown chunks and attach JSON metadata such as page number, node type, heading path, and bounding box.
- HTML is an optional companion artifact for styled downstream rendering and review when
output_moderequests it. - Plain text is available for search, RAG, and simple ingestion pipelines.
- Annotated PDF is an optional visual debug artifact for comparing detected structure to the source page.
markdown_with_htmlis available as an explicit format when you want Markdown output with richer inline/table markup retained.markdown_with_imagesis available as an explicit format. Embedded images produce a self-contained Markdown file; external image sidecars are bundled into one zip.- Only one markdown-style artifact can be requested in a single job:
markdown,markdown_with_html, ormarkdown_with_images. tagged_pdfis available as an explicit format for accessibility review workflows. It is not a PDF/UA compliance guarantee.
Poll And Download
- Poll
GET /v1/jobs/:jobIduntilstatusbecomesdoneorfailed. - When the job completes, the status payload includes public artifact links under
result.artifacts. - Use
GET /v1/jobs/:jobId/download?format=jsonfor the structured document,format=markdownfor the Markdown companion,format=htmlfor HTML,format=textfor plain text,format=annotated_pdffor the visual debug artifact,format=markdown_with_htmlfor richer Markdown,format=markdown_with_imagesfor image-capable Markdown, andformat=tagged_pdffor tagged PDF output.
Failure Notes
invalid_pdfcovers invalid file types and malformed uploads rejected before the worker starts.corrupt_pdfis reserved for damaged PDFs that fail deeper validation or parser execution.password_protectedis returned when the document requires a password.ocr_requiredis returned for scans or image-only PDFs when hybrid OCR is disabled, unavailable, or still produces too little extractable text.invalid_page_rangeis returned when the submitted page selector is malformed or selects no valid pages.page_limit_exceededis returned when the requested page set is larger than the plan-specific parse cap.server_busyorbackend_unavailableindicate temporary capacity problems. Retry with the same Idempotency-Key when safe.
Password-protected PDF
400password_protectedThe document cannot be parsed until it is decrypted outside the public API lane.
400 error
json
{
"error": {
"code": "password_protected",
"message": "This PDF is password-protected and cannot be parsed without a password.",
"type": "invalid_request_error",
"request_id": "req_01JX8Y62XCDNZ2BM7TBM2M9Q8E"
}
}OCR required
400ocr_requiredThe parser could not extract text from a scan or image-only file.
400 error
json
{
"error": {
"code": "ocr_required",
"message": "This PDF appears to require OCR before it can be parsed.",
"type": "invalid_request_error",
"request_id": "req_01JX8Y62XCDNZ2BM7TBM2M9Q8E"
}
}Invalid page range
400invalid_page_rangeThe submitted selector is malformed or does not resolve to valid pages.
400 error
json
{
"error": {
"code": "invalid_page_range",
"message": "The requested page_range is invalid for this PDF.",
"type": "invalid_request_error",
"request_id": "req_01JX8Y62XCDNZ2BM7TBM2M9Q8E"
}
}Plan page limit exceeded
400page_limit_exceededThe requested document or selected page range is larger than the active plan allows.
400 error
json
{
"error": {
"code": "page_limit_exceeded",
"message": "Requested page range exceeds your plan limit.",
"type": "invalid_request_error",
"request_id": "req_01JX8Y62XCDNZ2BM7TBM2M9Q8E"
}
}Authentication failure
401invalid_api_keyReturned when the bearer token is missing, revoked, expired, or not allowed to use the API lane.
401 error
json
{
"error": {
"code": "invalid_api_key",
"message": "Invalid API key.",
"type": "auth_error",
"request_id": "req_01JX8Y62XCDNZ2BM7TBM2M9Q8E"
}
}Idempotency conflict
409idempotency_key_reusedReturned when the same Idempotency-Key is reused with a different payload than the original request.
409 error
json
{
"error": {
"code": "idempotency_key_reused",
"message": "This Idempotency-Key was already used with a different request.",
"type": "invalid_request_error",
"request_id": "req_01JX8Y62XCDNZ2BM7TBM2M9Q8E"
}
}Webhook access disabled
403webhook_access_disabledReturned when a Starter API key submits webhook fields. Pro, Growth, and Scale include webhooks.
403 error
json
{
"error": {
"code": "webhook_access_disabled",
"message": "Webhooks are available on Pro, Growth, and Scale. Starter includes API access without webhooks.",
"type": "billing_error",
"request_id": "req_01JX8Y62XCDNZ2BM7TBM2M9Q8E"
}
}Rate limit
429rate_limit_exceededReturned when the API key or caller fingerprint exceeds the configured request rate.
429 error
json
{
"error": {
"code": "rate_limit_exceeded",
"message": "Rate limit exceeded.",
"type": "rate_limit_error",
"request_id": "req_01JX8Y62XCDNZ2BM7TBM2M9Q8E"
}
}Endpoint
Batch Endpoint
Use batch parse when you want one accepted request, one batch status, and one shared set of parse options for multiple PDFs.
POST/v1/parse/batch
Submit up to 10 PDFs as one async parse batch with shared parse options and per-file artifacts.
Auth
Bearer token required on submit, status, and artifact download requests.
Idempotency
Required Idempotency-Key. The default idempotency window is one hour from submit, intentionally shorter than Stripe-style 24-hour idempotency. Reusing the same key and request replays the accepted response; reusing the key with a different request returns 409 idempotency_key_reused.
Content Type
multipart/form-data
Headers
| Name | Type | Required | Location | Description |
|---|---|---|---|---|
| Authorization | Bearer <API_KEY> | Yes | header | User-owned API key created in the DocuShell dashboard. |
| Idempotency-Key | string | Yes | header | Required for every batch submit. Use a fresh key per logical batch and reuse it only for the exact same retry. |
Request Fields
| Name | Type | Required | Location | Description |
|---|---|---|---|---|
| files[] | file[] | Yes | multipart | PDF uploads for the batch. Every file is preflighted before enqueue; one invalid file rejects the whole submit. |
| page_range | string | No | multipart | Shared page selector applied to every file, such as 1-3,5. Selected pages count toward the batch total page limit. |
| include_header_footer | boolean | No | multipart | Shared setting. Set to true to keep repeated headers and footers in extracted output.Default: false |
| use_struct_tree | boolean | No | multipart | Shared setting. Set to true to prefer native tagged-PDF structure when available.Default: false |
| sanitize | boolean | No | multipart | Shared setting. Set to true to mask email addresses, URLs, and phone numbers in extracted output.Default: false |
| reading_order | `xycut` | `off` | No | multipart | Shared reading-order strategy. |
| table_method | `default` | `cluster` | No | multipart | Shared table-detection strategy. |
| keep_line_breaks | boolean | No | multipart | Shared setting. Set to true when text-oriented output should preserve original line breaks more closely.Default: false |
| output_mode | `json` | `both` | `html` | `all` | No | multipart | Backward-compatible artifact bundle selector. Do not send this together with formats.Default: both |
| formats | `json` | `markdown` | `html` | `text` | `annotated_pdf` | `markdown_with_html` | `markdown_with_images` | `tagged_pdf` | No | multipart | Explicit artifact list. Send repeated fields or a comma-separated value. Do not send this together with output_mode. |
| hybrid_mode | `auto` | `full` | No | multipart | Optional shared hybrid triage override when the hybrid backend is enabled by operations. |
| image_output | `off` | `embedded` | `external` | No | multipart | Shared image handling for image-capable outputs. |
| x-docushell-webhook-url | string | No | header | Optional public HTTPS endpoint for the terminal batch webhook. URLs with credentials, localhost, private, reserved, or metadata-service addresses are rejected. |
| x-docushell-webhook-secret | string | No | header | Required when x-docushell-webhook-url is present. Must be 16-256 characters with sufficient variety; do not reuse an API key. |
| x-docushell-webhook-endpoint-id | string | No | header | Saved managed webhook endpoint id. Use this instead of sending a per-request webhook URL and secret. |
Request Notes
- Batch parse is async-only. Submit returns
202; pollGET /v1/parse/batch/:batchIdfor truth. - All files are preflighted before enqueue. Empty, non-PDF, corrupt, password-protected, oversized, invalid-page-range, per-file page-limit, total-byte-limit, and total-page-limit failures reject the whole submit.
- Default v1 limits are 10 files, 100 MB total upload, 500 selected pages, and 2 batch submits per minute.
- Parse options are shared across the batch. v1 does not support per-file parse settings.
- Send either
output_modeorformats, not both. Only one markdown-style format (markdown,markdown_with_html, ormarkdown_with_images) can be requested. - The batch lane has separate backpressure. Queue saturation returns
503 server_busywithRetry-After; rate limits return429. - Batch responses report
estimated_creditsonly: each file estimatesmax(10, selected_pages)credits. v1 does not perform final credit settlement on this lane. - Every status and download request is owner-scoped. Unknown batches, files, or owner mismatches return
404. - Artifacts expire one hour after terminal completion by default. Batch idempotency expires one hour after submit by default.
- Terminal statuses are
completed,completed_with_failures, andfailed. No per-file retry is attempted in v1. - Webhooks are signed best-effort terminal notifications with short bounded retry. Polling remains the source of truth.
Multipart batch submit
bash
curl -X POST "https://api.docushell.com/api/v1/parse/batch" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Idempotency-Key: parse-batch-demo-001" \
-H "x-docushell-webhook-url: https://example.com/docushell/webhooks" \
-H "x-docushell-webhook-secret: replace_with_a_long_random_secret" \
-F "files[]=@./report-q1.pdf;type=application/pdf" \
-F "files[]=@./report-q2.pdf;type=application/pdf" \
-F "page_range=1-5" \
-F "formats=json,markdown"Try It Now
Console placeholder for safe sandbox execution.
Accepted batch response
json
{
"batch_id": "9c7f2f2e-4f4b-4cbf-bb12-7fd3c1f4f2ab",
"service": "parse-pdf",
"status": "queued",
"counts": {
"total": 2,
"queued": 2,
"processing": 0,
"completed": 0,
"failed": 0
},
"usage": {
"total_upload_bytes": 1843200,
"total_selected_pages": 12
},
"estimated_credits": 20,
"created_at": "2026-05-23T10:12:30.000Z",
"updated_at": "2026-05-23T10:12:30.000Z",
"completed_at": null,
"expires_at": null,
"webhook_delivery": {
"status": "pending"
},
"metrics": null,
"files": [
{
"file_id": "file_6c27f6d4-43c1-48c6-a3c8-3a89d0a3cf20",
"status": "queued",
"page_count": 8,
"billable_pages": 8,
"estimated_credits": 10
},
{
"file_id": "file_95ac2fb4-060d-427c-9f86-864747dfb935",
"status": "queued",
"page_count": 4,
"billable_pages": 4,
"estimated_credits": 10
}
],
"links": {
"status": "/v1/parse/batch/9c7f2f2e-4f4b-4cbf-bb12-7fd3c1f4f2ab",
"download": "/v1/parse/batch/9c7f2f2e-4f4b-4cbf-bb12-7fd3c1f4f2ab/download"
}
}Batch status response
json
{
"batch_id": "9c7f2f2e-4f4b-4cbf-bb12-7fd3c1f4f2ab",
"service": "parse-pdf",
"status": "completed_with_failures",
"counts": {
"total": 2,
"queued": 0,
"processing": 0,
"completed": 1,
"failed": 1
},
"usage": {
"total_upload_bytes": 1843200,
"total_selected_pages": 12
},
"estimated_credits": 20,
"created_at": "2026-05-23T10:12:30.000Z",
"updated_at": "2026-05-23T10:14:04.000Z",
"completed_at": "2026-05-23T10:14:04.000Z",
"expires_at": "2026-05-23T11:14:04.000Z",
"webhook_delivery": {
"status": "delivered"
},
"metrics": {
"queue_wait_ms": 214,
"duration_ms": 82341
},
"files": [
{
"file_id": "file_6c27f6d4-43c1-48c6-a3c8-3a89d0a3cf20",
"status": "completed",
"page_count": 8,
"billable_pages": 8,
"estimated_credits": 10,
"artifacts": {
"json_download": "/v1/parse/batch/9c7f2f2e-4f4b-4cbf-bb12-7fd3c1f4f2ab/files/file_6c27f6d4-43c1-48c6-a3c8-3a89d0a3cf20/download?format=json",
"markdown_download": "/v1/parse/batch/9c7f2f2e-4f4b-4cbf-bb12-7fd3c1f4f2ab/files/file_6c27f6d4-43c1-48c6-a3c8-3a89d0a3cf20/download?format=markdown"
}
},
{
"file_id": "file_95ac2fb4-060d-427c-9f86-864747dfb935",
"status": "failed",
"page_count": 4,
"billable_pages": 4,
"estimated_credits": 10,
"failure_code": "corrupt_pdf"
}
],
"links": {
"status": "/v1/parse/batch/9c7f2f2e-4f4b-4cbf-bb12-7fd3c1f4f2ab",
"download": "/v1/parse/batch/9c7f2f2e-4f4b-4cbf-bb12-7fd3c1f4f2ab/download"
}
}Batch ZIP download
bash
curl "https://api.docushell.com/api/v1/parse/batch/9c7f2f2e-4f4b-4cbf-bb12-7fd3c1f4f2ab/download" \
-H "Authorization: Bearer YOUR_API_KEY" \
--output parse-batch.zipPer-file artifact download
bash
curl "https://api.docushell.com/api/v1/parse/batch/9c7f2f2e-4f4b-4cbf-bb12-7fd3c1f4f2ab/files/file_6c27f6d4-43c1-48c6-a3c8-3a89d0a3cf20/download?format=json" \
-H "Authorization: Bearer YOUR_API_KEY" \
--output report.jsonArtifacts
- Per-file download links appear under
files[].artifactsafter each completed file is promoted. - The batch ZIP is generated on demand from completed file artifacts, streamed, and removed after the response finishes.
- Per-file artifact downloads can be retried until the batch expires.
- The per-file
formatquery must match an artifact requested for the batch. - Original filenames are not part of the public status payload or webhook logs.
Poll And Download
- Poll
GET /v1/parse/batch/:batchIduntilstatusbecomescompleted,completed_with_failures, orfailed. - When the batch completes or partially completes, use
GET /v1/parse/batch/:batchId/downloadfor the generated ZIP or use eachfiles[].artifacts.*_downloadlink for a specific file artifact. - If a file status is
failed, its per-file download returns409 batch_file_failed. Continue downloading completed files untilexpires_at.
Failure Notes
invalid_pdf,corrupt_pdf,password_protected,invalid_page_range, andpage_limit_exceededcan be returned during preflight before a batch is accepted.server_busywithRetry-Aftermeans the dedicated batch queue or active batch lane is saturated. Retry later with the same Idempotency-Key only for the exact same request.- Download
400means the requested format was not requested for this batch. - Download
425 batch_not_readymeans the batch or file is not terminal yet. - Download
409 batch_file_failedmeans that accepted file reached a terminal failed state. - Download
410 output_expiredmeans the artifact TTL has passed.
Batch not ready
425batch_not_readyReturned when a batch or file download is attempted before terminal status.
425 error
json
{
"error": {
"code": "batch_not_ready",
"message": "Batch is not ready.",
"type": "invalid_request_error",
"request_id": "req_01JX8Y62XCDNZ2BM7TBM2M9Q8E"
}
}Batch file failed
409batch_file_failedReturned when a per-file artifact is requested for a file that failed after enqueue.
409 error
json
{
"error": {
"code": "batch_file_failed",
"message": "Batch file failed.",
"type": "invalid_request_error",
"request_id": "req_01JX8Y62XCDNZ2BM7TBM2M9Q8E"
}
}Expired output
410output_expiredReturned after the one-hour default TTL for completed batch artifacts passes.
410 error
json
{
"error": {
"code": "output_expired",
"message": "Batch output expired.",
"type": "invalid_request_error",
"request_id": "req_01JX8Y62XCDNZ2BM7TBM2M9Q8E"
}
}Queue saturated
503server_busyReturned when batch-specific queue backpressure rejects the submit.
503 error
json
{
"error": {
"code": "server_busy",
"message": "The parse batch queue is busy. Retry later.",
"type": "internal_error",
"request_id": "req_01JX8Y62XCDNZ2BM7TBM2M9Q8E"
}
}Authentication failure
401invalid_api_keyReturned when the bearer token is missing, revoked, expired, or not allowed to use the API lane.
401 error
json
{
"error": {
"code": "invalid_api_key",
"message": "Invalid API key.",
"type": "auth_error",
"request_id": "req_01JX8Y62XCDNZ2BM7TBM2M9Q8E"
}
}Idempotency conflict
409idempotency_key_reusedReturned when the same Idempotency-Key is reused with a different payload than the original request.
409 error
json
{
"error": {
"code": "idempotency_key_reused",
"message": "This Idempotency-Key was already used with a different request.",
"type": "invalid_request_error",
"request_id": "req_01JX8Y62XCDNZ2BM7TBM2M9Q8E"
}
}Webhook access disabled
403webhook_access_disabledReturned when a Starter API key submits webhook fields. Pro, Growth, and Scale include webhooks.
403 error
json
{
"error": {
"code": "webhook_access_disabled",
"message": "Webhooks are available on Pro, Growth, and Scale. Starter includes API access without webhooks.",
"type": "billing_error",
"request_id": "req_01JX8Y62XCDNZ2BM7TBM2M9Q8E"
}
}Rate limit
429rate_limit_exceededReturned when the API key or caller fingerprint exceeds the configured request rate.
429 error
json
{
"error": {
"code": "rate_limit_exceeded",
"message": "Rate limit exceeded.",
"type": "rate_limit_error",
"request_id": "req_01JX8Y62XCDNZ2BM7TBM2M9Q8E"
}
}Webhooks
Completion Webhooks
Use webhook_url and webhook_secret for per-request completion callbacks, or send x-docushell-webhook-url and its matching secret header on batch parse requests.
Receivers must validate x-docushell-signature, deduplicate by x-docushell-delivery, and finish within 10-second request timeouts. Use public HTTPS staging endpoints or approved tunnels for receiver tests.
Terminal event names include pdf.parse.completed, pdf.parse.failed, pdf.parse.batch.completed, pdf.parse.batch.completed_with_failures, pdf.parse.batch.failed, resume.parse.completed, resume.parse.failed, resume.batch.completed, resume.batch.completed_with_failures, and resume.batch.failed.
Artifacts
Artifact Downloads
The shared download route streams one artifact at a time.
JSON artifact download
bash
curl "https://api.docushell.com/api/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download?format=json" \
-H "Authorization: Bearer YOUR_API_KEY"Markdown artifact download
bash
curl "https://api.docushell.com/api/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download?format=markdown" \
-H "Authorization: Bearer YOUR_API_KEY"HTML artifact download
bash
curl "https://api.docushell.com/api/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download?format=html" \
-H "Authorization: Bearer YOUR_API_KEY"Plain text artifact download
bash
curl "https://api.docushell.com/api/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download?format=text" \
-H "Authorization: Bearer YOUR_API_KEY"Annotated PDF debug artifact download
bash
curl "https://api.docushell.com/api/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download?format=annotated_pdf" \
-H "Authorization: Bearer YOUR_API_KEY" \
--output document.annotated.pdfMarkdown with HTML download
bash
curl "https://api.docushell.com/api/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download?format=markdown_with_html" \
-H "Authorization: Bearer YOUR_API_KEY" \
--output document.with-html.mdMarkdown with images download
bash
curl "https://api.docushell.com/api/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download?format=markdown_with_images" \
-H "Authorization: Bearer YOUR_API_KEY" \
--output document.with-images.mdTagged PDF download
bash
curl "https://api.docushell.com/api/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download?format=tagged_pdf" \
-H "Authorization: Bearer YOUR_API_KEY" \
--output document.tagged.pdfArtifacts
JSON Artifact Versus Markdown Artifact
Use the JSON artifact when you need structure, semantic content types, or document geometry. It is the canonical machine-readable representation and the right choice for pipelines that need sections, lists, tables, headings, captions, or positional metadata.
Use the Markdown artifact when you need a lightweight, portable text representation that still follows the extracted reading order. It works well for previews, quick QA, search indexing, and downstream LLM ingestion.
The parse status payload also includes result.metadata so you can inspect which extraction-tuning options were applied to a completed job.
Support
Troubleshooting Parse Failures
Most parse failures are actionable before you retry. Keep the same Idempotency-Key only when you are replaying the exact same logical request after a timeout or transport issue.
invalid_pdf: confirm the upload is a real PDF before retrying.corrupt_pdf: re-export or repair the file, then resubmit.password_protected: decrypt the PDF before uploading. Password submission is not part of this public lane yet.ocr_required: run OCR upstream first, then resubmit the text-native PDF.invalid_page_range: retry with a selector like1-3,5that resolves inside the document bounds.- Unexpected text layout: retry with
reading_order=offorreading_order=xycut, depending on whether you want less or more reading-order reconstruction. - Weak table extraction: retry with
table_method=clusterif the default path misses cell groupings.
server_busy or backend_unavailable, retry safely with the same Idempotency-Key after the transient issue clears.