Flagship Reference

Parse PDF

Parse PDF turns uploaded PDFs into a structured document tree plus Markdown, HTML, plain text, and optional annotated PDF debug artifacts. Use /v1/parse for one file or /v1/parse/batch for up to 10 PDFs with shared parse options.

Reference15 min
View as Markdown

Overview

What You Get Back

Completed parse jobs expose structured JSON plus optional Markdown, HTML, plain text, annotated PDF debug artifacts, richer Markdown, image-capable Markdown, and tagged PDF output. The JSON artifact is the structured representation for downstream automation. Markdown and text are text-friendly companions for indexing, previews, and human review.

The structured JSON preserves reading order and emits a hierarchical document rooted at numberOfPages plus kids, with semantic nodes for headings, paragraphs, lists, list items, tables, rows, cells, captions, and images when detected.

When the underlying output includes layout coordinates, nodes can also expose bounding boxes so you can map extracted content back to the source pages.

  • JSON artifact: structured document tree for automation and indexing.
  • Markdown artifact: readable text export with the same reading-order orientation.
  • HTML artifact: styled companion for rendering and review.
  • Plain text artifact: lightweight output for search, RAG, and simple ingestion.
  • Annotated PDF artifact: visual debug output for comparing detected structure to source pages.
  • Markdown with HTML artifact: Markdown-family output that keeps richer inline/table markup.
  • Markdown with images artifact: explicit image-capable Markdown; external sidecars are bundled into one zip.
  • Only one markdown-style artifact can be requested per job because the parse engine emits one Markdown-family file per run.
  • Tagged PDF artifact: automated structure inference for accessibility review, not a PDF/UA compliance guarantee.
  • Header and footer content stays excluded by default unless include_header_footer=true is requested.
  • Tagged PDFs can prefer their native structure tree when use_struct_tree=true is supplied.
  • Sanitization, reading-order, table, line-break, hybrid-mode, and image-output settings tune extraction behavior while output_mode and formats select emitted artifacts.

RAG

RAG Ingestion Workflow

Use DocuShell Parse PDF when a search, RAG, review, or agent workflow needs readable chunks plus source metadata.

For most RAG pipelines, request formats=json,markdown. Treat Markdown as the primary text to chunk and embed, then attach JSON metadata from the same parse job so retrieved passages can point back to source pages and bounding boxes.

Do not chunk PDFs by blind character windows first. Start with document semantics: headings, paragraphs, lists, captions, and tables. Merge short neighboring elements when needed, keep tables intact when their structure matters, and carry page numbers plus bounding boxes into vector metadata.

The JSON artifact is also the audit layer. Store enough metadata to reproduce a citation, highlight a source region in a review UI, and debug bad retrieval without reparsing the PDF.

  • Submit with formats=json,markdown for the common RAG pair.
  • Use use_struct_tree=true for tagged PDFs when you want DocuShell to prefer reliable native structure tags.
  • Keep DocuShell's default rendering-mismatch defenses enabled for untrusted PDFs so hidden, off-page, tiny, or transparent text is less likely to enter model context.
  • Keep the default header/footer exclusion unless repeated page furniture is important to the answer.
  • Use sanitize=true only when your pipeline should mask visible emails, URLs, and phone numbers in the extracted output.
  • Store per-chunk metadata such as source file ID, page number, heading path, node type, and bounding box when available.
  • Keep table nodes or Markdown tables as standalone chunks when row and column relationships are important.
RAG answers should cite DocuShell metadata, not just the text string sent to the model. Page and bounding-box metadata make citations inspectable.

Implementation

RAG Examples

These examples show the DocuShell API request and the downstream metadata shape to keep with embeddings.

Submit a RAG-ready parse job

bash

curl -X POST "https://api.docushell.com/api/v1/parse" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Idempotency-Key: rag-parse-001" \
  -F "file=@./policy-handbook.pdf;type=application/pdf" \
  -F "formats=json,markdown" \
  -F "reading_order=xycut" \
  -F "use_struct_tree=true"
`use_struct_tree=true` is helpful when the source PDF has reliable native tags. If usable tags are not present, continue validating the output in the Parse Playground and downloaded artifacts.

DocuShell RAG flow

text

PDF upload
  -> POST /api/v1/parse with formats=json,markdown
  -> poll GET /api/v1/jobs/:jobId
  -> download Markdown and JSON artifacts
  -> chunk Markdown by headings, paragraphs, lists, and tables
  -> attach JSON metadata: page, bounding box, node type, heading path
  -> store embeddings plus metadata
  -> retrieve chunks and cite the source page or region

Chunk metadata shape

json

{
  "content": "## Data Retention\n\nCustomer documents are retained for the configured retention window...",
  "metadata": {
    "source_file_id": "file_01JX...",
    "source_name": "policy-handbook.pdf",
    "section": "Data Retention",
    "node_types": ["heading", "paragraph"],
    "page_start": 3,
    "page_end": 4,
    "bounding_boxes": [
      { "page": 3, "bbox": { "x": 0.88, "y": 1.21, "w": 6.21, "h": 0.52 } }
    ]
  }
}
Use the exact metadata keys your system prefers, but preserve DocuShell page and bounding-box data when available.

Structure

Tagged PDFs And Structure Trees

Tagged PDFs can provide stronger semantic structure than coordinate-only extraction when the tags are present and trustworthy.

When a source PDF includes usable structure tags, use_struct_tree=true tells DocuShell to prefer that structure for reading order and semantic hierarchy. This can improve headings, lists, table relationships, and natural chunk boundaries for RAG pipelines.

Real document collections are mixed. Some PDFs are well tagged, some have no tags, and some have tags that are not useful enough for downstream retrieval. Always inspect representative outputs before relying on a single extraction strategy.

  • Use use_struct_tree=true for tagged policy documents, manuals, reports, and accessible PDFs where author-defined structure is expected.
  • Use the regular layout-aware path for untagged or poorly tagged files, or compare both settings during integration testing.
  • Request formats=tagged_pdf when you need a generated tagged PDF artifact for review or accessibility workflows.
  • Do not treat formats=tagged_pdf as a PDF/UA compliance guarantee. Review the artifact before making accessibility claims.
  • For RAG chunking, start new chunks at major headings, preserve heading-plus-paragraph groups, and avoid splitting tables across chunks.

Inspection

Playground Inspection Views

The Parse Playground lets developers inspect both the visual overlay and the structured data behind it.

Capabilities

What Parse Supports Today

The public parse lane exposes curated parser controls while OCR/enrichment remain backend-profile settings.

CapabilityAvailabilityNotes
JSON artifactAvailableHierarchical document output for automation, indexing, and structured QA.
Markdown artifactAvailableText-first companion download for previews, search, and LLM ingestion.
HTML artifactAvailableOptional styled companion download for rendering and review.
Plain text artifactAvailableOptional lightweight text output for search, RAG, and simple ingestion.
Annotated PDF artifactAvailableOptional visual debug artifact for validating extracted structure.
Annotated PDF playground viewerAvailableThe playground renders PDF pages with layout boxes, category tags, and reading-order numbers when block geometry is present.
Blocks / tables outputAvailable when presentThe playground lists parser nodes with order, category type, page number, bounding box, and extracted content.
Markdown with HTMLAvailableRequest formats=markdown_with_html when Markdown should retain richer inline/table markup.
Markdown with imagesAvailableRequest formats=markdown_with_images; external sidecars download as a zip.
Tagged PDF outputAvailableRequest formats=tagged_pdf; review before making accessibility compliance claims.
Batch parseAvailableUse /v1/parse/batch for up to 10 PDFs with shared parse options, per-file statuses, and per-file artifacts.
Reading-order preservationAvailableStructured and text-oriented artifacts follow the detected reading order.
Heading and list detectionAvailableHeadings plus numbered, bulleted, and nested lists are represented when detected.
Table extractionAvailable / backend-gatedStructured tables are emitted when detected; complex or borderless tables may require the hybrid backend.
Image extraction with coordinatesAvailable when presentImage nodes and coordinates can appear in JSON; use markdown_with_images for image-capable Markdown.
Tagged PDF structureAvailableUse use_struct_tree=true to prefer native structure tags when a tagged PDF provides them.
SanitizationAvailableUse sanitize=true to mask email addresses, URLs, and phone numbers in extracted output.
Reading-order overrideAvailableUse reading_order=xycut|off when you need an explicit reading-order setting.
Table-method overrideAvailableUse table_method=default|cluster for light table extraction tuning.
Keep line breaksAvailableUse keep_line_breaks=true when text-oriented output should preserve original line breaks more closely.
Header/footer inclusionAvailableUse include_header_footer=true when you need repeated page furniture in the output.
Request hybrid modeBackend-gatedUse hybrid_mode=auto|full only when the DocuShell hybrid backend is enabled.
Image output modeAvailableUse image_output=off|embedded|external for image-capable outputs.
OCR / scanned PDFsBackend-gatedAvailable when the DocuShell hybrid OCR profile is active; otherwise scans return ocr_required.
Formula/chart enrichmentBackend-gatedAvailable only when the active DocuShell backend profile includes those enrichments.

Endpoint

Single-File Endpoint

POST/v1/parse

Submit a PDF for queued parsing and receive structured JSON plus Markdown, HTML, plain text, and annotated PDF debug output.

Auth

Bearer token required on submit, status, and artifact download requests.

Idempotency

Server-minted job_id values with optional Idempotency-Key replay support.

Content Type

multipart/form-data

Headers

NameTypeRequiredLocationDescription
AuthorizationBearer <API_KEY>YesheaderUser-owned API key created in the DocuShell dashboard.
Idempotency-KeystringNoheaderRecommended for safely retrying submit requests without creating duplicate jobs.

Request Fields

NameTypeRequiredLocationDescription
filefileYesmultipartPDF upload. The gateway validates PDF magic bytes before forwarding the file.
file_namestringNomultipartOptional file name override used for storage metadata and downstream artifact names.
page_rangestringNomultipartComma-separated pages or ranges such as 1-3,5,9-11.
include_header_footerbooleanNomultipartSet to true to keep header and footer content in the extracted output.Default: false
use_struct_treebooleanNomultipartSet to true to prefer native tagged-PDF structure when the source document includes a usable structure tree.Default: false
sanitizebooleanNomultipartSet to true to mask email addresses, URLs, and phone numbers in extracted output.Default: false
reading_order`xycut` | `off`NomultipartOptional reading-order strategy. Omit it to keep the current default extraction behavior.
table_method`default` | `cluster`NomultipartOptional table-detection strategy. Omit it to keep the current default extraction behavior.
keep_line_breaksbooleanNomultipartSet to true to preserve source line breaks more aggressively in text-oriented output.Default: false
output_mode`json` | `both` | `html` | `all`NomultipartBackward-compatible artifact bundle selector. json keeps only structured JSON, both adds Markdown, html adds HTML, and all returns the common legacy bundle: JSON, Markdown, HTML, text, and annotated PDF.Default: both
formats`json` | `markdown` | `html` | `text` | `annotated_pdf` | `markdown_with_html` | `markdown_with_images` | `tagged_pdf`NomultipartOptional explicit artifact list. Send as repeated fields or a comma-separated value, such as formats=json,text.
hybrid_mode`auto` | `full`NomultipartOptional per-job hybrid triage override. Requires the hybrid backend to be enabled by operations.
image_output`off` | `embedded` | `external`NomultipartControls image handling for image-capable outputs. markdown_with_images defaults to embedded images unless external is requested.

Request Notes

  • Plan limits are enforced before the job is queued. Starter keeps the 50 MB per-file cap; Pro, Growth, and Scale raise upload size, per PDF/job page limits, and concurrency as monthly credits grow.
  • Set use_struct_tree=true when tagged PDFs should favor their native structure tree. Leave it off for the default reading-order-oriented extraction path.
  • Structured JSON remains the canonical parse result and is always generated for successful jobs so status responses can keep returning result.document.
  • sanitize, reading_order, table_method, keep_line_breaks, hybrid_mode, and image_output are extraction-tuning knobs. output_mode and formats control which companion artifacts are emitted.
  • Request newer artifact types such as markdown_with_html, markdown_with_images, and tagged_pdf with formats; only one markdown-style format (markdown, markdown_with_html, or markdown_with_images) can be requested per job because the parse engine emits one Markdown-family file per run.
  • DocuShell keeps rendering-mismatch safety filters enabled for Parse PDF output. sanitize=true is a separate optional control for masking visible sensitive data.
  • OCR, formula extraction, and chart/image descriptions follow the active DocuShell backend profile. They are not per-request fields on the shared public API.
  • Status polling stays on /v1/jobs/:jobId. Artifact streaming happens through the shared download route with format=json|markdown|html|text|annotated_pdf|markdown_with_html|markdown_with_images|tagged_pdf.

Multipart submit

bash

curl -X POST "https://api.docushell.com/api/v1/parse" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Idempotency-Key: parse-demo-001" \
  -F "file=@./quarterly-report.pdf;type=application/pdf" \
  -F "file_name=quarterly-report.pdf" \
  -F "page_range=1-3" \
  -F "include_header_footer=true" \
  -F "use_struct_tree=true" \
  -F "sanitize=true" \
  -F "reading_order=xycut" \
  -F "table_method=cluster" \
  -F "keep_line_breaks=true" \
  -F "formats=json,markdown_with_images" \
  -F "image_output=embedded"

Try It Now

Console placeholder for safe sandbox execution.

Coming soon

Queued response

json

{
  "job_id": "job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT",
  "status": "queued",
  "cost": 2500,
  "service": "parse-pdf",
  "request_id": "req_01JX8Y62XCDNZ2BM7TBM2M9Q8E",
  "links": {
    "status": "/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT"
  }
}

Parse job status

json

{
  "job_id": "job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT",
  "status": "done",
  "service": "parse-pdf",
  "request_id": "req_01JX8Y62XCDNZ2BM7TBM2M9Q8E",
  "result": {
    "document": {
      "fileName": "quarterly-report.pdf",
      "numberOfPages": 2,
      "kids": [
        {
          "type": "section",
          "children": [
            {
              "type": "heading",
              "content": "Executive summary",
              "heading level": 1,
              "page number": 1,
              "bounding box": { "x": 0.88, "y": 0.74, "w": 6.15, "h": 0.33 }
            },
            {
              "type": "paragraph",
              "content": "Revenue rose 18% year over year across the managed-services portfolio.",
              "page number": 1,
              "bounding box": { "x": 0.88, "y": 1.21, "w": 6.21, "h": 0.52 }
            },
            {
              "type": "list",
              "children": [
                { "type": "listItem", "content": "Renewals remained above 92%." },
                { "type": "listItem", "content": "Average contract value increased in EMEA." }
              ]
            },
            {
              "type": "table",
              "children": [
                {
                  "type": "tableRow",
                  "children": [
                    { "type": "tableCell", "content": "Region" },
                    { "type": "tableCell", "content": "Growth" }
                  ]
                },
                {
                  "type": "tableRow",
                  "children": [
                    { "type": "tableCell", "content": "North America" },
                    { "type": "tableCell", "content": "21%" }
                  ]
                }
              ]
            },
            {
              "type": "caption",
              "content": "Table 1. Regional growth by quarter."
            }
          ]
        }
      ]
    },
    "artifacts": {
      "markdown_download": "/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download?format=markdown",
      "json_download": "/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download?format=json",
      "html_download": "/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download?format=html",
      "text_download": "/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download?format=text",
      "annotated_pdf_download": "/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download?format=annotated_pdf"
    },
    "metadata": {
      "engine": "docushell_parse",
      "output_mode": "all",
      "include_header_footer": true,
      "use_struct_tree": true,
      "sanitize": true,
      "reading_order": "xycut",
      "table_method": "cluster",
      "keep_line_breaks": true
    }
  },
  "metrics": {
    "queue_wait_ms": 214,
    "duration_ms": 1789
  },
  "links": {
    "status": "/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT",
    "download": "/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download"
  }
}
The JSON artifact preserves reading order and exposes the structured document tree through `numberOfPages` and `kids`.

JSON artifact download

bash

curl "https://api.docushell.com/api/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download?format=json" \
  -H "Authorization: Bearer YOUR_API_KEY"
Use `format=json` to stream the structured document artifact directly.

Markdown artifact download

bash

curl "https://api.docushell.com/api/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download?format=markdown" \
  -H "Authorization: Bearer YOUR_API_KEY"
Markdown downloads preserve the extracted reading order in a plain-text friendly artifact.

HTML artifact download

bash

curl "https://api.docushell.com/api/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download?format=html" \
  -H "Authorization: Bearer YOUR_API_KEY"
HTML downloads are available when the parse job requested HTML output.

Plain text artifact download

bash

curl "https://api.docushell.com/api/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download?format=text" \
  -H "Authorization: Bearer YOUR_API_KEY"
Plain text downloads are available when the parse job requests all artifacts or explicit text output.

Annotated PDF debug artifact download

bash

curl "https://api.docushell.com/api/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download?format=annotated_pdf" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  --output document.annotated.pdf
Annotated PDF downloads are visual debug artifacts for validating extracted structure against the source page.

Markdown with HTML download

bash

curl "https://api.docushell.com/api/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download?format=markdown_with_html" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  --output document.with-html.md
`markdown_with_html` preserves richer inline/table markup inside a Markdown-family artifact.

Markdown with images download

bash

curl "https://api.docushell.com/api/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download?format=markdown_with_images" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  --output document.with-images.md
When `image_output=external` emits sidecars, this download may be a zip containing Markdown plus image assets.

Tagged PDF download

bash

curl "https://api.docushell.com/api/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download?format=tagged_pdf" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  --output document.tagged.pdf
Tagged PDF output is automated structure inference and should be reviewed before accessibility compliance claims.

Artifacts

  • The JSON artifact is the structured representation. It includes the document root, numberOfPages, and the hierarchical kids array.
  • Node types capture semantic structure such as headings, paragraphs, lists, list items, tables, rows, cells, and captions.
  • Where available, nodes include bounding boxes so you can map structured content back to source pages.
  • The Parse Playground uses those same node types and bounding boxes to draw layout boxes, category tags, and reading-order numbers in the Annotated PDF viewer.
  • Markdown is the flattened companion artifact for indexing, previews, search pipelines, and quick human review.
  • For RAG pipelines, index Markdown chunks and attach JSON metadata such as page number, node type, heading path, and bounding box.
  • HTML is an optional companion artifact for styled downstream rendering and review when output_mode requests it.
  • Plain text is available for search, RAG, and simple ingestion pipelines.
  • Annotated PDF is an optional visual debug artifact for comparing detected structure to the source page.
  • markdown_with_html is available as an explicit format when you want Markdown output with richer inline/table markup retained.
  • markdown_with_images is available as an explicit format. Embedded images produce a self-contained Markdown file; external image sidecars are bundled into one zip.
  • Only one markdown-style artifact can be requested in a single job: markdown, markdown_with_html, or markdown_with_images.
  • tagged_pdf is available as an explicit format for accessibility review workflows. It is not a PDF/UA compliance guarantee.

Poll And Download

  • Poll GET /v1/jobs/:jobId until status becomes done or failed.
  • When the job completes, the status payload includes public artifact links under result.artifacts.
  • Use GET /v1/jobs/:jobId/download?format=json for the structured document, format=markdown for the Markdown companion, format=html for HTML, format=text for plain text, format=annotated_pdf for the visual debug artifact, format=markdown_with_html for richer Markdown, format=markdown_with_images for image-capable Markdown, and format=tagged_pdf for tagged PDF output.

Failure Notes

  • invalid_pdf covers invalid file types and malformed uploads rejected before the worker starts.
  • corrupt_pdf is reserved for damaged PDFs that fail deeper validation or parser execution.
  • password_protected is returned when the document requires a password.
  • ocr_required is returned for scans or image-only PDFs when hybrid OCR is disabled, unavailable, or still produces too little extractable text.
  • invalid_page_range is returned when the submitted page selector is malformed or selects no valid pages.
  • page_limit_exceeded is returned when the requested page set is larger than the plan-specific parse cap.
  • server_busy or backend_unavailable indicate temporary capacity problems. Retry with the same Idempotency-Key when safe.

Password-protected PDF

400password_protected

The document cannot be parsed until it is decrypted outside the public API lane.

400 error

json

{
  "error": {
    "code": "password_protected",
    "message": "This PDF is password-protected and cannot be parsed without a password.",
    "type": "invalid_request_error",
    "request_id": "req_01JX8Y62XCDNZ2BM7TBM2M9Q8E"
  }
}

OCR required

400ocr_required

The parser could not extract text from a scan or image-only file.

400 error

json

{
  "error": {
    "code": "ocr_required",
    "message": "This PDF appears to require OCR before it can be parsed.",
    "type": "invalid_request_error",
    "request_id": "req_01JX8Y62XCDNZ2BM7TBM2M9Q8E"
  }
}

Invalid page range

400invalid_page_range

The submitted selector is malformed or does not resolve to valid pages.

400 error

json

{
  "error": {
    "code": "invalid_page_range",
    "message": "The requested page_range is invalid for this PDF.",
    "type": "invalid_request_error",
    "request_id": "req_01JX8Y62XCDNZ2BM7TBM2M9Q8E"
  }
}

Plan page limit exceeded

400page_limit_exceeded

The requested document or selected page range is larger than the active plan allows.

400 error

json

{
  "error": {
    "code": "page_limit_exceeded",
    "message": "Requested page range exceeds your plan limit.",
    "type": "invalid_request_error",
    "request_id": "req_01JX8Y62XCDNZ2BM7TBM2M9Q8E"
  }
}

Authentication failure

401invalid_api_key

Returned when the bearer token is missing, revoked, expired, or not allowed to use the API lane.

401 error

json

{
  "error": {
    "code": "invalid_api_key",
    "message": "Invalid API key.",
    "type": "auth_error",
    "request_id": "req_01JX8Y62XCDNZ2BM7TBM2M9Q8E"
  }
}

Idempotency conflict

409idempotency_key_reused

Returned when the same Idempotency-Key is reused with a different payload than the original request.

409 error

json

{
  "error": {
    "code": "idempotency_key_reused",
    "message": "This Idempotency-Key was already used with a different request.",
    "type": "invalid_request_error",
    "request_id": "req_01JX8Y62XCDNZ2BM7TBM2M9Q8E"
  }
}

Webhook access disabled

403webhook_access_disabled

Returned when a Starter API key submits webhook fields. Pro, Growth, and Scale include webhooks.

403 error

json

{
  "error": {
    "code": "webhook_access_disabled",
    "message": "Webhooks are available on Pro, Growth, and Scale. Starter includes API access without webhooks.",
    "type": "billing_error",
    "request_id": "req_01JX8Y62XCDNZ2BM7TBM2M9Q8E"
  }
}

Rate limit

429rate_limit_exceeded

Returned when the API key or caller fingerprint exceeds the configured request rate.

429 error

json

{
  "error": {
    "code": "rate_limit_exceeded",
    "message": "Rate limit exceeded.",
    "type": "rate_limit_error",
    "request_id": "req_01JX8Y62XCDNZ2BM7TBM2M9Q8E"
  }
}

Endpoint

Batch Endpoint

Use batch parse when you want one accepted request, one batch status, and one shared set of parse options for multiple PDFs.

POST/v1/parse/batch

Submit up to 10 PDFs as one async parse batch with shared parse options and per-file artifacts.

Auth

Bearer token required on submit, status, and artifact download requests.

Idempotency

Required Idempotency-Key. The default idempotency window is one hour from submit, intentionally shorter than Stripe-style 24-hour idempotency. Reusing the same key and request replays the accepted response; reusing the key with a different request returns 409 idempotency_key_reused.

Content Type

multipart/form-data

Headers

NameTypeRequiredLocationDescription
AuthorizationBearer <API_KEY>YesheaderUser-owned API key created in the DocuShell dashboard.
Idempotency-KeystringYesheaderRequired for every batch submit. Use a fresh key per logical batch and reuse it only for the exact same retry.

Request Fields

NameTypeRequiredLocationDescription
files[]file[]YesmultipartPDF uploads for the batch. Every file is preflighted before enqueue; one invalid file rejects the whole submit.
page_rangestringNomultipartShared page selector applied to every file, such as 1-3,5. Selected pages count toward the batch total page limit.
include_header_footerbooleanNomultipartShared setting. Set to true to keep repeated headers and footers in extracted output.Default: false
use_struct_treebooleanNomultipartShared setting. Set to true to prefer native tagged-PDF structure when available.Default: false
sanitizebooleanNomultipartShared setting. Set to true to mask email addresses, URLs, and phone numbers in extracted output.Default: false
reading_order`xycut` | `off`NomultipartShared reading-order strategy.
table_method`default` | `cluster`NomultipartShared table-detection strategy.
keep_line_breaksbooleanNomultipartShared setting. Set to true when text-oriented output should preserve original line breaks more closely.Default: false
output_mode`json` | `both` | `html` | `all`NomultipartBackward-compatible artifact bundle selector. Do not send this together with formats.Default: both
formats`json` | `markdown` | `html` | `text` | `annotated_pdf` | `markdown_with_html` | `markdown_with_images` | `tagged_pdf`NomultipartExplicit artifact list. Send repeated fields or a comma-separated value. Do not send this together with output_mode.
hybrid_mode`auto` | `full`NomultipartOptional shared hybrid triage override when the hybrid backend is enabled by operations.
image_output`off` | `embedded` | `external`NomultipartShared image handling for image-capable outputs.
x-docushell-webhook-urlstringNoheaderOptional public HTTPS endpoint for the terminal batch webhook. URLs with credentials, localhost, private, reserved, or metadata-service addresses are rejected.
x-docushell-webhook-secretstringNoheaderRequired when x-docushell-webhook-url is present. Must be 16-256 characters with sufficient variety; do not reuse an API key.
x-docushell-webhook-endpoint-idstringNoheaderSaved managed webhook endpoint id. Use this instead of sending a per-request webhook URL and secret.

Request Notes

  • Batch parse is async-only. Submit returns 202; poll GET /v1/parse/batch/:batchId for truth.
  • All files are preflighted before enqueue. Empty, non-PDF, corrupt, password-protected, oversized, invalid-page-range, per-file page-limit, total-byte-limit, and total-page-limit failures reject the whole submit.
  • Default v1 limits are 10 files, 100 MB total upload, 500 selected pages, and 2 batch submits per minute.
  • Parse options are shared across the batch. v1 does not support per-file parse settings.
  • Send either output_mode or formats, not both. Only one markdown-style format (markdown, markdown_with_html, or markdown_with_images) can be requested.
  • The batch lane has separate backpressure. Queue saturation returns 503 server_busy with Retry-After; rate limits return 429.
  • Batch responses report estimated_credits only: each file estimates max(10, selected_pages) credits. v1 does not perform final credit settlement on this lane.
  • Every status and download request is owner-scoped. Unknown batches, files, or owner mismatches return 404.
  • Artifacts expire one hour after terminal completion by default. Batch idempotency expires one hour after submit by default.
  • Terminal statuses are completed, completed_with_failures, and failed. No per-file retry is attempted in v1.
  • Webhooks are signed best-effort terminal notifications with short bounded retry. Polling remains the source of truth.

Multipart batch submit

bash

curl -X POST "https://api.docushell.com/api/v1/parse/batch" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Idempotency-Key: parse-batch-demo-001" \
  -H "x-docushell-webhook-url: https://example.com/docushell/webhooks" \
  -H "x-docushell-webhook-secret: replace_with_a_long_random_secret" \
  -F "files[]=@./report-q1.pdf;type=application/pdf" \
  -F "files[]=@./report-q2.pdf;type=application/pdf" \
  -F "page_range=1-5" \
  -F "formats=json,markdown"

Try It Now

Console placeholder for safe sandbox execution.

Coming soon

Accepted batch response

json

{
  "batch_id": "9c7f2f2e-4f4b-4cbf-bb12-7fd3c1f4f2ab",
  "service": "parse-pdf",
  "status": "queued",
  "counts": {
    "total": 2,
    "queued": 2,
    "processing": 0,
    "completed": 0,
    "failed": 0
  },
  "usage": {
    "total_upload_bytes": 1843200,
    "total_selected_pages": 12
  },
  "estimated_credits": 20,
  "created_at": "2026-05-23T10:12:30.000Z",
  "updated_at": "2026-05-23T10:12:30.000Z",
  "completed_at": null,
  "expires_at": null,
  "webhook_delivery": {
    "status": "pending"
  },
  "metrics": null,
  "files": [
    {
      "file_id": "file_6c27f6d4-43c1-48c6-a3c8-3a89d0a3cf20",
      "status": "queued",
      "page_count": 8,
      "billable_pages": 8,
      "estimated_credits": 10
    },
    {
      "file_id": "file_95ac2fb4-060d-427c-9f86-864747dfb935",
      "status": "queued",
      "page_count": 4,
      "billable_pages": 4,
      "estimated_credits": 10
    }
  ],
  "links": {
    "status": "/v1/parse/batch/9c7f2f2e-4f4b-4cbf-bb12-7fd3c1f4f2ab",
    "download": "/v1/parse/batch/9c7f2f2e-4f4b-4cbf-bb12-7fd3c1f4f2ab/download"
  }
}
The same `Idempotency-Key` with the same request replays this response. A different request with that key returns `409 idempotency_key_reused`.

Batch status response

json

{
  "batch_id": "9c7f2f2e-4f4b-4cbf-bb12-7fd3c1f4f2ab",
  "service": "parse-pdf",
  "status": "completed_with_failures",
  "counts": {
    "total": 2,
    "queued": 0,
    "processing": 0,
    "completed": 1,
    "failed": 1
  },
  "usage": {
    "total_upload_bytes": 1843200,
    "total_selected_pages": 12
  },
  "estimated_credits": 20,
  "created_at": "2026-05-23T10:12:30.000Z",
  "updated_at": "2026-05-23T10:14:04.000Z",
  "completed_at": "2026-05-23T10:14:04.000Z",
  "expires_at": "2026-05-23T11:14:04.000Z",
  "webhook_delivery": {
    "status": "delivered"
  },
  "metrics": {
    "queue_wait_ms": 214,
    "duration_ms": 82341
  },
  "files": [
    {
      "file_id": "file_6c27f6d4-43c1-48c6-a3c8-3a89d0a3cf20",
      "status": "completed",
      "page_count": 8,
      "billable_pages": 8,
      "estimated_credits": 10,
      "artifacts": {
        "json_download": "/v1/parse/batch/9c7f2f2e-4f4b-4cbf-bb12-7fd3c1f4f2ab/files/file_6c27f6d4-43c1-48c6-a3c8-3a89d0a3cf20/download?format=json",
        "markdown_download": "/v1/parse/batch/9c7f2f2e-4f4b-4cbf-bb12-7fd3c1f4f2ab/files/file_6c27f6d4-43c1-48c6-a3c8-3a89d0a3cf20/download?format=markdown"
      }
    },
    {
      "file_id": "file_95ac2fb4-060d-427c-9f86-864747dfb935",
      "status": "failed",
      "page_count": 4,
      "billable_pages": 4,
      "estimated_credits": 10,
      "failure_code": "corrupt_pdf"
    }
  ],
  "links": {
    "status": "/v1/parse/batch/9c7f2f2e-4f4b-4cbf-bb12-7fd3c1f4f2ab",
    "download": "/v1/parse/batch/9c7f2f2e-4f4b-4cbf-bb12-7fd3c1f4f2ab/download"
  }
}
`completed_with_failures` means at least one file produced artifacts and at least one accepted file failed after enqueue. Check each file status before downloading.

Batch ZIP download

bash

curl "https://api.docushell.com/api/v1/parse/batch/9c7f2f2e-4f4b-4cbf-bb12-7fd3c1f4f2ab/download" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  --output parse-batch.zip
The batch ZIP is generated on demand from completed file artifacts and is not a durable artifact itself.

Per-file artifact download

bash

curl "https://api.docushell.com/api/v1/parse/batch/9c7f2f2e-4f4b-4cbf-bb12-7fd3c1f4f2ab/files/file_6c27f6d4-43c1-48c6-a3c8-3a89d0a3cf20/download?format=json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  --output report.json
Use the file-specific download route when you want one artifact instead of the generated batch ZIP.

Artifacts

  • Per-file download links appear under files[].artifacts after each completed file is promoted.
  • The batch ZIP is generated on demand from completed file artifacts, streamed, and removed after the response finishes.
  • Per-file artifact downloads can be retried until the batch expires.
  • The per-file format query must match an artifact requested for the batch.
  • Original filenames are not part of the public status payload or webhook logs.

Poll And Download

  • Poll GET /v1/parse/batch/:batchId until status becomes completed, completed_with_failures, or failed.
  • When the batch completes or partially completes, use GET /v1/parse/batch/:batchId/download for the generated ZIP or use each files[].artifacts.*_download link for a specific file artifact.
  • If a file status is failed, its per-file download returns 409 batch_file_failed. Continue downloading completed files until expires_at.

Failure Notes

  • invalid_pdf, corrupt_pdf, password_protected, invalid_page_range, and page_limit_exceeded can be returned during preflight before a batch is accepted.
  • server_busy with Retry-After means the dedicated batch queue or active batch lane is saturated. Retry later with the same Idempotency-Key only for the exact same request.
  • Download 400 means the requested format was not requested for this batch.
  • Download 425 batch_not_ready means the batch or file is not terminal yet.
  • Download 409 batch_file_failed means that accepted file reached a terminal failed state.
  • Download 410 output_expired means the artifact TTL has passed.

Batch not ready

425batch_not_ready

Returned when a batch or file download is attempted before terminal status.

425 error

json

{
  "error": {
    "code": "batch_not_ready",
    "message": "Batch is not ready.",
    "type": "invalid_request_error",
    "request_id": "req_01JX8Y62XCDNZ2BM7TBM2M9Q8E"
  }
}

Batch file failed

409batch_file_failed

Returned when a per-file artifact is requested for a file that failed after enqueue.

409 error

json

{
  "error": {
    "code": "batch_file_failed",
    "message": "Batch file failed.",
    "type": "invalid_request_error",
    "request_id": "req_01JX8Y62XCDNZ2BM7TBM2M9Q8E"
  }
}

Expired output

410output_expired

Returned after the one-hour default TTL for completed batch artifacts passes.

410 error

json

{
  "error": {
    "code": "output_expired",
    "message": "Batch output expired.",
    "type": "invalid_request_error",
    "request_id": "req_01JX8Y62XCDNZ2BM7TBM2M9Q8E"
  }
}

Queue saturated

503server_busy

Returned when batch-specific queue backpressure rejects the submit.

503 error

json

{
  "error": {
    "code": "server_busy",
    "message": "The parse batch queue is busy. Retry later.",
    "type": "internal_error",
    "request_id": "req_01JX8Y62XCDNZ2BM7TBM2M9Q8E"
  }
}

Authentication failure

401invalid_api_key

Returned when the bearer token is missing, revoked, expired, or not allowed to use the API lane.

401 error

json

{
  "error": {
    "code": "invalid_api_key",
    "message": "Invalid API key.",
    "type": "auth_error",
    "request_id": "req_01JX8Y62XCDNZ2BM7TBM2M9Q8E"
  }
}

Idempotency conflict

409idempotency_key_reused

Returned when the same Idempotency-Key is reused with a different payload than the original request.

409 error

json

{
  "error": {
    "code": "idempotency_key_reused",
    "message": "This Idempotency-Key was already used with a different request.",
    "type": "invalid_request_error",
    "request_id": "req_01JX8Y62XCDNZ2BM7TBM2M9Q8E"
  }
}

Webhook access disabled

403webhook_access_disabled

Returned when a Starter API key submits webhook fields. Pro, Growth, and Scale include webhooks.

403 error

json

{
  "error": {
    "code": "webhook_access_disabled",
    "message": "Webhooks are available on Pro, Growth, and Scale. Starter includes API access without webhooks.",
    "type": "billing_error",
    "request_id": "req_01JX8Y62XCDNZ2BM7TBM2M9Q8E"
  }
}

Rate limit

429rate_limit_exceeded

Returned when the API key or caller fingerprint exceeds the configured request rate.

429 error

json

{
  "error": {
    "code": "rate_limit_exceeded",
    "message": "Rate limit exceeded.",
    "type": "rate_limit_error",
    "request_id": "req_01JX8Y62XCDNZ2BM7TBM2M9Q8E"
  }
}

Webhooks

Completion Webhooks

Use webhook_url and webhook_secret for per-request completion callbacks, or send x-docushell-webhook-url and its matching secret header on batch parse requests.

Receivers must validate x-docushell-signature, deduplicate by x-docushell-delivery, and finish within 10-second request timeouts. Use public HTTPS staging endpoints or approved tunnels for receiver tests.

Terminal event names include pdf.parse.completed, pdf.parse.failed, pdf.parse.batch.completed, pdf.parse.batch.completed_with_failures, pdf.parse.batch.failed, resume.parse.completed, resume.parse.failed, resume.batch.completed, resume.batch.completed_with_failures, and resume.batch.failed.

Artifacts

Artifact Downloads

The shared download route streams one artifact at a time.

JSON artifact download

bash

curl "https://api.docushell.com/api/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download?format=json" \
  -H "Authorization: Bearer YOUR_API_KEY"
Use `format=json` to stream the structured document artifact directly.

Markdown artifact download

bash

curl "https://api.docushell.com/api/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download?format=markdown" \
  -H "Authorization: Bearer YOUR_API_KEY"
Markdown downloads preserve the extracted reading order in a plain-text friendly artifact.

HTML artifact download

bash

curl "https://api.docushell.com/api/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download?format=html" \
  -H "Authorization: Bearer YOUR_API_KEY"
HTML downloads are available when the parse job requested HTML output.

Plain text artifact download

bash

curl "https://api.docushell.com/api/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download?format=text" \
  -H "Authorization: Bearer YOUR_API_KEY"
Plain text downloads are available when the parse job requests all artifacts or explicit text output.

Annotated PDF debug artifact download

bash

curl "https://api.docushell.com/api/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download?format=annotated_pdf" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  --output document.annotated.pdf
Annotated PDF downloads are visual debug artifacts for validating extracted structure against the source page.

Markdown with HTML download

bash

curl "https://api.docushell.com/api/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download?format=markdown_with_html" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  --output document.with-html.md
`markdown_with_html` preserves richer inline/table markup inside a Markdown-family artifact.

Markdown with images download

bash

curl "https://api.docushell.com/api/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download?format=markdown_with_images" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  --output document.with-images.md
When `image_output=external` emits sidecars, this download may be a zip containing Markdown plus image assets.

Tagged PDF download

bash

curl "https://api.docushell.com/api/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download?format=tagged_pdf" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  --output document.tagged.pdf
Tagged PDF output is automated structure inference and should be reviewed before accessibility compliance claims.

Artifacts

JSON Artifact Versus Markdown Artifact

Use the JSON artifact when you need structure, semantic content types, or document geometry. It is the canonical machine-readable representation and the right choice for pipelines that need sections, lists, tables, headings, captions, or positional metadata.

Use the Markdown artifact when you need a lightweight, portable text representation that still follows the extracted reading order. It works well for previews, quick QA, search indexing, and downstream LLM ingestion.

The parse status payload also includes result.metadata so you can inspect which extraction-tuning options were applied to a completed job.

Support

Troubleshooting Parse Failures

Most parse failures are actionable before you retry. Keep the same Idempotency-Key only when you are replaying the exact same logical request after a timeout or transport issue.

  • invalid_pdf: confirm the upload is a real PDF before retrying.
  • corrupt_pdf: re-export or repair the file, then resubmit.
  • password_protected: decrypt the PDF before uploading. Password submission is not part of this public lane yet.
  • ocr_required: run OCR upstream first, then resubmit the text-native PDF.
  • invalid_page_range: retry with a selector like 1-3,5 that resolves inside the document bounds.
  • Unexpected text layout: retry with reading_order=off or reading_order=xycut, depending on whether you want less or more reading-order reconstruction.
  • Weak table extraction: retry with table_method=cluster if the default path misses cell groupings.
If the gateway returns server_busy or backend_unavailable, retry safely with the same Idempotency-Key after the transient issue clears.