Flagship Reference

Parse PDF

Parse PDF turns uploaded PDFs into a structured document tree plus Markdown, HTML, plain text, and optional annotated PDF debug artifacts. Use /v1/parse for one file or /v1/parse/batch for up to 10 PDFs with shared parse options.

Reference15 min

View as Markdown

Getting startedReview auth, idempotency, and the shared jobs flow first.Download formatsJump to JSON, Markdown, HTML, text, and annotated PDF artifact handling.Batch parseSubmit multiple PDFs, poll batch status, and download per-file artifacts or a generated ZIP.RAG ingestionUse DocuShell JSON and Markdown artifacts for chunking, retrieval metadata, and citations.Try Parse liveOpen the parser playground with annotated PDF overlays, extracted blocks, JSON, and live API execution.

Overview

What You Get Back

Completed parse jobs expose structured JSON plus optional Markdown, HTML, plain text, annotated PDF debug artifacts, richer Markdown, image-capable Markdown, and tagged PDF output. The JSON artifact is the structured representation for downstream automation. Markdown and text are text-friendly companions for indexing, previews, and human review.

The structured JSON preserves reading order and emits a hierarchical document rooted at numberOfPages plus kids, with semantic nodes for headings, paragraphs, lists, list items, tables, rows, cells, captions, and images when detected.

When the underlying output includes layout coordinates, nodes can also expose bounding boxes so you can map extracted content back to the source pages.

JSON artifact: structured document tree for automation and indexing.
Markdown artifact: readable text export with the same reading-order orientation.
HTML artifact: styled companion for rendering and review.
Plain text artifact: lightweight output for search, RAG, and simple ingestion.
Annotated PDF artifact: visual debug output for comparing detected structure to source pages.
Markdown with HTML artifact: Markdown-family output that keeps richer inline/table markup.
Markdown with images artifact: explicit image-capable Markdown; external sidecars are bundled into one zip.
Only one markdown-style artifact can be requested per job because the parse engine emits one Markdown-family file per run.
Tagged PDF artifact: automated structure inference for accessibility review, not a PDF/UA compliance guarantee.
Header and footer content stays excluded by default unless include_header_footer=true is requested.
Tagged PDFs can prefer their native structure tree when use_struct_tree=true is supplied.
Sanitization, reading-order, table, line-break, hybrid-mode, and image-output settings tune extraction behavior while output_mode and formats select emitted artifacts.

RAG

RAG Ingestion Workflow

Use DocuShell Parse PDF when a search, RAG, review, or agent workflow needs readable chunks plus source metadata.

For most RAG pipelines, request formats=json,markdown. Treat Markdown as the primary text to chunk and embed, then attach JSON metadata from the same parse job so retrieved passages can point back to source pages and bounding boxes.

Do not chunk PDFs by blind character windows first. Start with document semantics: headings, paragraphs, lists, captions, and tables. Merge short neighboring elements when needed, keep tables intact when their structure matters, and carry page numbers plus bounding boxes into vector metadata.

The JSON artifact is also the audit layer. Store enough metadata to reproduce a citation, highlight a source region in a review UI, and debug bad retrieval without reparsing the PDF.

Submit with formats=json,markdown for the common RAG pair.
Use use_struct_tree=true for tagged PDFs when you want DocuShell to prefer reliable native structure tags.
Keep DocuShell's default rendering-mismatch defenses enabled for untrusted PDFs so hidden, off-page, tiny, or transparent text is less likely to enter model context.
Keep the default header/footer exclusion unless repeated page furniture is important to the answer.
Use sanitize=true only when your pipeline should mask visible emails, URLs, and phone numbers in the extracted output.
Store per-chunk metadata such as source file ID, page number, heading path, node type, and bounding box when available.
Keep table nodes or Markdown tables as standalone chunks when row and column relationships are important.

RAG answers should cite DocuShell metadata, not just the text string sent to the model. Page and bounding-box metadata make citations inspectable.

Implementation

RAG Examples

These examples show the DocuShell API request and the downstream metadata shape to keep with embeddings.

Submit a RAG-ready parse job

bash

curl -X POST "https://api.docushell.com/api/v1/parse" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Idempotency-Key: rag-parse-001" \
  -F "file=@./policy-handbook.pdf;type=application/pdf" \
  -F "formats=json,markdown" \
  -F "reading_order=xycut" \
  -F "use_struct_tree=true"

`use_struct_tree=true` is helpful when the source PDF has reliable native tags. If usable tags are not present, continue validating the output in the Parse Playground and downloaded artifacts.

DocuShell RAG flow

text

PDF upload
  -> POST /api/v1/parse with formats=json,markdown
  -> poll GET /api/v1/jobs/:jobId
  -> download Markdown and JSON artifacts
  -> chunk Markdown by headings, paragraphs, lists, and tables
  -> attach JSON metadata: page, bounding box, node type, heading path
  -> store embeddings plus metadata
  -> retrieve chunks and cite the source page or region

Chunk metadata shape

json

{
  "content": "## Data Retention\n\nCustomer documents are retained for the configured retention window...",
  "metadata": {
    "source_file_id": "file_01JX...",
    "source_name": "policy-handbook.pdf",
    "section": "Data Retention",
    "node_types": ["heading", "paragraph"],
    "page_start": 3,
    "page_end": 4,
    "bounding_boxes": [
      { "page": 3, "bbox": { "x": 0.88, "y": 1.21, "w": 6.21, "h": 0.52 } }
    ]
  }
}

Use the exact metadata keys your system prefers, but preserve DocuShell page and bounding-box data when available.

Structure

Tagged PDFs And Structure Trees

Tagged PDFs can provide stronger semantic structure than coordinate-only extraction when the tags are present and trustworthy.

When a source PDF includes usable structure tags, use_struct_tree=true tells DocuShell to prefer that structure for reading order and semantic hierarchy. This can improve headings, lists, table relationships, and natural chunk boundaries for RAG pipelines.

Real document collections are mixed. Some PDFs are well tagged, some have no tags, and some have tags that are not useful enough for downstream retrieval. Always inspect representative outputs before relying on a single extraction strategy.

Use use_struct_tree=true for tagged policy documents, manuals, reports, and accessible PDFs where author-defined structure is expected.
Use the regular layout-aware path for untagged or poorly tagged files, or compare both settings during integration testing.
Request formats=tagged_pdf when you need a generated tagged PDF artifact for review or accessibility workflows.
Do not treat formats=tagged_pdf as a PDF/UA compliance guarantee. Review the artifact before making accessibility claims.
For RAG chunking, start new chunks at major headings, preserve heading-plus-paragraph groups, and avoid splitting tables across chunks.

Inspection

Playground Inspection Views

The Parse Playground lets developers inspect both the visual overlay and the structured data behind it.

Annotated PDF Viewer

Renders the selected PDF pages and overlays layout boxes, category tags, and reading-order numbers from parser bounding boxes.

Blocks / Tables Output

Lists extracted nodes as data: order, category type, page number, bounding-box coordinates, and extracted text or table content when available.

JSON, Markdown, And Text

Switch tabs to compare the raw structured JSON with Markdown and plain text companion artifacts from the same parse job.

Capabilities

What Parse Supports Today

The public parse lane exposes curated parser controls while OCR/enrichment remain backend-profile settings.

Capability	Availability	Notes
JSON artifact	Available	Hierarchical document output for automation, indexing, and structured QA.
Markdown artifact	Available	Text-first companion download for previews, search, and LLM ingestion.
HTML artifact	Available	Optional styled companion download for rendering and review.
Plain text artifact	Available	Optional lightweight text output for search, RAG, and simple ingestion.
Annotated PDF artifact	Available	Optional visual debug artifact for validating extracted structure.
Annotated PDF playground viewer	Available	The playground renders PDF pages with layout boxes, category tags, and reading-order numbers when block geometry is present.
Blocks / tables output	Available when present	The playground lists parser nodes with order, category type, page number, bounding box, and extracted content.
Markdown with HTML	Available	Request `formats=markdown_with_html` when Markdown should retain richer inline/table markup.
Markdown with images	Available	Request `formats=markdown_with_images`; external sidecars download as a zip.
Tagged PDF output	Available	Request `formats=tagged_pdf`; review before making accessibility compliance claims.
Batch parse	Available	Use `/v1/parse/batch` for up to 10 PDFs with shared parse options, per-file statuses, and per-file artifacts.
Reading-order preservation	Available	Structured and text-oriented artifacts follow the detected reading order.
Heading and list detection	Available	Headings plus numbered, bulleted, and nested lists are represented when detected.
Table extraction	Available / backend-gated	Structured tables are emitted when detected; complex or borderless tables may require the hybrid backend.
Image extraction with coordinates	Available when present	Image nodes and coordinates can appear in JSON; use `markdown_with_images` for image-capable Markdown.
Tagged PDF structure	Available	Use `use_struct_tree=true` to prefer native structure tags when a tagged PDF provides them.
Sanitization	Available	Use `sanitize=true` to mask email addresses, URLs, and phone numbers in extracted output.
Reading-order override	Available	Use `reading_order=xycut\|off` when you need an explicit reading-order setting.
Table-method override	Available	Use `table_method=default\|cluster` for light table extraction tuning.
Keep line breaks	Available	Use `keep_line_breaks=true` when text-oriented output should preserve original line breaks more closely.
Header/footer inclusion	Available	Use `include_header_footer=true` when you need repeated page furniture in the output.
Request hybrid mode	Backend-gated	Use `hybrid_mode=auto\|full` only when the DocuShell hybrid backend is enabled.
Image output mode	Available	Use `image_output=off\|embedded\|external` for image-capable outputs.
OCR / scanned PDFs	Backend-gated	Available when the DocuShell hybrid OCR profile is active; otherwise scans return `ocr_required`.
Formula/chart enrichment	Backend-gated	Available only when the active DocuShell backend profile includes those enrichments.

Endpoint

Single-File Endpoint

POST/v1/parse

Submit a PDF for queued parsing and receive structured JSON plus Markdown, HTML, plain text, and annotated PDF debug output.

Auth

Bearer token required on submit, status, and artifact download requests.

Idempotency

Server-minted job_id values with optional Idempotency-Key replay support.

Content Type

multipart/form-data

Headers

Name	Type	Required	Location	Description
Authorization	Bearer <API_KEY>	Yes	header	User-owned API key created in the DocuShell dashboard.
Idempotency-Key	string	No	header	Recommended for safely retrying submit requests without creating duplicate jobs.

Request Fields

Name	Type	Required	Location	Description
file	file	Yes	multipart	PDF upload. The gateway validates PDF magic bytes before forwarding the file.
file_name	string	No	multipart	Optional file name override used for storage metadata and downstream artifact names.
page_range	string	No	multipart	Comma-separated pages or ranges such as `1-3,5,9-11`.
include_header_footer	boolean	No	multipart	Set to `true` to keep header and footer content in the extracted output.Default: false
use_struct_tree	boolean	No	multipart	Set to `true` to prefer native tagged-PDF structure when the source document includes a usable structure tree.Default: false
sanitize	boolean	No	multipart	Set to `true` to mask email addresses, URLs, and phone numbers in extracted output.Default: false
reading_order	`xycut` \| `off`	No	multipart	Optional reading-order strategy. Omit it to keep the current default extraction behavior.
table_method	`default` \| `cluster`	No	multipart	Optional table-detection strategy. Omit it to keep the current default extraction behavior.
keep_line_breaks	boolean	No	multipart	Set to `true` to preserve source line breaks more aggressively in text-oriented output.Default: false
output_mode	`json` \| `both` \| `html` \| `all`	No	multipart	Backward-compatible artifact bundle selector. `json` keeps only structured JSON, `both` adds Markdown, `html` adds HTML, and `all` returns the common legacy bundle: JSON, Markdown, HTML, text, and annotated PDF.Default: both
formats	`json` \| `markdown` \| `html` \| `text` \| `annotated_pdf` \| `markdown_with_html` \| `markdown_with_images` \| `tagged_pdf`	No	multipart	Optional explicit artifact list. Send as repeated fields or a comma-separated value, such as `formats=json,text`.
hybrid_mode	`auto` \| `full`	No	multipart	Optional per-job hybrid triage override. Requires the hybrid backend to be enabled by operations.
image_output	`off` \| `embedded` \| `external`	No	multipart	Controls image handling for image-capable outputs. `markdown_with_images` defaults to embedded images unless `external` is requested.

Request Notes

Plan limits are enforced before the job is queued. Starter keeps the 50 MB per-file cap; Pro, Growth, and Scale raise upload size, per PDF/job page limits, and concurrency as monthly credits grow.
Set use_struct_tree=true when tagged PDFs should favor their native structure tree. Leave it off for the default reading-order-oriented extraction path.
Structured JSON remains the canonical parse result and is always generated for successful jobs so status responses can keep returning result.document.
sanitize, reading_order, table_method, keep_line_breaks, hybrid_mode, and image_output are extraction-tuning knobs. output_mode and formats control which companion artifacts are emitted.
Request newer artifact types such as markdown_with_html, markdown_with_images, and tagged_pdf with formats; only one markdown-style format (markdown, markdown_with_html, or markdown_with_images) can be requested per job because the parse engine emits one Markdown-family file per run.
DocuShell keeps rendering-mismatch safety filters enabled for Parse PDF output. sanitize=true is a separate optional control for masking visible sensitive data.
OCR, formula extraction, and chart/image descriptions follow the active DocuShell backend profile. They are not per-request fields on the shared public API.
Status polling stays on /v1/jobs/:jobId. Artifact streaming happens through the shared download route with format=json|markdown|html|text|annotated_pdf|markdown_with_html|markdown_with_images|tagged_pdf.

Multipart submit

bash

curl -X POST "https://api.docushell.com/api/v1/parse" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Idempotency-Key: parse-demo-001" \
  -F "file=@./quarterly-report.pdf;type=application/pdf" \
  -F "file_name=quarterly-report.pdf" \
  -F "page_range=1-3" \
  -F "include_header_footer=true" \
  -F "use_struct_tree=true" \
  -F "sanitize=true" \
  -F "reading_order=xycut" \
  -F "table_method=cluster" \
  -F "keep_line_breaks=true" \
  -F "formats=json,markdown_with_images" \
  -F "image_output=embedded"

Try It Now

Console placeholder for safe sandbox execution.

Coming soon

MethodEndpointHeadersRequest Body

{
  "example": "Paste a request body when the sandbox is enabled"
}

Queued response

json

{
  "job_id": "job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT",
  "status": "queued",
  "cost": 2500,
  "service": "parse-pdf",
  "request_id": "req_01JX8Y62XCDNZ2BM7TBM2M9Q8E",
  "links": {
    "status": "/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT"
  }
}

Parse job status

json

{
  "job_id": "job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT",
  "status": "done",
  "service": "parse-pdf",
  "request_id": "req_01JX8Y62XCDNZ2BM7TBM2M9Q8E",
  "result": {
    "document": {
      "fileName": "quarterly-report.pdf",
      "numberOfPages": 2,
      "kids": [
        {
          "type": "section",
          "children": [
            {
              "type": "heading",
              "content": "Executive summary",
              "heading level": 1,
              "page number": 1,
              "bounding box": { "x": 0.88, "y": 0.74, "w": 6.15, "h": 0.33 }
            },
            {
              "type": "paragraph",
              "content": "Revenue rose 18% year over year across the managed-services portfolio.",
              "page number": 1,
              "bounding box": { "x": 0.88, "y": 1.21, "w": 6.21, "h": 0.52 }
            },
            {
              "type": "list",
              "children": [
                { "type": "listItem", "content": "Renewals remained above 92%." },
                { "type": "listItem", "content": "Average contract value increased in EMEA." }
              ]
            },
            {
              "type": "table",
              "children": [
                {
                  "type": "tableRow",
                  "children": [
                    { "type": "tableCell", "content": "Region" },
                    { "type": "tableCell", "content": "Growth" }
                  ]
                },
                {
                  "type": "tableRow",
                  "children": [
                    { "type": "tableCell", "content": "North America" },
                    { "type": "tableCell", "content": "21%" }
                  ]
                }
              ]
            },
            {
              "type": "caption",
              "content": "Table 1. Regional growth by quarter."
            }
          ]
        }
      ]
    },
    "artifacts": {
      "markdown_download": "/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download?format=markdown",
      "json_download": "/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download?format=json",
      "html_download": "/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download?format=html",
      "text_download": "/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download?format=text",
      "annotated_pdf_download": "/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download?format=annotated_pdf"
    },
    "metadata": {
      "engine": "docushell_parse",
      "output_mode": "all",
      "include_header_footer": true,
      "use_struct_tree": true,
      "sanitize": true,
      "reading_order": "xycut",
      "table_method": "cluster",
      "keep_line_breaks": true
    }
  },
  "metrics": {
    "queue_wait_ms": 214,
    "duration_ms": 1789
  },
  "links": {
    "status": "/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT",
    "download": "/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download"
  }
}

The JSON artifact preserves reading order and exposes the structured document tree through `numberOfPages` and `kids`.

JSON artifact download

bash

curl "https://api.docushell.com/api/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download?format=json" \
  -H "Authorization: Bearer YOUR_API_KEY"

Use `format=json` to stream the structured document artifact directly.

Markdown artifact download

bash

curl "https://api.docushell.com/api/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download?format=markdown" \
  -H "Authorization: Bearer YOUR_API_KEY"

Markdown downloads preserve the extracted reading order in a plain-text friendly artifact.

HTML artifact download

bash

curl "https://api.docushell.com/api/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download?format=html" \
  -H "Authorization: Bearer YOUR_API_KEY"

HTML downloads are available when the parse job requested HTML output.

Plain text artifact download

bash

curl "https://api.docushell.com/api/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download?format=text" \
  -H "Authorization: Bearer YOUR_API_KEY"

Plain text downloads are available when the parse job requests all artifacts or explicit text output.

Annotated PDF debug artifact download

bash

curl "https://api.docushell.com/api/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download?format=annotated_pdf" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  --output document.annotated.pdf

Annotated PDF downloads are visual debug artifacts for validating extracted structure against the source page.

Markdown with HTML download

bash

curl "https://api.docushell.com/api/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download?format=markdown_with_html" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  --output document.with-html.md

`markdown_with_html` preserves richer inline/table markup inside a Markdown-family artifact.

Markdown with images download

bash

curl "https://api.docushell.com/api/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download?format=markdown_with_images" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  --output document.with-images.md

When `image_output=external` emits sidecars, this download may be a zip containing Markdown plus image assets.

Tagged PDF download

bash

curl "https://api.docushell.com/api/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download?format=tagged_pdf" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  --output document.tagged.pdf

Tagged PDF output is automated structure inference and should be reviewed before accessibility compliance claims.

Artifacts

The JSON artifact is the structured representation. It includes the document root, numberOfPages, and the hierarchical kids array.
Node types capture semantic structure such as headings, paragraphs, lists, list items, tables, rows, cells, and captions.
Where available, nodes include bounding boxes so you can map structured content back to source pages.
The Parse Playground uses those same node types and bounding boxes to draw layout boxes, category tags, and reading-order numbers in the Annotated PDF viewer.
Markdown is the flattened companion artifact for indexing, previews, search pipelines, and quick human review.
For RAG pipelines, index Markdown chunks and attach JSON metadata such as page number, node type, heading path, and bounding box.
HTML is an optional companion artifact for styled downstream rendering and review when output_mode requests it.
Plain text is available for search, RAG, and simple ingestion pipelines.
Annotated PDF is an optional visual debug artifact for comparing detected structure to the source page.
markdown_with_html is available as an explicit format when you want Markdown output with richer inline/table markup retained.
markdown_with_images is available as an explicit format. Embedded images produce a self-contained Markdown file; external image sidecars are bundled into one zip.
Only one markdown-style artifact can be requested in a single job: markdown, markdown_with_html, or markdown_with_images.
tagged_pdf is available as an explicit format for accessibility review workflows. It is not a PDF/UA compliance guarantee.

Poll And Download

Poll GET /v1/jobs/:jobId until status becomes done or failed.
When the job completes, the status payload includes public artifact links under result.artifacts.
Use GET /v1/jobs/:jobId/download?format=json for the structured document, format=markdown for the Markdown companion, format=html for HTML, format=text for plain text, format=annotated_pdf for the visual debug artifact, format=markdown_with_html for richer Markdown, format=markdown_with_images for image-capable Markdown, and format=tagged_pdf for tagged PDF output.

Failure Notes

invalid_pdf covers invalid file types and malformed uploads rejected before the worker starts.
corrupt_pdf is reserved for damaged PDFs that fail deeper validation or parser execution.
password_protected is returned when the document requires a password.
ocr_required is returned for scans or image-only PDFs when hybrid OCR is disabled, unavailable, or still produces too little extractable text.
invalid_page_range is returned when the submitted page selector is malformed or selects no valid pages.
page_limit_exceeded is returned when the requested page set is larger than the plan-specific parse cap.
server_busy or backend_unavailable indicate temporary capacity problems. Retry with the same Idempotency-Key when safe.

Password-protected PDF

400password_protected

The document cannot be parsed until it is decrypted outside the public API lane.

400 error

json

{
  "error": {
    "code": "password_protected",
    "message": "This PDF is password-protected and cannot be parsed without a password.",
    "type": "invalid_request_error",
    "request_id": "req_01JX8Y62XCDNZ2BM7TBM2M9Q8E"
  }
}

OCR required

400ocr_required

The parser could not extract text from a scan or image-only file.

400 error

json

{
  "error": {
    "code": "ocr_required",
    "message": "This PDF appears to require OCR before it can be parsed.",
    "type": "invalid_request_error",
    "request_id": "req_01JX8Y62XCDNZ2BM7TBM2M9Q8E"
  }
}

Invalid page range

400invalid_page_range

The submitted selector is malformed or does not resolve to valid pages.

400 error

json

{
  "error": {
    "code": "invalid_page_range",
    "message": "The requested page_range is invalid for this PDF.",
    "type": "invalid_request_error",
    "request_id": "req_01JX8Y62XCDNZ2BM7TBM2M9Q8E"
  }
}

Plan page limit exceeded

400page_limit_exceeded

The requested document or selected page range is larger than the active plan allows.

400 error

json

{
  "error": {
    "code": "page_limit_exceeded",
    "message": "Requested page range exceeds your plan limit.",
    "type": "invalid_request_error",
    "request_id": "req_01JX8Y62XCDNZ2BM7TBM2M9Q8E"
  }
}

Authentication failure

401invalid_api_key

Returned when the bearer token is missing, revoked, expired, or not allowed to use the API lane.

401 error

json

{
  "error": {
    "code": "invalid_api_key",
    "message": "Invalid API key.",
    "type": "auth_error",
    "request_id": "req_01JX8Y62XCDNZ2BM7TBM2M9Q8E"
  }
}

Idempotency conflict

409idempotency_key_reused

Returned when the same Idempotency-Key is reused with a different payload than the original request.

409 error

json

{
  "error": {
    "code": "idempotency_key_reused",
    "message": "This Idempotency-Key was already used with a different request.",
    "type": "invalid_request_error",
    "request_id": "req_01JX8Y62XCDNZ2BM7TBM2M9Q8E"
  }
}

Webhook access disabled

403webhook_access_disabled

Returned when a Starter API key submits webhook fields. Pro, Growth, and Scale include webhooks.

403 error

json

{
  "error": {
    "code": "webhook_access_disabled",
    "message": "Webhooks are available on Pro, Growth, and Scale. Starter includes API access without webhooks.",
    "type": "billing_error",
    "request_id": "req_01JX8Y62XCDNZ2BM7TBM2M9Q8E"
  }
}

Rate limit

429rate_limit_exceeded

Returned when the API key or caller fingerprint exceeds the configured request rate.

429 error

json

{
  "error": {
    "code": "rate_limit_exceeded",
    "message": "Rate limit exceeded.",
    "type": "rate_limit_error",
    "request_id": "req_01JX8Y62XCDNZ2BM7TBM2M9Q8E"
  }
}

Endpoint

Batch Endpoint

Use batch parse when you want one accepted request, one batch status, and one shared set of parse options for multiple PDFs.

POST/v1/parse/batch

Submit up to 10 PDFs as one async parse batch with shared parse options and per-file artifacts.

Auth

Bearer token required on submit, status, and artifact download requests.

Idempotency

Required Idempotency-Key. The default idempotency window is one hour from submit, intentionally shorter than Stripe-style 24-hour idempotency. Reusing the same key and request replays the accepted response; reusing the key with a different request returns 409 idempotency_key_reused.

Content Type

multipart/form-data

Headers

Name	Type	Required	Location	Description
Authorization	Bearer <API_KEY>	Yes	header	User-owned API key created in the DocuShell dashboard.
Idempotency-Key	string	Yes	header	Required for every batch submit. Use a fresh key per logical batch and reuse it only for the exact same retry.

Request Fields

Name	Type	Required	Location	Description
files[]	file[]	Yes	multipart	PDF uploads for the batch. Every file is preflighted before enqueue; one invalid file rejects the whole submit.
page_range	string	No	multipart	Shared page selector applied to every file, such as `1-3,5`. Selected pages count toward the batch total page limit.
include_header_footer	boolean	No	multipart	Shared setting. Set to `true` to keep repeated headers and footers in extracted output.Default: false
use_struct_tree	boolean	No	multipart	Shared setting. Set to `true` to prefer native tagged-PDF structure when available.Default: false
sanitize	boolean	No	multipart	Shared setting. Set to `true` to mask email addresses, URLs, and phone numbers in extracted output.Default: false
reading_order	`xycut` \| `off`	No	multipart	Shared reading-order strategy.
table_method	`default` \| `cluster`	No	multipart	Shared table-detection strategy.
keep_line_breaks	boolean	No	multipart	Shared setting. Set to `true` when text-oriented output should preserve original line breaks more closely.Default: false
output_mode	`json` \| `both` \| `html` \| `all`	No	multipart	Backward-compatible artifact bundle selector. Do not send this together with `formats`.Default: both
formats	`json` \| `markdown` \| `html` \| `text` \| `annotated_pdf` \| `markdown_with_html` \| `markdown_with_images` \| `tagged_pdf`	No	multipart	Explicit artifact list. Send repeated fields or a comma-separated value. Do not send this together with `output_mode`.
hybrid_mode	`auto` \| `full`	No	multipart	Optional shared hybrid triage override when the hybrid backend is enabled by operations.
image_output	`off` \| `embedded` \| `external`	No	multipart	Shared image handling for image-capable outputs.
x-docushell-webhook-url	string	No	header	Optional public HTTPS endpoint for the terminal batch webhook. URLs with credentials, localhost, private, reserved, or metadata-service addresses are rejected.
x-docushell-webhook-secret	string	No	header	Required when `x-docushell-webhook-url` is present. Must be 16-256 characters with sufficient variety; do not reuse an API key.
x-docushell-webhook-endpoint-id	string	No	header	Saved managed webhook endpoint id. Use this instead of sending a per-request webhook URL and secret.

Request Notes

Batch parse is async-only. Submit returns 202; poll GET /v1/parse/batch/:batchId for truth.
All files are preflighted before enqueue. Empty, non-PDF, corrupt, password-protected, oversized, invalid-page-range, per-file page-limit, total-byte-limit, and total-page-limit failures reject the whole submit.
Default v1 limits are 10 files, 100 MB total upload, 500 selected pages, and 2 batch submits per minute.
Parse options are shared across the batch. v1 does not support per-file parse settings.
Send either output_mode or formats, not both. Only one markdown-style format (markdown, markdown_with_html, or markdown_with_images) can be requested.
The batch lane has separate backpressure. Queue saturation returns 503 server_busy with Retry-After; rate limits return 429.
Batch responses report estimated_credits only: each file estimates max(10, selected_pages) credits. v1 does not perform final credit settlement on this lane.
Every status and download request is owner-scoped. Unknown batches, files, or owner mismatches return 404.
Artifacts expire one hour after terminal completion by default. Batch idempotency expires one hour after submit by default.
Terminal statuses are completed, completed_with_failures, and failed. No per-file retry is attempted in v1.
Webhooks are signed best-effort terminal notifications with short bounded retry. Polling remains the source of truth.

Multipart batch submit

bash

curl -X POST "https://api.docushell.com/api/v1/parse/batch" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Idempotency-Key: parse-batch-demo-001" \
  -H "x-docushell-webhook-url: https://example.com/docushell/webhooks" \
  -H "x-docushell-webhook-secret: replace_with_a_long_random_secret" \
  -F "files[]=@./report-q1.pdf;type=application/pdf" \
  -F "files[]=@./report-q2.pdf;type=application/pdf" \
  -F "page_range=1-5" \
  -F "formats=json,markdown"

Try It Now

Console placeholder for safe sandbox execution.

Coming soon

MethodEndpointHeadersRequest Body

{
  "example": "Paste a request body when the sandbox is enabled"
}

Accepted batch response

json

{
  "batch_id": "9c7f2f2e-4f4b-4cbf-bb12-7fd3c1f4f2ab",
  "service": "parse-pdf",
  "status": "queued",
  "counts": {
    "total": 2,
    "queued": 2,
    "processing": 0,
    "completed": 0,
    "failed": 0
  },
  "usage": {
    "total_upload_bytes": 1843200,
    "total_selected_pages": 12
  },
  "estimated_credits": 20,
  "created_at": "2026-05-23T10:12:30.000Z",
  "updated_at": "2026-05-23T10:12:30.000Z",
  "completed_at": null,
  "expires_at": null,
  "webhook_delivery": {
    "status": "pending"
  },
  "metrics": null,
  "files": [
    {
      "file_id": "file_6c27f6d4-43c1-48c6-a3c8-3a89d0a3cf20",
      "status": "queued",
      "page_count": 8,
      "billable_pages": 8,
      "estimated_credits": 10
    },
    {
      "file_id": "file_95ac2fb4-060d-427c-9f86-864747dfb935",
      "status": "queued",
      "page_count": 4,
      "billable_pages": 4,
      "estimated_credits": 10
    }
  ],
  "links": {
    "status": "/v1/parse/batch/9c7f2f2e-4f4b-4cbf-bb12-7fd3c1f4f2ab",
    "download": "/v1/parse/batch/9c7f2f2e-4f4b-4cbf-bb12-7fd3c1f4f2ab/download"
  }
}

The same `Idempotency-Key` with the same request replays this response. A different request with that key returns `409 idempotency_key_reused`.

Batch status response

json

{
  "batch_id": "9c7f2f2e-4f4b-4cbf-bb12-7fd3c1f4f2ab",
  "service": "parse-pdf",
  "status": "completed_with_failures",
  "counts": {
    "total": 2,
    "queued": 0,
    "processing": 0,
    "completed": 1,
    "failed": 1
  },
  "usage": {
    "total_upload_bytes": 1843200,
    "total_selected_pages": 12
  },
  "estimated_credits": 20,
  "created_at": "2026-05-23T10:12:30.000Z",
  "updated_at": "2026-05-23T10:14:04.000Z",
  "completed_at": "2026-05-23T10:14:04.000Z",
  "expires_at": "2026-05-23T11:14:04.000Z",
  "webhook_delivery": {
    "status": "delivered"
  },
  "metrics": {
    "queue_wait_ms": 214,
    "duration_ms": 82341
  },
  "files": [
    {
      "file_id": "file_6c27f6d4-43c1-48c6-a3c8-3a89d0a3cf20",
      "status": "completed",
      "page_count": 8,
      "billable_pages": 8,
      "estimated_credits": 10,
      "artifacts": {
        "json_download": "/v1/parse/batch/9c7f2f2e-4f4b-4cbf-bb12-7fd3c1f4f2ab/files/file_6c27f6d4-43c1-48c6-a3c8-3a89d0a3cf20/download?format=json",
        "markdown_download": "/v1/parse/batch/9c7f2f2e-4f4b-4cbf-bb12-7fd3c1f4f2ab/files/file_6c27f6d4-43c1-48c6-a3c8-3a89d0a3cf20/download?format=markdown"
      }
    },
    {
      "file_id": "file_95ac2fb4-060d-427c-9f86-864747dfb935",
      "status": "failed",
      "page_count": 4,
      "billable_pages": 4,
      "estimated_credits": 10,
      "failure_code": "corrupt_pdf"
    }
  ],
  "links": {
    "status": "/v1/parse/batch/9c7f2f2e-4f4b-4cbf-bb12-7fd3c1f4f2ab",
    "download": "/v1/parse/batch/9c7f2f2e-4f4b-4cbf-bb12-7fd3c1f4f2ab/download"
  }
}

`completed_with_failures` means at least one file produced artifacts and at least one accepted file failed after enqueue. Check each file status before downloading.

Batch ZIP download

bash

curl "https://api.docushell.com/api/v1/parse/batch/9c7f2f2e-4f4b-4cbf-bb12-7fd3c1f4f2ab/download" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  --output parse-batch.zip

The batch ZIP is generated on demand from completed file artifacts and is not a durable artifact itself.

Per-file artifact download

bash

curl "https://api.docushell.com/api/v1/parse/batch/9c7f2f2e-4f4b-4cbf-bb12-7fd3c1f4f2ab/files/file_6c27f6d4-43c1-48c6-a3c8-3a89d0a3cf20/download?format=json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  --output report.json

Use the file-specific download route when you want one artifact instead of the generated batch ZIP.

Artifacts

Per-file download links appear under files[].artifacts after each completed file is promoted.
The batch ZIP is generated on demand from completed file artifacts, streamed, and removed after the response finishes.
Per-file artifact downloads can be retried until the batch expires.
The per-file format query must match an artifact requested for the batch.
Original filenames are not part of the public status payload or webhook logs.

Poll And Download

Poll GET /v1/parse/batch/:batchId until status becomes completed, completed_with_failures, or failed.
When the batch completes or partially completes, use GET /v1/parse/batch/:batchId/download for the generated ZIP or use each files[].artifacts.*_download link for a specific file artifact.
If a file status is failed, its per-file download returns 409 batch_file_failed. Continue downloading completed files until expires_at.

Failure Notes

invalid_pdf, corrupt_pdf, password_protected, invalid_page_range, and page_limit_exceeded can be returned during preflight before a batch is accepted.
server_busy with Retry-After means the dedicated batch queue or active batch lane is saturated. Retry later with the same Idempotency-Key only for the exact same request.
Download 400 means the requested format was not requested for this batch.
Download 425 batch_not_ready means the batch or file is not terminal yet.
Download 409 batch_file_failed means that accepted file reached a terminal failed state.
Download 410 output_expired means the artifact TTL has passed.

Batch not ready

425batch_not_ready

Returned when a batch or file download is attempted before terminal status.

425 error

json

{
  "error": {
    "code": "batch_not_ready",
    "message": "Batch is not ready.",
    "type": "invalid_request_error",
    "request_id": "req_01JX8Y62XCDNZ2BM7TBM2M9Q8E"
  }
}

Batch file failed

409batch_file_failed

Returned when a per-file artifact is requested for a file that failed after enqueue.

409 error

json

{
  "error": {
    "code": "batch_file_failed",
    "message": "Batch file failed.",
    "type": "invalid_request_error",
    "request_id": "req_01JX8Y62XCDNZ2BM7TBM2M9Q8E"
  }
}

Expired output

410output_expired

Returned after the one-hour default TTL for completed batch artifacts passes.

410 error

json

{
  "error": {
    "code": "output_expired",
    "message": "Batch output expired.",
    "type": "invalid_request_error",
    "request_id": "req_01JX8Y62XCDNZ2BM7TBM2M9Q8E"
  }
}

Queue saturated

503server_busy

Returned when batch-specific queue backpressure rejects the submit.

503 error

json

{
  "error": {
    "code": "server_busy",
    "message": "The parse batch queue is busy. Retry later.",
    "type": "internal_error",
    "request_id": "req_01JX8Y62XCDNZ2BM7TBM2M9Q8E"
  }
}

Authentication failure

401invalid_api_key

Returned when the bearer token is missing, revoked, expired, or not allowed to use the API lane.

401 error

json

{
  "error": {
    "code": "invalid_api_key",
    "message": "Invalid API key.",
    "type": "auth_error",
    "request_id": "req_01JX8Y62XCDNZ2BM7TBM2M9Q8E"
  }
}

Idempotency conflict

409idempotency_key_reused

Returned when the same Idempotency-Key is reused with a different payload than the original request.

409 error

json

{
  "error": {
    "code": "idempotency_key_reused",
    "message": "This Idempotency-Key was already used with a different request.",
    "type": "invalid_request_error",
    "request_id": "req_01JX8Y62XCDNZ2BM7TBM2M9Q8E"
  }
}

Webhook access disabled

403webhook_access_disabled

Returned when a Starter API key submits webhook fields. Pro, Growth, and Scale include webhooks.

403 error

json

{
  "error": {
    "code": "webhook_access_disabled",
    "message": "Webhooks are available on Pro, Growth, and Scale. Starter includes API access without webhooks.",
    "type": "billing_error",
    "request_id": "req_01JX8Y62XCDNZ2BM7TBM2M9Q8E"
  }
}

Rate limit

429rate_limit_exceeded

Returned when the API key or caller fingerprint exceeds the configured request rate.

429 error

json

{
  "error": {
    "code": "rate_limit_exceeded",
    "message": "Rate limit exceeded.",
    "type": "rate_limit_error",
    "request_id": "req_01JX8Y62XCDNZ2BM7TBM2M9Q8E"
  }
}

Webhooks

Completion Webhooks

Use webhook_url and webhook_secret for per-request completion callbacks, or send x-docushell-webhook-url and its matching secret header on batch parse requests.

Receivers must validate x-docushell-signature, deduplicate by x-docushell-delivery, and finish within 10-second request timeouts. Use public HTTPS staging endpoints or approved tunnels for receiver tests.

Terminal event names include pdf.parse.completed, pdf.parse.failed, pdf.parse.batch.completed, pdf.parse.batch.completed_with_failures, pdf.parse.batch.failed, resume.parse.completed, resume.parse.failed, resume.batch.completed, resume.batch.completed_with_failures, and resume.batch.failed.

Artifacts

Artifact Downloads

The shared download route streams one artifact at a time.

JSON artifact download

bash

curl "https://api.docushell.com/api/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download?format=json" \
  -H "Authorization: Bearer YOUR_API_KEY"

Use `format=json` to stream the structured document artifact directly.

Markdown artifact download

bash

curl "https://api.docushell.com/api/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download?format=markdown" \
  -H "Authorization: Bearer YOUR_API_KEY"

Markdown downloads preserve the extracted reading order in a plain-text friendly artifact.

HTML artifact download

bash

curl "https://api.docushell.com/api/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download?format=html" \
  -H "Authorization: Bearer YOUR_API_KEY"

HTML downloads are available when the parse job requested HTML output.

Plain text artifact download

bash

curl "https://api.docushell.com/api/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download?format=text" \
  -H "Authorization: Bearer YOUR_API_KEY"

Plain text downloads are available when the parse job requests all artifacts or explicit text output.

Annotated PDF debug artifact download

bash

curl "https://api.docushell.com/api/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download?format=annotated_pdf" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  --output document.annotated.pdf

Annotated PDF downloads are visual debug artifacts for validating extracted structure against the source page.

Markdown with HTML download

bash

curl "https://api.docushell.com/api/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download?format=markdown_with_html" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  --output document.with-html.md

`markdown_with_html` preserves richer inline/table markup inside a Markdown-family artifact.

Markdown with images download

bash

curl "https://api.docushell.com/api/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download?format=markdown_with_images" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  --output document.with-images.md

When `image_output=external` emits sidecars, this download may be a zip containing Markdown plus image assets.

Tagged PDF download

bash

curl "https://api.docushell.com/api/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download?format=tagged_pdf" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  --output document.tagged.pdf

Tagged PDF output is automated structure inference and should be reviewed before accessibility compliance claims.

Artifacts

JSON Artifact Versus Markdown Artifact

Use the JSON artifact when you need structure, semantic content types, or document geometry. It is the canonical machine-readable representation and the right choice for pipelines that need sections, lists, tables, headings, captions, or positional metadata.

Use the Markdown artifact when you need a lightweight, portable text representation that still follows the extracted reading order. It works well for previews, quick QA, search indexing, and downstream LLM ingestion.

The parse status payload also includes result.metadata so you can inspect which extraction-tuning options were applied to a completed job.

Support

Troubleshooting Parse Failures

Most parse failures are actionable before you retry. Keep the same Idempotency-Key only when you are replaying the exact same logical request after a timeout or transport issue.

invalid_pdf: confirm the upload is a real PDF before retrying.
corrupt_pdf: re-export or repair the file, then resubmit.
password_protected: decrypt the PDF before uploading. Password submission is not part of this public lane yet.
ocr_required: run OCR upstream first, then resubmit the text-native PDF.
invalid_page_range: retry with a selector like 1-3,5 that resolves inside the document bounds.
Unexpected text layout: retry with reading_order=off or reading_order=xycut, depending on whether you want less or more reading-order reconstruction.
Weak table extraction: retry with table_method=cluster if the default path misses cell groupings.

If the gateway returns server_busy or backend_unavailable, retry safely with the same Idempotency-Key after the transient issue clears.

Code Samples

Submit a RAG-ready parse job

curl -X POST "https://api.docushell.com/api/v1/parse" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Idempotency-Key: rag-parse-001" \
  -F "file=@./policy-handbook.pdf;type=application/pdf" \
  -F "formats=json,markdown" \
  -F "reading_order=xycut" \
  -F "use_struct_tree=true"

DocuShell RAG flow

PDF upload
  -> POST /api/v1/parse with formats=json,markdown
  -> poll GET /api/v1/jobs/:jobId
  -> download Markdown and JSON artifacts
  -> chunk Markdown by headings, paragraphs, lists, and tables
  -> attach JSON metadata: page, bounding box, node type, heading path
  -> store embeddings plus metadata
  -> retrieve chunks and cite the source page or region

Chunk metadata shape

{
  "content": "## Data Retention\n\nCustomer documents are retained for the configured retention window...",
  "metadata": {
    "source_file_id": "file_01JX...",
    "source_name": "policy-handbook.pdf",
    "section": "Data Retention",
    "node_types": ["heading", "paragraph"],
    "page_start": 3,
    "page_end": 4,
    "bounding_boxes": [
      { "page": 3, "bbox": { "x": 0.88, "y": 1.21, "w": 6.21, "h": 0.52 } }
    ]
  }
}