# Parse PDF

Parse PDF turns uploaded PDFs into a structured document tree plus Markdown, HTML, plain text, and optional annotated PDF debug artifacts. Use `/v1/parse` for one file or `/v1/parse/batch` for up to 10 PDFs with shared parse options.

Source: https://docs.docushell.com/parse-pdf
Category: Reference
Read time: 15 min

## Related

- [Getting started](/getting-started.md): Review auth, idempotency, and the shared jobs flow first.
- [Download formats](#artifact-downloads): Jump to JSON, Markdown, HTML, text, and annotated PDF artifact handling.
- [Batch parse](#submit-parse-batch): Submit multiple PDFs, poll batch status, and download per-file artifacts or a generated ZIP.
- [RAG ingestion](#rag-ingestion-workflow): Use DocuShell JSON and Markdown artifacts for chunking, retrieval metadata, and citations.
- [Try Parse live](/playgrounds/parse): Open the parser playground with annotated PDF overlays, extracted blocks, JSON, and live API execution.

## What You Get Back

Completed parse jobs expose structured JSON plus optional Markdown, HTML, plain text, annotated PDF debug artifacts, richer Markdown, image-capable Markdown, and tagged PDF output. The JSON artifact is the structured representation for downstream automation. Markdown and text are text-friendly companions for indexing, previews, and human review.

The structured JSON preserves reading order and emits a hierarchical document rooted at `numberOfPages` plus `kids`, with semantic nodes for headings, paragraphs, lists, list items, tables, rows, cells, captions, and images when detected.

When the underlying output includes layout coordinates, nodes can also expose bounding boxes so you can map extracted content back to the source pages.

- JSON artifact: structured document tree for automation and indexing.
- Markdown artifact: readable text export with the same reading-order orientation.
- HTML artifact: styled companion for rendering and review.
- Plain text artifact: lightweight output for search, RAG, and simple ingestion.
- Annotated PDF artifact: visual debug output for comparing detected structure to source pages.
- Markdown with HTML artifact: Markdown-family output that keeps richer inline/table markup.
- Markdown with images artifact: explicit image-capable Markdown; external sidecars are bundled into one zip.
- Only one markdown-style artifact can be requested per job because the parse engine emits one Markdown-family file per run.
- Tagged PDF artifact: automated structure inference for accessibility review, not a PDF/UA compliance guarantee.
- Header and footer content stays excluded by default unless `include_header_footer=true` is requested.
- Tagged PDFs can prefer their native structure tree when `use_struct_tree=true` is supplied.
- Sanitization, reading-order, table, line-break, hybrid-mode, and image-output settings tune extraction behavior while `output_mode` and `formats` select emitted artifacts.

## RAG Ingestion Workflow

Use DocuShell Parse PDF when a search, RAG, review, or agent workflow needs readable chunks plus source metadata.

For most RAG pipelines, request `formats=json,markdown`. Treat Markdown as the primary text to chunk and embed, then attach JSON metadata from the same parse job so retrieved passages can point back to source pages and bounding boxes.

Do not chunk PDFs by blind character windows first. Start with document semantics: headings, paragraphs, lists, captions, and tables. Merge short neighboring elements when needed, keep tables intact when their structure matters, and carry page numbers plus bounding boxes into vector metadata.

The JSON artifact is also the audit layer. Store enough metadata to reproduce a citation, highlight a source region in a review UI, and debug bad retrieval without reparsing the PDF.

- Submit with `formats=json,markdown` for the common RAG pair.
- Use `use_struct_tree=true` for tagged PDFs when you want DocuShell to prefer reliable native structure tags.
- Keep DocuShell's default rendering-mismatch defenses enabled for untrusted PDFs so hidden, off-page, tiny, or transparent text is less likely to enter model context.
- Keep the default header/footer exclusion unless repeated page furniture is important to the answer.
- Use `sanitize=true` only when your pipeline should mask visible emails, URLs, and phone numbers in the extracted output.
- Store per-chunk metadata such as source file ID, page number, heading path, node type, and bounding box when available.
- Keep table nodes or Markdown tables as standalone chunks when row and column relationships are important.

> RAG answers should cite DocuShell metadata, not just the text string sent to the model. Page and bounding-box metadata make citations inspectable.

## RAG Examples

These examples show the DocuShell API request and the downstream metadata shape to keep with embeddings.

### Submit a RAG-ready parse job

```bash
curl -X POST "https://api.docushell.com/api/v1/parse" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Idempotency-Key: rag-parse-001" \
  -F "file=@./policy-handbook.pdf;type=application/pdf" \
  -F "formats=json,markdown" \
  -F "reading_order=xycut" \
  -F "use_struct_tree=true"
```

`use_struct_tree=true` is helpful when the source PDF has reliable native tags. If usable tags are not present, continue validating the output in the Parse Playground and downloaded artifacts.

### DocuShell RAG flow

```text
PDF upload
  -> POST /api/v1/parse with formats=json,markdown
  -> poll GET /api/v1/jobs/:jobId
  -> download Markdown and JSON artifacts
  -> chunk Markdown by headings, paragraphs, lists, and tables
  -> attach JSON metadata: page, bounding box, node type, heading path
  -> store embeddings plus metadata
  -> retrieve chunks and cite the source page or region
```

### Chunk metadata shape

```json
{
  "content": "## Data Retention\n\nCustomer documents are retained for the configured retention window...",
  "metadata": {
    "source_file_id": "file_01JX...",
    "source_name": "policy-handbook.pdf",
    "section": "Data Retention",
    "node_types": ["heading", "paragraph"],
    "page_start": 3,
    "page_end": 4,
    "bounding_boxes": [
      { "page": 3, "bbox": { "x": 0.88, "y": 1.21, "w": 6.21, "h": 0.52 } }
    ]
  }
}
```

Use the exact metadata keys your system prefers, but preserve DocuShell page and bounding-box data when available.

## Tagged PDFs And Structure Trees

Tagged PDFs can provide stronger semantic structure than coordinate-only extraction when the tags are present and trustworthy.

When a source PDF includes usable structure tags, `use_struct_tree=true` tells DocuShell to prefer that structure for reading order and semantic hierarchy. This can improve headings, lists, table relationships, and natural chunk boundaries for RAG pipelines.

Real document collections are mixed. Some PDFs are well tagged, some have no tags, and some have tags that are not useful enough for downstream retrieval. Always inspect representative outputs before relying on a single extraction strategy.

- Use `use_struct_tree=true` for tagged policy documents, manuals, reports, and accessible PDFs where author-defined structure is expected.
- Use the regular layout-aware path for untagged or poorly tagged files, or compare both settings during integration testing.
- Request `formats=tagged_pdf` when you need a generated tagged PDF artifact for review or accessibility workflows.
- Do not treat `formats=tagged_pdf` as a PDF/UA compliance guarantee. Review the artifact before making accessibility claims.
- For RAG chunking, start new chunks at major headings, preserve heading-plus-paragraph groups, and avoid splitting tables across chunks.

## Playground Inspection Views

The Parse Playground lets developers inspect both the visual overlay and the structured data behind it.

- [Annotated PDF Viewer](/playgrounds/parse): Renders the selected PDF pages and overlays layout boxes, category tags, and reading-order numbers from parser bounding boxes.
- [Blocks / Tables Output](/playgrounds/parse): Lists extracted nodes as data: order, category type, page number, bounding-box coordinates, and extracted text or table content when available.
- [JSON, Markdown, And Text](/playgrounds/parse): Switch tabs to compare the raw structured JSON with Markdown and plain text companion artifacts from the same parse job.

## What Parse Supports Today

The public parse lane exposes curated parser controls while OCR/enrichment remain backend-profile settings.

| Capability | Availability | Notes |
| --- | --- | --- |
| JSON artifact | Available | Hierarchical document output for automation, indexing, and structured QA. |
| Markdown artifact | Available | Text-first companion download for previews, search, and LLM ingestion. |
| HTML artifact | Available | Optional styled companion download for rendering and review. |
| Plain text artifact | Available | Optional lightweight text output for search, RAG, and simple ingestion. |
| Annotated PDF artifact | Available | Optional visual debug artifact for validating extracted structure. |
| Annotated PDF playground viewer | Available | The playground renders PDF pages with layout boxes, category tags, and reading-order numbers when block geometry is present. |
| Blocks / tables output | Available when present | The playground lists parser nodes with order, category type, page number, bounding box, and extracted content. |
| Markdown with HTML | Available | Request `formats=markdown_with_html` when Markdown should retain richer inline/table markup. |
| Markdown with images | Available | Request `formats=markdown_with_images`; external sidecars download as a zip. |
| Tagged PDF output | Available | Request `formats=tagged_pdf`; review before making accessibility compliance claims. |
| Batch parse | Available | Use `/v1/parse/batch` for up to 10 PDFs with shared parse options, per-file statuses, and per-file artifacts. |
| Reading-order preservation | Available | Structured and text-oriented artifacts follow the detected reading order. |
| Heading and list detection | Available | Headings plus numbered, bulleted, and nested lists are represented when detected. |
| Table extraction | Available / backend-gated | Structured tables are emitted when detected; complex or borderless tables may require the hybrid backend. |
| Image extraction with coordinates | Available when present | Image nodes and coordinates can appear in JSON; use `markdown_with_images` for image-capable Markdown. |
| Tagged PDF structure | Available | Use `use_struct_tree=true` to prefer native structure tags when a tagged PDF provides them. |
| Sanitization | Available | Use `sanitize=true` to mask email addresses, URLs, and phone numbers in extracted output. |
| Reading-order override | Available | Use `reading_order=xycut\|off` when you need an explicit reading-order setting. |
| Table-method override | Available | Use `table_method=default\|cluster` for light table extraction tuning. |
| Keep line breaks | Available | Use `keep_line_breaks=true` when text-oriented output should preserve original line breaks more closely. |
| Header/footer inclusion | Available | Use `include_header_footer=true` when you need repeated page furniture in the output. |
| Request hybrid mode | Backend-gated | Use `hybrid_mode=auto\|full` only when the DocuShell hybrid backend is enabled. |
| Image output mode | Available | Use `image_output=off\|embedded\|external` for image-capable outputs. |
| OCR / scanned PDFs | Backend-gated | Available when the DocuShell hybrid OCR profile is active; otherwise scans return `ocr_required`. |
| Formula/chart enrichment | Backend-gated | Available only when the active DocuShell backend profile includes those enrichments. |

## Single-File Endpoint

- Method: `POST`
- Path: `/v1/parse`
- Auth: Bearer token required on submit, status, and artifact download requests.
- Idempotency: Server-minted `job_id` values with optional `Idempotency-Key` replay support.
- Content type: `multipart/form-data`

Submit a PDF for queued parsing and receive structured JSON plus Markdown, HTML, plain text, and annotated PDF debug output.

### Headers

| Name | Type | Required | Location | Description |
| --- | --- | --- | --- | --- |
| Authorization | Bearer <API_KEY> | Yes | header | User-owned API key created in the DocuShell dashboard. |
| Idempotency-Key | string | No | header | Recommended for safely retrying submit requests without creating duplicate jobs. |

### Request Fields

| Name | Type | Required | Location | Description |
| --- | --- | --- | --- | --- |
| file | file | Yes | multipart | PDF upload. The gateway validates PDF magic bytes before forwarding the file. |
| file_name | string | No | multipart | Optional file name override used for storage metadata and downstream artifact names. |
| page_range | string | No | multipart | Comma-separated pages or ranges such as `1-3,5,9-11`. |
| include_header_footer | boolean | No | multipart | Set to `true` to keep header and footer content in the extracted output. Default: false |
| use_struct_tree | boolean | No | multipart | Set to `true` to prefer native tagged-PDF structure when the source document includes a usable structure tree. Default: false |
| sanitize | boolean | No | multipart | Set to `true` to mask email addresses, URLs, and phone numbers in extracted output. Default: false |
| reading_order | `xycut` \| `off` | No | multipart | Optional reading-order strategy. Omit it to keep the current default extraction behavior. |
| table_method | `default` \| `cluster` | No | multipart | Optional table-detection strategy. Omit it to keep the current default extraction behavior. |
| keep_line_breaks | boolean | No | multipart | Set to `true` to preserve source line breaks more aggressively in text-oriented output. Default: false |
| output_mode | `json` \| `both` \| `html` \| `all` | No | multipart | Backward-compatible artifact bundle selector. `json` keeps only structured JSON, `both` adds Markdown, `html` adds HTML, and `all` returns the common legacy bundle: JSON, Markdown, HTML, text, and annotated PDF. Default: both |
| formats | `json` \| `markdown` \| `html` \| `text` \| `annotated_pdf` \| `markdown_with_html` \| `markdown_with_images` \| `tagged_pdf` | No | multipart | Optional explicit artifact list. Send as repeated fields or a comma-separated value, such as `formats=json,text`. |
| hybrid_mode | `auto` \| `full` | No | multipart | Optional per-job hybrid triage override. Requires the hybrid backend to be enabled by operations. |
| image_output | `off` \| `embedded` \| `external` | No | multipart | Controls image handling for image-capable outputs. `markdown_with_images` defaults to embedded images unless `external` is requested. |

### Request Notes

- Plan limits are enforced before the job is queued. Starter keeps the 50 MB per-file cap; Pro, Growth, and Scale raise upload size, per PDF/job page limits, and concurrency as monthly credits grow.
- Set `use_struct_tree=true` when tagged PDFs should favor their native structure tree. Leave it off for the default reading-order-oriented extraction path.
- Structured JSON remains the canonical parse result and is always generated for successful jobs so status responses can keep returning `result.document`.
- `sanitize`, `reading_order`, `table_method`, `keep_line_breaks`, `hybrid_mode`, and `image_output` are extraction-tuning knobs. `output_mode` and `formats` control which companion artifacts are emitted.
- Request newer artifact types such as `markdown_with_html`, `markdown_with_images`, and `tagged_pdf` with `formats`; only one markdown-style format (`markdown`, `markdown_with_html`, or `markdown_with_images`) can be requested per job because the parse engine emits one Markdown-family file per run.
- DocuShell keeps rendering-mismatch safety filters enabled for Parse PDF output. `sanitize=true` is a separate optional control for masking visible sensitive data.
- OCR, formula extraction, and chart/image descriptions follow the active DocuShell backend profile. They are not per-request fields on the shared public API.
- Status polling stays on `/v1/jobs/:jobId`. Artifact streaming happens through the shared download route with `format=json|markdown|html|text|annotated_pdf|markdown_with_html|markdown_with_images|tagged_pdf`.

### Sample Requests

#### Multipart submit

```bash
curl -X POST "https://api.docushell.com/api/v1/parse" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Idempotency-Key: parse-demo-001" \
  -F "file=@./quarterly-report.pdf;type=application/pdf" \
  -F "file_name=quarterly-report.pdf" \
  -F "page_range=1-3" \
  -F "include_header_footer=true" \
  -F "use_struct_tree=true" \
  -F "sanitize=true" \
  -F "reading_order=xycut" \
  -F "table_method=cluster" \
  -F "keep_line_breaks=true" \
  -F "formats=json,markdown_with_images" \
  -F "image_output=embedded"
```

### Queued Response

#### Queued response

```json
{
  "job_id": "job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT",
  "status": "queued",
  "cost": 2500,
  "service": "parse-pdf",
  "request_id": "req_01JX8Y62XCDNZ2BM7TBM2M9Q8E",
  "links": {
    "status": "/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT"
  }
}
```

### Status Response

#### Parse job status

```json
{
  "job_id": "job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT",
  "status": "done",
  "service": "parse-pdf",
  "request_id": "req_01JX8Y62XCDNZ2BM7TBM2M9Q8E",
  "result": {
    "document": {
      "fileName": "quarterly-report.pdf",
      "numberOfPages": 2,
      "kids": [
        {
          "type": "section",
          "children": [
            {
              "type": "heading",
              "content": "Executive summary",
              "heading level": 1,
              "page number": 1,
              "bounding box": { "x": 0.88, "y": 0.74, "w": 6.15, "h": 0.33 }
            },
            {
              "type": "paragraph",
              "content": "Revenue rose 18% year over year across the managed-services portfolio.",
              "page number": 1,
              "bounding box": { "x": 0.88, "y": 1.21, "w": 6.21, "h": 0.52 }
            },
            {
              "type": "list",
              "children": [
                { "type": "listItem", "content": "Renewals remained above 92%." },
                { "type": "listItem", "content": "Average contract value increased in EMEA." }
              ]
            },
            {
              "type": "table",
              "children": [
                {
                  "type": "tableRow",
                  "children": [
                    { "type": "tableCell", "content": "Region" },
                    { "type": "tableCell", "content": "Growth" }
                  ]
                },
                {
                  "type": "tableRow",
                  "children": [
                    { "type": "tableCell", "content": "North America" },
                    { "type": "tableCell", "content": "21%" }
                  ]
                }
              ]
            },
            {
              "type": "caption",
              "content": "Table 1. Regional growth by quarter."
            }
          ]
        }
      ]
    },
    "artifacts": {
      "markdown_download": "/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download?format=markdown",
      "json_download": "/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download?format=json",
      "html_download": "/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download?format=html",
      "text_download": "/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download?format=text",
      "annotated_pdf_download": "/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download?format=annotated_pdf"
    },
    "metadata": {
      "engine": "docushell_parse",
      "output_mode": "all",
      "include_header_footer": true,
      "use_struct_tree": true,
      "sanitize": true,
      "reading_order": "xycut",
      "table_method": "cluster",
      "keep_line_breaks": true
    }
  },
  "metrics": {
    "queue_wait_ms": 214,
    "duration_ms": 1789
  },
  "links": {
    "status": "/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT",
    "download": "/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download"
  }
}
```

The JSON artifact preserves reading order and exposes the structured document tree through `numberOfPages` and `kids`.

### Download Samples

#### JSON artifact download

```bash
curl "https://api.docushell.com/api/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download?format=json" \
  -H "Authorization: Bearer YOUR_API_KEY"
```

Use `format=json` to stream the structured document artifact directly.

#### Markdown artifact download

```bash
curl "https://api.docushell.com/api/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download?format=markdown" \
  -H "Authorization: Bearer YOUR_API_KEY"
```

Markdown downloads preserve the extracted reading order in a plain-text friendly artifact.

#### HTML artifact download

```bash
curl "https://api.docushell.com/api/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download?format=html" \
  -H "Authorization: Bearer YOUR_API_KEY"
```

HTML downloads are available when the parse job requested HTML output.

#### Plain text artifact download

```bash
curl "https://api.docushell.com/api/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download?format=text" \
  -H "Authorization: Bearer YOUR_API_KEY"
```

Plain text downloads are available when the parse job requests all artifacts or explicit text output.

#### Annotated PDF debug artifact download

```bash
curl "https://api.docushell.com/api/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download?format=annotated_pdf" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  --output document.annotated.pdf
```

Annotated PDF downloads are visual debug artifacts for validating extracted structure against the source page.

#### Markdown with HTML download

```bash
curl "https://api.docushell.com/api/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download?format=markdown_with_html" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  --output document.with-html.md
```

`markdown_with_html` preserves richer inline/table markup inside a Markdown-family artifact.

#### Markdown with images download

```bash
curl "https://api.docushell.com/api/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download?format=markdown_with_images" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  --output document.with-images.md
```

When `image_output=external` emits sidecars, this download may be a zip containing Markdown plus image assets.

#### Tagged PDF download

```bash
curl "https://api.docushell.com/api/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download?format=tagged_pdf" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  --output document.tagged.pdf
```

Tagged PDF output is automated structure inference and should be reviewed before accessibility compliance claims.

### Artifacts

- The JSON artifact is the structured representation. It includes the document root, `numberOfPages`, and the hierarchical `kids` array.
- Node types capture semantic structure such as headings, paragraphs, lists, list items, tables, rows, cells, and captions.
- Where available, nodes include bounding boxes so you can map structured content back to source pages.
- The Parse Playground uses those same node types and bounding boxes to draw layout boxes, category tags, and reading-order numbers in the Annotated PDF viewer.
- Markdown is the flattened companion artifact for indexing, previews, search pipelines, and quick human review.
- For RAG pipelines, index Markdown chunks and attach JSON metadata such as page number, node type, heading path, and bounding box.
- HTML is an optional companion artifact for styled downstream rendering and review when `output_mode` requests it.
- Plain text is available for search, RAG, and simple ingestion pipelines.
- Annotated PDF is an optional visual debug artifact for comparing detected structure to the source page.
- `markdown_with_html` is available as an explicit format when you want Markdown output with richer inline/table markup retained.
- `markdown_with_images` is available as an explicit format. Embedded images produce a self-contained Markdown file; external image sidecars are bundled into one zip.
- Only one markdown-style artifact can be requested in a single job: `markdown`, `markdown_with_html`, or `markdown_with_images`.
- `tagged_pdf` is available as an explicit format for accessibility review workflows. It is not a PDF/UA compliance guarantee.

### Poll And Download

- Poll `GET /v1/jobs/:jobId` until `status` becomes `done` or `failed`.
- When the job completes, the status payload includes public artifact links under `result.artifacts`.
- Use `GET /v1/jobs/:jobId/download?format=json` for the structured document, `format=markdown` for the Markdown companion, `format=html` for HTML, `format=text` for plain text, `format=annotated_pdf` for the visual debug artifact, `format=markdown_with_html` for richer Markdown, `format=markdown_with_images` for image-capable Markdown, and `format=tagged_pdf` for tagged PDF output.

### Failure Notes

- `invalid_pdf` covers invalid file types and malformed uploads rejected before the worker starts.
- `corrupt_pdf` is reserved for damaged PDFs that fail deeper validation or parser execution.
- `password_protected` is returned when the document requires a password.
- `ocr_required` is returned for scans or image-only PDFs when hybrid OCR is disabled, unavailable, or still produces too little extractable text.
- `invalid_page_range` is returned when the submitted page selector is malformed or selects no valid pages.
- `page_limit_exceeded` is returned when the requested page set is larger than the plan-specific parse cap.
- `server_busy` or `backend_unavailable` indicate temporary capacity problems. Retry with the same Idempotency-Key when safe.

### Error Examples

#### Password-protected PDF

- Status: `400`
- Code: `password_protected`

The document cannot be parsed until it is decrypted outside the public API lane.

##### 400 error

```json
{
  "error": {
    "code": "password_protected",
    "message": "This PDF is password-protected and cannot be parsed without a password.",
    "type": "invalid_request_error",
    "request_id": "req_01JX8Y62XCDNZ2BM7TBM2M9Q8E"
  }
}
```

#### OCR required

- Status: `400`
- Code: `ocr_required`

The parser could not extract text from a scan or image-only file.

##### 400 error

```json
{
  "error": {
    "code": "ocr_required",
    "message": "This PDF appears to require OCR before it can be parsed.",
    "type": "invalid_request_error",
    "request_id": "req_01JX8Y62XCDNZ2BM7TBM2M9Q8E"
  }
}
```

#### Invalid page range

- Status: `400`
- Code: `invalid_page_range`

The submitted selector is malformed or does not resolve to valid pages.

##### 400 error

```json
{
  "error": {
    "code": "invalid_page_range",
    "message": "The requested page_range is invalid for this PDF.",
    "type": "invalid_request_error",
    "request_id": "req_01JX8Y62XCDNZ2BM7TBM2M9Q8E"
  }
}
```

#### Plan page limit exceeded

- Status: `400`
- Code: `page_limit_exceeded`

The requested document or selected page range is larger than the active plan allows.

##### 400 error

```json
{
  "error": {
    "code": "page_limit_exceeded",
    "message": "Requested page range exceeds your plan limit.",
    "type": "invalid_request_error",
    "request_id": "req_01JX8Y62XCDNZ2BM7TBM2M9Q8E"
  }
}
```

#### Authentication failure

- Status: `401`
- Code: `invalid_api_key`

Returned when the bearer token is missing, revoked, expired, or not allowed to use the API lane.

##### 401 error

```json
{
  "error": {
    "code": "invalid_api_key",
    "message": "Invalid API key.",
    "type": "auth_error",
    "request_id": "req_01JX8Y62XCDNZ2BM7TBM2M9Q8E"
  }
}
```

#### Idempotency conflict

- Status: `409`
- Code: `idempotency_key_reused`

Returned when the same Idempotency-Key is reused with a different payload than the original request.

##### 409 error

```json
{
  "error": {
    "code": "idempotency_key_reused",
    "message": "This Idempotency-Key was already used with a different request.",
    "type": "invalid_request_error",
    "request_id": "req_01JX8Y62XCDNZ2BM7TBM2M9Q8E"
  }
}
```

#### Webhook access disabled

- Status: `403`
- Code: `webhook_access_disabled`

Returned when a Starter API key submits webhook fields. Pro, Growth, and Scale include webhooks.

##### 403 error

```json
{
  "error": {
    "code": "webhook_access_disabled",
    "message": "Webhooks are available on Pro, Growth, and Scale. Starter includes API access without webhooks.",
    "type": "billing_error",
    "request_id": "req_01JX8Y62XCDNZ2BM7TBM2M9Q8E"
  }
}
```

#### Rate limit

- Status: `429`
- Code: `rate_limit_exceeded`

Returned when the API key or caller fingerprint exceeds the configured request rate.

##### 429 error

```json
{
  "error": {
    "code": "rate_limit_exceeded",
    "message": "Rate limit exceeded.",
    "type": "rate_limit_error",
    "request_id": "req_01JX8Y62XCDNZ2BM7TBM2M9Q8E"
  }
}
```

## Batch Endpoint

Use batch parse when you want one accepted request, one batch status, and one shared set of parse options for multiple PDFs.

- Method: `POST`
- Path: `/v1/parse/batch`
- Auth: Bearer token required on submit, status, and artifact download requests.
- Idempotency: Required `Idempotency-Key`. The default idempotency window is one hour from submit, intentionally shorter than Stripe-style 24-hour idempotency. Reusing the same key and request replays the accepted response; reusing the key with a different request returns `409 idempotency_key_reused`.
- Content type: `multipart/form-data`

Submit up to 10 PDFs as one async parse batch with shared parse options and per-file artifacts.

### Headers

| Name | Type | Required | Location | Description |
| --- | --- | --- | --- | --- |
| Authorization | Bearer <API_KEY> | Yes | header | User-owned API key created in the DocuShell dashboard. |
| Idempotency-Key | string | Yes | header | Required for every batch submit. Use a fresh key per logical batch and reuse it only for the exact same retry. |

### Request Fields

| Name | Type | Required | Location | Description |
| --- | --- | --- | --- | --- |
| files[] | file[] | Yes | multipart | PDF uploads for the batch. Every file is preflighted before enqueue; one invalid file rejects the whole submit. |
| page_range | string | No | multipart | Shared page selector applied to every file, such as `1-3,5`. Selected pages count toward the batch total page limit. |
| include_header_footer | boolean | No | multipart | Shared setting. Set to `true` to keep repeated headers and footers in extracted output. Default: false |
| use_struct_tree | boolean | No | multipart | Shared setting. Set to `true` to prefer native tagged-PDF structure when available. Default: false |
| sanitize | boolean | No | multipart | Shared setting. Set to `true` to mask email addresses, URLs, and phone numbers in extracted output. Default: false |
| reading_order | `xycut` \| `off` | No | multipart | Shared reading-order strategy. |
| table_method | `default` \| `cluster` | No | multipart | Shared table-detection strategy. |
| keep_line_breaks | boolean | No | multipart | Shared setting. Set to `true` when text-oriented output should preserve original line breaks more closely. Default: false |
| output_mode | `json` \| `both` \| `html` \| `all` | No | multipart | Backward-compatible artifact bundle selector. Do not send this together with `formats`. Default: both |
| formats | `json` \| `markdown` \| `html` \| `text` \| `annotated_pdf` \| `markdown_with_html` \| `markdown_with_images` \| `tagged_pdf` | No | multipart | Explicit artifact list. Send repeated fields or a comma-separated value. Do not send this together with `output_mode`. |
| hybrid_mode | `auto` \| `full` | No | multipart | Optional shared hybrid triage override when the hybrid backend is enabled by operations. |
| image_output | `off` \| `embedded` \| `external` | No | multipart | Shared image handling for image-capable outputs. |
| x-docushell-webhook-url | string | No | header | Optional public HTTPS endpoint for the terminal batch webhook. URLs with credentials, localhost, private, reserved, or metadata-service addresses are rejected. |
| x-docushell-webhook-secret | string | No | header | Required when `x-docushell-webhook-url` is present. Must be 16-256 characters with sufficient variety; do not reuse an API key. |
| x-docushell-webhook-endpoint-id | string | No | header | Saved managed webhook endpoint id. Use this instead of sending a per-request webhook URL and secret. |

### Request Notes

- Batch parse is async-only. Submit returns `202`; poll `GET /v1/parse/batch/:batchId` for truth.
- All files are preflighted before enqueue. Empty, non-PDF, corrupt, password-protected, oversized, invalid-page-range, per-file page-limit, total-byte-limit, and total-page-limit failures reject the whole submit.
- Default v1 limits are 10 files, 100 MB total upload, 500 selected pages, and 2 batch submits per minute.
- Parse options are shared across the batch. v1 does not support per-file parse settings.
- Send either `output_mode` or `formats`, not both. Only one markdown-style format (`markdown`, `markdown_with_html`, or `markdown_with_images`) can be requested.
- The batch lane has separate backpressure. Queue saturation returns `503 server_busy` with `Retry-After`; rate limits return `429`.
- Batch responses report `estimated_credits` only: each file estimates `max(10, selected_pages)` credits. v1 does not perform final credit settlement on this lane.
- Every status and download request is owner-scoped. Unknown batches, files, or owner mismatches return `404`.
- Artifacts expire one hour after terminal completion by default. Batch idempotency expires one hour after submit by default.
- Terminal statuses are `completed`, `completed_with_failures`, and `failed`. No per-file retry is attempted in v1.
- Webhooks are signed best-effort terminal notifications with short bounded retry. Polling remains the source of truth.

### Sample Requests

#### Multipart batch submit

```bash
curl -X POST "https://api.docushell.com/api/v1/parse/batch" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Idempotency-Key: parse-batch-demo-001" \
  -H "x-docushell-webhook-url: https://example.com/docushell/webhooks" \
  -H "x-docushell-webhook-secret: replace_with_a_long_random_secret" \
  -F "files[]=@./report-q1.pdf;type=application/pdf" \
  -F "files[]=@./report-q2.pdf;type=application/pdf" \
  -F "page_range=1-5" \
  -F "formats=json,markdown"
```

### Queued Response

#### Accepted batch response

```json
{
  "batch_id": "9c7f2f2e-4f4b-4cbf-bb12-7fd3c1f4f2ab",
  "service": "parse-pdf",
  "status": "queued",
  "counts": {
    "total": 2,
    "queued": 2,
    "processing": 0,
    "completed": 0,
    "failed": 0
  },
  "usage": {
    "total_upload_bytes": 1843200,
    "total_selected_pages": 12
  },
  "estimated_credits": 20,
  "created_at": "2026-05-23T10:12:30.000Z",
  "updated_at": "2026-05-23T10:12:30.000Z",
  "completed_at": null,
  "expires_at": null,
  "webhook_delivery": {
    "status": "pending"
  },
  "metrics": null,
  "files": [
    {
      "file_id": "file_6c27f6d4-43c1-48c6-a3c8-3a89d0a3cf20",
      "status": "queued",
      "page_count": 8,
      "billable_pages": 8,
      "estimated_credits": 10
    },
    {
      "file_id": "file_95ac2fb4-060d-427c-9f86-864747dfb935",
      "status": "queued",
      "page_count": 4,
      "billable_pages": 4,
      "estimated_credits": 10
    }
  ],
  "links": {
    "status": "/v1/parse/batch/9c7f2f2e-4f4b-4cbf-bb12-7fd3c1f4f2ab",
    "download": "/v1/parse/batch/9c7f2f2e-4f4b-4cbf-bb12-7fd3c1f4f2ab/download"
  }
}
```

The same `Idempotency-Key` with the same request replays this response. A different request with that key returns `409 idempotency_key_reused`.

### Status Response

#### Batch status response

```json
{
  "batch_id": "9c7f2f2e-4f4b-4cbf-bb12-7fd3c1f4f2ab",
  "service": "parse-pdf",
  "status": "completed_with_failures",
  "counts": {
    "total": 2,
    "queued": 0,
    "processing": 0,
    "completed": 1,
    "failed": 1
  },
  "usage": {
    "total_upload_bytes": 1843200,
    "total_selected_pages": 12
  },
  "estimated_credits": 20,
  "created_at": "2026-05-23T10:12:30.000Z",
  "updated_at": "2026-05-23T10:14:04.000Z",
  "completed_at": "2026-05-23T10:14:04.000Z",
  "expires_at": "2026-05-23T11:14:04.000Z",
  "webhook_delivery": {
    "status": "delivered"
  },
  "metrics": {
    "queue_wait_ms": 214,
    "duration_ms": 82341
  },
  "files": [
    {
      "file_id": "file_6c27f6d4-43c1-48c6-a3c8-3a89d0a3cf20",
      "status": "completed",
      "page_count": 8,
      "billable_pages": 8,
      "estimated_credits": 10,
      "artifacts": {
        "json_download": "/v1/parse/batch/9c7f2f2e-4f4b-4cbf-bb12-7fd3c1f4f2ab/files/file_6c27f6d4-43c1-48c6-a3c8-3a89d0a3cf20/download?format=json",
        "markdown_download": "/v1/parse/batch/9c7f2f2e-4f4b-4cbf-bb12-7fd3c1f4f2ab/files/file_6c27f6d4-43c1-48c6-a3c8-3a89d0a3cf20/download?format=markdown"
      }
    },
    {
      "file_id": "file_95ac2fb4-060d-427c-9f86-864747dfb935",
      "status": "failed",
      "page_count": 4,
      "billable_pages": 4,
      "estimated_credits": 10,
      "failure_code": "corrupt_pdf"
    }
  ],
  "links": {
    "status": "/v1/parse/batch/9c7f2f2e-4f4b-4cbf-bb12-7fd3c1f4f2ab",
    "download": "/v1/parse/batch/9c7f2f2e-4f4b-4cbf-bb12-7fd3c1f4f2ab/download"
  }
}
```

`completed_with_failures` means at least one file produced artifacts and at least one accepted file failed after enqueue. Check each file status before downloading.

### Download Samples

#### Batch ZIP download

```bash
curl "https://api.docushell.com/api/v1/parse/batch/9c7f2f2e-4f4b-4cbf-bb12-7fd3c1f4f2ab/download" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  --output parse-batch.zip
```

The batch ZIP is generated on demand from completed file artifacts and is not a durable artifact itself.

#### Per-file artifact download

```bash
curl "https://api.docushell.com/api/v1/parse/batch/9c7f2f2e-4f4b-4cbf-bb12-7fd3c1f4f2ab/files/file_6c27f6d4-43c1-48c6-a3c8-3a89d0a3cf20/download?format=json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  --output report.json
```

Use the file-specific download route when you want one artifact instead of the generated batch ZIP.

### Artifacts

- Per-file download links appear under `files[].artifacts` after each completed file is promoted.
- The batch ZIP is generated on demand from completed file artifacts, streamed, and removed after the response finishes.
- Per-file artifact downloads can be retried until the batch expires.
- The per-file `format` query must match an artifact requested for the batch.
- Original filenames are not part of the public status payload or webhook logs.

### Poll And Download

- Poll `GET /v1/parse/batch/:batchId` until `status` becomes `completed`, `completed_with_failures`, or `failed`.
- When the batch completes or partially completes, use `GET /v1/parse/batch/:batchId/download` for the generated ZIP or use each `files[].artifacts.*_download` link for a specific file artifact.
- If a file status is `failed`, its per-file download returns `409 batch_file_failed`. Continue downloading completed files until `expires_at`.

### Failure Notes

- `invalid_pdf`, `corrupt_pdf`, `password_protected`, `invalid_page_range`, and `page_limit_exceeded` can be returned during preflight before a batch is accepted.
- `server_busy` with `Retry-After` means the dedicated batch queue or active batch lane is saturated. Retry later with the same Idempotency-Key only for the exact same request.
- Download `400` means the requested format was not requested for this batch.
- Download `425 batch_not_ready` means the batch or file is not terminal yet.
- Download `409 batch_file_failed` means that accepted file reached a terminal failed state.
- Download `410 output_expired` means the artifact TTL has passed.

### Error Examples

#### Batch not ready

- Status: `425`
- Code: `batch_not_ready`

Returned when a batch or file download is attempted before terminal status.

##### 425 error

```json
{
  "error": {
    "code": "batch_not_ready",
    "message": "Batch is not ready.",
    "type": "invalid_request_error",
    "request_id": "req_01JX8Y62XCDNZ2BM7TBM2M9Q8E"
  }
}
```

#### Batch file failed

- Status: `409`
- Code: `batch_file_failed`

Returned when a per-file artifact is requested for a file that failed after enqueue.

##### 409 error

```json
{
  "error": {
    "code": "batch_file_failed",
    "message": "Batch file failed.",
    "type": "invalid_request_error",
    "request_id": "req_01JX8Y62XCDNZ2BM7TBM2M9Q8E"
  }
}
```

#### Expired output

- Status: `410`
- Code: `output_expired`

Returned after the one-hour default TTL for completed batch artifacts passes.

##### 410 error

```json
{
  "error": {
    "code": "output_expired",
    "message": "Batch output expired.",
    "type": "invalid_request_error",
    "request_id": "req_01JX8Y62XCDNZ2BM7TBM2M9Q8E"
  }
}
```

#### Queue saturated

- Status: `503`
- Code: `server_busy`

Returned when batch-specific queue backpressure rejects the submit.

##### 503 error

```json
{
  "error": {
    "code": "server_busy",
    "message": "The parse batch queue is busy. Retry later.",
    "type": "internal_error",
    "request_id": "req_01JX8Y62XCDNZ2BM7TBM2M9Q8E"
  }
}
```

#### Authentication failure

- Status: `401`
- Code: `invalid_api_key`

Returned when the bearer token is missing, revoked, expired, or not allowed to use the API lane.

##### 401 error

```json
{
  "error": {
    "code": "invalid_api_key",
    "message": "Invalid API key.",
    "type": "auth_error",
    "request_id": "req_01JX8Y62XCDNZ2BM7TBM2M9Q8E"
  }
}
```

#### Idempotency conflict

- Status: `409`
- Code: `idempotency_key_reused`

Returned when the same Idempotency-Key is reused with a different payload than the original request.

##### 409 error

```json
{
  "error": {
    "code": "idempotency_key_reused",
    "message": "This Idempotency-Key was already used with a different request.",
    "type": "invalid_request_error",
    "request_id": "req_01JX8Y62XCDNZ2BM7TBM2M9Q8E"
  }
}
```

#### Webhook access disabled

- Status: `403`
- Code: `webhook_access_disabled`

Returned when a Starter API key submits webhook fields. Pro, Growth, and Scale include webhooks.

##### 403 error

```json
{
  "error": {
    "code": "webhook_access_disabled",
    "message": "Webhooks are available on Pro, Growth, and Scale. Starter includes API access without webhooks.",
    "type": "billing_error",
    "request_id": "req_01JX8Y62XCDNZ2BM7TBM2M9Q8E"
  }
}
```

#### Rate limit

- Status: `429`
- Code: `rate_limit_exceeded`

Returned when the API key or caller fingerprint exceeds the configured request rate.

##### 429 error

```json
{
  "error": {
    "code": "rate_limit_exceeded",
    "message": "Rate limit exceeded.",
    "type": "rate_limit_error",
    "request_id": "req_01JX8Y62XCDNZ2BM7TBM2M9Q8E"
  }
}
```

## Completion Webhooks

Use `webhook_url` and `webhook_secret` for per-request completion callbacks, or send `x-docushell-webhook-url` and its matching secret header on batch parse requests.

Receivers must validate `x-docushell-signature`, deduplicate by `x-docushell-delivery`, and finish within 10-second request timeouts. Use public HTTPS staging endpoints or approved tunnels for receiver tests.

Terminal event names include `pdf.parse.completed`, `pdf.parse.failed`, `pdf.parse.batch.completed`, `pdf.parse.batch.completed_with_failures`, `pdf.parse.batch.failed`, `resume.parse.completed`, `resume.parse.failed`, `resume.batch.completed`, `resume.batch.completed_with_failures`, and `resume.batch.failed`.

## Artifact Downloads

The shared download route streams one artifact at a time.

### JSON artifact download

```bash
curl "https://api.docushell.com/api/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download?format=json" \
  -H "Authorization: Bearer YOUR_API_KEY"
```

Use `format=json` to stream the structured document artifact directly.

### Markdown artifact download

```bash
curl "https://api.docushell.com/api/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download?format=markdown" \
  -H "Authorization: Bearer YOUR_API_KEY"
```

Markdown downloads preserve the extracted reading order in a plain-text friendly artifact.

### HTML artifact download

```bash
curl "https://api.docushell.com/api/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download?format=html" \
  -H "Authorization: Bearer YOUR_API_KEY"
```

HTML downloads are available when the parse job requested HTML output.

### Plain text artifact download

```bash
curl "https://api.docushell.com/api/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download?format=text" \
  -H "Authorization: Bearer YOUR_API_KEY"
```

Plain text downloads are available when the parse job requests all artifacts or explicit text output.

### Annotated PDF debug artifact download

```bash
curl "https://api.docushell.com/api/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download?format=annotated_pdf" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  --output document.annotated.pdf
```

Annotated PDF downloads are visual debug artifacts for validating extracted structure against the source page.

### Markdown with HTML download

```bash
curl "https://api.docushell.com/api/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download?format=markdown_with_html" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  --output document.with-html.md
```

`markdown_with_html` preserves richer inline/table markup inside a Markdown-family artifact.

### Markdown with images download

```bash
curl "https://api.docushell.com/api/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download?format=markdown_with_images" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  --output document.with-images.md
```

When `image_output=external` emits sidecars, this download may be a zip containing Markdown plus image assets.

### Tagged PDF download

```bash
curl "https://api.docushell.com/api/v1/jobs/job_01JX8Y5YJ2M2D8N1AQ5F7Q3KVT/download?format=tagged_pdf" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  --output document.tagged.pdf
```

Tagged PDF output is automated structure inference and should be reviewed before accessibility compliance claims.

## JSON Artifact Versus Markdown Artifact

Use the JSON artifact when you need structure, semantic content types, or document geometry. It is the canonical machine-readable representation and the right choice for pipelines that need sections, lists, tables, headings, captions, or positional metadata.

Use the Markdown artifact when you need a lightweight, portable text representation that still follows the extracted reading order. It works well for previews, quick QA, search indexing, and downstream LLM ingestion.

The parse status payload also includes `result.metadata` so you can inspect which extraction-tuning options were applied to a completed job.

## Troubleshooting Parse Failures

Most parse failures are actionable before you retry. Keep the same `Idempotency-Key` only when you are replaying the exact same logical request after a timeout or transport issue.

- `invalid_pdf`: confirm the upload is a real PDF before retrying.
- `corrupt_pdf`: re-export or repair the file, then resubmit.
- `password_protected`: decrypt the PDF before uploading. Password submission is not part of this public lane yet.
- `ocr_required`: run OCR upstream first, then resubmit the text-native PDF.
- `invalid_page_range`: retry with a selector like `1-3,5` that resolves inside the document bounds.
- Unexpected text layout: retry with `reading_order=off` or `reading_order=xycut`, depending on whether you want less or more reading-order reconstruction.
- Weak table extraction: retry with `table_method=cluster` if the default path misses cell groupings.

> If the gateway returns `server_busy` or `backend_unavailable`, retry safely with the same `Idempotency-Key` after the transient issue clears.
