# Security And Privacy

DocuShell is designed around ephemeral storage, strict input validation, rate limiting, PDF prompt-injection defenses, and private-service isolation for browser-based PDF workflows.

Source: https://docs.docushell.com/security
Category: Security
Read time: 7 min

## Related

- [Public API Gateway](/public-api-gateway.md): Understand how public requests are normalized before reaching processing services.
- [Parse PDF](/parse-pdf.md#rag-ingestion-workflow): Review RAG ingestion, structure metadata, and citation guidance.
- [Webpage to PDF](/webpage-to-pdf.md): Review URL rendering and private-network protections.

## Privacy Model

- Uploaded and generated files are ephemeral and swept from `/tmp/docushell-storage` within one hour.
- Parse batch artifacts use the batch storage root and expire one hour after terminal completion by default.
- Generated files are streamed through the gateway and deleted after download completion or interruption where the service supports one-time streaming.
- DocuShell does not turn user PDFs into a long-lived document store.
- Request IDs and operational metadata are used for debugging without exposing internal worker URLs.

## Validation

- Public inputs are validated with schemas before work is queued.
- PDF uploads are checked by MIME and magic bytes, not just file extension.
- Parse batch preflight is all-or-nothing: every file must pass PDF, page, size, and password checks before the batch is queued.
- Plan limits are enforced before or during queueing so oversized jobs fail early.
- Structured logs and shared error middleware keep failures consistent across services.

## PDF Prompt-Injection Defense

PDFs can contain machine-readable text that a human reviewer cannot reasonably see. That matters for LLM, RAG, resume screening, contract review, and document automation workflows because hidden instructions can be extracted and passed into a downstream model as trusted context.

DocuShell keeps rendering-mismatch defenses enabled for Parse PDF output. The public API does not expose a field to disable these defenses. For untrusted or internet-sourced PDFs, treat this as part of the extraction baseline rather than an optional tuning knob.

`sanitize=true` is a separate control. It masks visible sensitive data in extracted output, such as emails, URLs, and phone numbers. It is disabled by default because it changes legitimate document content.

- Hidden or transparent text is filtered when it is not part of what a normal reader should see.
- Extremely small text and off-page text are filtered so machine-only prompts are less likely to enter model context.
- Hidden PDF layers are excluded where the parser can identify that the layer is not visible.
- Header and footer content remains excluded by default to reduce repeated boilerplate in downstream context.
- Use the Parse Playground, JSON artifact, and annotated PDF artifact to inspect what the parser extracted before connecting a new RAG workflow to production.

> These defenses reduce PDF-specific prompt-injection risk, but they are not a replacement for normal LLM application controls such as instruction hierarchy, retrieval allowlists, output validation, and least-privilege tool access.

## Parse Safety Controls

Public Parse PDF controls separate document safety, sensitive-data masking, and workflow validation.

| Control | Default | DocuShell API behavior |
| --- | --- | --- |
| Rendering-mismatch filtering | Enabled | Filters hidden, off-page, tiny, transparent, or hidden-layer text where identifiable. No public disable field is exposed. |
| Sensitive-data masking | Disabled | Set `sanitize=true` to mask visible emails, URLs, and phone numbers in extracted output. |
| Header/footer exclusion | Enabled | Repeated page furniture stays out of output unless `include_header_footer=true` is requested. |
| PDF validation | Mandatory | Uploads are checked by schema, MIME, magic bytes, page preflight, password state, size, and plan limits before queueing. |
| Artifact inspection | Available | Use JSON, Markdown, text, and annotated PDF outputs to review extraction before indexing or prompting. |
| Ephemeral storage | Enabled | Uploads and generated artifacts are temporary and swept within the configured retention window. |

## URL Security

Webpage-to-PDF rendering accepts public URLs and blocks private network targets, intranet hosts, and metadata IP ranges before Chromium is asked to render.

- Private and loopback network ranges are rejected.
- Cloud metadata endpoints are rejected.
- Rendering runs in isolated Chromium workers behind the gateway.
