Security
Security And Privacy
DocuShell is designed around ephemeral storage, strict input validation, rate limiting, PDF prompt-injection defenses, and private-service isolation for browser-based PDF workflows.
Section
Privacy Model
- Uploaded and generated files are ephemeral and swept from
/tmp/docushell-storagewithin one hour. - Parse batch artifacts use the batch storage root and expire one hour after terminal completion by default.
- Generated files are streamed through the gateway and deleted after download completion or interruption where the service supports one-time streaming.
- DocuShell does not turn user PDFs into a long-lived document store.
- Request IDs and operational metadata are used for debugging without exposing internal worker URLs.
Section
Validation
- Public inputs are validated with schemas before work is queued.
- PDF uploads are checked by MIME and magic bytes, not just file extension.
- Parse batch preflight is all-or-nothing: every file must pass PDF, page, size, and password checks before the batch is queued.
- Plan limits are enforced before or during queueing so oversized jobs fail early.
- Structured logs and shared error middleware keep failures consistent across services.
Section
PDF Prompt-Injection Defense
PDFs can contain machine-readable text that a human reviewer cannot reasonably see. That matters for LLM, RAG, resume screening, contract review, and document automation workflows because hidden instructions can be extracted and passed into a downstream model as trusted context.
DocuShell keeps rendering-mismatch defenses enabled for Parse PDF output. The public API does not expose a field to disable these defenses. For untrusted or internet-sourced PDFs, treat this as part of the extraction baseline rather than an optional tuning knob.
sanitize=true is a separate control. It masks visible sensitive data in extracted output, such as emails, URLs, and phone numbers. It is disabled by default because it changes legitimate document content.
- Hidden or transparent text is filtered when it is not part of what a normal reader should see.
- Extremely small text and off-page text are filtered so machine-only prompts are less likely to enter model context.
- Hidden PDF layers are excluded where the parser can identify that the layer is not visible.
- Header and footer content remains excluded by default to reduce repeated boilerplate in downstream context.
- Use the Parse Playground, JSON artifact, and annotated PDF artifact to inspect what the parser extracted before connecting a new RAG workflow to production.
Section
Parse Safety Controls
Public Parse PDF controls separate document safety, sensitive-data masking, and workflow validation.
| Control | Default | DocuShell API behavior |
|---|---|---|
| Rendering-mismatch filtering | Enabled | Filters hidden, off-page, tiny, transparent, or hidden-layer text where identifiable. No public disable field is exposed. |
| Sensitive-data masking | Disabled | Set sanitize=true to mask visible emails, URLs, and phone numbers in extracted output. |
| Header/footer exclusion | Enabled | Repeated page furniture stays out of output unless include_header_footer=true is requested. |
| PDF validation | Mandatory | Uploads are checked by schema, MIME, magic bytes, page preflight, password state, size, and plan limits before queueing. |
| Artifact inspection | Available | Use JSON, Markdown, text, and annotated PDF outputs to review extraction before indexing or prompting. |
| Ephemeral storage | Enabled | Uploads and generated artifacts are temporary and swept within the configured retention window. |
Section
URL Security
Webpage-to-PDF rendering accepts public URLs and blocks private network targets, intranet hosts, and metadata IP ranges before Chromium is asked to render.
- Private and loopback network ranges are rejected.
- Cloud metadata endpoints are rejected.
- Rendering runs in isolated Chromium workers behind the gateway.