Security

Security And Privacy

DocuShell is designed around ephemeral storage, strict input validation, rate limiting, PDF prompt-injection defenses, and private-service isolation for browser-based PDF workflows.

Security7 min

View as Markdown

Public API GatewayUnderstand how public requests are normalized before reaching processing services.Parse PDFReview RAG ingestion, structure metadata, and citation guidance.Webpage to PDFReview URL rendering and private-network protections.

Section

Privacy Model

Uploaded and generated files are ephemeral and swept from /tmp/docushell-storage within one hour.
Parse batch artifacts use the batch storage root and expire one hour after terminal completion by default.
Generated files are streamed through the gateway and deleted after download completion or interruption where the service supports one-time streaming.
DocuShell does not turn user PDFs into a long-lived document store.
Request IDs and operational metadata are used for debugging without exposing internal worker URLs.

Section

Validation

Public inputs are validated with schemas before work is queued.
PDF uploads are checked by MIME and magic bytes, not just file extension.
Parse batch preflight is all-or-nothing: every file must pass PDF, page, size, and password checks before the batch is queued.
Plan limits are enforced before or during queueing so oversized jobs fail early.
Structured logs and shared error middleware keep failures consistent across services.

Section

PDF Prompt-Injection Defense

PDFs can contain machine-readable text that a human reviewer cannot reasonably see. That matters for LLM, RAG, resume screening, contract review, and document automation workflows because hidden instructions can be extracted and passed into a downstream model as trusted context.

DocuShell keeps rendering-mismatch defenses enabled for Parse PDF output. The public API does not expose a field to disable these defenses. For untrusted or internet-sourced PDFs, treat this as part of the extraction baseline rather than an optional tuning knob.

sanitize=true is a separate control. It masks visible sensitive data in extracted output, such as emails, URLs, and phone numbers. It is disabled by default because it changes legitimate document content.

Hidden or transparent text is filtered when it is not part of what a normal reader should see.
Extremely small text and off-page text are filtered so machine-only prompts are less likely to enter model context.
Hidden PDF layers are excluded where the parser can identify that the layer is not visible.
Header and footer content remains excluded by default to reduce repeated boilerplate in downstream context.
Use the Parse Playground, JSON artifact, and annotated PDF artifact to inspect what the parser extracted before connecting a new RAG workflow to production.

These defenses reduce PDF-specific prompt-injection risk, but they are not a replacement for normal LLM application controls such as instruction hierarchy, retrieval allowlists, output validation, and least-privilege tool access.

Section

Parse Safety Controls

Public Parse PDF controls separate document safety, sensitive-data masking, and workflow validation.

Control	Default	DocuShell API behavior
Rendering-mismatch filtering	Enabled	Filters hidden, off-page, tiny, transparent, or hidden-layer text where identifiable. No public disable field is exposed.
Sensitive-data masking	Disabled	Set `sanitize=true` to mask visible emails, URLs, and phone numbers in extracted output.
Header/footer exclusion	Enabled	Repeated page furniture stays out of output unless `include_header_footer=true` is requested.
PDF validation	Mandatory	Uploads are checked by schema, MIME, magic bytes, page preflight, password state, size, and plan limits before queueing.
Artifact inspection	Available	Use JSON, Markdown, text, and annotated PDF outputs to review extraction before indexing or prompting.
Ephemeral storage	Enabled	Uploads and generated artifacts are temporary and swept within the configured retention window.

Section

URL Security

Webpage-to-PDF rendering accepts public URLs and blocks private network targets, intranet hosts, and metadata IP ranges before Chromium is asked to render.

Private and loopback network ranges are rejected.
Cloud metadata endpoints are rejected.
Rendering runs in isolated Chromium workers behind the gateway.