Security

Security And Privacy

DocuShell is designed around ephemeral storage, strict input validation, rate limiting, PDF prompt-injection defenses, and private-service isolation for browser-based PDF workflows.

Security7 min
View as Markdown

Section

Privacy Model

  • Uploaded and generated files are ephemeral and swept from /tmp/docushell-storage within one hour.
  • Parse batch artifacts use the batch storage root and expire one hour after terminal completion by default.
  • Generated files are streamed through the gateway and deleted after download completion or interruption where the service supports one-time streaming.
  • DocuShell does not turn user PDFs into a long-lived document store.
  • Request IDs and operational metadata are used for debugging without exposing internal worker URLs.

Section

Validation

  • Public inputs are validated with schemas before work is queued.
  • PDF uploads are checked by MIME and magic bytes, not just file extension.
  • Parse batch preflight is all-or-nothing: every file must pass PDF, page, size, and password checks before the batch is queued.
  • Plan limits are enforced before or during queueing so oversized jobs fail early.
  • Structured logs and shared error middleware keep failures consistent across services.

Section

PDF Prompt-Injection Defense

PDFs can contain machine-readable text that a human reviewer cannot reasonably see. That matters for LLM, RAG, resume screening, contract review, and document automation workflows because hidden instructions can be extracted and passed into a downstream model as trusted context.

DocuShell keeps rendering-mismatch defenses enabled for Parse PDF output. The public API does not expose a field to disable these defenses. For untrusted or internet-sourced PDFs, treat this as part of the extraction baseline rather than an optional tuning knob.

sanitize=true is a separate control. It masks visible sensitive data in extracted output, such as emails, URLs, and phone numbers. It is disabled by default because it changes legitimate document content.

  • Hidden or transparent text is filtered when it is not part of what a normal reader should see.
  • Extremely small text and off-page text are filtered so machine-only prompts are less likely to enter model context.
  • Hidden PDF layers are excluded where the parser can identify that the layer is not visible.
  • Header and footer content remains excluded by default to reduce repeated boilerplate in downstream context.
  • Use the Parse Playground, JSON artifact, and annotated PDF artifact to inspect what the parser extracted before connecting a new RAG workflow to production.
These defenses reduce PDF-specific prompt-injection risk, but they are not a replacement for normal LLM application controls such as instruction hierarchy, retrieval allowlists, output validation, and least-privilege tool access.

Section

Parse Safety Controls

Public Parse PDF controls separate document safety, sensitive-data masking, and workflow validation.

ControlDefaultDocuShell API behavior
Rendering-mismatch filteringEnabledFilters hidden, off-page, tiny, transparent, or hidden-layer text where identifiable. No public disable field is exposed.
Sensitive-data maskingDisabledSet sanitize=true to mask visible emails, URLs, and phone numbers in extracted output.
Header/footer exclusionEnabledRepeated page furniture stays out of output unless include_header_footer=true is requested.
PDF validationMandatoryUploads are checked by schema, MIME, magic bytes, page preflight, password state, size, and plan limits before queueing.
Artifact inspectionAvailableUse JSON, Markdown, text, and annotated PDF outputs to review extraction before indexing or prompting.
Ephemeral storageEnabledUploads and generated artifacts are temporary and swept within the configured retention window.

Section

URL Security

Webpage-to-PDF rendering accepts public URLs and blocks private network targets, intranet hosts, and metadata IP ranges before Chromium is asked to render.

  • Private and loopback network ranges are rejected.
  • Cloud metadata endpoints are rejected.
  • Rendering runs in isolated Chromium workers behind the gateway.