Skip to main content
Convert PDF files into normalized text chunks. The extractor uses OCR when necessary and preserves logical sections so you can stream the output into search indexes or prompt scaffolds.

Request body

  • sourceUrl — HTTPS or signed URL pointing to the PDF. Required if file is not provided.
  • file — Optional uploaded PDF file (use multipart/form-data when sending the binary).
  • sourceName — Optional label saved with the extracted records (e.g., Playbook 2025).
  • options — Optional object. Supported keys:
    • segmentLength — Target characters per chunk (default 1000).
    • language — ISO language hint (en, fr, etc.) to improve OCR accuracy.
    • ocr — Boolean; force OCR on scan-heavy PDFs (defaults to automatic detection).
  • webhookUrl — Optional HTTPS URL Horizon should call when the extraction finishes.

Sample request

curl https://api.worklet.cloud/v1/extract/pdf \
  -H "Content-Type: application/json" \
  -d '{
    "sourceUrl": "https://cdn.example.com/handbooks/agent-playbook.pdf",
    "sourceName": "Agent Playbook 2025",
    "options": {
      "segmentLength": 1500,
      "language": "en",
      "ocr": true
    }
  }'

# or upload the raw file (Base64 encoded)

curl https://api.worklet.cloud/v1/extract/pdf \
  -H "Content-Type: application/json" \
  -d '{
    "file": "data:application/pdf;base64,JVBERi0xLjcKJcTl8uX...<snip>",
    "sourceName": "Agent Playbook 2025",
    "options": {
      "segmentLength": 1500,
      "language": "en",
      "ocr": true
    }
  }'

Response

Returns 202 Accepted with a jobId, status, and statusUrl. When the PDF is small enough to finish synchronously, the normalized chunks are included in result.

Notes

  • Signed URLs should remain valid until the job completes; most files process within a few minutes.
  • Set segmentLength to align with downstream token budgets.
  • OCR runs automatically when vector text is unavailable; use ocr: false to bypass it for machine-generated PDFs.
  • Poll GET /jobs/{jobId} (the same as the returned statusUrl) to monitor progress or retrieve the final result later.
  • To upload the file directly, send multipart/form-data with a file field instead of sourceUrl (e.g., curl -F "[email protected]").

x402 flow

PDF extraction is billed per document via Coinbase’s x402 protocol. When payment is required, Horizon returns a structured 402 challenge:
HTTP/1.1 402 Payment Required
Content-Type: application/json

{
  "x402Version": 1,
  "accepts": [
    {
      "scheme": "exact",
      "network": "base-sepolia",
      "maxAmountRequired": "150000",
      "resource": "POST /extract/pdf",
      "description": "Horizon PDF extraction",
      "mimeType": "application/json",
      "payTo": "0xYourReceivingWallet",
      "maxTimeoutSeconds": 300,
      "asset": "0xYourUSDCContract",
      "extra": {
        "name": "USDC",
        "version": "1"
      }
    }
  ],
  "error": null
}
Resolve it by forwarding the accepts entry to your facilitator, calling /verify and /settle, then replaying the request with the facilitator-provided Base64 payload inside X-PAYMENT. Successful responses include X-PAYMENT-RESPONSE with the settlement receipt. See the Coinbase quickstart if you need help provisioning facilitator credentials.