Skip to main content
POST
/
extract
/
pdf
Extract PDF documents
curl --request POST \
  --url https://api.horizon.new/v1/extract/pdf \
  --header 'Content-Type: application/json' \
  --data '{
  "sourceUrl": "<string>",
  "sourceName": "<string>",
  "options": {},
  "metadata": {},
  "webhookUrl": "<string>"
}'
{
  "jobId": "job_01hx9q9",
  "status": "queued",
  "statusUrl": "<string>",
  "result": {},
  "etaSeconds": 123,
  "jobType": "extract/pdf"
}
Convert PDF files into normalized text chunks enriched with metadata. The extractor uses OCR when necessary and preserves logical sections so you can stream the output into search indexes or prompt scaffolds.

Request body

  • sourceUrl — HTTPS or signed URL pointing to the PDF. Required if file is not provided.
  • file — Optional uploaded PDF file (use multipart/form-data when sending the binary).
  • sourceName — Optional label saved with the extracted records (e.g., Playbook 2025).
  • options — Optional object. Supported keys:
    • segmentLength — Target characters per chunk (default 1000).
    • language — ISO language hint (en, fr, etc.) to improve OCR accuracy.
    • ocr — Boolean; force OCR on scan-heavy PDFs (defaults to automatic detection).
  • metadata — Optional object for custom tags (e.g., {"department":"Support"}).
  • webhookUrl — Optional HTTPS URL Horizon should call when the extraction finishes.

Sample request

curl https://api.horizon.new/v1/extract/pdf \
  -H "Content-Type: application/json" \
  -d '{
    "sourceUrl": "https://cdn.example.com/handbooks/agent-playbook.pdf",
    "sourceName": "Agent Playbook 2025",
    "options": {
      "segmentLength": 1500,
      "language": "en",
      "ocr": true
    },
    "metadata": {
      "department": "Support",
      "quarter": "Q1-2025"
    }
  }'

# or upload the raw file (Base64 encoded)

curl https://api.horizon.new/v1/extract/pdf \
  -H "Content-Type: application/json" \
  -d '{
    "file": "data:application/pdf;base64,JVBERi0xLjcKJcTl8uX...<snip>",
    "sourceName": "Agent Playbook 2025",
    "options": {
      "segmentLength": 1500,
      "language": "en",
      "ocr": true
    },
    "metadata": {
      "department": "Support",
      "quarter": "Q1-2025"
    }
  }'

Response

Returns 202 Accepted with a jobId, status, and statusUrl. When the PDF is small enough to finish synchronously, the normalized chunks are included in result.

Notes

  • Signed URLs should remain valid until the job completes; most files process within a few minutes.
  • Set segmentLength to align with downstream token budgets.
  • OCR runs automatically when vector text is unavailable; use ocr: false to bypass it for machine-generated PDFs.
  • Poll GET /jobs/{jobId} (the same as the returned statusUrl) to monitor progress or retrieve the final result later.
  • To upload the file directly, send multipart/form-data with a file field instead of sourceUrl (e.g., curl -F "file=@playbook.pdf").

x402 flow

PDF extraction is billed per document via Coinbase’s x402 protocol. When payment is required, Horizon returns a structured 402 challenge:
HTTP/1.1 402 Payment Required
Content-Type: application/json

{
  "x402Version": 1,
  "accepts": [
    {
      "scheme": "exact",
      "network": "base-sepolia",
      "maxAmountRequired": "150000",
      "resource": "POST /extract/pdf",
      "description": "Horizon PDF extraction",
      "mimeType": "application/json",
      "payTo": "0xYourReceivingWallet",
      "maxTimeoutSeconds": 300,
      "asset": "0xYourUSDCContract",
      "extra": {
        "name": "USDC",
        "version": "1"
      }
    }
  ],
  "error": null
}
Resolve it by forwarding the accepts entry to your facilitator, calling /verify and /settle, then replaying the request with the facilitator-provided Base64 payload inside X-PAYMENT. Successful responses include X-PAYMENT-RESPONSE with the settlement receipt. See the Coinbase quickstart if you need help provisioning facilitator credentials.

Body

application/json

Provide either sourceUrl or file.

sourceUrl
string<uri>
sourceName
string
options
object

Extraction hints such as language, segmentLength, transcriptionModel, or sheet preferences depending on the endpoint.

metadata
object
webhookUrl
string<uri>

Webhook to call when the extraction completes.

file
file

Upload the raw file instead of providing sourceUrl.

Response

Extraction job accepted

jobId
string
required
Example:

"job_01hx9q9"

status
enum<string>
required
Available options:
queued,
processing,
completed,
failed
statusUrl
string<uri>
required

Canonical link to GET /jobs/{jobId} for this job.

jobType
string
required
Example:

"extract/pdf"

result
object | null

Present when the job completes synchronously.

etaSeconds
integer | null

Estimated seconds until completion.