Extract PDFs

Convert PDF files into normalized text chunks enriched with metadata. The extractor uses OCR when necessary and preserves logical sections so you can stream the output into search indexes or prompt scaffolds.

Request body

sourceUrl — HTTPS or signed URL pointing to the PDF. Required if file is not provided.
file — Optional uploaded PDF file (use multipart/form-data when sending the binary).
sourceName — Optional label saved with the extracted records (e.g., Playbook 2025).
options — Optional object. Supported keys:
- segmentLength — Target characters per chunk (default 1000).
- language — ISO language hint (en, fr, etc.) to improve OCR accuracy.
- ocr — Boolean; force OCR on scan-heavy PDFs (defaults to automatic detection).
metadata — Optional object for custom tags (e.g., {"department":"Support"}).
webhookUrl — Optional HTTPS URL Horizon should call when the extraction finishes.

Sample request

curl https://api.horizon.new/v1/extract/pdf \
  -H "Content-Type: application/json" \
  -d '{
    "sourceUrl": "https://cdn.example.com/handbooks/agent-playbook.pdf",
    "sourceName": "Agent Playbook 2025",
    "options": {
      "segmentLength": 1500,
      "language": "en",
      "ocr": true
    },
    "metadata": {
      "department": "Support",
      "quarter": "Q1-2025"
    }
  }'

# or upload the raw file (Base64 encoded)

curl https://api.horizon.new/v1/extract/pdf \
  -H "Content-Type: application/json" \
  -d '{
    "file": "data:application/pdf;base64,JVBERi0xLjcKJcTl8uX...<snip>",
    "sourceName": "Agent Playbook 2025",
    "options": {
      "segmentLength": 1500,
      "language": "en",
      "ocr": true
    },
    "metadata": {
      "department": "Support",
      "quarter": "Q1-2025"
    }
  }'

Response

Returns 202 Accepted with a jobId, status, and statusUrl. When the PDF is small enough to finish synchronously, the normalized chunks are included in result.

Notes

Signed URLs should remain valid until the job completes; most files process within a few minutes.
Set segmentLength to align with downstream token budgets.
OCR runs automatically when vector text is unavailable; use ocr: false to bypass it for machine-generated PDFs.
Poll GET /jobs/{jobId} (the same as the returned statusUrl) to monitor progress or retrieve the final result later.
To upload the file directly, send multipart/form-data with a file field instead of sourceUrl (e.g., curl -F "file=@playbook.pdf").

x402 flow

PDF extraction is billed per document via Coinbase’s x402 protocol. When payment is required, Horizon returns a structured 402 challenge:

HTTP/1.1 402 Payment Required
Content-Type: application/json

{
  "x402Version": 1,
  "accepts": [
    {
      "scheme": "exact",
      "network": "base-sepolia",
      "maxAmountRequired": "150000",
      "resource": "POST /extract/pdf",
      "description": "Horizon PDF extraction",
      "mimeType": "application/json",
      "payTo": "0xYourReceivingWallet",
      "maxTimeoutSeconds": 300,
      "asset": "0xYourUSDCContract",
      "extra": {
        "name": "USDC",
        "version": "1"
      }
    }
  ],
  "error": null
}

Resolve it by forwarding the accepts entry to your facilitator, calling /verify and /settle, then replaying the request with the facilitator-provided Base64 payload inside X-PAYMENT. Successful responses include X-PAYMENT-RESPONSE with the settlement receipt. See the Coinbase quickstart if you need help provisioning facilitator credentials.

Body

application/json

Provide either sourceUrl or file.

sourceUrl

string<uri>

sourceName

string

options

object

Extraction hints such as language, segmentLength, transcriptionModel, or sheet preferences depending on the endpoint.

metadata

object

Show child attributes

webhookUrl

string<uri>

Webhook to call when the extraction completes.

file

Upload the raw file instead of providing sourceUrl.

Response

Extraction job accepted

jobId

string

required

Example:

"job_01hx9q9"

status

enum<string>

required

Available options:

queued,

processing,

completed,

failed

statusUrl

string<uri>

required

Canonical link to GET /jobs/{jobId} for this job.

jobType

string

required

Example:

"extract/pdf"

result

object | null

Present when the job completes synchronously.

etaSeconds

integer | null

Estimated seconds until completion.

Introduction

Discovery endpoints

Extraction endpoints

Generation endpoints

Job endpoints

Request body

Sample request

Response

Notes

x402 flow

Body

Response

Introduction

Discovery endpoints

Extraction endpoints

Generation endpoints

Job endpoints

​Request body

​Sample request

​Response

​Notes

​x402 flow

Body

Response

Request body

Sample request

Response

Notes

x402 flow