Extract docs

Flatten Microsoft Word (.doc, .docx) or similar word-processing files into clean text blocks. Styles, tables, and lists are normalized while preserving heading hierarchy for downstream indexing.

Request body

sourceUrl — HTTPS or signed URL pointing to the document. Required if file is not provided.
file — Optional uploaded document (.doc, .docx, etc.) using multipart/form-data.
sourceName — Optional label for the extracted dataset (e.g., Pricing Policy).
options — Optional object. Supported keys:
- segmentLength — Target characters per chunk (default 1000).
- language — ISO language hint to improve sentence segmentation.
- includeComments — Boolean; include tracked changes and comments in output (default false).
webhookUrl — Optional HTTPS URL Horizon should call when the extraction finishes.

Sample request

curl https://api.worklet.cloud/v1/extract/doc \
  -H "Content-Type: application/json" \
  -d '{
    "sourceUrl": "https://cdn.example.com/policies/pricing.docx",
    "sourceName": "Pricing Policy v3",
    "options": {
      "segmentLength": 1200,
      "language": "en",
      "includeComments": false
    }
  }'

# or upload the raw document

curl https://api.worklet.cloud/v1/extract/doc \
  -H "Content-Type: application/json" \
  -d '{
    "file": "data:application/vnd.openxmlformats-officedocument.wordprocessingml.document;base64,UEsDBBQABgAIA...",
    "sourceName": "Pricing Policy v3",
    "options": {
      "segmentLength": 1200,
      "language": "en",
      "includeComments": false
    }
  }'

Response

Returns 202 Accepted with jobId, status, and statusUrl. When the doc is small, extracted chunks appear immediately in result.

Notes

Track changes are removed by default; set includeComments: true to retain reviewer notes.
Embedded images are ignored; captions are extracted where available.
Use segmentLength to tune chunk size for language models or vector storage.
Poll GET /jobs/{jobId} (matches the statusUrl) to check progress or fetch the final output later on demand.
To upload the document directly, send multipart/form-data with a file field instead of sourceUrl.

x402 flow

Word document extraction is priced via Coinbase’s x402 protocol. A missing proof yields a challenge like:

HTTP/1.1 402 Payment Required
Content-Type: application/json

{
  "x402Version": 1,
  "accepts": [
    {
      "scheme": "exact",
      "network": "base-sepolia",
      "maxAmountRequired": "140000",
      "resource": "POST /extract/doc",
      "description": "Horizon Word document extraction",
      "mimeType": "application/json",
      "payTo": "0xYourReceivingWallet",
      "maxTimeoutSeconds": 300,
      "asset": "0xYourUSDCContract",
      "extra": {
        "name": "USDC",
        "version": "1"
      }
    }
  ],
  "error": null
}

Send the accepts payload to your facilitator, complete /verify and /settle, then replay the request with the Base64 token in X-PAYMENT. Horizon validates the proof, resumes extraction, and includes X-PAYMENT-RESPONSE on success. Refer to the Coinbase quickstart if you need a reference implementation.

Introduction

Discovery endpoints

Extraction endpoints

Generation endpoints

Job endpoints

Request body

Sample request

Response

Notes

x402 flow

Introduction

Discovery endpoints

Extraction endpoints

Generation endpoints

Job endpoints

​Request body

​Sample request

​Response

​Notes

​x402 flow

Request body

Sample request

Response

Notes

x402 flow