Extract websites

Fetch a single webpage (or lightweight article) and convert it into clean text chunks. Use this when you need one-off ingestion without running a full crawl.

Request body

sourceUrl — Absolute URL to the page. Required if file is not provided.
file — Optional HTML snapshot (text/html) uploaded via multipart/form-data.
sourceName — Optional label stored with the extracted record.
options — Optional object. Supported keys:
- selector — CSS selector that scopes extraction to a specific container.
- stripSelectors — Array of selectors to remove (ads, nav, etc.).
- segmentLength — Target characters per chunk (default 1200).
- language — ISO language hint for improved segmentation.
webhookUrl — Optional HTTPS URL Horizon should call when the extraction finishes.

Sample request

curl https://api.worklet.cloud/v1/extract/website \
  -H "Content-Type: application/json" \
  -d '{
    "sourceUrl": "https://blog.horizon.new/horizon-product-overview",
    "sourceName": "Horizon Product Overview",
    "options": {
      "selector": "article",
      "stripSelectors": [".share-buttons", ".newsletter-cta"],
      "segmentLength": 1100,
      "language": "en"
    }
  }'

# or upload an HTML snapshot

curl https://api.worklet.cloud/v1/extract/website \
  -H "Content-Type: application/json" \
  -d '{
    "file": "data:text/html;base64,PGh0bWw+PGhlYWQ+...",
    "sourceName": "Horizon Product Overview Snapshot",
    "options": {
      "selector": "article",
      "stripSelectors": [".share-buttons", ".newsletter-cta"],
      "segmentLength": 1100,
      "language": "en"
    }
  }'

Response

Returns 202 Accepted with jobId, status, and statusUrl. If the URL is concise, the request may finish synchronously and include the normalized chunks under result.

Notes

The extractor renders the page with a headless browser to execute light client-side JavaScript. Heavier SPAs may require exporting content or using the crawl endpoint.
Use stripSelectors to remove headers/footers, cookie banners, or social widgets before chunking.
Authentication-gated URLs are not supported; provide publicly accessible pages or host signed snapshots.
Poll GET /jobs/{jobId} (same as the statusUrl) to monitor progress or retrieve the extracted chunks later.
To upload a static HTML snapshot instead of crawling, send multipart/form-data with a file field.

x402 flow

Website extraction is priced per page via Coinbase’s x402 protocol. A missing proof results in:

HTTP/1.1 402 Payment Required
Content-Type: application/json

{
  "x402Version": 1,
  "accepts": [
    {
      "scheme": "exact",
      "network": "base-sepolia",
      "maxAmountRequired": "130000",
      "resource": "POST /extract/website",
      "description": "Horizon website extraction",
      "mimeType": "application/json",
      "payTo": "0xYourReceivingWallet",
      "maxTimeoutSeconds": 180,
      "asset": "0xYourUSDCContract",
      "extra": {
        "name": "USDC",
        "version": "1"
      }
    }
  ],
  "error": null
}

Send the challenge to your facilitator, call /verify and /settle, then replay the request with the facilitator-issued Base64 payload in X-PAYMENT. Horizon resumes processing and returns settlement details via X-PAYMENT-RESPONSE.

Introduction

Discovery endpoints

Extraction endpoints

Generation endpoints

Job endpoints

Request body

Sample request

Response

Notes

x402 flow

Introduction

Discovery endpoints

Extraction endpoints

Generation endpoints

Job endpoints

​Request body

​Sample request

​Response

​Notes

​x402 flow

Request body

Sample request

Response

Notes

x402 flow