Skip to main content
Fetch a single webpage (or lightweight article) and convert it into clean text chunks. Use this when you need one-off ingestion without running a full crawl.

Request body

  • sourceUrl — Absolute URL to the page. Required if file is not provided.
  • file — Optional HTML snapshot (text/html) uploaded via multipart/form-data.
  • sourceName — Optional label stored with the extracted record.
  • options — Optional object. Supported keys:
    • selector — CSS selector that scopes extraction to a specific container.
    • stripSelectors — Array of selectors to remove (ads, nav, etc.).
    • segmentLength — Target characters per chunk (default 1200).
    • language — ISO language hint for improved segmentation.
  • webhookUrl — Optional HTTPS URL Horizon should call when the extraction finishes.

Sample request

curl https://api.worklet.cloud/v1/extract/website \
  -H "Content-Type: application/json" \
  -d '{
    "sourceUrl": "https://blog.horizon.new/horizon-product-overview",
    "sourceName": "Horizon Product Overview",
    "options": {
      "selector": "article",
      "stripSelectors": [".share-buttons", ".newsletter-cta"],
      "segmentLength": 1100,
      "language": "en"
    }
  }'

# or upload an HTML snapshot

curl https://api.worklet.cloud/v1/extract/website \
  -H "Content-Type: application/json" \
  -d '{
    "file": "data:text/html;base64,PGh0bWw+PGhlYWQ+...",
    "sourceName": "Horizon Product Overview Snapshot",
    "options": {
      "selector": "article",
      "stripSelectors": [".share-buttons", ".newsletter-cta"],
      "segmentLength": 1100,
      "language": "en"
    }
  }'

Response

Returns 202 Accepted with jobId, status, and statusUrl. If the URL is concise, the request may finish synchronously and include the normalized chunks under result.

Notes

  • The extractor renders the page with a headless browser to execute light client-side JavaScript. Heavier SPAs may require exporting content or using the crawl endpoint.
  • Use stripSelectors to remove headers/footers, cookie banners, or social widgets before chunking.
  • Authentication-gated URLs are not supported; provide publicly accessible pages or host signed snapshots.
  • Poll GET /jobs/{jobId} (same as the statusUrl) to monitor progress or retrieve the extracted chunks later.
  • To upload a static HTML snapshot instead of crawling, send multipart/form-data with a file field.

x402 flow

Website extraction is priced per page via Coinbase’s x402 protocol. A missing proof results in:
HTTP/1.1 402 Payment Required
Content-Type: application/json

{
  "x402Version": 1,
  "accepts": [
    {
      "scheme": "exact",
      "network": "base-sepolia",
      "maxAmountRequired": "130000",
      "resource": "POST /extract/website",
      "description": "Horizon website extraction",
      "mimeType": "application/json",
      "payTo": "0xYourReceivingWallet",
      "maxTimeoutSeconds": 180,
      "asset": "0xYourUSDCContract",
      "extra": {
        "name": "USDC",
        "version": "1"
      }
    }
  ],
  "error": null
}
Send the challenge to your facilitator, call /verify and /settle, then replay the request with the facilitator-issued Base64 payload in X-PAYMENT. Horizon resumes processing and returns settlement details via X-PAYMENT-RESPONSE.