Introduction

Crawling is useful to automatically crawl all accessible subpages of a website and convert them into structured JSON or markdown. This guide helps you:
  • Initiate a crawling session
  • Check crawling session status
  • List crawled pages
  • Access crawled page content

Prerequisites

To get the most out of this guide, you’ll need to:

1. Start a Crawl

To initiate a web crawl, send a POST request to the /crawl endpoint with the desired configuration. View full API reference →
// POST /v4/crawl
{
  "url": "https://demo.vercel.store/",
  "maxDepth": 10,
  "maxPages": 50
}
Use pathsFilterIn and pathsFilterOut to include or exclude specific paths. Adjust timeout, maxDepth, and maxPages to refine the crawling process.

2. Check Crawl Status

Monitor the progress of your crawling session using the /crawl/status/<sessionId> endpoint. View full API reference →
GET https://api.kadoa.com/v4/crawl/<sessionId>/status
{
  "payload": {
    "crawledPages": 14,
    "finished": true
  },
  "sessionId": "<sessionId>",
  "error": null
}

3. List Crawled Pages

Access the crawled pages using the /crawl/<sessionId>/pages endpoint with pagination. View full API reference →
// GET /v4/crawl/<sessionId>/pages?currentPage=0&pageSize=100
Query parameters:
  • currentPage: Positive integer, starting from 0.
  • pageSize: Positive integer, starting from 1.

4. Retrieve Page Content

Now let’s retrieve the content of the crawled pages in our preferred format. The API can deliver the page payload directly in an LLM-ready format, such as markdown. View full API reference →
// GET /v4/crawl/<sessionId>/pages/<pageId>?format=md
Supported Formats:
  • html: Full HTML structure
  • md: Markdown format