Crawling - Kadoa API

Introduction

Crawling is useful to automatically crawl all accessible subpages of a website and convert them into structured JSON or markdown. This guide helps you:

Initiate a crawling session
Check crawling session status
List crawled pages
Access crawled page content

Prerequisites

To get the most out of this guide, you’ll need to:

Create a Kadoa account
Get your API key

1. Start a Crawl

To initiate a web crawl, send a POST request to the /crawl endpoint with the desired configuration. View full API reference →

// POST /v4/crawl
{
  "url": "https://demo.vercel.store/",
  "maxDepth": 10,
  "maxPages": 50
}

Use pathsFilterIn and pathsFilterOut to include or exclude specific paths. Adjust timeout, maxDepth, and maxPages to refine the crawling process.

2. Check Crawl Status

Monitor the progress of your crawling session using the /crawl/status/<sessionId> endpoint. View full API reference →

GET https://api.kadoa.com/v4/crawl/<sessionId>/status

{
  "payload": {
    "crawledPages": 14,
    "finished": true
  },
  "sessionId": "<sessionId>",
  "error": null
}

3. List Crawled Pages

Access the crawled pages using the /crawl/<sessionId>/pages endpoint with pagination. View full API reference →

// GET /v4/crawl/<sessionId>/pages?currentPage=0&pageSize=100

Query parameters:

currentPage: Positive integer, starting from 0.
pageSize: Positive integer, starting from 1.

4. Retrieve Page Content

Now let’s retrieve the content of the crawled pages in our preferred format. The API can deliver the page payload directly in an LLM-ready format, such as markdown. View full API reference →

// GET /v4/crawl/<sessionId>/pages/<pageId>?format=md

Supported Formats:

html: Full HTML structure
md: Markdown format

Documentation

​Introduction

​Prerequisites

​1. Start a Crawl

​2. Check Crawl Status

​3. List Crawled Pages

​4. Retrieve Page Content

Introduction

Prerequisites

1. Start a Crawl

2. Check Crawl Status

3. List Crawled Pages

4. Retrieve Page Content