Skip to main content
Crawl all accessible subpages of a website and convert them into structured JSON or markdown. This guide covers:
  • Initiating crawl sessions with single or multiple URLs
  • Checking crawl status
  • Listing crawled pages
  • Accessing page content

Prerequisites

  • Kadoa account with API key
  • SDK installed: npm install @kadoa/node-sdk or pip install kadoa-sdk

1. Start a Crawl

Start a crawl session with a single URL or multiple URLs from the same domain. View full API reference →
import { KadoaClient } from '@kadoa/node-sdk';

const client = new KadoaClient({ apiKey: 'YOUR_API_KEY' });

const result = await client.crawler.session.start({
  url: "https://demo.vercel.store/",
  maxDepth: 10,
  maxPages: 50,
});

console.log(result.sessionId);

Multiple URLs

Crawl from multiple entry points on the same domain:
const result = await client.crawler.session.start({
  startUrls: [
    "https://demo.vercel.store/",
    "https://demo.vercel.store/collections",
    "https://demo.vercel.store/about",
  ],
  maxDepth: 10,
  maxPages: 50,
});
When using startUrls, all URLs must be from the same domain or subdomain. You can mix www.example.com and shop.example.com, but not example.com and different-site.com.

2. Check Crawl Status

Monitor crawl progress to know when extraction is complete. View full API reference →
const status = await client.crawler.session.getSessionStatus(sessionId);

console.log(status.payload.crawledPages);
console.log(status.payload.finished);

3. List Crawled Pages

Get a paginated list of crawled pages with their statuses. View full API reference →
const pages = await client.crawler.session.getPages(sessionId, {
  currentPage: 1,
  pageSize: 100,
});

for (const page of pages.payload) {
  console.log(page.id, page.url, page.status);
}
Page statuses: DONE, CRAWLING, PENDING

4. Retrieve Page Content

Get page content in markdown (LLM-ready) or HTML format. View full API reference →
// Get as markdown
const markdown = await client.crawler.session.getPage(sessionId, pageId, {
  format: "markdown",
});

console.log(markdown.payload);

// Get as HTML
const html = await client.crawler.session.getPage(sessionId, pageId, {
  format: "html",
});
Supported formats: md (markdown), html

Configuration Options

ParameterTypeDefaultDescription
urlstring-Single URL to crawl (use this or startUrls)
startUrlsstring[]-Multiple URLs to crawl (use this or url)
maxDepthnumber-Maximum crawl depth from entry points
maxPagesnumber-Maximum pages to crawl
maxMatchesnumber-Stop after N matched pages (with blueprint)
pathsFilterInstring[]-Regex patterns to include paths
pathsFilterOutstring[]-Regex patterns to exclude paths
proxyTypestringnullProxy type: "dc" (datacenter) or "residential"
proxyCountrystring-ISO country code for proxy location
concurrencynumber20Number of parallel crawlers
timeoutnumber-Request timeout in milliseconds
strictDomainbooleantrueStay within the same domain
loadImagesbooleantrueLoad images during crawl
callbackUrlstring-Webhook URL for completion notification

Artifact Options

ParameterTypeDefaultDescription
screenshotbooleanfalseCapture page screenshots
screenshotFullbooleanfalseCapture full-page screenshots
archivePdfbooleanfalseGenerate PDF archives

Error Handling

ErrorCauseSolution
401 UnauthorizedInvalid API keyVerify API key in dashboard
402 Payment RequiredInsufficient creditsTop up account credits
404 Not FoundInvalid session or page IDVerify ID exists
429 Too Many RequestsRate limit exceededReduce request frequency

Next Steps