Crawl all accessible subpages of a website and convert them into structured JSON or markdown. This guide covers:
Initiating crawl sessions with single or multiple URLs
Checking crawl status
Listing crawled pages
Accessing page content
Prerequisites
Kadoa account with API key
SDK installed: npm install @kadoa/node-sdk or pip install kadoa-sdk
1. Start a Crawl
Start a crawl session with a single URL or multiple URLs from the same domain.
View full API reference →
Node SDK
Python SDK
REST API
Response
import { KadoaClient } from '@kadoa/node-sdk' ;
const client = new KadoaClient ({ apiKey: 'YOUR_API_KEY' });
const result = await client . crawler . session . start ({
url: "https://demo.vercel.store/" ,
maxDepth: 10 ,
maxPages: 50 ,
});
console . log ( result . sessionId );
Multiple URLs
Crawl from multiple entry points on the same domain:
Node SDK
Python SDK
REST API
const result = await client . crawler . session . start ({
startUrls: [
"https://demo.vercel.store/" ,
"https://demo.vercel.store/collections" ,
"https://demo.vercel.store/about" ,
],
maxDepth: 10 ,
maxPages: 50 ,
});
When using startUrls, all URLs must be from the same domain or subdomain. You can mix www.example.com and shop.example.com, but not example.com and different-site.com.
2. Check Crawl Status
Monitor crawl progress to know when extraction is complete.
View full API reference →
Node SDK
Python SDK
REST API
Response
const status = await client . crawler . session . getSessionStatus ( sessionId );
console . log ( status . payload . crawledPages );
console . log ( status . payload . finished );
3. List Crawled Pages
Get a paginated list of crawled pages with their statuses.
View full API reference →
Node SDK
Python SDK
REST API
Response
const pages = await client . crawler . session . getPages ( sessionId , {
currentPage: 1 ,
pageSize: 100 ,
});
for ( const page of pages . payload ) {
console . log ( page . id , page . url , page . status );
}
Page statuses: DONE, CRAWLING, PENDING
4. Retrieve Page Content
Get page content in markdown (LLM-ready) or HTML format.
View full API reference →
Node SDK
Python SDK
REST API
Response
// Get as markdown
const markdown = await client . crawler . session . getPage ( sessionId , pageId , {
format: "markdown" ,
});
console . log ( markdown . payload );
// Get as HTML
const html = await client . crawler . session . getPage ( sessionId , pageId , {
format: "html" ,
});
Supported formats: md (markdown), html
Configuration Options
Parameter Type Default Description urlstring - Single URL to crawl (use this or startUrls) startUrlsstring[] - Multiple URLs to crawl (use this or url) maxDepthnumber - Maximum crawl depth from entry points maxPagesnumber - Maximum pages to crawl maxMatchesnumber - Stop after N matched pages (with blueprint) pathsFilterInstring[] - Regex patterns to include paths pathsFilterOutstring[] - Regex patterns to exclude paths proxyTypestring null Proxy type: "dc" (datacenter) or "residential" proxyCountrystring - ISO country code for proxy location concurrencynumber 20 Number of parallel crawlers timeoutnumber - Request timeout in milliseconds strictDomainboolean true Stay within the same domain loadImagesboolean true Load images during crawl callbackUrlstring - Webhook URL for completion notification
Artifact Options
Parameter Type Default Description screenshotboolean false Capture page screenshots screenshotFullboolean false Capture full-page screenshots archivePdfboolean false Generate PDF archives
Error Handling
Error Cause Solution 401 Unauthorized Invalid API key Verify API key in dashboard 402 Payment Required Insufficient credits Top up account credits 404 Not Found Invalid session or page ID Verify ID exists 429 Too Many Requests Rate limit exceeded Reduce request frequency
Next Steps