Introduction
Crawling is useful to automatically crawl all accessible subpages of a website and convert them into structured JSON or markdown. This guide helps you:- Initiate a crawling session with single or multiple URLs
- Check crawling session status
- List crawled pages
- Access crawled page content
Prerequisites
To get the most out of this guide, you’ll need to:- Create a Kadoa account
- Get your API key
1. Start a Crawl
To initiate a web crawl, send a POST request to the/crawl
endpoint with the desired configuration. You can provide either a single URL or multiple starting URLs from the same domain.
View full API reference →
When using
startUrls
with multiple URLs, all URLs must be from the same domain or subdomain. For example, you can mix www.example.com
and shop.example.com
, but not example.com
and different-site.com
.- Use either
url
(single URL) orstartUrls
(multiple URLs), not both - Use
pathsFilterIn
andpathsFilterOut
to include or exclude specific paths - Adjust
timeout
,maxDepth
, andmaxPages
to refine the crawling process
2. Check Crawl Status
Monitor the progress of your crawling session using the/crawl/status/<sessionId>
endpoint.
View full API reference →
3. List Crawled Pages
Access the crawled pages using the/crawl/<sessionId>/pages
endpoint with pagination.
View full API reference →
currentPage
: Positive integer, starting from 0.pageSize
: Positive integer, starting from 1.
4. Retrieve Page Content
Now let’s retrieve the content of the crawled pages in our preferred format. The API can deliver the page payload directly in an LLM-ready format, such as markdown. View full API reference →html
: Full HTML structuremd
: Markdown format