browser_extract_site
Extract content from multiple pages of a website. Crawls links starting from a URL and extracts content in the specified format. Returns a job_id immediately for async progress tracking. Use browser_extract_site_progress to monitor and browser_extract_site_result to get output.
Usage Example
Parameters
Required
context_idstringrequiredThe unique identifier of the browser context (e.g., 'ctx_000001')
urlstringrequiredStarting URL to begin extraction from
Optional
depthstringHow many link levels to follow from the starting page. Default: 2. Higher values extract more pages but take longer
max_pagesstringMaximum number of pages to extract. Default: 5. Limits total extraction to prevent runaway crawling
follow_externalbooleanWhether to follow links to external domains. Default: false. When false, only links within the same domain are followed
output_formatenummarkdowntextjsonOutput format for extracted content: 'markdown' (default), 'text', or 'json'. Markdown preserves structure, text is plain, JSON includes metadata
include_imagesbooleanInclude resolved image URLs in output. Default: true
include_metadatabooleanInclude page title and description metadata. Default: true
exclude_patternsstringArray of URL patterns to skip (glob patterns). Example: ["*/login*", "*/admin/*"]
timeout_per_pagestringTimeout per page in milliseconds. Default: 10000 (10 seconds)
Response
Returns a JSON object with the operation result.
{
"success": true,
"result": <value>
}