DocsContent Extractionbrowser_extract_site

browser_extract_site

Browser Extract Site

Extract content from multiple pages of a website. Crawls links starting from a URL and extracts content in the specified format. Returns a job_id immediately for async progress tracking. Use browser_extract_site_progress to monitor and browser_extract_site_result to get output.

Usage Example

123456789101112
import asyncio
from owl_browser import OwlBrowser, RemoteConfig
# Async usage
async with OwlBrowser(config) as browser:
context = await browser.create_context()
context_id = context["context_id"]
await browser.extract_site(
context_id=context_id,
url="https://example.com"
)

Parameters

Required

context_idstringrequired

The unique identifier of the browser context (e.g., 'ctx_000001')

urlstringrequired

Starting URL to begin extraction from

Optional

depthstring

How many link levels to follow from the starting page. Default: 2. Higher values extract more pages but take longer

max_pagesstring

Maximum number of pages to extract. Default: 5. Limits total extraction to prevent runaway crawling

follow_externalboolean

Whether to follow links to external domains. Default: false. When false, only links within the same domain are followed

output_formatenum
markdowntextjson

Output format for extracted content: 'markdown' (default), 'text', or 'json'. Markdown preserves structure, text is plain, JSON includes metadata

include_imagesboolean

Include resolved image URLs in output. Default: true

include_metadataboolean

Include page title and description metadata. Default: true

exclude_patternsstring

Array of URL patterns to skip (glob patterns). Example: ["*/login*", "*/admin/*"]

timeout_per_pagestring

Timeout per page in milliseconds. Default: 10000 (10 seconds)

Response

Returns a JSON object with the operation result.

{
  "success": true,
  "result": <value>
}