Web scraping

Cortex Click provides a web scraper that is optimized for parsing and cleaning data for LLM consumption. It automatically transforms content into markdown, cleans redundant sections like nav headers and sidebars, and resolves images and links to fully qualified paths. This enables the intelligent content engine to insert images and citations from your website automatically.

Scraping individual URLs

Upsert one or more URLs for web scraping. Upserting URLs returns immediately with a 202 accepted, and scraping and indexing happens asynchronously.

const docs: UrlDocument[] = [
  {
    url: "https://www.cortexclick.com/",
    contentType: "url",
  },
];
 
await catalog.upsertDocuments(docs);

Scraping sitemaps

Upsert one or more sitemap documents to scrape and index an entire website. Sitemaps and sitemap indexes will be recursively traversed. Upserting sitemaps returns immediately with a 202 accepted, and scraping and indexing happens asynchronously.

const docs: SitemapDocument[] = [
  {
    sitemapUrl: "https://www.cortexclick.com/sitemap.xml",
    contentType: "sitemap-url",
  },
];
 
await catalog.upsertDocuments(docs);

Scheduled Runs

From the catalog page, you can create an indexer to automatically update your content. Indexers can run daily, weekly, or monthly depending on your needs.


SDK support for scheduled runs is coming soon.