Indexing documents

Cortex Click provides document indexing APIs and SDKs that simplify ingesting your knowledge base. These tools take care of cleaning the input data and chunking it for optimal consumption by the intelligent content engine. They store the raw input while also vectorizing it to enable real-time search capabilities. Additionally, these tools ensure that the data remains up to date as changes occur.

Documents are identified by a documentId property, and can optionally include imageUrl and url metadata that is used to insert images and URL citations (cross-linking) into generated content and answers. Content with cross-linking, citations, and images tends to be higher performing, so include these fields wherever possible.

To request support for additional data formats and source, please open a GitHub issue or contact us.

Upserting documents directly

Batches of documents can be submitted directly via the batch upsert endpoint. It includes support for the following formats:

  • text (.txt)
  • markdown (.md and .mdx)
  • JSON
  • .docx files
  • web scraping urls and sitemaps

Uploading markdown documents into a catalog:

const catalog = await client.getCatalog("github-markdown");
 
const docs: TextDocument[] = [
  {
    documentId: "1",
    contentType: "markdown",
    content: "# some markdown",
    url: "https://foo.com",
    imageUrl: "https://foo.com/image.jpg",
  },
  {
    documentId: "2",
    contentType: "markdown",
    content: "# some more markdown",
    url: "https://foo.com/2",
    imageUrl: "https://foo.com/image2.jpg",
  },
];
 
await catalog.upsertDocuments(docs);

Scraping an entire website via the sitemap and uploading it to your catalog:

const catalog = await client.getCatalog();
 
const sitemap: SitemapDocument = {
  sitemapUrl: "https://acme.com/sitemap.xml",
  contentType: "sitemap-url",
};
 
await catalog.upsertDocuments([sitemap]);

See the batch upsert documentation for more details.

SDK indexers for common data types

The SDK also includes several client-side indexers that wrap the batch upsert endpoint to conveniently ingest common data sources in bulk. There is a fully-managed version of the web scraping indexer for scheduled updates, and more sources are actively in development.

A directory indexer that ingests a GitHub repo:

const catalog = await client.getCatalog("github-docs");
const rootDir = path.join(process.env.GITHUB_DOCS_ROOT_DIR, "content");
 
const gitHubDocsIndexer = new DirectoryIndexer(catalog, {
  rootDir,
  urlBase: "https://www.acme.com",
  // an optional function that maps directory structure to URLs on a website
  getUrl,
  // set document ID to URL
  getId: getUrl,
  // only include markdown
  includeFile(filePath) {
    return filePath.endsWith(".md");
  },
});
 
await gitHubDocsIndexer.index();
 
// getUrl specifies how to map documents on disk to public URLs
const getUrl = (docsPathList: string[], sitePathList: string[]) => {
  const fileName = sitePathList.pop();
  if (fileName === "_index.md") {
    return sitePathList.join("/");
  }
 
  return [...sitePathList, fileName].join("/").slice(0, -3);
};

Indexing a Shopify product catalog:

const catalog = await client.getCatalog("shopify-products");
 
const shopifyProductIndexer = catalog.shopifyIndexer({
  shopifyBaseUrl: "https://shopify-store-url.com",
});
 
await shopifyProductIndexer.index();