Indexing documents
Cortex Click provides document indexing APIs and SDKs that simplify ingesting your knowledge base. These tools take care of cleaning the input data and chunking it for optimal consumption by the intelligent content engine. They store the raw input while also vectorizing it to enable real-time search capabilities. Additionally, these tools ensure that the data remains up to date as changes occur.
Documents are identified by a documentId
property, and can optionally include imageUrl
and url
metadata that is used to insert
images and URL citations (cross-linking) into generated content and answers.
Content with cross-linking, citations, and images tends to be higher performing, so include these fields wherever possible.
To request support for additional data formats and source, please open a GitHub issue or contact us.
Upserting documents directly
Batches of documents can be submitted directly via the batch upsert endpoint. It includes support for the following formats:
- text (
.txt
) - markdown (
.md
and.mdx
) - JSON
.docx
files- web scraping urls and sitemaps
Uploading markdown documents into a catalog:
Scraping an entire website via the sitemap and uploading it to your catalog:
See the batch upsert documentation for more details.
SDK indexers for common data types
The SDK also includes several client-side indexers that wrap the batch upsert endpoint to conveniently ingest common data sources in bulk. There is a fully-managed version of the web scraping indexer for scheduled updates, and more sources are actively in development.
- Directory indexer for uploading entire directories of files and common formats like GitHub repos, and Google Drive folders
- JSON indexer for uploading arrays of JSON data in bulk
- TSV indexer for turning rows in a spreadsheet into indexed documents
- Shopify indexer to index entire product catalogs from a Shopify store URL
A directory indexer that ingests a GitHub repo:
Indexing a Shopify product catalog: