Skip to main content

Ingestion settings

Ingestion settings control how documents are parsed, chunked, and enriched during ingestion. They’re used by the ingestion service when processing documents from document sources.

At runtime, the ingestion pipeline typically uses:

  • source.ingestionSettings (if a document source has explicit settings)
  • otherwise a default ingestion settings record (if one exists)
  • otherwise a built-in fallback (for some file types like spreadsheets)

API endpoints

All endpoints require Authorization: Bearer <ACCESS_TOKEN>.

  • GET /api/ingestion/settings — list settings
  • POST /api/ingestion/settings — create settings
  • HEAD /api/ingestion/settings/{id} — check existence (200 if exists, 204 if not)
  • GET /api/ingestion/settings/{id} — fetch by id
  • PUT /api/ingestion/settings/{id} — update
  • DELETE /api/ingestion/settings/{id} — delete (204)

What’s inside an ingestion settings object

Ingestion settings are a single object that includes:

  • LLM provider + embedding model (used for embedding / enrichment)
  • parser (how raw content is extracted)
  • splitter (how text is chunked)
  • extractors (optional enrichment stages)

Type selection uses discriminator field _t in each sub-object.

Parser providers (parser._t)

Common parser providers include:

  • azure-document-intelligence
  • llama-parse
  • default (if enabled in your deployment)

Splitter types (splitter._t)

Common splitter types include:

  • sentence (chunkSize/chunkOverlap)
  • token (chunkSize/chunkOverlap/separator)
  • semantic (bufferSize/breakpointPercentileThreshold)
  • markdown, markdown-table, html

Extractor types (extractors[]. _t)

Optional extractors can enrich the ingestion output:

  • keyword
  • metadata
  • summary
  • questions-answered
  • title

Create example

This example mirrors the ingestion-service integration tests.

{
"name": "pytest-test-settings",
"default": false,
"provider": {
"_t": "azure-openai",
"endpoint": "https://YOUR-RESOURCE.openai.azure.com",
"apiKey": "YOUR_AZURE_OPENAI_API_KEY",
"apiVersion": "2024-02-01",
"deployment": "gpt-4",
"embeddingDeployment": "text-embedding-3-large"
},
"embeddingModel": "text-embedding-3-large",
"embeddingDimensions": 3072,
"parser": {
"_t": "azure-document-intelligence",
"endpoint": "https://YOUR-DI.cognitiveservices.azure.com",
"apiKey": "YOUR_AZURE_DOCUMENT_INTELLIGENCE_KEY",
"model": "prebuilt-layout",
"locale": "en",
"format": "markdown"
},
"splitter": {
"_t": "sentence",
"chunkSize": 2048,
"chunkOverlap": 128
},
"extractors": [
{
"_t": "keyword",
"keywords": 10
}
],
"excludeMetadata": [],
"extractMetadata": []
}