Ingestion settings

Ingestion settings control how documents are parsed, chunked, and enriched during ingestion. They’re used by the ingestion service when processing documents from document sources.

At runtime, the ingestion pipeline typically uses:

source.ingestionSettings (if a document source has explicit settings)
otherwise a default ingestion settings record (if one exists)
otherwise a built-in fallback (for some file types like spreadsheets)

API endpoints

All endpoints require Authorization: Bearer <ACCESS_TOKEN>.

GET /api/ingestion/settings — list settings
POST /api/ingestion/settings — create settings
HEAD /api/ingestion/settings/{id} — check existence (200 if exists, 204 if not)
GET /api/ingestion/settings/{id} — fetch by id
PUT /api/ingestion/settings/{id} — update
DELETE /api/ingestion/settings/{id} — delete (204)

What’s inside an ingestion settings object

Ingestion settings are a single object that includes:

LLM provider + embedding model (used for embedding / enrichment)
parser (how raw content is extracted)
splitter (how text is chunked)
extractors (optional enrichment stages)

Type selection uses discriminator field _t in each sub-object.

Parser providers (`parser._t`)

Common parser providers include:

azure-document-intelligence
llama-parse
default (if enabled in your deployment)

Splitter types (`splitter._t`)

Common splitter types include:

sentence (chunkSize/chunkOverlap)
token (chunkSize/chunkOverlap/separator)
semantic (bufferSize/breakpointPercentileThreshold)
markdown, markdown-table, html

Extractor types (`extractors[]. _t`)

Optional extractors can enrich the ingestion output:

keyword
metadata
summary
questions-answered
title

Create example

This example mirrors the ingestion-service integration tests.

{
  "name": "pytest-test-settings",
  "default": false,
  "provider": {
    "_t": "azure-openai",
    "endpoint": "https://YOUR-RESOURCE.openai.azure.com",
    "apiKey": "YOUR_AZURE_OPENAI_API_KEY",
    "apiVersion": "2024-02-01",
    "deployment": "gpt-4",
    "embeddingDeployment": "text-embedding-3-large"
  },
  "embeddingModel": "text-embedding-3-large",
  "embeddingDimensions": 3072,
  "parser": {
    "_t": "azure-document-intelligence",
    "endpoint": "https://YOUR-DI.cognitiveservices.azure.com",
    "apiKey": "YOUR_AZURE_DOCUMENT_INTELLIGENCE_KEY",
    "model": "prebuilt-layout",
    "locale": "en",
    "format": "markdown"
  },
  "splitter": {
    "_t": "sentence",
    "chunkSize": 2048,
    "chunkOverlap": 128
  },
  "extractors": [
    {
      "_t": "keyword",
      "keywords": 10
    }
  ],
  "excludeMetadata": [],
  "extractMetadata": []
}

API endpoints​

What’s inside an ingestion settings object​

Parser providers (parser._t)​

Splitter types (splitter._t)​

Extractor types (extractors[]. _t)​

Create example​