Configuration | Extend

The Extract API accepts a config object that controls how documents are processed and how values are returned. Configuration options are organized into several categories:

Schema: The JSON Schema describing the fields to extract (optional).
Base processor: The model family that powers extraction (accuracy vs. speed and cost).
Extraction rules: Natural-language guidance for the model.
Advanced options: Citations, multimodal processing, large-array handling, chunking, the Review Agent, and Excel.
Parse config: How the document is parsed before extraction.

For default values and the full schema, see the Create Extract Run API reference.

Prefer a UI? Extend Studio lets you configure an extractor visually and export the config JSON.

Schema

`schema`

Type: JSON Schema object (optional)

Defines the fields to extract and their shape. The root must be an object; each property describes a field you want returned in output.value. Add array properties for repeating rows, nest object properties for grouped data, and use extend:type for typed fields like dates, currency, and signatures.

1 {
2   "config": {
3     "schema": {
4       "type": "object",
5       "properties": {
6         "invoice_number": { "type": ["string", "null"], "description": "The invoice number." }
7       }
8     }
9   }
10 }

schema is optional. If you omit it, Extend automatically infers a schema from the document before extracting. See Schema-less extraction in the overview.

For the full reference — objects, arrays, enums, and custom types — see Schema.

Base processor

`baseProcessor`

Type: "extraction_performance" | "extraction_light" (default: "extraction_performance")

Selects the model family that powers extraction.

Processor	When to use
`extraction_performance`	Best for complex documents, high accuracy requirements, and multimodal content. Higher accuracy on complex layouts, better handling of handwritten content, more sophisticated reasoning, and parses documents as markdown for better performance. The default.
`extraction_light`	Best for high-volume processing, cost-sensitive applications, and simple document types. Faster processing and lower cost per run with good accuracy for straightforward extractions, but removes support for advanced visual features (figure parsing, signature detection, page rotation).

1 {
2   "config": {
3     "baseProcessor": "extraction_performance"
4   }
5 }

`baseVersion`

Type: string

Pins the run to a specific version of the selected processor. If omitted, the latest stable version is used. See the Extraction Performance versions page for the changelog.

1 {
2   "config": {
3     "baseProcessor": "extraction_performance",
4     "baseVersion": "4.6.0"
5   }
6 }

Extraction rules

`extractionRules`

Type: string

Plain-language rules that steer the model — useful for disambiguating fields, setting formats, or encoding business logic. Applied across the whole extraction.

When schema is omitted, extractionRules also serves as schema generation instructions. One sentence describing the document type and key fields produces a more targeted inferred schema.

1 {
2   "config": {
3     "extractionRules": "If multiple totals appear, use the grand total. Return all dates in ISO 8601 format."
4   }
5 }

Advanced options

Citations

`advancedOptions.citationsEnabled`

Type: boolean

Returns spatial (bounding-box) references and source text for each extracted value. Useful for highlighting and validation in review interfaces, but adds processing overhead. See Citations for the response shape.

Generating citations uses an additional citation-focused model, which adds a moderate increase in latency. Disable it in latency-critical pipelines that don’t need spatial references.

`advancedOptions.citationMode`

Type: "line" | "word" | "block" (default: "line")

Controls the granularity of each citation. Requires citationsEnabled: true and a base processor version that supports bounding-box citations.

line — returns one or more relevant OCR lines per citation (default).
word — narrows to the relevant OCR word span when possible. Useful for precise citations from a table cell to an array property (e.g. line_items.total).
block — returns block-level polygons (paragraphs, key-value regions, tables). Highest recall, lowest granularity.

`advancedOptions.arrayCitationStrategy`

Type: "item" | "property"

Granularity for citations on array fields. Requires citationsEnabled: true and extraction_performance ≥ 4.4.0 for property-level citations.

1 {
2   "config": {
3     "advancedOptions": {
4       "citationsEnabled": true,
5       "citationMode": "line",
6       "arrayCitationStrategy": "property"
7     }
8   }
9 }

Multimodal

`advancedOptions.advancedMultimodalEnabled`

Type: boolean

Uses vision-language models to better understand visual elements in the document. Essential for scanned documents, handwritten content, checks and forms, and poor-quality images. It adds latency, so disable it for clean digital PDFs, text-only documents, and latency-critical workflows where visual understanding isn’t required.

1 {
2   "config": {
3     "advancedOptions": {
4       "advancedMultimodalEnabled": true
5     }
6   }
7 }

Reasoning insights

`advancedOptions.modelReasoningInsightsEnabled`

Type: boolean

Returns the model’s reasoning for each field as reasoning entries in the metadata insights array. Useful for debugging and validation during development; consider disabling it in production to reduce overhead. See Insights.

1 {
2   "config": {
3     "advancedOptions": {
4       "modelReasoningInsightsEnabled": true
5     }
6   }
7 }

Review Agent

`advancedOptions.reviewAgent.enabled`

Type: boolean

When enabled, an automated agent reviews each extracted value and adds a reviewAgentScore (1–5) to the field’s metadata, plus issue and review_summary insights that flag fields needing manual review. See Review Agent.

1 {
2   "config": {
3     "advancedOptions": {
4       "reviewAgent": { "enabled": true }
5     }
6   }
7 }

Current date

`advancedOptions.currentDateEnabled`

Type: boolean (default: false)

Includes the current date as context for the model during extraction.

1 {
2   "config": {
3     "advancedOptions": {
4       "currentDateEnabled": true
5     }
6   }
7 }

Large arrays

`advancedOptions.arrayStrategy.type`

Type: "large_array_heuristics" | "large_array_max_context" | "large_array_overlap_context"

Controls how very large arrays (for example, hundreds of line items across many pages) are extracted and merged. Omit arrayStrategy for the default behavior; set it only for large-array use cases. If you’re unsure which to use, reach out to the Extend team.

Strategy	Latency / cost	Description
(omit)	Standard	Default. Arrays are merged using intelligent (Performance) or confidence (Light) merging.
`large_array_heuristics`	Lower	Optimized for very large arrays where latency matters, using simpler chunking and merging heuristics.
`large_array_max_context`	Higher (≈2× credits)	Multiple passes through the document for maximum accuracy.
`large_array_overlap_context`	Medium	Keeps surrounding page context for each chunk to eliminate context loss at chunk boundaries.

1 {
2   "config": {
3     "advancedOptions": {
4       "arrayStrategy": { "type": "large_array_heuristics" }
5     }
6   }
7 }

Chunking and merging

Extract breaks large documents into chunks, extracts from each, and merges the results. These options tune that process.

`advancedOptions.chunkingOptions.chunkingStrategy`

Type: "standard" | "semantic"

standard — page-based chunking with heuristics (e.g. reduces chunk size for large tables). Works for most documents.
semantic — uses AI to intelligently determine whether pages can be split without breaking content relationships.

`advancedOptions.chunkingOptions.pageChunkSize`

Type: integer

The number of pages per chunk (25 by default). Larger chunks mean fewer processing calls and less overhead; smaller chunks can lower latency for large-array extraction.

`advancedOptions.chunkingOptions.chunkSelectionStrategy`

Type: "intelligent" | "confidence" | "take_first" | "take_last"

When the same field is found in multiple chunks, this decides which value wins.

Strategy	Speed	Description
`intelligent`	Slowest	Uses an additional LLM call to pick the most accurate value from document context.
`confidence`	Fast	Selects the value with the highest confidence score.
`take_first`	Fastest	Takes the first non-null value (earliest page). Best when authoritative values appear at the start.
`take_last`	Fastest	Takes the last non-null value (latest page). Best when authoritative values appear at the end.

`advancedOptions.chunkingOptions.customSemanticChunkingRules`

Type: string

Custom rules to guide semantic chunking.

1 {
2   "config": {
3     "advancedOptions": {
4       "chunkingOptions": {
5         "chunkingStrategy": "standard",
6         "pageChunkSize": 25,
7         "chunkSelectionStrategy": "confidence"
8       }
9     }
10   }
11 }

Large tables can shrink the effective chunk size when chunking by page. To preserve context across a long table, try intelligent merging (chunkSelectionStrategy: "intelligent") and enable table header continuation in parseConfig (see Parse config).

Page ranges

`advancedOptions.pageRanges`

Type: Array<{ start: number, end: number }>

Limits extraction to specific pages. Page numbers are 1-based and inclusive; ranges can overlap or arrive out of order (the platform merges and sorts them). Use it when the relevant data is consistently on known pages of a long document — it reduces processing time and cost.

1 {
2   "config": {
3     "advancedOptions": {
4       "pageRanges": [
5         { "start": 1, "end": 5 }
6       ]
7     }
8   }
9 }

Excel

`advancedOptions.excelSheetSelectionStrategy`

Type: "intelligent" | "all" | "first" | "last"

Chooses which sheets to extract from a workbook.

`advancedOptions.excelSheetRanges`

Type: Array<ExcelSheetRange>

Restricts extraction to specific sheet-index ranges.

1 {
2   "config": {
3     "advancedOptions": {
4       "excelSheetSelectionStrategy": "intelligent"
5     }
6   }
7 }

Parse config

`parseConfig`

Type: Parse config object

Because Extract runs Parse under the hood, you can tune how the document is parsed before extraction with parseConfig. It accepts the same options as the Parse API — figure parsing, signature detection, agentic OCR, formula parsing, table formatting, and the parse engine. Reach for this when a value isn’t being read correctly (for example, enabling agentic OCR for messy scans).

1 {
2   "config": {
3     "parseConfig": {
4       "blockOptions": {
5         "text": { "agentic": { "enabled": true } }
6       }
7     }
8   }
9 }

For every parse option, see the Parse Configuration reference.

Using a saved extractor

To reuse a configuration across runs and workflows, create an Extractor and reference it by id instead of inlining config each time. You can override specific fields per run with overrideConfig.

An extractor is a kind of processor — see that page for how saving a configuration lets you version, evaluate, and optimize it.

Create an extractor — set up a new extractor with your configuration.
Update an extractor — modify an existing extractor’s configuration.
Run an extractor — execute an extractor, optionally with extractor.overrideConfig.