Configuration
The Extract API accepts a config object that controls how documents are processed and how values are returned. Configuration options are organized into several categories:
- Schema: The JSON Schema describing the fields to extract (optional).
- Base processor: The model family that powers extraction (accuracy vs. speed and cost).
- Extraction rules: Natural-language guidance for the model.
- Advanced options: Citations, multimodal processing, large-array handling, chunking, the Review Agent, and Excel.
- Parse config: How the document is parsed before extraction.
For default values and the full schema, see the Create Extract Run API reference.
Prefer a UI? Extend Studio lets you configure an extractor visually and export the config JSON.
Schema
schema
Type: JSON Schema object (optional)
Defines the fields to extract and their shape. The root must be an object; each property describes a field you want returned in output.value. Add array properties for repeating rows, nest object properties for grouped data, and use extend:type for typed fields like dates, currency, and signatures.
schema is optional. If you omit it, Extend automatically infers a schema from the document before extracting. See Schema-less extraction in the overview.
For the full reference — objects, arrays, enums, and custom types — see Schema.
Base processor
baseProcessor
Type: "extraction_performance" | "extraction_light" (default: "extraction_performance")
Selects the model family that powers extraction.
baseVersion
Type: string
Pins the run to a specific version of the selected processor. If omitted, the latest stable version is used. See the Extraction Performance versions page for the changelog.
Extraction rules
extractionRules
Type: string
Plain-language rules that steer the model — useful for disambiguating fields, setting formats, or encoding business logic. Applied across the whole extraction.
When schema is omitted, extractionRules also serves as schema generation instructions. One sentence describing the document type and key fields produces a more targeted inferred schema.
Advanced options
Citations
advancedOptions.citationsEnabled
Type: boolean
Returns spatial (bounding-box) references and source text for each extracted value. Useful for highlighting and validation in review interfaces, but adds processing overhead. See Citations for the response shape.
Generating citations uses an additional citation-focused model, which adds a moderate increase in latency. Disable it in latency-critical pipelines that don’t need spatial references.
advancedOptions.citationMode
Type: "line" | "word" | "block" (default: "line")
Controls the granularity of each citation. Requires citationsEnabled: true and a base processor version that supports bounding-box citations.
- line — returns one or more relevant OCR lines per citation (default).
- word — narrows to the relevant OCR word span when possible. Useful for precise citations from a table cell to an array property (e.g.
line_items.total). - block — returns block-level polygons (paragraphs, key-value regions, tables). Highest recall, lowest granularity.
advancedOptions.arrayCitationStrategy
Type: "item" | "property"
Granularity for citations on array fields. Requires citationsEnabled: true and extraction_performance ≥ 4.4.0 for property-level citations.
Multimodal
advancedOptions.advancedMultimodalEnabled
Type: boolean
Uses vision-language models to better understand visual elements in the document. Essential for scanned documents, handwritten content, checks and forms, and poor-quality images. It adds latency, so disable it for clean digital PDFs, text-only documents, and latency-critical workflows where visual understanding isn’t required.
Reasoning insights
advancedOptions.modelReasoningInsightsEnabled
Type: boolean
Returns the model’s reasoning for each field as reasoning entries in the metadata insights array. Useful for debugging and validation during development; consider disabling it in production to reduce overhead. See Insights.
Review Agent
advancedOptions.reviewAgent.enabled
Type: boolean
When enabled, an automated agent reviews each extracted value and adds a reviewAgentScore (1–5) to the field’s metadata, plus issue and review_summary insights that flag fields needing manual review. See Review Agent.
Current date
advancedOptions.currentDateEnabled
Type: boolean (default: false)
Includes the current date as context for the model during extraction.
Large arrays
advancedOptions.arrayStrategy.type
Type: "large_array_heuristics" | "large_array_max_context" | "large_array_overlap_context"
Controls how very large arrays (for example, hundreds of line items across many pages) are extracted and merged. Omit arrayStrategy for the default behavior; set it only for large-array use cases. If you’re unsure which to use, reach out to the Extend team.
Chunking and merging
Extract breaks large documents into chunks, extracts from each, and merges the results. These options tune that process.
advancedOptions.chunkingOptions.chunkingStrategy
Type: "standard" | "semantic"
- standard — page-based chunking with heuristics (e.g. reduces chunk size for large tables). Works for most documents.
- semantic — uses AI to intelligently determine whether pages can be split without breaking content relationships.
advancedOptions.chunkingOptions.pageChunkSize
Type: integer
The number of pages per chunk (25 by default). Larger chunks mean fewer processing calls and less overhead; smaller chunks can lower latency for large-array extraction.
advancedOptions.chunkingOptions.chunkSelectionStrategy
Type: "intelligent" | "confidence" | "take_first" | "take_last"
When the same field is found in multiple chunks, this decides which value wins.
advancedOptions.chunkingOptions.customSemanticChunkingRules
Type: string
Custom rules to guide semantic chunking.
Large tables can shrink the effective chunk size when chunking by page. To preserve context across a long table, try intelligent merging (chunkSelectionStrategy: "intelligent") and enable table header continuation in parseConfig (see Parse config).
Page ranges
advancedOptions.pageRanges
Type: Array<{ start: number, end: number }>
Limits extraction to specific pages. Page numbers are 1-based and inclusive; ranges can overlap or arrive out of order (the platform merges and sorts them). Use it when the relevant data is consistently on known pages of a long document — it reduces processing time and cost.
Excel
advancedOptions.excelSheetSelectionStrategy
Type: "intelligent" | "all" | "first" | "last"
Chooses which sheets to extract from a workbook.
advancedOptions.excelSheetRanges
Type: Array<ExcelSheetRange>
Restricts extraction to specific sheet-index ranges.
Parse config
parseConfig
Type: Parse config object
Because Extract runs Parse under the hood, you can tune how the document is parsed before extraction with parseConfig. It accepts the same options as the Parse API — figure parsing, signature detection, agentic OCR, formula parsing, table formatting, and the parse engine. Reach for this when a value isn’t being read correctly (for example, enabling agentic OCR for messy scans).
For every parse option, see the Parse Configuration reference.
Using a saved extractor
To reuse a configuration across runs and workflows, create an Extractor and reference it by id instead of inlining config each time. You can override specific fields per run with overrideConfig.
An extractor is a kind of processor — see that page for how saving a configuration lets you version, evaluate, and optimize it.
- Create an extractor — set up a new extractor with your configuration.
- Update an extractor — modify an existing extractor’s configuration.
- Run an extractor — execute an extractor, optionally with
extractor.overrideConfig.

