Data Extractor
Overview
The Data Extractor endpoint is a knowledge agent that extracts structured data from documents with high accuracy using advanced AI models. Specify the entities you want to extract (organizations, people, dates, amounts, etc.) and the AI automatically identifies and extracts them from your documents.
Use this endpoint to:
- Extract structured data from documents
- Identify and extract specific entities (organizations, people, dates, financial data, etc.)
- Automate data entry from contracts, invoices, and legal documents
- Build document processing pipelines with custom entity extraction
- Parse multiple documents simultaneously for batch processing
Endpoint Details
Method:
POST
Endpoint:/api/agent/data_extractor
Base URL:https://api.k-v.ai
Authentication: Access Key (Required)
Request Specification
Headers
| Header | Type | Required | Description |
|---|---|---|---|
| access-key | string | Yes | Your unique access-key generated from the platform UI |
| Content-Type | string | Yes | Must be application/json |
Request Body
{
"doc_process_ids": [
"264dfa262b748d15ccbeaada89430c68"
],
"entity_list": [
"Organisations",
"People"
],
"model_data": {
"model_name": "gpt-5.1",
"api_key": ""
}
}Body Fields
| Field | Type | Required | Description |
|---|---|---|---|
| doc_process_ids | array | Yes | Array of document IDs to extract entities from (obtained from Upload or List Documents APIs) |
| entity_list | array | No | List of entity types to extract (e.g., "People", "Organisations", "Dates", "Amounts"). Leave empty to skip extraction |
| model_data | object | Yes | AI model configuration |
| model_data.model_name | string | Yes | AI model to use for extraction (see supported models below) |
| model_data.api_key | string | No | Your own LLM API key (leave empty to use platform's default keys) |
Supported AI Models
| Model Name | Provider |
|---|---|
| gpt-5.1 | OpenAI |
| gpt-5-mini | OpenAI |
| claude-sonnet-4-5-20250929 | Anthropic |
| gemini/gemini-2.5-flash-lite | |
| gemini/gemini-2.5-pro | |
| gemini/gemini-3-pro-preview | |
| mistral/mistral-small-latest | Mistral AI |
| mistral/mistral-medium-latest | Mistral AI |
| llama3.1-70b | Meta |
Using Your Own LLM API Keys
Platform Keys (Default)
{
"model_data": {
"model_name": "gpt-5.1",
"api_key": ""
}
}Your Own Keys
{
"model_data": {
"model_name": "gpt-5.1",
"api_key": "sk-your-openai-api-key-here"
}
}Response Specification
Success Response (200 OK)
{
"data": {
"entities": [
{
"file_name": "Settlement Agreement (1).pdf",
"doc_hash": "264dfa262b748d15ccbeaada89430c68",
"entity_table": [
{
"entity_type": "Organisations",
"values": [
"Widget Corporation (Defendant)",
"Dewey, Cheatum & Howe LLP",
"ABC Software Corporation (Plaintiff)",
"Propel Software Corporation (Plaintiff)"
]
},
{
"entity_type": "People",
"values": [
"Joe Average (CEO of ABC Software Corporation, Plaintiff)",
"James Smith, Esq. (Attorney for Plaintiff)"
]
}
],
"tokens": {
"input": 1208,
"output": 78,
"total": 1286
}
}
],
"tokens": {
"input": 1208,
"output": 78,
"total": 1286
}
},
"message": "Entities extracted successfully"
}Success Response - Empty Entity List
When entity_list is empty or not provided:
{
"data": {
"entities": [
{
"file_name": "Settlement Agreement (1).pdf",
"doc_hash": "264dfa262b748d15ccbeaada89430c68",
"entity_table": [],
"tokens": {
"input": 0,
"output": 0,
"total": 0
}
}
],
"tokens": {
"input": 0,
"output": 0,
"total": 0
}
},
"message": "Entites extracted successfully"
}Note: No entities are extracted when entity_list is empty, and no tokens are consumed.
Response Fields
| Field | Type | Description |
|---|---|---|
| data | object | Response data object |
| data.entities | array | Array of extracted entity results per document |
| data.entities[].file_name | string | Original filename of the document |
| data.entities[].doc_hash | string | Document process ID |
| data.entities[].entity_table | array | Array of entity types and their extracted values |
| data.entities[].entity_table[].entity_type | string | Type of entity extracted (matches entity_list values) |
| data.entities[].entity_table[].values | array | List of extracted values for this entity type |
| data.entities[].tokens | object | Token usage for this specific document |
| data.entities[].tokens.input | integer | Input tokens consumed |
| data.entities[].tokens.output | integer | Output tokens generated |
| data.entities[].tokens.total | integer | Total tokens used for this document |
| data.tokens | object | Total token usage across all documents |
| data.tokens.input | integer | Total input tokens consumed |
| data.tokens.output | integer | Total output tokens generated |
| data.tokens.total | integer | Total tokens used for all documents |
| message | string | Human-readable response message |
Understanding Token Usage
Per-Document Tokens
Located inside each entities[] object, shows consumption for individual document processing.
Aggregate Tokens
Located at data.tokens level, represents the sum of all document token usage. Use this for cost calculation and monitoring.
Error Responses
401 Unauthorized
{
"data": {},
"message": "Invalid or missing access key"
}Cause: Missing or invalid access-key header.
422 Unprocessable Entity - Invalid Model Name
{
"detail": [
{
"type": "literal_error",
"loc": [
"body",
"model_data",
"model_name"
],
"msg": "Input should be 'gpt-5-chat-latest', 'gpt-5.1', 'gpt-5-mini', 'claude-sonnet-4-5-20250929', 'gemini/gemini-2.5-flash-lite', 'mistral/mistral-small-latest', 'mistral/mistral-medium-latest', 'gemini/gemini-2.5-pro', 'gemini/gemini-3-pro-preview' or 'llama3.1-70b'",
"input": "",
"ctx": {
"expected": "'gpt-5-chat-latest', 'gpt-5.1', 'gpt-5-mini', 'claude-sonnet-4-5-20250929', 'gemini/gemini-2.5-flash-lite', 'mistral/mistral-small-latest', 'mistral/mistral-medium-latest', 'gemini/gemini-2.5-pro', 'gemini/gemini-3-pro-preview' or 'llama3.1-70b'"
}
}
]
}Cause: Invalid or unsupported model_name in model_data. See supported models list above.
400 Bad Request - Invalid Document IDs
{
"data": {},
"message": "Invalid docs selected"
}Causes:
- Missing or invalid doc_process_ids.
500 Internal Server Error - Invalid LLM API Key
{
"data": {},
"message": "litellm.AuthenticationError: AuthenticationError: OpenAIException - Incorrect API key provided: tyrdfuih**uhf7. You can find your API key at https://platform.openai.com/account/api-keys."
}Cause: The api_key provided in model_data is invalid or expired. Verify your LLM provider API key.
500 Internal Server Error - General Error
{
"data": {},
"message": "Something went wrong"
}Causes:
- LLM service temporarily unavailable
- Server-side processing error
Code Snippets
curl --location 'https://api.k-v.ai/api/agent/data_extractor' \
--header 'access-key: YOUR_ACCESS_KEY' \
--header 'Content-Type: application/json' \
--data '{
"doc_process_ids": [
"264dfa262b748d15ccbeaada89430c68"
],
"entity_list": [
"Organisations",
"People"
],
"model_data": {
"model_name": "gpt-5.1",
"api_key": ""
}
}'import requests
import json
url = "https://api.k-v.ai/api/agent/data_extractor"
payload = json.dumps({
"doc_process_ids": [
"264dfa262b748d15ccbeaada89430c68"
],
"entity_list": [
"Organisations",
"People"
],
"model_data": {
"model_name": "gpt-5.1",
"api_key": ""
}
})
headers = {
'access-key': 'YOUR_ACCESS_KEY',
'Content-Type': 'application/json'
}
response = requests.request("POST", url, headers=headers, data=payload)
print(response.text)const axios = require('axios');
let data = JSON.stringify({
"doc_process_ids": [
"264dfa262b748d15ccbeaada89430c68"
],
"entity_list": [
"Organisations",
"People"
],
"model_data": {
"model_name": "gpt-5.1",
"api_key": ""
}
});
let config = {
method: 'post',
maxBodyLength: Infinity,
url: 'https://api.k-v.ai/api/agent/data_extractor',
headers: {
'access-key': 'YOUR_ACCESS_KEY',
'Content-Type': 'application/json'
},
data : data
};
axios.request(config)
.then((response) => {
console.log(JSON.stringify(response.data));
})
.catch((error) => {
console.log(error);
});Important Notes
- Document Processing: Only extract from documents with status: "processed"
- Entity Specificity: More specific entity names yield better extraction accuracy
- Empty Entity List: Returns 200 OK with empty entity_table and zero token usage
- Token Costs: Monitor tokens.total for cost management
Next Steps
After extracting entities:
- Validate Results: Review extracted values for accuracy
- Automate Workflows: Integrate into document processing pipelines
Need Help? Contact support at support@k-v.ai