Bevaya Labs | Insurance AI Research & Specialized Models for InsurGPT™

The Benchmarks

When put to the test, InsurGPT™ beats general AI on every insurance task.

Specialized models outperform general-purpose ones on the work your team actually handles. Here's the head-to-head.

On claims accuracy, InsurGPT™ scored 99%. The strongest general model managed 62%.

A 37-point gap on the work your team handles every day — claims indexing, FNOL, demand letters, medical bills.

Read the full benchmark report

Source: Roots benchmark tests, December 2025. View methodology

Claims Accuracy

InsurGPT™: 0%
Mistral AI: 62%
GPT-5.0: 58%
Gemini 3.0 Pro: 55%

Featured Research

GutenOCR. Grounded OCR for production documents.

Generic OCR fails on the messy, multi-format documents insurance operations actually receive: handwritten notes, faxed pages, six-generation scans, multi-column tables with merged cells. GutenOCR is Bevaya Labs' proprietary grounded OCR, built specifically for the documents that defeat off-the-shelf tools.

Handles handwritten notes, faxed pages, and low-quality scans where generic OCR fails.
Every extracted field traces back to its exact source location — the foundation of X-Ray Mode.
Powers Document Intelligence inside the Bevaya Platform at production scale today.

Claim Number

CLM-2024-118827

0.99

p1 · line 14 · col 28

Demand_Letter_FAXED.pdf Page 1 of 4

WEXLER LAW FIRM

Attorneys at Law · Personal Injury

May 2, 2026

VIA EMAIL: claims-intake@bevaya-demo.com

Bevaya Insurance Company · Claims Department

RE: Daniel R. Smith v. Stoltz Trucking & Logistics LLC

Date of Loss: November 14, 2024

Claim Number: CLM-2024-118827

Policy Number: CGL-PA-7711-04

Every extracted field traces to its source token. Even from faxed pages, handwritten notes, and low-quality scans.

Insurance math, validated

When a loss run omits Total Incurred because Paid and Reserves are listed separately, InsurGPT™ reconciles the calculation, validates the result, and flags low-confidence outputs for human review. Insurance arithmetic isn't a transcription problem — it's a reasoning problem the models have to get right.

Derived Field Reconciliation | Reconciled

Total Incurred missing from a loss run? InsurGPT™ derives it from Paid + Reserves + Expenses, validates the math against the document's own totals, and surfaces the computed value alongside its inputs.

Cross-Column Validation | Verified

Every numeric field is checked against neighboring columns and document-level subtotals. A reserve figure that doesn't roll up to the schedule total gets caught before it leaves the canvas.

Low-Confidence Routing | Flagged

Calculations that fall below your confidence threshold route automatically to a human reviewer with the source page, the inputs used, and the proposed value — not a black-box answer to rubber-stamp.

Clinical coherence on medical bills

Drug codes extracted from a medical bill are validated against national databases to verify clinical coherence — a check a general-purpose model has no concept of. Knowing what an NDC number is, and what it should appear next to, is insurance domain knowledge encoded in the model.

NDC Validation | Verified

Every National Drug Code pulled off a bill is checked against the FDA's NDC Directory to confirm the code exists, the drug name matches, and the dosage form is consistent with what's billed.

CPT & ICD Coherence | Cross-checked

Procedure codes (CPT/HCPCS) are cross-referenced against diagnosis codes (ICD-10) to catch billing combinations that don't clinically hang together — the kind of mismatch a generalist OCR has no way to see.

Provider & Pricing Sanity | Calibrated

Billed amounts are sanity-checked against expected ranges for the procedure, provider type, and jurisdiction. Outliers are flagged before they reach reserve-setting, not after.

Intelligent document orchestration

A 40-page submission package is identified as multiple document subtypes — ACORDs, supplementals, loss runs, schedules — split into indexed components, and routed through the right workflow for each type. No human triage.

Multi-Subtype Classification | Identified

InsurGPT™ reads a single bundled PDF and identifies every document inside it — ACORD 125, ACORD 140, supplemental applications, loss runs, SOVs — without anyone pre-tagging the pages.

Component Splitting & Indexing | Indexed

The package is split at the right page boundaries, each component is indexed with its subtype and page range, and downstream nodes pull the slice they need instead of re-parsing the whole bundle.

Subtype-Aware Routing | Routed

Each component is dispatched to the workflow tuned for it — loss runs to the reconciliation flow, ACORDs to underwriting intake, SOVs to schedule normalization — with no human in the middle making the routing call.

Research

Page stream segmentation with LLMs

How Bevaya Labs approaches a foundational problem in insurance document AI.

Case Study

Workers' comp carrier processes claims 100x faster

How indexing automation delivered 432% ROI in 12 months.

2026.06.02-library-webinar-registration-how-to-establish-clear-ai-ownership-in-your-insurance-organization

Architecture

Inside the Bevaya platform architecture

How specialized models, HITL controls, and integrations come together in production.

FAQ

Frequently asked questions.

What is Bevaya Labs?

Bevaya Labs is the applied research arm of Bevaya. The team develops specialized AI models for insurance, publishes findings to demonstrate the depth of the work, and releases open tools like GutenOCR. Every model the team builds is shipped into customer production deployments.

Why does InsurGPT™ outperform GPT, Gemini, and other frontier models?

InsurGPT™ is a mosaic of dozens of specialized models, each purpose-built for one insurance task. When a document enters the system, InsurGPT™ selects the right combination of models for that document. A loss run is processed by models trained specifically on loss run formats. An ACORD form is handled by models that know every field and variation. That specialization is why InsurGPT™ reaches 93% accuracy on loss runs while general-purpose models sit at 80–84%.

How long does it take to build a production-quality insurance model?

Months, not days. Our loss run model alone took seven months of expert annotation by insurance domain specialists before reaching production quality. Across the model portfolio, one to two years of data collection and labeling is typical to create a robust, cross-customer model for a use case. This is why prompt engineering and off-the-shelf APIs cannot match purpose-built insurance AI.

How does Bevaya manage model drift and continuous improvement?

Bevaya Labs runs production-grade MLOps. Every model is monitored for drift, controlled rollouts use A/B testing for every new version, and data and model versions are tracked for full reproducibility. Federated learning across the customer base means every carrier benefits from platform-wide improvements while their individual data stays protected.

Why publish research and open-source tools at all?

Two reasons. First, insurance buyers do not believe bold claims without substantiation — publishing methods lets technical evaluators verify the work for themselves. Second, the team participates in the broader AI research community. Open tools like GutenOCR move the field forward and demonstrate technical depth that prompt-engineered competitors cannot match.

Who leads AI research at Bevaya?

Ratish Dalvi is VP of AI and Machine Learning. He leads a team of AI researchers, ML engineers, and data annotation specialists focused on vision-language reasoning models for insurance. The team publishes at venues including COLING and maintains the Bevaya Labs research blog.

Can our IT or AI team evaluate the models directly?

Yes. Bevaya supports proof-of-value evaluations on real customer documents, with published benchmarks for context. The Bevaya Labs team will share methodology and engage directly with technical evaluators on model architecture, training approach, and production results. For GutenOCR specifically, a live demo runs at ocr.roots.ai.

How does Bevaya Labs compare to building AI in-house?

Building an in-house equivalent takes 12–36 months, requires a dedicated AI team and infrastructure, and starts at $5M+ in upfront cost. A single-model approach cannot match the specialized mosaic InsurGPT™ is built on, and the review experience, orchestration, and integrations all have to be built from scratch. Bevaya Labs delivers seven years of focused research, 300M+ documents of training data, and an operating platform — today.

The Platform →

Build

Run

Review

Govern

Underwriting Automation →

Claims Automation →

Policy Servicing →

AI Agent Library →

Technology

Accuracy

Trust

Stories

Who It's For

Why Bevaya

Featured Case Study →

Learn

Research

Updates

Featured Report →

About

News & Trust

Connect

AI Research & Development

The research behind insurance's most accurate AI.

Confirms accuracy

Reads loss history

Extracts ACORDs

Sorts 100+ doc types

Property schedules

Separates documents

Reads the unreadable

Traces every answer

AI Research & Development

By the numbers.

Behind the platform

How we build and govern the models.

When put to the test, InsurGPT™ beats general AI on every insurance task.

Featured Research

GutenOCR. Grounded OCR for production documents.

INSURANCE-NATIVE REASONING

It is not extraction. It is judgment.

Insurance math, validated

Derived Field Reconciliation | Reconciled

Cross-Column Validation | Verified

Low-Confidence Routing | Flagged

Clinical coherence on medical bills

NDC Validation | Verified

CPT & ICD Coherence | Cross-checked

Provider & Pricing Sanity | Calibrated

Intelligent document orchestration

Multi-Subtype Classification | Identified

Component Splitting & Indexing | Indexed

Subtype-Aware Routing | Routed

Workflow Canvas

Human-in-the-Loop

Document Intelligence

Grounded Explainability

Analytics Dashboard

Governed Automation

InsurGPT™

An AI agent across the platform

Research Output

Published, so customers can verify the claims.

Research

Case Study

Architecture

FAQ

Frequently asked questions.

GET STARTED

See the research running on your documents.

It is not extraction.
It is judgment.