SLM Judges

Why Small Language Models?

General-purpose LLMs are expensive, slow, and not optimized for evaluation tasks. Qualifire’s SLM judges solve this by providing purpose-built models that are fine-tuned for specific evaluation tasks — delivering higher accuracy at a fraction of the cost and latency.

99.6% Faster

~100ms latency vs seconds for general-purpose LLMs

97% Cheaper

0.01/M tokens vs

1.25–$3.00 for frontier LLMs

Higher Accuracy

Fine-tuned models outperform general-purpose LLMs on targeted evaluation tasks

Omni — Multi-Task Evaluation Model

Omni is Qualifire’s flagship 14B parameter model, capable of handling multiple evaluation tasks in a single inference call. It delivers frontier-model accuracy at SLM speed and cost.

Property	Value
Parameters	14B
Latency	~100ms
Cost	$0.01 / 1M tokens
Tasks	Prompt Injection Detection, Safety, Grounding, Hallucination Detection, Policy Enforcement, Tool Use Quality, Topic Scoping

Benchmarks

Omni matches or exceeds the performance of frontier models like GPT-5, Claude Sonnet 4.5, and Gemini 3 Pro across evaluation tasks — at 60x lower latency and 125–300x lower cost.

Detects prompt injection and jailbreak attempts targeting your AI system.

Model	Creator	Avg F1	Latency	Cost/1M tokens
Sentinel v2	Qualifire	0.957	~0.038s	$0.005
Omni	Qualifire	0.936	~0.1s	$0.01
Qwen3Guard 8B	Qwen	0.882	~0.76s	—
Qwen3Guard 4B	Qwen	0.877	~0.48s	—
Qwen3Guard 0.6B	Qwen	0.858	~0.27s	—
GPT OSS Safeguard 20B	OpenAI	0.803	~10s	—
Llama Guard 3 8B	Meta	0.628	~0.21s	—
Llama Guard 3 1B	Meta	0.475	~0.09s	—

Identifies when AI generates information not supported by the provided context.

Model	Creator	Avg Accuracy	Latency	Cost/1M tokens
Gemini 3 Pro	Google	0.901	~6s	$2.00
Claude Sonnet 4.5	Anthropic	0.898	~8s	$3.00
Omni	Qualifire	0.893	~0.1s	$0.01
GPT-5.1	OpenAI	0.883	~3.5s	$1.25
GPT-5	OpenAI	0.872	~2.5s	$1.25
Patronus Lynx 8B	PatronusAI	0.810	—	—
Quotient Detections	QuotientAI	0.807	—	—
Bespoke Minicheck 7B	Minicheck	0.791	—	—
Ragas Faithfulness	Ragas	0.781	—	—
Vectara HHEM-2.1	Vectara	0.780	—	—
Azure Groundedness	Microsoft	0.778	—	—

Verifies that responses are anchored in provided reference material.

Model	Creator	Avg Score	Latency	Cost/1M tokens
Claude Sonnet 4.5	Anthropic	82.56	~7.5s	$3.00
Omni	Qualifire	82.48	~0.1s	$0.01
Gemini 3 Pro	Google	82.42	~6s	$2.00
Gemini 2.5 Flash	Google	80.59	~2.5s	$0.30
GPT-5.1	OpenAI	79.62	~1.2s	$1.25
GPT OSS Safeguard 20B	OpenAI	79.50	~10s	—
Paladin Mini (3.8B)	Qualifire	79.31	~0.064s	$0.01
Bespoke MiniCheck 7B	MiniCheck	77.87	~0.17s	—
GPT-5	OpenAI	77.47	~1.2s	$1.25

Evaluates compliance with custom-defined policies and assertions.

Model	Avg Accuracy	Latency	Cost/1M tokens
Gemini 3 Pro	0.895	~5.5s	$2.00
Omni	0.871	~0.1s	$0.01
Claude Sonnet 4.5	0.870	~7s	$3.00
GPT-5	0.806	~1s	$1.25
GPT-5.1	0.798	~1s	$1.25

Evaluates whether AI agents correctly select and invoke tools.

Model	Avg Binary Acc	Latency	Cost/1M tokens
Gemini 3 Pro	0.939	~5.5s	$2.00
Claude Sonnet 4.5	0.936	~10s	$3.00
Omni	0.932	~0.1s	$0.01
GPT-5	0.930	~1.2s	$1.25
GPT-5.1	0.912	~1.2s	$1.25
Gemini 2.5 Flash	0.858	~4.9s	$0.30

Model	Avg Accuracy	Latency	Cost/1M tokens
Omni	0.972	~0.1s	$0.01
Claude Sonnet 4.5	0.966	~6s	$3.00
Gemini 3 Flash	0.963	~6s	$0.50
GPT-5.2	0.937	~6s	$1.75

Filters harmful content across multiple safety categories.

Model	Creator	Params	Avg F1	Latency	Cost/1M tokens
Cleric v2 Mini	Qualifire	0.6B	0.886	~0.038s	$0.01
GPT OSS Safeguard 20B	OpenAI	20B	0.867	~10s	—
Omni	Qualifire	14B	0.857	~0.087s	$0.01
Qwen3Guard 8B	Qwen	8B	0.811	~0.76s	—
Llama Guard 3 8B	Meta	8B	0.785	~0.21s	—

Specialist Models

In addition to Omni, Qualifire provides fine-tuned specialist models optimized for single tasks where maximum accuracy or minimal latency is required.

Sentinel — Prompt Injection Detection

Detects prompt injection and jailbreak attempts that try to manipulate your AI into ignoring its instructions.

Property	Value
Avg F1	0.957
Latency	~38ms
Parameters	596M
Cost	$0.005 / 1M tokens

Benchmark comparison (Prompt Injection):

Model	Creator	Avg F1	Latency	Cost/1M tokens
Sentinel v2	Qualifire	0.957	~0.038s	$0.005
Qwen3Guard 8B	Qwen	0.882	~0.76s	—
Qwen3Guard 4B	Qwen	0.877	~0.48s	—
Qwen3Guard 0.6B	Qwen	0.858	~0.27s	—
GPT OSS Safeguard 20B	OpenAI	0.803	~10s	—
Llama Guard 3 8B	Meta	0.628	~0.21s	—

Cleric — Content Safety Moderation

Evaluates content for harmful or inappropriate material across multiple safety categories (dangerous content, harassment, hate speech, sexually explicit).

Property	Value
Avg F1	0.886
Latency	~38ms
Parameters	0.6B
Cost	$0.01 / 1M tokens

Paladin — Context Grounding

Verifies that responses are accurately grounded in provided reference material.

Property	Value
Avg Score	79.31
Latency	~64ms
Parameters	3.8B
Cost	$0.016 / 1M tokens

Paladin Mini is optimized for speed-critical applications. For higher accuracy, use Omni.

Ranger — Tool Use Quality

Evaluates MCP tool selection quality for AI agents — correct tool selection, parameters, and values.

Property	Value
F1	0.945
Latency	~90ms
Cost	$0.01 / 1M tokens

Sage — Hallucination Detection

Uses reasoning to identify inaccurate outputs and logic faults.

Property	Value
F1	0.834
Latency	~250ms
Cost	$0.01 / 1M tokens

Hunter — PII Detection

Identifies and flags personally identifiable information to prevent data leaks.

Property	Value
F1	0.834
Latency	~40ms
Cost	$0.01 / 1M tokens

Magistrate — Policy Enforcement

Enforces custom rules, standards, and policies using natural language assertions.

Property	Value
F1	0.835
Latency	~100ms
Cost	$0.01 / 1M tokens

Deployment Options

Qualifire SLMs can be deployed in the way that fits your infrastructure and compliance requirements.

SaaS

Fully managed by Qualifire. No infrastructure to maintain — just send API requests.

Your Cloud

Deploy in your own cloud environment (AWS, GCP, Azure) for data residency and compliance needs.

On-Premise

Run entirely on your infrastructure for maximum control and air-gapped environments.

Qualifire models can be fine-tuned for your specific domain and policies. Contact our team to discuss custom model training for your use case.

Get Started

Evaluations

Guardrails

Observability

Prompt Management

Integrations

Why Small Language Models?

99.6% Faster

97% Cheaper

Higher Accuracy

Omni — Multi-Task Evaluation Model

Benchmarks

Specialist Models

Deployment Options

SaaS

Your Cloud

On-Premise

Getting Started

Evaluations

SDK

Get Started

Evaluations

Guardrails

Observability

Prompt Management

Integrations

​Why Small Language Models?

99.6% Faster

97% Cheaper

Higher Accuracy

​Omni — Multi-Task Evaluation Model

​Benchmarks

​Specialist Models

​Deployment Options

SaaS

Your Cloud

On-Premise

​Getting Started

Evaluations

SDK

Why Small Language Models?

Omni — Multi-Task Evaluation Model

Benchmarks

Specialist Models

Deployment Options

Getting Started