Skip to main content

Evaluations

Evaluations are the fundamental building block for ensuring the integrity of your AI agent’s behavior. Each evaluation acts as a specialized check tailored to catch specific issues before they reach your users.

Evaluation Modes

Each evaluation can run in different modes, letting you choose the right balance for your use case:

Speed

~20ms latency. Simple pass/fail with the fastest possible response. Best for real-time guardrails and high-throughput systems.

Balanced

~100ms latency. Includes reasoning explanations while maintaining good performance. Best for production use with explanations.

Quality

~500ms latency. Uses larger models for the most thorough analysis. Best for detailed analysis, debugging, and experiments.

Evaluation Categories

Protect your AI system from attacks and prevent sensitive data exposure. Includes prompt injection detection and PII scanning.
Ensure your AI produces appropriate, non-harmful content across multiple safety categories including dangerous content, harassment, and hate speech.
Verify that your AI produces accurate, high-quality outputs. Includes hallucination detection, context grounding, and tool selection quality.
Enforce your custom rules and guardrails using natural language assertions. Define any policy and have it consistently enforced.
Ensure your AI stays on-topic by defining allowed topics. Detects when conversations drift outside the intended scope of your application.

Qualifire’s Small Language Models (SLMs) Judges

Qualifire employs a suite of fine-tuned, state-of-the-art Small Language Models (SLMs), each specialized for a specific evaluation task. This provides faster, more accurate, and more targeted analysis of agent behavior.
Detects prompt injection and jailbreak attempts that try to manipulate your AI into ignoring its instructions or behaving maliciously.Results:
  • BENIGN — Input is safe
  • INJECTION — Attack attempt detected
Use when: You need to protect against adversarial inputs trying to bypass your system prompt or guardrails.F1 Score: 0.988 | Latency: ~20ms
Evaluates content for harmful or inappropriate material across multiple safety categories.
CategoryDescription
Dangerous ContentViolence instructions, self-harm, harmful activities
HarassmentBullying, abuse, targeting individuals or groups
Sexually ExplicitAdult content, non-consensual sexual content
Hate SpeechDiscrimination, incitement against protected groups
Results:
  • SAFE — Content passes all safety checks
  • UNSAFE — Harmful content detected (includes which categories were triggered)
Use when: You need to ensure AI outputs don’t contain harmful, abusive, or inappropriate content.F1 Score: 0.946 | Latency: ~35ms
Verifies that responses are properly anchored in your provided reference material. Ensures claims are supported by source documents or the system prompt.Configuration:
  • Single-turn: Evaluates against the system prompt only
  • Multi-turn: Evaluates against the full conversation history
Results:
  • GROUNDED — Response is supported by the context
  • UNGROUNDED — Response makes claims not found in context
Use when: You have specific reference material (documents, knowledge bases) that responses should be based on.Balanced Accuracy: 98.48% | Latency: ~80ms
Evaluates whether your AI agent correctly selects and calls tools/functions. Catches wrong tool selection, invalid parameters, and incorrect parameter values.Results:
  • VALID_CALL — Tool call is correct
  • TOOL_ERROR — Wrong tool was selected
  • PARAM_NAME_ERROR — Invalid parameter name used
  • PARAM_VALUE_ERROR — Parameter value is incorrect
Use when: Your AI agent uses function calling and you need to ensure tools are invoked correctly.F1 Score: 0.945 | Latency: ~500ms
Evaluates whether content complies with your custom-defined policies and guardrails. Define any rule in natural language and enforce it consistently.Example assertions:
  • “Response must not provide medical advice”
  • “Always recommend consulting a professional for legal matters”
  • “Never disclose internal pricing information”
  • “Responses should be in a professional tone”
Configuration:
  • Target: Choose what to evaluate
    • input — Check only the user’s message
    • output — Check only the AI’s response
    • both — Check the entire conversation
Results:
  • COMPLIES — Content follows the policy
  • WARNING — Potential concern (borderline case)
  • VIOLATES — Content breaks the policy
Use when: You have specific business rules, compliance requirements, or behavioral guidelines your AI must follow.F1 Score: 0.835 | Latency: ~100ms
Identifies when your AI generates information that isn’t supported by the provided context. Catches fabricated facts, invented details, and unfaithful responses.Results:
  • NOT_HALLUCINATED — Response is faithful to the context
  • HALLUCINATED — Response contains unsupported claims
Use when: You need to ensure AI responses stick to the facts provided in the conversation or knowledge base.F1 Score: 0.8335 | Latency: ~250ms
Scans content for Personally Identifiable Information to prevent data leaks and ensure privacy compliance.Detected categories include:
  • Personal identifiers (name, date of birth, address)
  • Financial data (credit card, bank account, SSN)
  • Government IDs (passport, driver’s license, national ID)
  • Contact information (phone, email, IP address)
  • Healthcare data (health insurance ID)
Results:
  • NO_PII_FOUND — Content is clean
  • PII_FOUND — Sensitive data detected (includes the specific type and location)
Use when: You need to prevent PII from being stored, logged, or exposed in responses.F1 Score: 0.8335 | Latency: ~40ms

Combining Evaluations

You can run multiple evaluations simultaneously. The overall result passes only if all individual evaluations pass, giving you comprehensive coverage in a single check.
A typical production setup might include: - Prompt Injection — Block attacks on input - Content Moderation — Ensure safe outputs - Hallucinations — Verify accuracy - Custom Assertions — Enforce business rules

Bypass Behavior

When an evaluation can’t run due to missing requirements (e.g., no AI response yet for hallucination detection), it automatically bypasses with a pass result. This prevents evaluations from blocking your application when they don’t apply to the current context.
For code examples showing how to run evaluations, see the SDK documentation.