At the core of Rogue’s evaluation capabilities is a sophisticated process for judging whether an AI agent has adhered to a specific policy during a conversation. This is handled by a dedicated “Judge LLM” that analyzes the interaction based on a structured prompt.Documentation Index
Fetch the complete documentation index at: https://docs.qualifire.ai/llms.txt
Use this file to discover all available pages before exploring further.
The Evaluation Prompt
When theEvaluatorAgent needs to determine if a policy was followed, it constructs a detailed prompt for the Judge LLM. This prompt contains all the necessary context for an informed and consistent decision.
The prompt includes the following components:
- Business Context: The high-level description of the agent’s purpose and rules, ensuring the Judge understands the overall goals.
- Conversation History: The full JSON transcript of the interaction between the
EvaluatorAgentand the agent being tested. - Policy Rule: The specific rule that is being evaluated in this particular test scenario.
- Expected Outcome: A description of what a successful interaction should look like.
The Judgment Process
The Judge LLM is instructed to follow a precise set of steps:- Analyze the Conversation: It parses the conversation history to isolate the responses from the agent being tested.
- Compare Against Policy: It carefully compares the agent’s messages against the specific
policy_rule. - Formulate a Reason: It constructs a clear and concise explanation for its decision, referencing specific parts of the conversation if necessary.
- Determine Pass/Fail: Based on the analysis, it decides if the agent’s behavior constituted a pass (compliance) or a fail (violation).