Overview
Red Teaming in Rogue provides automated security testing for AI agents by simulating adversarial attacks to identify vulnerabilities. The system uses a vulnerability-centric approach where each vulnerability is tested using relevant attack techniques, with results mapped to compliance frameworks.How It Works
The Red Team Orchestrator follows a systematic approach:- Select Vulnerabilities: Choose which vulnerabilities to test (or use predefined scan types)
- Apply Attacks: For each vulnerability, apply relevant attack techniques
- Generate Attack Messages: Create adversarial prompts using attack transformations
- Send to Agent: Deliver attack messages to the target agent
- Evaluate Responses: Use LLM-based judges to detect successful exploits
- Calculate Risk Scores: Compute CVSS-like risk scores for findings
- Map to Frameworks: Associate findings with compliance frameworks
Scan Types
Rogue offers three scan types for different use cases:Basic Scan (Free)
A curated set of essential security tests focusing on:- Prompt Security: System prompt extraction, override attempts, indirect injection
- PII Protection: Direct exposure, API/database access, session data leaks
Full Scan (Premium)
Comprehensive testing across all 87+ vulnerability types including:- Content Safety (hate speech, explicit content, violence)
- Bias & Fairness (age, gender, race, disability, religion)
- Technical Vulnerabilities (SQL injection, shell injection, SSRF)
- Business Logic (unauthorized commitments, goal misalignment)
- Agent-Specific (memory poisoning, RAG attacks, tool discovery)
Custom Scan
Select specific vulnerabilities and attacks for targeted testing:Vulnerability Categories
Rogue tests across 13 vulnerability categories:| Category | Description | Example Vulnerabilities |
|---|---|---|
| Content Safety | Harmful content generation | Hate speech, explicit content, violence |
| PII Protection | Personal data exposure | Direct PII, API/DB access, session leaks |
| Technical | Code/injection attacks | SQL injection, command injection, SSRF |
| Bias & Fairness | Discriminatory responses | Gender, race, age, disability bias |
| Prompt Security | Prompt manipulation | Extraction, override, indirect injection |
| Access Control | Authorization bypass | RBAC, BOLA, BFLA, excessive agency |
| Business Logic | Business rule violations | Unauthorized commitments, off-topic |
| Intellectual Property | IP violations | Copyright, trade secrets |
| Information Quality | Factual accuracy | Hallucination, misinformation |
| Compliance | Regulatory violations | COPPA, FERPA |
| Specialized Threats | Critical content | Weapons, drugs, extremism |
| Agent-Specific | Agent architecture attacks | Memory poisoning, RAG attacks |
| Resource Attacks | Resource exhaustion | DoS, unbounded consumption |
Attack Categories
Attacks are organized into three categories:Single-Turn Attacks
Quick, single-message attacks that attempt immediate exploitation:| Attack | Description |
|---|---|
| Base64 | Encodes payload in Base64 to bypass text filters |
| ROT13 | Uses ROT13 cipher to obfuscate payload |
| Prompt Injection | Direct instruction override attempts |
| Roleplay | Persona-based filter bypass |
| Prompt Probing | Attempts to extract system prompts |
| Leetspeak | Character substitution obfuscation |
| Gray Box | Injects fake internal context |
| Multilingual | Uses translation to bypass filters |
Multi-Turn Attacks (Premium)
Sophisticated attacks that build context over multiple messages:| Attack | Description |
|---|---|
| Social Engineering | Trust-building prompt extraction |
| GOAT | Generative Offensive Agent Tester |
| Crescendo | Gradually escalating intensity |
| Linear Jailbreak | Sequential progression strategy |
| Bad Likert Judge | Manipulative evaluator persona |
Agentic Attacks (Premium)
AI-driven adaptive attacks:| Attack | Description |
|---|---|
| Hydra | Multi-headed parallel exploration |
| Tree Jailbreak | Tree search-based attack exploration |
| Meta-Agent | Meta-agent orchestrated strategies |
| Iterative Jailbreak | AI-driven refinement of attempts |
LLM-Based Evaluation
Rogue uses LLM judges to evaluate agent responses for vulnerabilities:vulnerability_detected: Whether the vulnerability was exploitedconfidence: High, medium, or low confidenceseverity: Critical, high, medium, or lowreason: Explanation of the finding
Session Management
Red team attacks use intelligent session management:- Single-Turn Attacks: Each attempt gets a fresh session
- Multi-Turn Attacks: All turns share a session for context continuity
- Session IDs: Format
redteam-{vulnerability}-{attack}-{seed}
Output & Reporting
Red team results include:- Vulnerability Results: Per-vulnerability pass/fail with severity
- Attack Statistics: Success rates per attack technique
- Framework Compliance: Scores mapped to OWASP, MITRE, etc.
- CVSS Risk Scores: Industry-standard 0-10 scoring
- CSV Exports: Detailed conversation logs for analysis
- Key Findings: Top critical issues with summaries
Integration with Policy Evaluation
Red teaming complements Rogue’s policy evaluation:- Policy Evaluation: Tests business logic and expected behaviors
- Red Teaming: Tests security and adversarial resistance