Running the Agent
The ARMs red-teaming agent is executed through src/main.py, which orchestrates the complete evaluation pipeline from initialization to result generation. This guide covers the agent structure, command-line interface, and execution flow.
Table of contents
Dynamic Memory Module
The Dynamic Memory Module is a core component of ARMs that enables adaptive strategy selection through intelligent learning from attack history. It implements a layered memory architecture with ε-greedy exploration to balance exploitation of successful strategies with exploration of diverse attack patterns.
Memory Architecture
The memory system consists of three layers:
- Episodic Memory: Stores individual attack instances with success/failure outcomes
- Strategy Memory: Tracks performance statistics for each attack strategy
- Similarity Memory: Maintains embeddings for behavior similarity matching
ε-Greedy Exploration Algorithm
The memory module uses an ε-greedy approach for strategy selection:
# Simplified ε-greedy strategy selection
def select_strategy(self, behavior, epsilon_lambda=1.0):
# Calculate similarity to past behaviors
similarities = self.compute_similarities(behavior)
# Get top-k most similar memories
top_memories = self.retrieve_top_k(similarities)
# ε-greedy selection
if random.random() < self.epsilon * (epsilon_lambda ** self.total_queries):
# Exploration: random strategy selection
strategy = random.choice(self.available_strategies)
else:
# Exploitation: select best performing strategy from memory
strategy = self.select_best_strategy(top_memories)
return strategy
Memory Configuration Parameters
- top_k: Number of similar memories to retrieve for strategy selection
- alpha: Weighting factor for similarity scoring and strategy ranking
- epsilon_lambda: Decay rate for exploration probability over time
Memory Persistence
The memory state is automatically saved and can be loaded for future sessions:
# Save memory state after evaluation
# Automatically saved to: results/timestamp/memory.json
# Load existing memory for continued evaluation
python -m src.main \
--dataset StrongReject \
--preload "results/previous_run/memory.json" \
--attacker_name 'gpt-4o-2024-11-20' \
--victim_name claude-3-7-sonnet-20250219
Memory Analysis
The memory module provides insights into strategy effectiveness:
- Success Rates: Per-strategy attack success percentages
- Diversity Metrics: Coverage of different attack patterns
- Learning Curves: Performance improvement over time
- Strategy Preferences: Most frequently selected strategies
Agent Structure
The main entry point src/main.py is structured around several key components:
Core Components
- Model Providers: Support for OpenAI, Anthropic, Together, and custom model endpoints
- Memory System: Layered memory with ε-greedy exploration for strategy selection
- MCP Integration: Dynamic loading and management of attack strategy servers
- Evaluation Framework: Automated assessment of attack success and response quality
- Data Management: Flexible input handling for datasets and risk definitions
Initialization Sequence
# Key initialization steps in main.py
# 1. Environment setup and result directories
timestamp = datetime.datetime.now().strftime("%m%d_%H%M%S%f")
res_dir = os.path.join('results', timestamp)
os.makedirs(res_dir, exist_ok=True)
# 2. Model initialization based on provider
if 'gpt' in victim_name.lower():
victim_model = OpenAIModel(victim_name, victim_base_url)
elif 'claude' in victim_name.lower():
victim_model = AnthropicModel(victim_name)
elif 'internvl' in victim_name.lower():
victim_model = InternVLModel(victim_name, victim_base_url)
# 3. Memory system initialization
memory = Trajectory(top_k=top_k, alpha=alpha, epsilon_lambda=epsilon_lambda)
# 4. Attack strategy server setup
# MCP servers are loaded from configuration file
Command-Line Interface
The agent accepts numerous command-line parameters for flexible configuration:
| Parameter | Type | Description | Example |
|---|---|---|---|
--dataset |
str | Target dataset for evaluation | StrongReject |
--generate_from_risks |
flag | Generate behaviors from risk definitions | --generate_from_risks |
--risk_input_file |
str | Path to risk definition JSON file | datasets/EU_AI_Act/risk_definition.json |
--attacker_name |
str | Backbone model for red-teaming agent | gpt-4o-2024-11-20 |
--attacker_base_url |
str | Custom endpoint for attacker model | http://localhost:8000/v1 |
--victim_name |
str | Target model to evaluate | claude-3-7-sonnet-20250219 |
--victim_base_url |
str | Custom endpoint for victim model | http://localhost:8911/v1 |
--mcp_config |
str | MCP server configuration file | attack_lib/configs/full_config.json |
--instance_per_category |
int | Number of instances per risk category | 10 |
--max_actions |
int | Maximum actions per red-teaming session | 30 |
--epoch |
int | Number of evaluation epochs | 1 |
--debug |
flag | Run in debug mode (first instance only) | --debug |
Memory Configuration Parameters
| Parameter | Default | Description |
|---|---|---|
--top_k |
3 | Number of top memories to retrieve for strategy selection |
--alpha |
1.2 | Weighting factor for memory similarity scoring |
--epsilon_lambda |
1.0 | Exploration decay rate in ε-greedy strategy selection |
--preload |
"" | Path to preload existing memory state |
Execution Flow
The agent follows a structured execution flow:
1. Initialization Phase
- Environment Setup: Creates timestamped result directories
- Model Loading: Initializes attacker, victim, and evaluator models
- Data Preparation: Loads datasets or generates behaviors from risk definitions
- Memory Initialization: Sets up the layered memory system
- MCP Server Startup: Launches attack strategy servers
2. Red-teaming Loop
# Simplified red-teaming loop structure
for category in dataset_categories:
for instance in category_instances:
# Strategy selection using memory
selected_strategies = memory.retrieve_strategies(instance)
for action_step in range(max_actions):
# Generate attack using selected strategy
attack_instance = strategy.generate_attack(
behavior=instance.behavior,
history=conversation_history
)
# Query victim model
victim_response = victim_model.query(attack_instance)
# Evaluate response
evaluation = evaluator.assess(victim_response, instance.behavior)
# Update memory with results
memory.update(strategy, evaluation.success)
# Check termination conditions
if evaluation.success or action_step >= max_actions:
break
3. Evaluation and Logging
- Success Assessment: Automated evaluation of attack effectiveness
- Memory Updates: Strategy performance tracking for future selection
- Result Logging: Comprehensive logging of conversations and outcomes
- Progress Tracking: Real-time progress monitoring with timestamps
4. Result Compilation
- Statistics Generation: Attack success rates and performance metrics
- Memory Serialization: Saving memory state for future use
- Report Generation: Structured output files with evaluation results
Result Output
The agent generates comprehensive output in the results/<timestamp>/ directory:
Directory Structure
results/0210_143052123456/
├── .cache/ # Generated images during execution
│ ├── instance_001_step_1.png
│ ├── instance_001_step_2.png
│ └── ...
├── messages/ # Conversation logs with backbone model
│ ├── instance_001.json
│ ├── instance_002.json
│ └── ...
├── log.json # Main execution log
├── memory.json # Serialized memory state
└── results.json # Final evaluation results
Key Output Files
- log.json: Detailed logs of the red-teaming process for each instance, including strategy selection, attack generation, and evaluation steps
- memory.json: Complete state of the layered memory module, including strategy performance history and similarity embeddings
- results.json: Final evaluation results with attack success rates, strategy effectiveness, and summary statistics
- messages/: Individual conversation files containing the complete dialogue history between the agent and models
- .cache/: All images generated during the red-teaming process, organized by instance and step
Running Multiple Configurations
For systematic evaluation, you can run multiple configurations:
# Example: Multi-model evaluation
MODELS=("claude-3-7-sonnet-20250219" "gpt-4o-2024-11-20" "claude-3-5-sonnet-20241022")
for model in "${MODELS[@]}"; do
python -m src.main \
--dataset JailbreakBench \
--attacker_name 'gpt-4o-2024-11-20' \
--victim_name $model \
--mcp_config "attack_lib/configs/full_config.json" \
--instance_per_category 20 \
--max_actions 30
done
--debug flag for initial testing and setup verification. It runs only the first instance per category, significantly reducing execution time.
max_actions parameter directly controls the compute budget per instance.
ARMs Documentation