Running the Agent

The ARMs red-teaming agent is executed through src/main.py, which orchestrates the complete evaluation pipeline from initialization to result generation. This guide covers the agent structure, command-line interface, and execution flow.

Dynamic Memory Module

The Dynamic Memory Module is a core component of ARMs that enables adaptive strategy selection through intelligent learning from attack history. It implements a layered memory architecture with ε-greedy exploration to balance exploitation of successful strategies with exploration of diverse attack patterns.

Memory Architecture

The memory system consists of three layers:

ε-Greedy Exploration Algorithm

The memory module uses an ε-greedy approach for strategy selection:

# Simplified ε-greedy strategy selection
def select_strategy(self, behavior, epsilon_lambda=1.0):
    # Calculate similarity to past behaviors
    similarities = self.compute_similarities(behavior)
    
    # Get top-k most similar memories
    top_memories = self.retrieve_top_k(similarities)
    
    # ε-greedy selection
    if random.random() < self.epsilon * (epsilon_lambda ** self.total_queries):
        # Exploration: random strategy selection
        strategy = random.choice(self.available_strategies)
    else:
        # Exploitation: select best performing strategy from memory
        strategy = self.select_best_strategy(top_memories)
    
    return strategy

Memory Configuration Parameters

Memory Persistence

The memory state is automatically saved and can be loaded for future sessions:

# Save memory state after evaluation
# Automatically saved to: results/timestamp/memory.json

# Load existing memory for continued evaluation
python -m src.main \
    --dataset StrongReject \
    --preload "results/previous_run/memory.json" \
    --attacker_name 'gpt-4o-2024-11-20' \
    --victim_name claude-3-7-sonnet-20250219

Memory Analysis

The memory module provides insights into strategy effectiveness:

Agent Structure

The main entry point src/main.py is structured around several key components:

Core Components

Initialization Sequence

# Key initialization steps in main.py

# 1. Environment setup and result directories
timestamp = datetime.datetime.now().strftime("%m%d_%H%M%S%f")
res_dir = os.path.join('results', timestamp)
os.makedirs(res_dir, exist_ok=True)

# 2. Model initialization based on provider
if 'gpt' in victim_name.lower():
    victim_model = OpenAIModel(victim_name, victim_base_url)
elif 'claude' in victim_name.lower():
    victim_model = AnthropicModel(victim_name)
elif 'internvl' in victim_name.lower():
    victim_model = InternVLModel(victim_name, victim_base_url)

# 3. Memory system initialization
memory = Trajectory(top_k=top_k, alpha=alpha, epsilon_lambda=epsilon_lambda)

# 4. Attack strategy server setup
# MCP servers are loaded from configuration file

Command-Line Interface

The agent accepts numerous command-line parameters for flexible configuration:

Parameter Type Description Example
--dataset str Target dataset for evaluation StrongReject
--generate_from_risks flag Generate behaviors from risk definitions --generate_from_risks
--risk_input_file str Path to risk definition JSON file datasets/EU_AI_Act/risk_definition.json
--attacker_name str Backbone model for red-teaming agent gpt-4o-2024-11-20
--attacker_base_url str Custom endpoint for attacker model http://localhost:8000/v1
--victim_name str Target model to evaluate claude-3-7-sonnet-20250219
--victim_base_url str Custom endpoint for victim model http://localhost:8911/v1
--mcp_config str MCP server configuration file attack_lib/configs/full_config.json
--instance_per_category int Number of instances per risk category 10
--max_actions int Maximum actions per red-teaming session 30
--epoch int Number of evaluation epochs 1
--debug flag Run in debug mode (first instance only) --debug

Memory Configuration Parameters

Parameter Default Description
--top_k 3 Number of top memories to retrieve for strategy selection
--alpha 1.2 Weighting factor for memory similarity scoring
--epsilon_lambda 1.0 Exploration decay rate in ε-greedy strategy selection
--preload "" Path to preload existing memory state

Execution Flow

The agent follows a structured execution flow:

1. Initialization Phase

2. Red-teaming Loop

# Simplified red-teaming loop structure

for category in dataset_categories:
    for instance in category_instances:
        
        # Strategy selection using memory
        selected_strategies = memory.retrieve_strategies(instance)
        
        for action_step in range(max_actions):
            
            # Generate attack using selected strategy
            attack_instance = strategy.generate_attack(
                behavior=instance.behavior,
                history=conversation_history
            )
            
            # Query victim model
            victim_response = victim_model.query(attack_instance)
            
            # Evaluate response
            evaluation = evaluator.assess(victim_response, instance.behavior)
            
            # Update memory with results
            memory.update(strategy, evaluation.success)
            
            # Check termination conditions
            if evaluation.success or action_step >= max_actions:
                break

3. Evaluation and Logging

4. Result Compilation

Result Output

The agent generates comprehensive output in the results/<timestamp>/ directory:

Directory Structure

results/0210_143052123456/
├── .cache/                    # Generated images during execution
│   ├── instance_001_step_1.png
│   ├── instance_001_step_2.png
│   └── ...
├── messages/                  # Conversation logs with backbone model  
│   ├── instance_001.json
│   ├── instance_002.json
│   └── ...
├── log.json                   # Main execution log
├── memory.json               # Serialized memory state
└── results.json              # Final evaluation results

Key Output Files

Running Multiple Configurations

For systematic evaluation, you can run multiple configurations:

# Example: Multi-model evaluation
MODELS=("claude-3-7-sonnet-20250219" "gpt-4o-2024-11-20" "claude-3-5-sonnet-20241022")

for model in "${MODELS[@]}"; do
    python -m src.main \
        --dataset JailbreakBench \
        --attacker_name 'gpt-4o-2024-11-20' \
        --victim_name $model \
        --mcp_config "attack_lib/configs/full_config.json" \
        --instance_per_category 20 \
        --max_actions 30
done
Performance Tip: Use the --debug flag for initial testing and setup verification. It runs only the first instance per category, significantly reducing execution time.
Resource Management: Monitor API usage during large evaluations. The max_actions parameter directly controls the compute budget per instance.