ARMs Documentation

The ARMs (Adaptive Red-Teaming Agent against Multimodal Models) is a comprehensive red-teaming framework designed to systematically evaluate the safety and robustness of vision-language models (VLMs). Built on the Model Context Protocol (MCP), ARMs provides a plug-and-play architecture for integrating diverse attack strategies and conducting thorough safety assessments.

Overview of ARMs
Installation
Quickstart

Overview of ARMs

ARMs addresses critical challenges in multimodal AI safety evaluation through several key innovations:

🔧

Plug-and-Play Architecture

Built on Model Context Protocol (MCP) with 17+ attack strategies covering diverse adversarial patterns like typographic attacks, contextual cloaking, and visual perturbations.

🧠

Adaptive Memory System

Layered memory module with ε-greedy exploration algorithm that intelligently balances attack diversity and efficiency for comprehensive red-teaming.

🎯

SOTA Performance

Achieves 90%+ attack success rate on Claude-3.7-Sonnet and 27.5% improvement over baselines with 95.83% diversity enhancement.

📋

Policy-Following Evaluation

Supports comprehensive policy-based safety assessments aligned with real-world regulatory frameworks like EU AI Act, OWASP, and FINRA.

Here are the main capabilities of ARMs:

Systematic Red-teaming: Automated generation and optimization of diverse attack strategies against VLMs
Memory-Driven Adaptation: Dynamic strategy selection based on historical success patterns and exploration needs
Comprehensive Coverage: Evaluation across 71 risk categories with both instance-based and policy-based assessments
Extensible Framework: Easy integration of new attack strategies, victim models, and evaluation metrics
Real-world Alignment: Risk definitions and evaluations based on established regulatory and safety frameworks

Installation

ARMs requires Python 3.10+ and several dependencies for multimodal processing and model integration.

Environment Setup

Create and activate a conda environment:

conda create -n ARMs python=3.10
conda activate ARMs

Install Dependencies

Install the required packages:

pip install -r requirements.txt

Environment Configuration

Set up the Python path and API keys:

# Set Python path (replace with your actual path)
export PYTHONPATH=/your_path/ARMs-preview:$PYTHONPATH

# Configure API keys
export OPENAI_API_KEY=
export ANTHROPIC_API_KEY=
export TOGETHER_API_KEY=

Note: Make sure to replace /your_path/ARMs-preview with the actual path to your ARMs directory, and configure the API keys for the model providers you plan to use.

Quickstart

Here's how to get started with ARMs for red-teaming evaluation:

Basic Usage

Run a debug evaluation on StrongReject dataset:

python -m src.main \
    --dataset StrongReject \
    --attacker_name 'gpt-4o-2024-11-20' \
    --victim_name claude-3-7-sonnet-20250219 \
    --mcp_config "attack_lib/configs/full_config.json" \
    --instance_per_category 10 \
    --max_actions 30 \
    --debug

This command will:

Use GPT-4o as the red-teaming agent backbone
Target Claude-3.7-Sonnet as the victim model
Load all 17 attack strategies from the full configuration
Run in debug mode (first instance per category only)
Allow up to 30 actions per red-teaming session

Key Parameters

--dataset: Choose from [StrongReject, JailbreakBench, JailbreakV, OWASP, EU-AI-Act, FINRA]
--attacker_name: Backbone model for the red-teaming agent (recommended: 'gpt-4o-2024-11-20')
--victim_name: Target model to evaluate
--mcp_config: Configuration file defining available attack strategies
--max_actions: Compute budget (number of queries and tool calls allowed)
--debug: Run on first instance per category for quick testing

Supported Models

Attacker models (backbone):

gpt-4o-2024-11-20 (recommended)
claude-3-7-sonnet-20250219
Qwen/Qwen3-8B (requires local endpoint)

Victim models:

gpt-4o-2024-11-20
claude-opus-4-20250514-v1:0
claude-3-7-sonnet-20250219
claude-3-5-sonnet-20241022
InternVL3-2B/8B/14B/38B (requires vLLM endpoint)

Important: The --debug flag is recommended for first-time usage as it provides a quick run-through of the pipeline. Remove it for full evaluations across all instances.

Results Structure

After running ARMs, results are saved in results/<timestamp>/ with the following structure:

.cache/ - All generated images during the process
messages/ - Conversation logs with the backbone model
log.json - Main red-teaming process logs for each instance
memory.json - Layered memory module contents
results.json - Final evaluation results for each instance

Next: Attack Strategy Integration →