ARMs Documentation
The ARMs (Adaptive Red-Teaming Agent against Multimodal Models) is a comprehensive red-teaming framework designed to systematically evaluate the safety and robustness of vision-language models (VLMs). Built on the Model Context Protocol (MCP), ARMs provides a plug-and-play architecture for integrating diverse attack strategies and conducting thorough safety assessments.
Table of contents
Overview of ARMs
ARMs addresses critical challenges in multimodal AI safety evaluation through several key innovations:
Here are the main capabilities of ARMs:
- Systematic Red-teaming: Automated generation and optimization of diverse attack strategies against VLMs
- Memory-Driven Adaptation: Dynamic strategy selection based on historical success patterns and exploration needs
- Comprehensive Coverage: Evaluation across 71 risk categories with both instance-based and policy-based assessments
- Extensible Framework: Easy integration of new attack strategies, victim models, and evaluation metrics
- Real-world Alignment: Risk definitions and evaluations based on established regulatory and safety frameworks
Installation
ARMs requires Python 3.10+ and several dependencies for multimodal processing and model integration.
Environment Setup
Create and activate a conda environment:
conda create -n ARMs python=3.10
conda activate ARMs
Install Dependencies
Install the required packages:
pip install -r requirements.txt
Environment Configuration
Set up the Python path and API keys:
# Set Python path (replace with your actual path)
export PYTHONPATH=/your_path/ARMs-preview:$PYTHONPATH
# Configure API keys
export OPENAI_API_KEY=
export ANTHROPIC_API_KEY=
export TOGETHER_API_KEY=
/your_path/ARMs-preview with the actual path to your ARMs directory, and configure the API keys for the model providers you plan to use.
Quickstart
Here's how to get started with ARMs for red-teaming evaluation:
Basic Usage
Run a debug evaluation on StrongReject dataset:
python -m src.main \
--dataset StrongReject \
--attacker_name 'gpt-4o-2024-11-20' \
--victim_name claude-3-7-sonnet-20250219 \
--mcp_config "attack_lib/configs/full_config.json" \
--instance_per_category 10 \
--max_actions 30 \
--debug
This command will:
- Use GPT-4o as the red-teaming agent backbone
- Target Claude-3.7-Sonnet as the victim model
- Load all 17 attack strategies from the full configuration
- Run in debug mode (first instance per category only)
- Allow up to 30 actions per red-teaming session
Key Parameters
--dataset: Choose from[StrongReject, JailbreakBench, JailbreakV, OWASP, EU-AI-Act, FINRA]--attacker_name: Backbone model for the red-teaming agent (recommended:'gpt-4o-2024-11-20')--victim_name: Target model to evaluate--mcp_config: Configuration file defining available attack strategies--max_actions: Compute budget (number of queries and tool calls allowed)--debug: Run on first instance per category for quick testing
Supported Models
Attacker models (backbone):
gpt-4o-2024-11-20(recommended)claude-3-7-sonnet-20250219Qwen/Qwen3-8B(requires local endpoint)
Victim models:
gpt-4o-2024-11-20claude-opus-4-20250514-v1:0claude-3-7-sonnet-20250219claude-3-5-sonnet-20241022InternVL3-2B/8B/14B/38B(requires vLLM endpoint)
--debug flag is recommended for first-time usage as it provides a quick run-through of the pipeline. Remove it for full evaluations across all instances.
Results Structure
After running ARMs, results are saved in results/<timestamp>/ with the following structure:
.cache/- All generated images during the processmessages/- Conversation logs with the backbone modellog.json- Main red-teaming process logs for each instancememory.json- Layered memory module contentsresults.json- Final evaluation results for each instance
ARMs Documentation