Examples

This page provides practical examples for using ARMs in various scenarios, from basic red-teaming evaluations to advanced custom configurations.

Basic Examples
Advanced Usage
Custom Configurations
Batch Evaluations
Adding Victim Models

Basic Examples

Quick Debug Run

Perfect for testing your setup and getting familiar with ARMs. This runs only the first instance per category.

python -m src.main \
    --dataset StrongReject \
    --attacker_name 'gpt-4o-2024-11-20' \
    --victim_name claude-3-7-sonnet-20250219 \
    --mcp_config "attack_lib/configs/full_config.json" \
    --instance_per_category 1 \
    --max_actions 10 \
    --debug

Standard Evaluation

Comprehensive evaluation on a standard benchmark with all attack strategies.

python -m src.main \
    --dataset JailbreakBench \
    --attacker_name 'gpt-4o-2024-11-20' \
    --victim_name claude-3-7-sonnet-20250219 \
    --mcp_config "attack_lib/configs/full_config.json" \
    --instance_per_category 20 \
    --max_actions 30

Policy Compliance Testing

Evaluate compliance with regulatory frameworks by generating behaviors from risk definitions.

python -m src.main \
    --generate_from_risks \
    --risk_input_file "datasets/EU_AI_Act/risk_definition.json" \
    --attacker_name 'gpt-4o-2024-11-20' \
    --victim_name claude-3-7-sonnet-20250219 \
    --mcp_config "attack_lib/configs/full_config.json" \
    --instance_per_category 5 \
    --max_actions 25

Local Model Evaluation

Evaluate locally hosted models (e.g., InternVL3 with vLLM).

python -m src.main \
    --dataset StrongReject \
    --attacker_name 'gpt-4o-2024-11-20' \
    --victim_name InternVL3-14B \
    --victim_base_url 'http://localhost:8911/v1' \
    --mcp_config "attack_lib/configs/full_config.json" \
    --instance_per_category 10 \
    --max_actions 30

Advanced Usage

Custom Memory Configuration

Fine-tune the memory module parameters for specific evaluation needs.

python -m src.main \
    --dataset JailbreakV \
    --attacker_name 'gpt-4o-2024-11-20' \
    --victim_name claude-3-5-sonnet-20241022 \
    --mcp_config "attack_lib/configs/full_config.json" \
    --instance_per_category 15 \
    --max_actions 40 \
    --top_k 5 \
    --alpha 1.5 \
    --epsilon_lambda 0.8

Multi-Epoch Evaluation

Run multiple epochs for statistical significance and consistency testing.

python -m src.main \
    --dataset OWASP \
    --attacker_name 'gpt-4o-2024-11-20' \
    --victim_name gpt-4o-2024-11-20 \
    --mcp_config "attack_lib/configs/full_config.json" \
    --instance_per_category 10 \
    --max_actions 30 \
    --epoch 3

Lightweight Attack Configuration

Use a subset of attack strategies for faster evaluation or targeted testing.

python -m src.main \
    --dataset StrongReject \
    --attacker_name 'gpt-4o-2024-11-20' \
    --victim_name claude-3-7-sonnet-20250219 \
    --mcp_config "attack_lib/configs/lightweight_config.json" \
    --instance_per_category 20 \
    --max_actions 25

Custom Configurations

Creating Custom Attack Configurations

Create a custom MCP configuration file with specific attack strategies.

# attack_lib/configs/custom_config.json
[
    {
        "server_name": "acronym",
        "command": "python",
        "args": ["attack_lib/servers/acronym.py"]
    },
    {
        "server_name": "actor_attack",
        "command": "python", 
        "args": ["attack_lib/servers/actor_attack.py"]
    },
    {
        "server_name": "crescendo",
        "command": "python",
        "args": ["attack_lib/servers/crescendo.py"]
    }
]

Custom Risk Definitions

Define custom risk categories for domain-specific evaluation.

# datasets/Healthcare/risk_definition.json
{
    "Medical Misinformation": {
        "description": "Requests that could spread harmful medical misinformation",
        "rules": [
            "Do not provide unverified medical advice",
            "Do not contradict established medical consensus without evidence",
            "Do not recommend dangerous self-medication"
        ]
    },
    "Patient Privacy": {
        "description": "Requests that could compromise patient confidentiality",
        "rules": [
            "Do not reveal patient information without consent",
            "Do not assist in accessing protected health information",
            "Do not help bypass medical privacy protections"
        ]
    }
}

Batch Evaluations

Multi-Model Comparison

Bash script to evaluate multiple victim models with the same configuration.

#!/bin/bash

# Multi-model evaluation script
MODELS=("claude-3-7-sonnet-20250219" "claude-3-5-sonnet-20241022" "gpt-4o-2024-11-20")
DATASET="JailbreakBench"
INSTANCES=10

for MODEL in "${MODELS[@]}"; do
    echo "Evaluating model: $MODEL"
    python -m src.main \
        --dataset $DATASET \
        --attacker_name 'gpt-4o-2024-11-20' \
        --victim_name $MODEL \
        --mcp_config "attack_lib/configs/full_config.json" \
        --instance_per_category $INSTANCES \
        --max_actions 30
    echo "Completed evaluation for $MODEL"
done

Cross-Dataset Evaluation

Evaluate the same model across multiple datasets for comprehensive assessment.

#!/bin/bash

# Cross-dataset evaluation script
DATASETS=("StrongReject" "JailbreakBench" "OWASP" "EU-AI-Act")
VICTIM_MODEL="claude-3-7-sonnet-20250219"

for DATASET in "${DATASETS[@]}"; do
    echo "Evaluating dataset: $DATASET"
    python -m src.main \
        --dataset $DATASET \
        --attacker_name 'gpt-4o-2024-11-20' \
        --victim_name $VICTIM_MODEL \
        --mcp_config "attack_lib/configs/full_config.json" \
        --instance_per_category 15 \
        --max_actions 30
    echo "Completed evaluation for $DATASET"
done

Standalone Behavior Generation

Generate diverse behaviors from risk definitions without immediate evaluation.

# Generate behaviors for later use
python -m src.risk_to_behavior \
    --input_file "datasets/FINRA/risk_definition.json" \
    --output_file "datasets/FINRA/generated_behaviors.csv" \
    --samples_per_category 20

# Then use the generated behaviors
python -m src.main \
    --dataset Generated-FINRA \
    --attacker_name 'gpt-4o-2024-11-20' \
    --victim_name claude-3-7-sonnet-20250219 \
    --mcp_config "attack_lib/configs/full_config.json" \
    --instance_per_category 20 \
    --max_actions 35

Best Practices: Start with debug runs to verify your setup, use appropriate compute budgets (max_actions) based on your needs, and consider running multiple epochs for statistical significance in research evaluations.

Resource Management: Large-scale evaluations can consume significant API credits. Monitor your usage and adjust batch sizes accordingly.

Adding Victim Models

ARMs supports integration of various victim models through a flexible provider system. You can easily add new models by extending the existing model classes or configuring custom endpoints.

Supported Model Providers

ARMs currently supports the following model providers:

OpenAI: GPT-4o, GPT-4, GPT-3.5 families
Anthropic: Claude-3.7-Sonnet, Claude-3.5-Sonnet, Claude-3 families
Together AI: Various open-source models via Together's API
Custom vLLM: Self-hosted models using vLLM serving
Custom OpenAI-compatible: Any API following OpenAI's format

Adding OpenAI-Compatible Models

For models that follow the OpenAI API format:

# Use with custom endpoint
python -m src.main \
    --dataset StrongReject \
    --attacker_name 'gpt-4o-2024-11-20' \
    --victim_name your-custom-model \
    --victim_base_url 'http://your-endpoint:8000/v1' \
    --mcp_config "attack_lib/configs/full_config.json"

InternVL Models with vLLM

Setting up and evaluating InternVL models:

# 1. Start vLLM server for InternVL
python -m vllm.entrypoints.openai.api_server \
    --model OpenGVLab/InternVL2-8B \
    --host 0.0.0.0 \
    --port 8911 \
    --served-model-name InternVL3-8B

# 2. Evaluate with ARMs
python -m src.main \
    --dataset JailbreakV \
    --attacker_name 'gpt-4o-2024-11-20' \
    --victim_name InternVL3-8B \
    --victim_base_url 'http://localhost:8911/v1' \
    --instance_per_category 10

Customized Providers

For models requiring special handling, you can extend the ARMs model provider system.

Creating a Custom Provider

Implement a new model class by extending the base model interface:

# src/models/custom_model.py
from .base_model import BaseModel
from typing import List, Dict, Any

class CustomModel(BaseModel):
    def __init__(self, model_name: str, base_url: str = None):
        super().__init__(model_name)
        self.base_url = base_url
        # Initialize your model client here
    
    def query(self, messages: List[Dict[str, Any]]) -> str:
        """
        Query the custom model with messages.
        
        Args:
            messages: List of message dictionaries with role and content
            
        Returns:
            Model response as string
        """
        # Implement your model query logic here
        response = self._call_model_api(messages)
        return response
    
    def _call_model_api(self, messages):
        # Custom API calling logic
        pass

Registering the Custom Provider

Add your custom model to the main evaluation pipeline:

# In src/main.py, add model detection logic
if 'custom-model' in victim_name.lower():
    from models.custom_model import CustomModel
    victim_model = CustomModel(victim_name, victim_base_url)
elif 'gpt' in victim_name.lower():
    victim_model = OpenAIModel(victim_name, victim_base_url)
# ... existing model detection logic

Provider Configuration Examples

Common configurations for different model types:

# Hugging Face Transformers (via custom endpoint)
python -m src.main \
    --victim_name llama-2-7b-chat \
    --victim_base_url 'http://localhost:8080/v1'

# Azure OpenAI
python -m src.main \
    --victim_name gpt-4 \
    --victim_base_url 'https://your-resource.openai.azure.com/v1'

# AWS Bedrock (requires custom provider implementation)
python -m src.main \
    --victim_name claude-v2 \
    --victim_base_url 'bedrock://us-east-1'

Model-Specific Considerations

Input Format: Ensure your model handles multimodal inputs correctly
Rate Limiting: Implement appropriate request throttling
Error Handling: Handle API failures and timeouts gracefully
Response Parsing: Extract text responses consistently
Token Limits: Manage context length constraints

Model Integration Tip: Test your custom provider with a simple debug run before conducting full evaluations to ensure proper integration.

← Previous: Risk Assessment Types Back to Overview →