Skip to main content
Currently in beta, our Benchmarking Agent will be the standard way to measure AI performance across model providers.

ARC Harness arcagi3

This is a developer harness for building and benchmarking agentic research workflows on the ARC-AGI-3 corpus of environments.

When to use it

  • Compare model versions or prompt strategies on the same game set.
  • Detect regressions after code or prompt changes.
  • Generate official scorecards and replays for sharing.
  • Experimenting with multiple custom agentic architectures.

Quickstart

Prerequisites

  • Python: 3.9+
  • uv: recommended package manager. Install from uv.pm or curl -LsSf https://astral.sh/uv/install.sh | sh
  • ARC-AGI-3 API key: required to talk to the ARC server. Sign up for a key here

Install

Clone the repository:
git clone [email protected]:arcprize/arc-agi-3-benchmarking.git
cd arc-agi-3-benchmarking
From repo root:
uv venv
uv sync
This creates a virtual environment (if needed) and installs the project and dependencies in editable mode. Alternatively, without uv:
pip install -e .

Setting up your environment

Set the ARC API key and your provider keys. You can put them in a .env file (see .env.example) or export them in your shell. Provider key links: Check configuration:
uv run python -m arcagi3.runner --check

Select your game

uv run python -m arcagi3.runner --list-games

Pick your model

uv run python -m arcagi3.runner --list-models

Benchmark

uv run python -m arcagi3.runner \
  --game_id ls20 \
  --config gpt-5-2-openrouter \
  --max_actions 3

Scorecards

When you run a benchmark, a scorecard is saved on the ARC server. If you are logged in, you can view them at three.arcprize.org/scorecards.

Learn More

The benchmarking README has more information than what is published here. Be sure to also reference how to create your own agent to start experimenting with new agentic architectures.