Benchmarking Tooling (BETA)

Currently in beta, our Benchmarking Agent will be the standard way to measure AI performance across model providers.

ARC Harness `arcagi3`

This is a developer harness for building and benchmarking agentic research workflows on the ARC-AGI-3 corpus of environments.

When to use it

Compare model versions or prompt strategies on the same game set.
Detect regressions after code or prompt changes.
Generate official scorecards and replays for sharing.
Experimenting with multiple custom agentic architectures.

Quickstart

Prerequisites

Python: 3.9+
uv: recommended package manager. Install from uv.pm or curl -LsSf https://astral.sh/uv/install.sh | sh
ARC-AGI-3 API key: required to talk to the ARC server. Sign up for a key here

Install

Clone the repository:

git clone git@github.com:arcprize/arc-agi-3-benchmarking.git
cd arc-agi-3-benchmarking

From repo root:

uv venv
uv sync

This creates a virtual environment (if needed) and installs the project and dependencies in editable mode. Alternatively, without uv:

pip install -e .

Setting up your environment

Set the ARC API key and your provider keys. You can put them in a .env file (see .env.example) or export them in your shell. Provider key links:

Check configuration:

uv run python -m arcagi3.runner --check

Select your game

uv run python -m arcagi3.runner --list-games

Pick your model

uv run python -m arcagi3.runner --list-models

Benchmark

uv run python -m arcagi3.runner \
  --game_id ls20 \
  --config gpt-5-2-openrouter \
  --max_actions 3

Scorecards

When you run a benchmark, a scorecard is saved on the ARC server. If you are logged in, you can view them at arcprize.org/scorecards.

Learn More

The benchmarking README has more information than what is published here. Be sure to also reference how to create your own agent to start experimenting with new agentic architectures.

​ARC Harness arcagi3

​When to use it

​Quickstart

​Prerequisites

​Install

​Setting up your environment

​Select your game

​Pick your model

​Benchmark

​Scorecards

​Learn More