ARC-AGI-3 Scoring Methodology

ARC-AGI-3 uses Relative Human Action Efficiency (RHAE, pronounced “ray”) to score AI systems. RHAE measures per-level action efficiency compared to a human baseline, normalized per game, across all games.

What Gets Measured

AI is scored on two criteria:

Completion — How many levels did the AI complete in each game?
Efficiency — How many actions did the AI take compared to humans?

What Counts as an Action

An action is a discrete interaction with the environment. Each turn where the agent submits a command, move, or input that affects the game state counts as an action. Internal operations that do not alter the environment (tool calls, reasoning steps, retries) are not counted as actions.

Human Baseline

Human baselines are established through controlled testing where participants play each ARC-AGI-3 game for the first time (having never seen the game before). For each game, multiple first-time players are observed, and the 2nd best human (fewest actions) per game is recorded as the baseline. Using the 2nd best human:

Removes outlier winners while still representing proficient human performance
Avoids penalizing for early misclicks
Keeps the baseline grounded in real play, not theoretical speed-runs

How Scoring Works

Per-Level Scoring

For each level the AI completes, calculate:

level_score = human_baseline_actions / ai_actions

If human baseline is 10 actions and AI takes 10 → level score is 1.0 (100%)
If human baseline is 10 actions and AI takes 20 → level score is 0.5 (50%)
If human baseline is 10 actions and AI takes 1,000 → level score is 0.01 (1%)

Per-Level Score Cap

The maximum score per level is capped at 1.0x human baseline. If an AI discovers a shortcut and completes a level faster than humans, it still only receives 1.0. This encourages building AI that generalizes across games rather than exploiting individual levels.

Per-Game Aggregation

The game score is the average of all per-level scores for that game. Example: A game has 7 levels. The AI scores:

Levels 1-3: 0.5 each (took twice as many actions as human)
Levels 4-7: 0 each (did not complete the level)

Game score = (0.5 + 0.5 + 0.5 + 0 + 0 + 0 + 0) / 7 = 0.21 (21%) Per-level aggregation prevents longer levels from drowning out signal from shorter levels, and lets you see exactly where a test-taker is strong or weak.

Total Score

Total score is the average of all game scores, resulting in a final score between 0% and 100%.

Score Interpretation

Score	Interpretation
100%	AI completes all games/levels while matching or surpassing human efficiency
1-99%	A mixture of level completion rates and efficiency relative to human baseline
0%	AI never completes a level across any game

Get Started

Core Concepts

Further Reading

Learn More

Community

ARC-AGI-3 Scoring Methodology

What Gets Measured

What Counts as an Action

Human Baseline

How Scoring Works

Per-Level Scoring

Per-Level Score Cap

Per-Game Aggregation

Total Score

Score Interpretation

Get Started

Core Concepts

Further Reading

Learn More

Community

​What Gets Measured

​What Counts as an Action

​Human Baseline

​How Scoring Works

​Per-Level Scoring

​Per-Level Score Cap

​Per-Game Aggregation

​Total Score

​Score Interpretation

What Gets Measured

What Counts as an Action

Human Baseline

How Scoring Works

Per-Level Scoring

Per-Level Score Cap

Per-Game Aggregation

Total Score

Score Interpretation