What Gets Measured
AI is scored on two criteria:- Completion — How many levels did the AI complete in each game?
- Efficiency — How many actions did the AI take compared to humans?
What Counts as an Action
An action is a discrete interaction with the environment. Each turn where the agent submits a command, move, or input that affects the game state counts as an action. Internal operations that do not alter the environment (tool calls, reasoning steps, retries) are not counted as actions.Human Baseline
Human baselines are established through controlled testing where participants play each ARC-AGI-3 game for the first time (having never seen the game before). For each game, multiple first-time players are observed, and the 2nd best human (fewest actions) per game is recorded as the baseline. Using the 2nd best human:- Removes outlier winners while still representing proficient human performance
- Avoids penalizing for early misclicks
- Keeps the baseline grounded in real play, not theoretical speed-runs
How Scoring Works
Per-Level Scoring
For each level the AI completes, calculate:- If human baseline is 10 actions and AI takes 10 → level score is 1.0 (100%)
- If human baseline is 10 actions and AI takes 20 → level score is 0.5 (50%)
- If human baseline is 10 actions and AI takes 1,000 → level score is 0.01 (1%)
Per-Level Score Cap
The maximum score per level is capped at 1.0x human baseline. If an AI discovers a shortcut and completes a level faster than humans, it still only receives 1.0. This encourages building AI that generalizes across games rather than exploiting individual levels.Per-Game Aggregation
The game score is the average of all per-level scores for that game. Example: A game has 7 levels. The AI scores:- Levels 1-3: 0.5 each (took twice as many actions as human)
- Levels 4-7: 0 each (did not complete the level)
Total Score
Total score is the average of all game scores, resulting in a final score between 0% and 100%.Score Interpretation
| Score | Interpretation |
|---|---|
| 100% | AI completes all games/levels while matching or surpassing human efficiency |
| 1-99% | A mixture of level completion rates and efficiency relative to human baseline |
| 0% | AI never completes a level across any game |

