← Back to Home

Did We Just Cross an AGI Threshold? What Gemini 2.5 & GPT-5 at ICPC Really Mean

Posted Sep 19, 2025 • 8 min read • Benchmarks · Reasoning · Forecasting

A visible jump on a benchmark can look like a step toward AGI. Here’s how to read those jumps—without over-reading them.

What the ICPC measures (and what it doesn’t)

ICPC-style contests reward decomposition, algorithm selection, and edge-case handling under time pressure. When models like Gemini 2.5 or GPT-5 achieve gold-level performance, that’s strong evidence of improved deliberate reasoning and tool-use. But coding-contest wins are narrow: they don’t test long-horizon planning, interactive ambiguity, or real-world constraints (latency, flaky APIs, partner requirements).

Signals that matter

  • Fewer “hallucinated” APIs / tighter I/O contracts
  • Higher pass@k on multi-file problems with tests
  • Competent use of search/tools during reasoning

Limits to remember

  • Benchmarks ≠ production engineering
  • Reward hacking / leakage can inflate scores
  • Missing evals: safety, reliability, economics

Coding-contest gains are real but partial: they’re the “reasoning sprint,” not the “marathon” of deployed systems.

Do contest wins translate to the real world?

In enterprise work, agents must integrate identity, security, rate-limits, and legacy systems—none of which show up in ICPC. Still, the same skills that win contests (decomposition + search + tool-use) are exactly what help LLMs triage tickets, draft pull requests, and summarize noisy data feeds.

Our own live testbed—PredictionArena—keeps models honest: every day at 10:00 ET, four top LLMs predict BTC +24h and we grade the call publicly. It’s a small, clean example of closed-loop evaluation: a forecast, a reference, and a transparent result the next day.

Which benchmarks track real value?

  • Long-form reasoning: multi-step chain-of-thought tasks with hidden tests.
  • Tool reliability: how often a model calls tools correctly on the first try.
  • Domain transfer: does performance hold when the prompt distribution shifts?
  • Latency–quality frontier: quality at fixed budgets matters more than maxima.

What we’ll watch next

  1. Cross-day consistency. Not one-off wins but stable weekly improvements.
  2. Abstention behavior. Good models say “uncertain” when signals are thin.
  3. Attribution. Explanations that match sources (no “vibes-only” rationales).

How this ties back to PredictionArena

Benchmark stories are exciting; calibration is what compounds. Our Leaderboard (see All-time accuracy) tells you which models translate fancy evals into measurable, out-of-sample performance. Check Today’s Predictions and come back tomorrow to see what held up.

Need assistance with your AI project?

We design rigorous experiments, connect models to your data/tools, and ship measurable wins. Schedule a call

Built by youraiconsultant.london Want your own private experiments? Book a call ☎️