Tag: benchmarking

Showing 1-9 of 9

Jul 12, 20268 min readTooling Deep DiveIntermediate240 min build

Reproducing the Neutrality Project Release 01: pipeline to assess AI political neutrality

Guide to reproduce the Neutrality Project Release 01: run 18 models on six political axes, record per-axis means, refusal rates, and 95% confidence intervals.

ai benchmarking neutrality ethics evaluation

+3 more

TutorialsFrance

Open

Jul 06, 20267 min readAgent PlaybookIntermediate240 min build

ScarfBench: Evaluating AI Agents on Spring, Jakarta EE and Quarkus Migrations

ScarfBench is an open benchmark for assessing AI agents on Enterprise Java migrations (Spring, Jakarta EE, Quarkus), measuring behavior preservation, build success, and runtime safety.

ScarfBench AI agents Java framework migration benchmarking

+3 more

Spring Jakarta EE Quarkus

TutorialsFrance

Open

Jun 21, 20267 min readAgent PlaybookIntermediate240 min build

Measuring how open models use your libraries: a reproducible agent benchmark

Build a repeatable harness that records agents' plan steps, API calls, retries, tokens, wall time and cost to reveal friction points in your library and guide rollout decisions.

agents benchmarking open-models tooling huggingface

+3 more

evaluation pi-agent observability

TutorialsFrance

Open

May 29, 20268 min readAgent PlaybookIntermediate240 min build

Reproduce ITBench‑AA SRE Evaluations and Produce Audit‑Ready JSON Reports

Reproducible tutorial to run ITBench‑AA's SRE tasks and emit audit‑ready JSON reports (accuracy, avg_turns, false_positive_rate, task_count). Frontier models scored below 50%.

SRE benchmarking agentic-AI Kubernetes ITOps

+3 more

ITBench-AA IBM Artificial Analysis

Model BreakdownsUnited Kingdom

Open

May 16, 20267 min readTooling Deep DiveIntermediate

ai-ml-gpu-bench: a lightweight harness to compare CPU and GPU for Python ML training and local LLM inference

Guide to albedan/ai-ml-gpu-bench: clone a small harness to time Python ML training and local LLM inference on CPU vs GPU and export metrics to compare latency and cost.

benchmarking gpu cpu llm machine-learning

+3 more

mlops open-source performance

TutorialsFrance

Open

Apr 18, 20268 min readAgent PlaybookIntermediate180 min build

VAKRA benchmark: reproducible execution traces for diagnosing multi-step agent tool use

Guides running VAKRA's runnable benchmark—8,000+ local APIs across 62 domains—to record full execution traces, reproduce common multi‑step agent failures, and guide focused fixes.

VAKRA agents benchmarking tool-use failure-modes

+3 more

evaluation ibm-research hugging-face

TutorialsFrance

Open

Feb 09, 20267 min readAgent PlaybookIntermediate240 min build

Build an APEX-Agents-style harness to evaluate AI agents' multi-domain performance

Reproducible tutorial to build an APEX-Agents-style test harness measuring AI agents' ability to stitch context across Slack and Google Drive. Includes configs, logs and rollout gates.

ai-agents benchmarking APEX-Agents Mercor knowledge-work

+3 more

evaluation production-readiness reliability

Model BreakdownsUnited States

Open

Feb 02, 20267 min readModel Release BriefIntermediate5 min build

Kaggle Game Arena expands with Poker and Werewolf; Gemini 3 Pro and Flash top chess

Kaggle’s Game Arena adds Poker and Werewolf, broadening benchmarks to partial-observability and social-deduction. Read a compact checklist and rollout gate guidance for teams.

Game Arena Kaggle benchmarking Poker Werewolf

+3 more

Gemini 3 Pro Flash chess

NewsFrance

Open

Jan 21, 20268 min readModel Release BriefIntermediate5 min build

ChatGPT 5.2 vs Gemini 3.2 Fast: Ars Technica head‑to‑head and what Apple’s Gemini choice means for Siri

Ars Technica compares default non‑subscriber models — ChatGPT 5.2 vs Gemini 3.2 Fast — using complex prompts. Read on for test takeaways and how Apple’s Gemini choice affects Siri.

Gemini ChatGPT Siri Apple model-comparison

+3 more

benchmarking AI-integration Ars Technica