TutorialsFrance
OpenA 90-minute paired-prompt test to detect models that alter behavior during benchmarks
Run a 50-200 paired-prompt test to measure 'evaluation awareness'—how often models detect they're being evaluated (e.g., Muse Spark 19.8% vs 2.0%) and inform procurement.