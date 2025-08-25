Latest Tech News Tech Tech News AI systems great at tests, but how do they perform in real life?

AI systems great at tests, but how do they perform in real life?

Traditional benchmarks for AI evaluation are limited in scope and can be manipulated.

By:PTI
| Updated on: Aug 25 2025, 16:36 IST
ChatGPT
OpenAI's GPT-5 promises greater intelligence, yet benchmark tests fail to measure real-world effectiveness. (AI generated)

Earlier this month, when OpenAI released its latest flagship artificial intelligence (AI) system, GPT-5, the company said it was “much smarter across the board” than earlier models. Backing up the claim were high scores on a range of benchmark tests assessing domains such as software coding, mathematics and healthcare.

Benchmark tests like these have become the standard way we assess AI systems – but they don't tell us much about the actual performance and effects of these systems in the real world.

What would be a better way to measure AI models? A group of AI researchers and metrologists – experts in the science of measurement – recently outlined a way forward.

Also read
Looking for a smartphone? To check mobile finder click here.

Metrology is important here because we need ways of not only ensuring the reliability of the AI systems we may increasingly depend upon, but also some measure of their broader economic, cultural, and societal impact.

Measuring safety

We count on metrology to ensure the tools, products, services, and processes we use are reliable.

Take something close to my heart as a biomedical ethicist – health AI. In healthcare, AI promises to improve diagnoses and patient monitoring, make medicine more personalised and help prevent diseases, as well as handle some administrative tasks.

These promises will only be realised if we can be sure health AI is safe and effective, and that means finding reliable ways to measure it.

We already have well-established systems for measuring the safety and effectiveness of drugs and medical devices, for example. But this is not yet the case for AI – not in healthcare, or in other domains such as education, employment, law enforcement, insurance, and biometrics.

Test results and real effects

At present, most evaluation of state-of-the-art AI systems relies on benchmarks. These are tests that aim to assess AI systems based on their outputs.

They might answer questions about how often a system's responses are accurate or relevant, or how they compare to responses from a human expert.

There are literally hundreds of AI benchmarks, covering a wide range of knowledge domains.

However, benchmark performance tells us little about the effect these models will have in real-world settings. For this, we need to consider the context in which a system is deployed.

The problem with benchmarks

Benchmarks have become very important to commercial AI developers to show off product performance and attract funding.

For example, in April this year a young startup called Cognition AI posted impressive results on a software engineering benchmark. Soon after, the company raised USD175 million (AUSD270 million) in funding in a deal that valued it at USD2 billion (AUSD3.1 billion).

Benchmarks have also been gamed. Meta seems to have adjusted some versions of its Llama-4 model to optimise its score on a prominent chatbot-ranking site. After OpenAI's o3 model scored highly on the FrontierMath benchmark, it came out that the company had had access to the dataset behind the benchmark, raising questions about the result.

The overall risk here is known as Goodhart's law, after British economist Charles Goodhart: “When a measure becomes a target, it ceases to be a good measure.”

In the words of Rumman Chowdhury, who has helped shape the development of the field of algorithmic ethics, placing too much importance on metrics can lead to “manipulation, gaming, and a myopic focus on short-term qualities and inadequate consideration of long-term consequences”.

Beyond benchmarks

So if not benchmarks, then what? Let's return to the example of health AI. The first benchmarks for evaluating the usefulness of large language models (LLMs) in healthcare made use of medical licensing exams. These are used to assess the competence and safety of doctors before they're allowed to practice in particular jurisdictions.

State-of-the-art models now achieve near-perfect scores on such benchmarks. However, these have been widely criticised for not adequately reflecting the complexity and diversity of real-world clinical practice.

In response, a new generation of “holistic” frameworks have been developed to evaluate these models across more diverse and realistic tasks. For health applications, the most sophisticated is the MedHELM evaluation framework, which includes 35 benchmarks across five categories of clinical tasks, from decision-making and note-taking to communication and research.

What better testing would look like

More holistic evaluation frameworks such as MedHELM aim to avoid these pitfalls. They have been designed to reflect the actual demands of a particular field of practice.

However, these frameworks still fall short of accounting for the ways humans interact with AI system in the real world. And they don't even begin to come to terms with their impacts on the broader economic, cultural, and societal contexts in which they operate.

For this we will need a whole new evaluation ecosystem. It will need to draw on expertise from academia, industry, and civil society with the aim of developing rigorous and reproducible ways to evaluate AI systems.

Work on this has already begun. There are methods for evaluating the real-world impact of AI systems in the contexts in which they're deployed – things like red-teaming (where testers deliberately try to produce unwanted outputs from the system) and field testing (where a system is tested in real-world environments). The next step is to refine and systematise these methods, so that what actually counts can be reliably measured.

If AI delivers even a fraction of the transformation it's hyped to bring, we need a measurement science that safeguards the interests of all of us, not just the tech elite. (The Conversation) RD RD

Mobile finder: iPhone 16 Pro Max LATEST price, specs and all details

Catch all the Latest Tech News, Mobile News, Laptop News, Gaming news, Wearables News , How To News, also keep up with us on Whatsapp channel,Twitter, Facebook, Google News, and Instagram. For our latest videos, subscribe to our YouTube channel.

First Published Date: 25 Aug, 16:36 IST
Trending: hisense c2 ultra 4k mini laser projector launched in india: check features, availability and more reliance jio rolls out affordable isd plans starting at 39 with new benefits and offers - all details bored of your instagram explore feed? here’s how you can change, reset it ios 18.1 release date india: here’s when iphone users may get apple intelligence jio financial services launches revamped app with host of features ios 18.1 releasing soon: ios 18, iphone 16 users complain of battery drain how to change whatsapp font style and font size in chat window elon musk’s optimus ai robot wows guests by serving drinks; to cost around 25 lakh big relief! your google storage plan increased to a fantastic 1tb for free air purifiers to buy in india for healthy and clean air- here are top 5 picks
NEXT ARTICLE BEGINS

Tips & Tricks

Ghibli-style art

How to turn your photos into Ghibli-style art with ChatGPT for free: Step-by-step guide
How to create an invitation using iPhone’s new Apple Invites app- Step-by-step guide

How to create an invitation using iPhone’s new Apple Invites app- Step-by-step guide
Apple TV app arrives on Android phones and tablets: What you get and how to install it

Apple TV app arrives on Android phones and tablets: What you get and how to install it
Vodafone Idea OTT plans

Vodafone Idea users can now stream 17 OTT apps with one subscription: Here's how
Android data transfer methods

How to transfer data from one Android to another using Google backup, USB-C cable, and more

Editor’s Pick

iPhone 16e launched with A18 chip, Apple AI—bringing flagship power on a budget: Check specs, price

iPhone 16e launched in India: All details about the new budget mobile from Apple
DeepSeek was founded in China with only a fraction of the budget of the top AI companies in the US.

DeepSeek AI: How it works, who’s behind it, how it’s different from ChatGPT, and its big US market impact explained
iPhone 17 Air likely to get this feature that iPhone 17 Pro models may miss out on

iPhone 17 Air likely to get this feature that iPhone 17 Pro models may miss out on
Google Whisk AI explained: How remixing works, availability, and how it differs from Gemini

Google Whisk AI explained: How remixing works, availability, and how it differs from Gemini
iOS 18.4 beta arriving soon: Know why it is the biggest iPhone update

iOS 18.4 beta arriving soon: Know why it is the biggest iPhone update

Trending Stories

Airtel 5G SIM card will now be delivered in 10 minutes with Blinkit- How to order

Airtel 5G SIM card will now be delivered in 10 minutes with Blinkit- How to order
Infinix Note 50s 5G+

Infinix Note 50s 5G+ set to launch in India with 64MP camera and MediaTek Dimensity 7300 SoC
Xiaomi X Pro QLED Series sale starts in India: Check price, availability, and more

Xiaomi X Pro QLED Series sale starts in India: Check price, availability, and more
Apple Intelligence AI training

Apple plans to train AI using your on-device data without accessing personal content or emails: Report
CMF Buds 2

CMF Buds 2 key features and price revealed ahead of April 28 launch: Details inside
keep up with tech

Gaming

Shade Silver

Shade Silver free on Steam for a limited time: Here’s how to get it
Call of duty

Call of Duty: Black Ops 7 skips Switch 2 at launch, release date and price leak
Nintendo sells record 3.5 million Switch 2 consoles in four days

Nintendo sells record 3.5 million Switch 2 consoles in four days
Xbox Copilot for Gaming

Microsoft launches Xbox Copilot beta on Android app to assist gamers with real-time support
PlayStation India Days of Play sale 2025

PlayStation Days of Play Sale: Spider-Man 2, God of War Ragnarök, and more games get big price cuts

 Gaming Stories

Best Deals For You

Air purifiers to buy in India for healthy and clean air- Here are top 5 picks

Air purifiers to buy in India for healthy and clean air- Here are top 5 picks
Honor 90

5 best smartphones for your eyes: Xiaomi 13, Honor 90 to Motorola Edge Plus, check list
best smartwatch brands

Top 10 smartwatch brands: Leading the market with innovation
High-tech Japanese toilets are so amazing that you might never want to use tissue paper ever again. Thanks to Toto, you can buy them in India too. (Varun Krishnan)

Japanese toilets in India: TOTO washlet starting price, features and all details to know
Amazon Diwali Sale 2024

Amazon Diwali Sale 2024: Get up to 40% off on ASUS Vivobook S 16 OLED to Lenovo Yoga Slim 6 and more laptops

    Trending News

    Airtel 5G SIM card will now be delivered in 10 minutes with Blinkit- How to order

    Airtel 5G SIM card will now be delivered in 10 minutes with Blinkit- How to order

    Infinix Note 50s 5G+ set to launch in India with 64MP camera and MediaTek Dimensity 7300 SoC

    Infinix Note 50s 5G+

    Xiaomi X Pro QLED Series sale starts in India: Check price, availability, and more

    Xiaomi X Pro QLED Series sale starts in India: Check price, availability, and more

    Apple plans to train AI using your on-device data without accessing personal content or emails: Report

    Apple Intelligence AI training

    CMF Buds 2 key features and price revealed ahead of April 28 launch: Details inside

    CMF Buds 2

    Trending Gadgets

    Mobiles Laptops Tablets