An image of AI

Managers need to question where the 'ground truth' of any AI has come from

AI tools are big business. They reach into every aspect of life; allocating social housing, hiring top talent, diagnosing medical conditions, predicting traffic jams, forecasting stock prices, generating sales leads, and the list goes on.

The global AI market is expected to exceed $1 trillion this decade. Whether it's the media, business or governments, everyone appears to be discussing AI and how it will transform the world. 

Managers are adopting AI because they believe it will allow their organisations to carry out tasks and make decisions quicker, more accurately, and at a lower cost than their staff could alone.

The pressure for adoption comes from all directions; vendors, boards, competitive markets, even the media. And there is constant reassurance that these AI tools will deliver on their promises, based on seemingly credible third-party performance claims using common assessment measures.

Yet, all too often managers risk discovering a huge gap between expectations and reality. Far from improving performance, AI tools may lower the accuracy and quality of decisions and even undermine the knowledge capital the organisation has built up over decades.

If it sounds shocking, it should be. Because as my award-winning research with co-authors Sarah Lebovitz and Natalia Levina shows, it's a risk that thousands of organisations are taking by failing to do the appropriate due diligence when adopting AI solutions; a risk that not only threatens to damage organisations and governments, but in some situations even cost lives.

We were able to study, at extremely close quarters, how five AI tools were evaluated for adoption in a renowned US hospital that employed leading experts in their fields.

The application for the AI was in diagnostic radiology. For more than 11 months, we observed managers testing and evaluating the AI tools at research conferences, workshops, symposia, vendor presentations and 31 detailed evaluation meetings, as well as 22 interviews and many informal conversations. We also had access to a wide range of associated data.

What we discovered was both surprising and highly concerning. It wasn't that the medical professionals involved didn't want to thoroughly evaluate the AI tools. Rather, they didn't initially know the critical questions to ask to make a reliable assessment.

As they probed further and looked beyond the surface-level metrics, they discovered fundamental flaws in the way the AI had been trained and validated. 

At the heart of any AI software is its 'ground truth', the labelled data that represents (and is used to verify) the correct answer to the question the AI is trying to solve. At its simplest, for example, the ground truth dataset for a cat identification AI might consist of labelled images of various breeds of cats. A picture of a cat could then be checked against the ground truth dataset to see what type of cat it is.

However, most applications for AI tools in organisations are far more complex, whether it's deciding if a radiology image shows a malignant tumour, a candidate is a suitable hire, or a start-up is investable. Far from being clear and absolute, what constitutes the ground truth is often up for discussion. In these cases users need to be certain that the ground truth is a sufficiently verifiable version of the truth so that they can rely on it for their decision-making.

Wherever possible, the ground truth dataset should be based on objective information. A radiological image for detecting malignant tumours should be checked against subsequent biopsy results, for example.

Often the nature of the many prediction problems means the labelling of the ground truth data is not necessarily objective. In such cases, the labelling should be performed and checked by people who are sufficiently expert, using their know-how and applying the relevant professional standards for that information.

Why managers should not rely on AUC when adopting AI

In practice this means the tool developers interact with relevant expert practitioners, and tap into the experts' accumulated knowledge and experience to better understand the practices and processes involved, in order to codify as much of that know-how as possible.

Unfortunately, as with many other organisations making AI purchasing decisions, the managers in our study relied too heavily on a metric commonly used to assess AI performance – the AUC (Area Under the receiver operating characteristic Curve).

The problem is that the AUC says little about the tool's performance versus the performance of the people at the organisation who will be using the tool. Instead it measures how likely it is that the tool delivers a correct response based on whatever ground truth labels had been selected by the AI designers, ie performance on their own terms.

Once the medical professionals in our study looked beyond the AUC metric and began to put the AI tools under the spotlight, problems soon emerged. In a series of pilot studies, the medical professionals used their expert know-how to develop their own ground truth datasets and test the AI against it. 

In many cases their results conflicted with the accuracy measures claimed for the tool. On closer examination it became clear that the ground truth used by the models had not been generated in a way that reflected how the experts arrived at their decisions in real life.

Most organisations adopting AI don't go through such a rigorous process of evaluation. But failure to examine an AI tool properly risks suffering damaging fall-out from its poor performance.

The behaviour of managers, often reluctant AI adopters, can also aggravate problems. They may back their own know-how against an AI tool, paying lip-service to using it while continuing work as usual.

The risk here is that, as happened in at least one organisation, senior managers give the credit for good results to the AI tool and make the staff redundant. The danger is that by the time an organisation realises the hit it has taken to its performance and expertise it becomes very costly or even impossible to rectify.

The best strategy is to evaluate AI tools thoroughly before they are acquired, implemented and embedded in the organisation. This means putting some key questions to whoever is pushing for the AI's adoption, whether that's the designer, vendor or the organisation's own data scientists.

The questions managers need to ask when buying AI

Find out exactly how the AI tool has been trained. Can it be objectively validated? How was the data labelled? Who did the labelling and validated the labelling? How expert were they in their field? Was it done to the professional standards expected in that area? What was the evidence used? Where is the data source from? How applicable is it to the exact use case?

Don’t be put off. Don't accept jargon-filled responses that obscure the truth. Only then, if satisfied with the answers, is it time to move to piloting the tool.

AI is going to be unimaginably transformative, in many cases for the better. Eventually the way AI tools are constructed will become more transparent, best practices established by stakeholders, and transaction of AI tools better regulated.

But until then, our study shows that ‘caveat emptor’ - when the buyer alone is responsible for checking the quality and suitability of goods before a purchase - has to be the watchword for organisations thinking of adopting AI tools.

At the moment, the burden for checking the merits of these AI tools falls on the purchasers. And thorough due diligence is needed if organisations want to avoid a bad case of buyer's remorse.

Further reading:

Lebovitz S., Lifshitz-Assaf H. and Levina N. (2022). To engage or not to engage with AI for critical judgments: How professionals deal with opacity when using AI for medical diagnosis. Organization Science. 33(1):126-14. 

Lebovitz S., Levina N., and Lifshitz-Assaf H. (2021). Is AI Ground Truth Really True? The Dangers of Training and Evaluating AI Tools Based on Experts’ Know-What. MIS Quarterly 45(3b): 1501-1525.


Hila Lifshitz-Assaf is Professor of Management and  a visiting faculty at Harvard University's Lab for  Innovation Science. She teaches Digital Transformation on the Executive MBA, Distance Learning MBA and Managing Digital Innovation on the MSc Management of Information Systems and Digital Innovation.

Follow Hila Lifshitz-Assaf on Twitter @H_DigInnovation.

For more articles on Digital Innovation and Entrepreneurship sign up to Core Insights here.