John Burn-Murdoch and Sarah O’Connor

‌

PREMIUM

July 2 2026

Welcome back to The AI Shift, our weekly exploration of AI’s impacts on jobs and work. For this edition, we ask whether one reason there often appears to be contradictions between different pieces of evidence around AI’s capabilities is that researchers are asking two very different questions of the technology.

John writes

‌

Time and again as AI has matured over the past few years we have run into apparent disconnects and puzzles regarding its capabilities and impacts: AI is acing notoriously challenging exams (but not taking jobs?); AI is performing complex software tasks that would take an experienced coder multiple days (OK, but what about other types of white-collar work?); AI has a 50 per cent chance of succeeding at complex technical tasks (isn’t 50 per cent quite low?).

We’ve talked about many aspects of this before — AI automates tasks not jobs; capability and reliability are very different — but I wanted to take a step back and get into why these disconnects arise. Something that often goes unmentioned when talking about measures of AI’s capabilities is that many of them were never really designed to assess whether or when AI becomes capable of doing white-collar jobs; they were designed primarily to assess whether or when AI might start to pose risks to human lives and societies.

These are extremely different questions, with almost diametrically opposing criteria in some cases. For AI to demonstrably pose a large-scale danger to individuals or organisations, it only needs to have a reasonable chance of succeeding at the kinds of software tasks that could enable computer systems hacking or other cyber attacks (bear in mind such attacks can have major physical as well as digital consequences). If it succeeds at this even just 50 per cent of the time, that’s a huge concern. But for AI to be able to replace human workers, its outputs need to be reliable and consistent with success rates much closer to 100 per cent, and it needs to be adept at a wide range of tasks beyond software, including dealing with the inherent messiness of working with humans and navigating the physical world.

Arguably the most prominent tracker of AI capabilities to date is the software time horizon chart from AI research firm METR, which tracks the mounting complexity and scale of software tasks that AI is able to complete with at least a 50 per cent success rate, and finds rapid progress following an exponential curve — even accelerating above constant growth in recent months. However, an alternative measure developed by Princeton University’s Stephan Rabanser, Sayash Kapoor and Arvind Narayanan that incorporates safety, consistency and robustness standards from fields like aviation and nuclear energy to give a more holistic picture of AI’s overall reliability, finds that progress on these broader measures is happening much more slowly than on raw ‘does it have the capability to sometimes achieve this?’ metrics.

The Princeton team are worried about the risk of a bad outcome resulting from AI failing, so their rubric assesses how confident we can be that AI will almost always pass the test. Whereas metrics like METR’s software progress chart were born out of worries about the risk of a bad outcome resulting from AI succeeding, and thus assess ‘is there even a modest chance that AI could pass the test?’. Yet the answers to these very different questions with their very different implications are frequently flattened into a single ‘how capable is AI?’

Something similar is at work with the broader over-indexing on software and coding in AI, both in the benchmarks and the capabilities themselves. There was an interesting moment in an episode of Bloomberg’s Odd Lots podcast a few months back when METR’s President Chris Painter explained that another reason headline measures of AI capabilities tend to focus on coding tasks is that progress in this domain is an indicator of when AI systems may start to be able to speed up their own progress — setting in motion the ‘recursive self-improvement’ flywheel that could propel the technology into unknown and potentially dangerous territory. AI companies and researchers alike are so focused on software because it’s where both the biggest rewards and greatest risks lie.

None of this changes the overarching story of continued progress in AI performance across a wide range of domains — new evaluation measures now track everything from legal work to medical diagnostics and management consulting — but whenever we talk about its capabilities it’s useful to ask whether the measure in question was designed to assess the chance that AI might automate and disrupt human jobs (capturing broad and reliable performance), or the chance that it might automate itself and in doing so disrupt humanity (narrow and edge-case performance).

Sarah, I’m curious how you’ve been navigating this strange period where we’re simultaneously asking whether a new technology could be the nuclear bomb or the spreadsheet?

Sarah writes

‌

John, I think this is such a useful way to clarify the debate about how powerful AI really is. It can sometimes feel as if different people are living in different worlds on this one. In one world, AI is improving at an exponential pace and people are fretting about catastrophic risks, while in the other, people are shrugging that the technology still doesn’t even seem good enough to do many of the tasks involved in their own daily jobs.

But as you say, both of these things can be true at once. The people in the first group aren’t just intoxicated by hype, and the people in the second aren’t just burying their heads in the sand.

I’m increasingly worried about the risks of damaging AI-enabled cyber attacks, even though I remain sceptical that AI is going to replace vast numbers of human jobs any time soon. After all, you probably wouldn’t hire someone for an important role in your organisation who has flashes of brilliance but only succeeds in the tasks you give them 50 per cent of the time. (Just this week, Bloomberg ran an interesting story about Ford rehiring experienced “grey beard” engineers because the company’s automated quality control systems weren’t good enough). But you certainly would worry if your attackers have powerful tools which have a 50 per cent success rate. Indeed, the recent joint statement by the leaders of the Five Eyes cyber security agencies was a sobering read on this score.

And it leaves me with a slightly depressing question, John. Have we created a technology which is now capable enough to be dangerous, but still not reliable enough to be useful in many domains? And if so, shouldn’t the economic incentives for the AI labs now be to improve reliability (in order to boost enterprise adoption), rather than just pushing those capability charts ever higher?

John responds

‌

Great questions, Sarah. One could certainly make a case that the dangerous frontier coding capabilities have won out over reliable utility so far. We’re still in the foothills of proven value-adding uses in most domains, but the cyber capabilities of cutting-edge models are extremely advanced. More than a year ago researchers from Anthropic and Carnegie Mellon University showed that publicly available models were already able to infiltrate business-sized computer networks in simulations of real-world attacks such as the 2017 breach of consumer credit firm Equifax. Since then there have been proven cases of AI-assisted or even AI-orchestrated attacks such as Anthropic’s detection last September of a sophisticated cyber attack attributed to a hacking group linked to the Chinese government, in which Anthropic’s own agentic tools were used to automate almost the entire attack, breaching major tech firms and government agencies.

There has been more positive news on the cyber front since then — the very latest models are now also proving a huge boon to cyber defence, helping businesses spot and fix critical vulnerabilities at unprecedented scale. But on the broader reliability of AI there’s still a lot of room for improvement. Some of the most recent models actually scored lower than their predecessors for consistency and safety, even as headline accuracy rose. But I suspect the combination of high-profile gaffes caused by AI slip-ups and the ongoing re-evaluation of corporate AI budgets may nudge reliability metrics higher up the list in the months ahead.

John writes

Sarah writes

John responds

Recommended reading

Recommended newsletters for you

How satisfied are you with The AI Shift?