OpenAI's Deep Research demonstrates that reasoning does not solve hallucinations

Therefore, really great flawless Agentic AI is probably not going to arrive this year, and we don't know when or if it will. A long post synthesizing recent AI progress, and.. what even is analysis?

Feb 20, 2025

This post will be completely about AI, one of my favorite topics. I believe that the developments in AI in the past few years kicked off the biggest technological shift in my lifetime, on par with the Industrial Revolution, electrification, computing and the internet.

How I came to this is mostly through playing the game called: “Can AI do this task?” on a daily basis. Thanks to that habit/hobby, I find myself knowing things. Theoretical things, but especially practical insights from spending many hundreds of hours using AI. By now I’ve started to formally train people on using AI at work and initiated and helped create an AI usage policy. More and more have I become a go-to person for many people around me on AI. And last, but not least, I’ve learned to code thanks to AI and created Magicdoor.

With all that background, I find myself delightfully puzzled and uncertain of the future, but also reasonably equipped to tune-out the (insane amount of) noise, hype and anti-hype out there. So I end up writing nuanced, long-form takes. Neither overly hyped, nor overly dismissive, reflecting my personal ambivalence. If you enjoy that sort of thing, you’ll probably enjoy this post. Here are some of my previous posts about AI:

The problem with OpenAI’s Deep Research

Deep Research1 is hyped. It seems to have completely taken

Ethan Mollick

, one of my favorite and more nuanced AI commentators.

One of my other favorite AI analysts, Benedict Evans just released a great piece on Deep Research where he was a lot less optimistic. He looked at OpenAI’s very own marketing example from this release announcement, about smartphone adoption.

It just so happens that Benedict is a career telco analyst, so he not only knows how to analyse data, he also is an expert in this particular domain. I strongly recommend reading the whole piece, but I’m going to try summarizing it (by hand, not with AI ✍️2)

Deep Research uses two sources: Statista and Statcounter. Ben first criticises the nature of the Statcounter data: This is ‘traffic’ data, which is a bad way to measure adoption because high-end phones (i.e. iPhones) are used more. I would probably not have flagged this error myself.
He then criticizes Statista for being a generally low quality source, with most of the data paywalled anyway. I fully agree with that, every time I’ve ended up on Statista it has been near useless.
Quote: “Setting that aside, though, let’s jump through the registration hoops and look at one number - Japan. Deep Research says that the Japanese smartphone market is split 69% iOS and 31% Android. That prompts two questions: is that what those sources say, and are they right? These are very different kinds of question.”
First, Statcounter doesn’t say 69% anywhere, it says 59.7% for iOS 🤨. On Statista, it turns out the key underlying source is another report from Kantar World Panel, which actually shows almost the exact opposite result: 63% Android and 36% iOS. Wow…
Benedict ends this section with: “We could also go and check some of the other numbers, but if I have to check every number in a table then it hasn’t saved me any time - I might as well do it myself anyway. And for what it’s worth, a Japanese regulator does a survey of the actual number we’re looking for here (page 25), which says that the installed base is about 53% Android and 47% iOS. Ah.”

This is pretty bad, and remember that this is powered by GPT-o3 — the very edge of the frontier — and one of OpenAI’s own marketing examples!

One thing I often explain to young people who come to work for me is that trust is earned in drops and lost in buckets. It’s something I learned myself the hard way when I was interning at Deutsche Bank. Once people start finding small mistakes in your work, they immediately start wondering what other mistakes you made that they haven’t found yet.

Finding one mistake like this in a piece of analysis makes the entire thing suspect, as well as every other analysis done by that person (or machine). This is what Benedict really means when he says: “if I have to check every number in a table then it hasn’t saved me any time - I might as well do it myself anyway.” If an intern made a mistake like this, it would be a great teachable moment for a coaching session, if a professional highly paid researcher did it, it would be grounds for immediate termination.

That’s why I agree with the general heuristic that AI is like having infinite interns who work inhumanly fast. Can be huge time-saver for sure, but you won’t let it anywhere near anything really high stakes.

Deep Research was built for ‘analysis’ but it fails at doing that

Analysis is a distinct writing space in between factual reporting and opinion writing.

Noah Smith

wrote about this just this week:

But the hoary distinction between fact and opinion, enshrined in the organization of legacy publications, leaves out a crucial third kind of writing that’s in very high demand: analysis.
Consider a forecast about the future, such as “Democrats will win the 2026 midterm elections.” This isn’t a fact, because it can’t be proven or disproven with currently available data. And although people might colloquially call it an “opinion”, it’s not a subjective value judgement either — it’s based on facts, even though it’s not a fact itself. A forecast is a third kind of thing.
…………..
Forecasts, assessments, and theories are all part of a category called analysis. Some people casually refer to analysis as “opinion” — if you say “Democrats will win the midterms”, they might say “Well that’s just, like, your opinion, man!”. But discerning people recognize that there is a salient, useful distinction between value judgements and reasoned interpretations of facts. The way people argue about the two things is fundamentally different — to argue about opinions, you typically have to make an emotional appeal, while arguing about analysis can be done using logic and reason alone.

Analysis is particularly tricky because a good analysis combines and builds an argument on a foundational set of facts from various places. If the underlying facts are wrong, the whole argument falls apart. AI hypers usually reply with: "you're using it wrong", or "the models will get better." The first one just doesn't make any sense in this case since the bad example we just analyzed was created and marketed by OpenAI themselves!

But the second one is more in line with what I'm talking about in this post and what Benedict also keeps highlighting in his repeated "are better models better?" question. What does 'better' mean? Instead of 85% of the data being correct, will the next version of Deep Research get 90% correct? Benedict:

That doesn’t help me. If there are mistakes in the table, it doesn’t matter how many there are - I can’t trust it. If, on the other hand, you think that these models will go to being 100% right, that would change everything, but that would also be a binary change in the nature of these systems, not a percentage change, and we don’t know if that’s even possible.

We just don't know how, and hence if or when, AI will get to 100% accuracy. The big reason I am writing all this now, is that over the past 2 months GPT-o3, Deepseek, and reasoning models more generally, were explicitly and aggressively marketed to do almost exactly that. And hey don't worry about that "almost" part, that will get fixed. And then AGI!

Well, it turns out that back here on earth, that "almost" part is the critical part, and at this point we are not closer to AI doing autonomous, reliable analysis than we were two years ago.

A quick word from our sponsor Magicdoor.ai, a $6 per month platform fee + the exact bill for your usage. Replace >$60 of paid AI products for the price of one fancy craft beer per month. Special shout-out to Claudio who told me I wasn’t plugging my own product enough ;)

There are good reasons to believe this is a fundamental problem

So we have seen that Deep Research, powered by the most state of the art reasoning model in the world (OpenAI GPT-o3), still hallucinates at a deal-breaking rate. This is oil on the fire for the skeptic side of the discourse. My favorite skeptic is Emmanuel Maggiori, who is a Machine Learning expert/practitioner from ‘before it was cool’, and does a great job explaining the fundamentals.

Since the results from GPT and it’s siblings and cousins are so surprisingly good, given the fundamental approach, me and others who try to stay grounded keep an open mind to the possibility that they might continue to surprise us. But Emmanuel feels strongly that there is no way that the current trajectory can lead to AGI, or even to make that step to 100% accuracy.

Looking at how Deep Research fails, it does seem like he has a strong case. Without discounting the achievements of figuring out that ‘reasoning’ significantly improves answer quality and accuracy, in retrospect it’s clear that it improves, but does not solve, and probably cannot solve the problem. Fundamentally, LLMs create things that look right, but they don’t know if they ARE right. They don’t understand what they are doing. I was struck by this argument from Dwarkesh Patel, who asked:

These things [LLMs] have basically the entire corpus of human knowledge memorized and they haven’t been able to make a single new connection that has led to a discovery?
Whereas if even a moderately intelligent person had this much stuff memorized, they would notice — Oh, this thing causes this symptom. This other thing also causes this symptom. There’s a medical cure right here
Shouldn’t we be expecting that kind of stuff?

Again, with the benefit of hindsight, it is quite easy to understand why LLMs behave like this. There are many technical primers online, but I recommend this one from Ethan Mollick to explains the basics of tokenisation and next token prediction. What it boils down to is that LLMs just spit out the highest probability next token (word or word snippet), given the prompt + the result so far. That last part is important to understand because it holds the key to why getting the model to reason out loud actually does work to improve answers.

The inference cycle of an LLM: 

Round 1: Input = { Prompt } -> Output = { the next word }

Round 2: Input = { Prompt + new word } -> Output = { the next word } 

Round 3: Input = { Prompt + two new words } -> etc

So the answer as it is being generated goes back into the LLM to predict every next word. Knowing this, it is obvious why ‘thinking out loud’ works. It’s the only way an LLM can ‘think’. But it’s also quite clear why they hallucinate so much. They literally just spit out whatever is most likely to occur given the training data3.

So, to recap this section: there are good reasons to believe that the lack of truly novel ideas and that the persistent hallucinations are fundamental to the way LLMs work. OpenAI constantly claims to have the solution, but they haven’t released it. And the last thing they did release, positioning it very loudly as a step toward solving hallucinations, actually hallucinates like crazy.

The most honest assessment is that nobody knows the solution

So then it seems really nobody knows the solution to eliminate hallucinations, or even if there is a solution. That means all of the talk about AI replacing people’s jobs and AGI is pure opinion, not rooted in any real-world observation. While I do think it is important to do high quality alignment research just in case, the entire conversation about whether and when AGI will arrive is just the same as talking about whether aliens exist.

The more practical problem is that because we don’t know if the technology will become reliable, we don’t know what kinds of products we should build. Buttons and sliders where AI is an API call to do a certain task, or an Agent where the AI turns everything else into an API call.

Benedict Evans: “First, to repeat, we don’t know if the error rate will go away, and so we don’t know whether we should be building products that presume the model will sometimes be wrong or whether in a year or two we will be building products that presume we can rely on the model by itself.”

So, what to build is one interesting question to ponder. The other interesting topic is to figure out how to get the most useful outputs from AI right now, while buffering for the high error rate.

It’s slop all the way down

Even though image generation models and LLMs are very different under the hood, they seem to have very similar problems. All generative models are severely hamstrung by whatever happens to be in their training data, while at the same time occasionally having a moment of surprising awesomeness. But as it stands right now, when you are using AI you are almost constantly confronted with their inability to understand what they are doing.

Images: Absolutely impossible to get anything specific. Very poor prompt adherence. But some things come out looking just awesome. It makes the image generation experience frustrating because many attempts are needed, and even then you have to eventually settle for something…ok. For example, it took me 16 tries and $0.72 to generate the cover image for this post, and I still think it’s quite shitty (just gave up).

A minimalist, **modern split composition** showing two contrasting sides: **On the left,** a pristine, perfectly organized digital **neural network in cool blue** tones. **On the right,** colorful and chaotic hallucinogenic **LSD trip imagery in warmer tones**. The center line where they meet creates an interesting tension.

Left: An epic outcome with a very shitty and lazy prompt. Right: the 20th attempt with ever more detailed prompting to get a specific outcome.

All I wanted to do was make a stupid image of Lance sitting down and trying to work at his new desk.

Code: Verbose, lots of unnecessary extra things, needlessly complicated (and this is just simple UI code, imagine actual full-stack features).

Left: Raw AI code in ‘YOLO mode’ (accept all agent code without checking). Right: the same code after manually editing. Look for how much I could delete without impacting the UI. These are styles that I had already defined on a project level so they don’t need to be repeated over and over. And this was just one set of changes, there are still a lot more unnecessary classes even in the one on the right.

Agentic code quality agent: Almost everything it says is wrong or shortsighted

The agent checks my code commit and comments on what it thinks are bugs or issues. Almost all quite obviously wrong.

Writing: Verbose, poor style adherence, lack of ‘so what’

Left: Raw AI text after detailed ‘few-shot’ (with good examples) prompting , Right: after light editing. The main problem I had is that AI repeatedly removed the important engagement hook “AI agents are doomed to fail”. Having said that, I liked the part “humans can’t just be “in the loop” they need to BE the loop.”

But what if this is still at an ‘above average’ level compared to humans?

When I was starting Ox Street, the developers I initially hired created a mess that took better developers months to clean up. Worse than AI, they lied about how out of their depth they were, and refused to explain or even admit responsibility for the situation. When I read Noah Smith or Benedict Evans’ analysis, I am delighted by the depth and originality of their insights, and impressed by their level of factual backing. But when I read the average piece of analysis I am usually disappointed to find the same kinds of errors GPT-o3 makes. So, despite being quite bad at analysis, is Deep Research still an improvement over the average human analysis? Quite possible. Does my Customer Support bot do a better job staying on brand and finding the right answer to a question, compared to low-paid off-shore support workers? Almost certainly. Would I be building fully functional NextJs apps without AI? Very unlikely.

So this is the state of things. The technology is already transformative. But it’s also wrong or bad way too much. But is it still better than most humans? What is ‘better’ really? This a question about evaluation and benchmarks. And Benedict’s ‘are better models better’ question.

For now, we know these things: Usage of AI is increasing rapidly. It has strong product-market fit in coding, marketing and customer support, but nowhere else for now. It is wrong a lot, and we don’t know if it will stop being wrong. You can get better at using it well, and the only way to do that is to use it. More likely than not, it will be important to get good at using it.

Thanks for reading Playbook Musings! This post is public so feel free to share it.

This was one of my more high-effort posts. I am not at all planning to monetize this newsletter but I do hope to reach more people over time. A like and some shares with likeminded people would mean a lot!

DeepSeek, DeepResearch. It seems like product names are becoming just as indistinguishable as model performance. Also, when is DeepThought going to be released?

Here is the job Claude 3.5 Sonnet did, prompted with:

OpenAI's Deep Research demo, designed for research and analysis work, shows significant accuracy issues when tested against smartphone market data in Japan - it claimed 69% iOS/31% Android split, while actual regulatory data shows roughly 47% iOS/53% Android
The tool relies on problematic sources (Statcounter and Statista) and presents incorrect data even from these sources, making it unreliable for professional research where accuracy is crucial
While LLMs excel at understanding vague queries (what ~~humans~~ [computers] are bad at), they struggle with precise data retrieval (what computers are good at), creating a fundamental challenge for tools like Deep Research [The hallucination in this one is quite bad]
The error rate issue is binary, not incremental - even 85-90% accuracy isn't enough for professional research, as any mistakes mean the entire output requires verification, negating time savings
Despite limitations, the tool can reduce a 2-day research project to several hours for domain experts who can verify and correct the output - making it useful as an "infinite intern" or "bicycle for the mind," but not as a standalone solution

I’m leaving out some nuance for readability/relevance reason. LLMs don’t just spit out the highest probability next token, they spit out a distribution of tokens, each with a probability. So after “the best type of pet is a …” you have 50% chance of “dog” and 11% chance of “subjective”. Since this new word gets added to the prompt for the next round, you can imagine you’d have a completely different conversation with the AI depending on whether it ended up being “dog” or “subjective”. In one case, the AI is going to double and triple down on pitching dogs as the best type of pets, whereas in the other it’s going to give a balanced view on the pros and cons of different pets.

Playbook Musings