Why Top AI Models Stumble on a Classic Test of Focus

For all the talk of artificial intelligence rivaling human reasoning, a deceptively simple color-and-word puzzle has just drawn a sharp line between the two. In a study published this month in PNAS Nexus, researchers took a psychology experiment that first appeared in the 1930s and pointed it squarely at the most advanced large language models available today. The machines that can draft legal briefs and debug software faltered on a task most schoolchildren can manage: naming the color of the ink a word is printed in, rather than reading the word itself.

The finding is not that these systems are unintelligent. It is something more specific and, arguably, more interesting. It suggests that whatever AI is doing when it appears to "pay attention," it is not the same thing the human brain does when it holds a goal in mind and refuses to be pulled off course.

What Happened

A team led by Suketu Patel, working with Hongbin Wang and Jin Fan, administered the Stroop task to a roster of leading language models and recorded how their accuracy changed as the task grew longer. The Stroop task is a fixture of cognitive psychology: color words such as "red," "blue," or "green" are shown in colored ink that sometimes matches the word and sometimes conflicts with it. The participant must name the ink color, suppressing the deeply ingrained reflex to simply read the word. It is a clean, decades-old measure of what psychologists call executive control.

Stroop test: color words printed in mismatched ink — Belbury / CC0 / Wikimedia Commons

On short lists of five items, the models looked competent. GPT-4o answered correctly 91 percent of the time. But as the lists stretched out, performance unraveled. At ten words GPT-4o had already slipped to 57 percent, and by forty words it managed just 15 percent. Anthropic's Claude 3.5 Sonnet held steady through lists of twenty before dropping to 24 percent at forty words. The researchers reported the same shape of decline across newer systems too, including GPT-5, Claude Opus 4.1, and Gemini 2.5.

The collapse was sharpest when matching and conflicting items were mixed into the same list. Under that pressure the models increasingly abandoned the instruction and reverted to reading the words, the very response they had been most heavily trained to produce. Accuracy on the conflicting items fell, in some cases, to nearly zero. The paper's title puts the diagnosis plainly: "Deficient executive control in transformer attention."

Why It Matters

It is tempting to file this under harmless party trick, but the result lands on a sensitive nerve. The word "attention" sits at the literal center of the transformer architecture that powers modern language models. Attention, in that technical sense, is a mathematical mechanism for weighting which parts of an input matter most. The study's quiet provocation is that this engineering notion of attention shares a name, but apparently not a function, with the psychological one. A system can have "attention" in its blueprint and still fail to stay attentive in the everyday sense.

Diagram of the Transformer model architecture — Yuening Jia / CC BY-SA 3.0 / Wikimedia Commons

That gap matters because the tasks we increasingly hand to these models look less like single questions and more like long, cluttered instructions. Reviewing a contract, triaging a support queue, or following a multi-step procedure all require holding one goal firmly while ignoring a stream of competing cues. If accuracy quietly erodes as the input grows, the failure may not announce itself. The model keeps producing fluent, confident output even as it drifts from what it was asked to do.

There is also a measurement lesson buried here. Benchmarks tend to reward models for getting hard problems right in isolation. The Stroop result is a reminder that brittleness can hide in the simplest tasks, surfacing only when length and conflict are dialed up together. The researchers frame the decline as evidence of a structural limitation rather than a tuning problem that a bigger model will automatically solve.

The Reaction

The study has reopened a familiar argument about how much we should read into human-style tests applied to machines. Skeptics note that a language model is not a person, and that the Stroop task was designed around the peculiar wiring of the human visual and reading systems. By that view, asking a text model to "name an ink color" is already a translation, and some of the failure may live in the framing rather than the mind of the machine.

Location of the prefrontal cortex in the human brain — Kwesterlund / CC BY 4.0 / Wikimedia Commons

Others see exactly the value the authors intended. Borrowing a clean, well-validated instrument from psychology offers a way to probe behavior that glossy benchmark scores can obscure. The point is not that the model has a brain; it is that a long-trusted test of focus and inhibition reveals a sharp, reproducible breakdown that pure accuracy numbers on standard tasks would never have flagged. For anyone deploying these systems in high-stakes settings, that is a finding worth taking seriously rather than waving away.

What both camps tend to agree on is the human contrast. People carry the same bias the models do, since reading is far more automatic than color-naming, yet most of us hold our accuracy steady even on long, messy lists. Whatever lets a person grip a goal and resist the easy answer does not seem to have a clean counterpart inside today's transformers.

What's Next

The obvious next question is whether this is a wall or a speed bump. One path runs through architecture: researchers are already exploring mechanisms that give models a more persistent sense of task and state, rather than recomputing their footing from scratch at every step. Another runs through training, testing whether models can be taught to treat an instruction as a constraint to defend rather than a suggestion to drift away from.

Artificial neural network depicted with a processor chip — mikemacmarketing / original posted on flickr Liam Huang / cl / CC BY 2.0 / Wikimedia Commons

Evaluation is likely to shift as well. Expect more tests that deliberately scale difficulty along a single axis, such as length or conflict, to map where and how reliability falls off. That kind of stress test is more useful to someone deciding whether to trust a model with a forty-step workflow than a single headline accuracy figure. The study also invites a broader collaboration between cognitive science and machine learning, with each field lending the other its sharpest tools.

None of this implies the current generation of models is about to be dethroned. It implies something more measured: that the next round of progress may be judged less by what a model can do on its best day and more by how gracefully it holds together when the task gets long and noisy.

Closing Thoughts

There is a certain humility in watching a ninety-year-old psychology test stump systems trained on a meaningful slice of the internet. It is a reminder that fluency and focus are not the same thing, and that a machine can sound completely sure of itself while quietly losing the thread. The most useful AI results are often the ones that map the edges of a capability rather than celebrate its center, because the edges are where real-world tasks tend to live.

Illustration of a multipolar neuron — Hariadhi / CC BY-SA 4.0 / Wikimedia Commons

Perhaps the deeper takeaway is about the word we keep reaching for. We named a piece of machinery "attention" and then, almost without noticing, started expecting it to mean what the word means for us. Studies like this one gently pull those two meanings apart again. They do not diminish what these systems can do; they sharpen our sense of what we are actually relying on when we ask a machine to keep its eye on the ball.

한글 요약

2026년 6월 학술지 PNAS Nexus에 실린 연구에서, 연구진은 1930년대부터 심리학에서 쓰여 온 '스트룹 과제'를 최신 대형언어모델(LLM)에 적용했습니다. 스트룹 과제는 색깔 이름이 적힌 글자를 다른 색 잉크로 보여주고, 글자를 읽는 대신 잉크 색을 말하게 하는 검사로, 주의력과 자기통제(집행 기능)를 측정합니다. 짧은 목록에서는 모델들이 잘 해냈지만, 목록이 길어지자 성능이 급격히 무너졌습니다. GPT-4o는 5단어에서 91%였던 정확도가 40단어에서 15%까지 떨어졌고, Claude 3.5 Sonnet도 40단어에서 24%로 하락했습니다. GPT-5, Claude Opus 4.1, Gemini 2.5에서도 비슷한 양상이 나타났습니다.

특히 일치하는 항목과 충돌하는 항목이 한 목록에 섞이면, 모델들은 지시를 지키지 못하고 가장 많이 학습한 반응, 즉 '글자 읽기'로 되돌아갔습니다. 흥미로운 점은 사람도 글자 읽기가 색 말하기보다 훨씬 자동적이라 같은 편향을 갖지만, 대부분 긴 목록에서도 정확도를 유지한다는 것입니다. 연구진은 이 차이가 단순한 조정 문제가 아니라, 트랜스포머 구조의 '주의(attention)' 메커니즘이 인간의 집중·억제 능력과는 근본적으로 다르게 작동함을 보여준다고 해석했습니다.

이 결과는 길고 복잡한 지시를 처리해야 하는 실제 업무, 예컨대 계약서 검토나 다단계 작업 자동화에서 신뢰성 문제가 조용히 누적될 수 있음을 시사합니다. 모델은 흐트러지면서도 여전히 유창하고 자신감 있는 답을 내놓기 때문입니다. 동시에 이 연구는 화려한 벤치마크 점수가 가리는 취약점을 드러내는 데 심리학의 검증된 도구가 유용할 수 있음을 보여 주며, 인지과학과 기계학습의 협업 가능성을 시사합니다. 참고: ScienceDaily, PNAS Nexus 2026; 5(6).