What Happened
For most of the short history of large language models, the honest answer to a simple question was no. Could these systems create genuinely new knowledge, the kind a working researcher had never seen before? They could summarize, translate, and recombine, but the frontier of human understanding seemed to belong to people alone. A recent project from Google DeepMind is one of the clearest signs yet that the line is moving. The company has introduced what it calls an AI co-mathematician, and the name is deliberate. This is not a chatbot you ask for an answer. It is a research partner designed to sit alongside professional mathematicians and work with them on problems that nobody has solved.
Built on Google's latest Gemini models, the system posted a new high score on FrontierMath Tier 4, the hardest level of a benchmark that Epoch AI describes as a set of problems built to defeat even the best AI systems, some expected to stand unsolved for decades. It solved 23 of 48 private problems, a 48 percent success rate, against 19 percent for the same Gemini model used on its own. Three of the problems it cracked had never been solved by any system tested before. There is an important caveat worth stating plainly: unlike its competitors, the co-mathematician ran without limits on how much it could compute, which makes the comparison uneven and the underlying cost meaningfully higher. But the gap at the top of the leaderboard is real.
What matters more than the score is the shape of the thing. The system is described not as a model you query but as a stateful, asynchronous workspace, a hierarchy of agents coordinated by a top-level project manager, working in parallel across many separate lines of inquiry. It tracks dead ends as carefully as it tracks progress, on the principle that, as the paper puts it, knowing what does not work is often as important as knowing what does
. It writes up its reasoning in full, with annotations and notes on where each idea came from, so that a human collaborator can follow the trail rather than simply trust the conclusion.
Why It Matters
The deeper claim buried in this work is that we have been measuring the wrong thing. For years, progress in mathematical AI has been tracked by how many problems a model can solve in isolation, under a strict budget of time and tokens. But raw problem-solving, at least at the level of well-posed questions, is no longer where these systems fall down. The harder skill is orchestration: holding a research thread together over weeks, reading the narrow and difficult literature around a question, being honest about uncertainty, and recognizing the moment to stop and ask a human for help.
This is the same shift that quietly transformed software. The leap there was not a smarter autocomplete but the arrival of coding agents that could work across a real codebase over a long horizon while staying steerable. The co-mathematician is an attempt to give mathematics an equivalent. It also sits in a lineage of DeepMind systems aimed at discovery rather than conversation. AlphaEvolve, an earlier effort, found new algorithms for matrix multiplication and resolved combinatorial puzzles that had resisted researchers for decades. The difference is the posture. AlphaEvolve was an autonomous search engine pointed at a problem. The co-mathematician is built to collaborate, to be argued with, corrected, and redirected by a person who knows the terrain.
Reaction
The most persuasive evidence came not from the benchmark but from mathematicians who put it to work on real research. Marc Lackenby, a mathematician at Oxford, used the system to settle a long-standing open question from the Kourovka Notebook, a famous running collection of unsolved problems in group theory. The detail that stays with you is how it happened. The AI's first proof contained a flaw. A reviewer agent inside the system flagged it, and in seeing exactly where the argument broke, Lackenby realized he knew how to repair it. The back-and-forth was not a failure of the tool; it was the entire point of it.
Others reported similar experiences. One mathematician obtained claimed proofs for conjectures about Stirling coefficients; another, working on a technical subproblem in Hamiltonian systems, was handed a key lemma that, in his words, withstood careful checking
, after other AI systems had failed on the same prompt. He even rated the elegance of its proofs above anything he had seen from other models. Every one of them stressed the same condition for success: the system works best when the human already knows the field deeply and can steer it. This is not a machine that replaces the mathematician. It is one that rewards a mathematician who knows what to ask, and how to listen to the answer.
What's Next
For now the co-mathematician is in limited release, available to a small circle of researchers. Google has signaled that the longer goal is to fold this style of working into products that far more people can reach, the way agentic coding tools moved from labs into everyday developer workflows within a couple of years. If that happens, the texture of mathematical research could change for a much wider community, not only at a handful of elite institutions.
The ambitions stated around the project are large. One prominent lab leader has suggested that AI could contribute to solving a Millennium Prize Problem, one of the seven great open questions in mathematics, within a handful of years. DeepMind's leadership has argued that the labs with the strongest mathematics and coding tools will pull ahead of the rest, precisely because those capabilities compound: better tools help build better models, which in turn make better tools. Whether or not the boldest timelines hold, the direction is set. The next contests will be less about solving a single hard problem and more about sustaining genuine research over the long, uncertain arc that real discovery requires.
Closing Thoughts
It would be easy to read this as a story about machines catching up to us, but the more interesting reading is about partnership, and about what gets harder when the partnership works. The paper is unusually candid about the failure modes. Review cycles between agents can drift into what the authors call a reviewer-pleasing bias, settling on an argument that sounds right while a subtle error hides inside it. They can also lock into endless disagreement, a kind of death spiral that experienced users learn to recognize and set aside. None of this is solved, and the authors do not pretend otherwise.
And there is a quieter problem that reaches beyond any single system. If an AI can generate a plausible twenty-page proof in minutes, while a human referee needs days to check it, the slow and largely volunteer machinery of mathematical peer review comes under real strain. The authors name this directly as a risk to the field, not merely as noise in the literature. It is a reminder that the bottleneck in knowledge has never only been generation; it has always also been verification, the patient work of deciding what is actually true. What this project shows is not that the machine has replaced the mathematician's judgment, but that it has made that judgment more valuable, and more necessary, than before. The most striking moment in the whole effort was not the AI's answer. It was a human being, shown precisely where an argument failed, who suddenly understood how to make it whole.
Sources and further reading: OfficeChai, the research paper "AI Co-Mathematician" (arXiv), and Epoch AI's FrontierMath benchmark.
한글 요약
구글 딥마인드가 'AI 코-매스매티션(AI co-mathematician)'을 공개했습니다. 단순히 질문에 답하는 챗봇이 아니라, 여러 에이전트가 하나의 프로젝트 관리자 아래에서 병렬로 협업하며 아직 풀리지 않은 수학 난제를 사람 연구자와 함께 풀어 나가는 '연구 작업 공간'입니다. 이 시스템은 가장 어려운 수학 벤치마크인 프런티어매스(FrontierMath) 최상위 단계에서 48퍼센트(48문제 중 23문제)를 풀어 역대 최고 기록을 세웠는데, 같은 제미나이 모델을 단독으로 썼을 때의 19퍼센트를 크게 웃돕니다. 다만 경쟁 시스템과 달리 연산 한도 없이 작동해 비용이 훨씬 크고 비교가 공정하지 않다는 점은 분명히 짚어 둘 부분입니다.
가장 인상적인 대목은 실제 수학자들의 경험이었습니다. 옥스퍼드의 수학자 마크 라켄비는 군론(group theory)의 오래된 미해결 문제(쿠로프카 노트북에 수록)를 이 시스템으로 해결했는데, AI가 처음 내놓은 증명에 결함이 있었고 내부의 '검토 에이전트'가 그 지점을 짚어내자 라켄비가 어떻게 메울지 깨달았다고 합니다. 즉 사람을 대체한 것이 아니라, 분야를 깊이 아는 연구자가 능숙하게 방향을 잡아 줄 때 가장 잘 작동했습니다. 연구진은 검토 과정이 '그럴듯하지만 미묘하게 틀린' 결론으로 수렴하는 편향이나 끝없는 의견 충돌 같은 실패 양상도 솔직하게 기술했습니다.
더 깊은 함의는 측정 기준의 변화입니다. 이제 개별 문제 풀이 능력보다, 몇 주에 걸친 연구 흐름을 유지하고 좁고 어려운 문헌을 종합하며 불확실성을 정직하게 드러내고 사람에게 도움을 청할 때를 아는 '오케스트레이션' 능력이 관건이 되었습니다. AI가 단 몇 분 만에 20쪽짜리 증명을 쏟아낼 수 있게 되면서, 자원봉사로 운영되는 수학 동료 평가(peer review)가 큰 부담을 안게 된다는 구조적 위험도 제기됩니다. 결국 지식의 병목은 '생성'만이 아니라 무엇이 참인지 가려내는 '검증'에 있으며, 이 도구는 사람의 판단을 대체하기보다 그 가치를 더 크게 만들었다는 점이 핵심입니다.