NVIDIA Cosmos 3 Opens Physical AI Omnimodel for Robots

What Happened

NVIDIA unveiled Cosmos 3 at GTC Taipei on May 31, 2026, calling it the first fully open omnimodel for physical AI. The release packages vision reasoning, world generation, and action prediction into a single mixture-of-transformers system that can ingest text, images, video, ambient sound, and motion trajectories and generate any of them back. Founder and CEO Jensen Huang framed the launch as a generational shift, telling developers the model gives them new ground to build robots, autonomous vehicles, and vision agents that can perceive, reason, plan, and act in the physical world.

Jensen Huang on stage during an NVIDIA keynote — Jensen Huang during an NVIDIA keynote. Photo by Pronoia / CC0 via Wikimedia Commons.

The lineup ships in two scales out of the gate. Cosmos 3 Super, the 64B-parameter flagship, is tuned for post-training the robotics and AV models that need the highest physics accuracy. Cosmos 3 Nano, a 16B-parameter sibling built on a dense 8B-parameter transformer, targets sub-second video and action reasoning. A third variant, Cosmos 3 Edge, is slated to follow for real-time inference on devices. NVIDIA paired the launch with the Cosmos Coalition, a new alliance with Agile Robots, Black Forest Labs, Generalist, LTX, Runway, and Skild AI committed to building open world models together.

According to the company, Cosmos 3 currently ranks first among open models on Artificial Analysis, Physics-IQ, PAI-Bench, and R-Bench for world generation, on RoboLab and RoboArena for action policy, and on VANTAGE-Bench and TAR for vision understanding. Models are live on build.nvidia.com and Hugging Face today, with NIM microservice deployment available for production.

Why It Matters

Physical AI has been stuck on a structural problem: the systems that drive robots and self-driving cars need huge amounts of real-world data, yet collecting it is slow, dangerous, and expensive. Cosmos 3 attacks that gap by acting as a synthetic data factory. Trained on what NVIDIA calls one of the largest multimodal physical AI datasets — billions of samples spanning text, image, video, sound, and action — it can simulate plausible environments and produce labeled trajectories to bootstrap downstream policies. The pitch to developers is direct: less collected data, lower training cost, faster iteration.

Humanoid robot on display, representative of physical AI form factors targeted by Cosmos 3 — A humanoid robot of the kind Cosmos 3 is built to train. Photo by Sikander / CC BY-SA 4.0 via Wikimedia Commons.

The technical bet is the two-tower mixture-of-transformers design. A reasoner tower — effectively a vision-language model — interprets multimodal inputs and works out how objects, motion, and space relate before any pixels are generated. A separate generator tower then renders physics-aware video and action sequences conditioned on that reasoning. The split lets one model reason and produce, instead of stitching a perception model to a separate simulator and a separate policy network. For teams building autonomous vehicles or warehouse robots, that means fewer integration seams and more consistent behavior under unfamiliar scenarios.

It also reshapes the open-source frontier. Until now, the strongest physical AI models have been proprietary, locked inside vendors' simulation stacks. By dropping Cosmos 3 weights on Hugging Face under an open license and pairing them with NVIDIA's training tools, NVIDIA is making a bid to set the default open baseline the rest of the field measures against, much as Llama did for text models in 2023 and 2024.

Industry Reaction

The Coalition is the headline signal, but the deployment list is longer. NVIDIA named Agile Robots, Doosan Robotics, LG Electronics, Samsung, and Skild AI as robotics builders adopting the platform, with Li Auto picked out for autonomous driving. On the vision AI side, Centific, Fogsphere, Linker Vision, Milestone Systems, and Yuan are pulling Cosmos 3 into industrial and smart-space products. Cloud partners Baseten, CoreWeave, Microsoft Azure, Nebius, Deep Infra, and Classmethod will host the inference workloads.

Collaborative industrial robot arm, representative of the manufacturing partners adopting Cosmos 3 — An industrial collaborative robot arm. Photo by Auledas / CC BY-SA 4.0 via Wikimedia Commons.

Robotics analysts at AIwire read the move as NVIDIA stitching a single platform across what used to be three separate stacks: humanoids, AVs, and industrial vision. The Cosmos Coalition language about contributing models and research suggests NVIDIA wants outside labs publishing on top of the omnimodel, not forking it. That mirrors the path of recent vision-language and code models, where shared baselines accelerated everyone's progress.

Quadruped mobile robot, representative of the mobile platforms training on world models — A quadruped mobile robot of the type Cosmos 3 can train. Photo by Jonte / CC BY-SA 4.0 via Wikimedia Commons.

There is also a hardware tailwind. The same week as the Cosmos 3 launch, NVIDIA rolled out new physical AI agent skills for neural scene reconstruction and defect-image generation, plus an updated Isaac stack for humanoid training. Taken together with last week's Generalist AI funding round and Mistral's pivot to industrial AI, the field is finally getting the unified tooling its hype cycle has promised for two years.

What's Next

The roadmap centers on three threads. First, Cosmos 3 Edge is the missing leg of the lineup — designed for real-time inference on robots and vehicles where round-trip cloud latency is fatal. Until it ships, deployments will lean on Super and Nano running in NVIDIA's DGX Cloud or in coalition partners' data centers, with on-device control loops handed off to smaller distilled policies.

NVIDIA H100 data center GPU, representative of the accelerators powering Cosmos 3 training and inference — An NVIDIA H100 data center accelerator. Photo by Geekerwan / CC BY 3.0 via Wikimedia Commons.

Second, the Cosmos Coalition will be the test of whether NVIDIA can keep the model open without losing control of the ecosystem. Founding members like Runway and Black Forest Labs already ship competitive video models; their willingness to contribute back, rather than fork, will shape how durable the alliance feels by GTC 2027. Third, regulators are watching. Synthetic data for autonomous driving sits in a legal gray zone in the United States and the European Union, and a foundation model that generates plausible traffic scenes will pull both regions deeper into the data-provenance debate that began with text models.

For engineers building today, the practical next step is benchmarking. Independent evaluations of physical AI claims have a thin track record, and the leaderboards NVIDIA cites are young. Expect academic groups to publish replication studies within months, particularly on Physics-IQ and the RoboArena policy tests. If the numbers hold, Cosmos 3 will quietly become the reference point for every physical AI demo through the rest of 2026. If they fade under scrutiny, the omnimodel pitch will need to defend itself task by task.

Closing Thoughts

For two years the phrase “physical AI” lived mostly in keynote slides. Cosmos 3 is the first time a single open model claims to reason, simulate, and act on the same backbone, and the first time the partner list reaches that far across humanoids, autonomous driving, and industrial vision at once. Whether the model lives up to its leaderboards is the next question. The architectural argument — that reasoning and generation belong in one system, sharing a world prior — is the part that may matter longer than any single benchmark.

Atlas humanoid robot, representative of the physical AI systems the Cosmos platform aims to train — The Atlas humanoid robot built by Boston Dynamics. Photo by DARPA / Public domain via Wikimedia Commons.

If you build robots or AVs, the practical move this week is to pull the Hugging Face weights and run the Physics-IQ tasks against your own scenarios. If you build foundation models, the more interesting question is whether the two-tower design generalizes outside physical AI — for instance, to scientific simulation or game engines. Either way, the omnimodel framing is now planted, and the next twelve months will tell us whether physical AI has finally found its Llama moment or whether it remains, for now, the most ambitious slide deck in the industry.

한글 요약

엔비디아가 5월 31일 GTC 타이베이에서 피지컬 AI를 위한 개방형 옴니모델 Cosmos 3를 공개했습니다. 단일 시스템 안에 시각 추론, 월드 생성, 행동 예측을 한꺼번에 묶은 mixture-of-transformers 구조로, 텍스트와 이미지, 영상, 환경음, 그리고 행동 궤적까지 모두 이해하고 생성할 수 있습니다. 640억 파라미터의 Super와 160억 파라미터의 Nano 두 버전이 먼저 공개됐고, 엣지 디바이스용 Edge 버전이 곧 추가됩니다.

핵심은 두 개의 타워 구조입니다. Reasoner 타워가 비전-언어 모델 역할로 물체와 운동, 공간 관계를 먼저 해석하면, Generator 타워가 그 결과에 맞춰 물리적으로 그럴듯한 영상과 행동 시퀀스를 생성합니다. 그동안 로봇과 자율주행 개발자가 인지 모델, 시뮬레이터, 정책망을 따로 붙여 쓰던 파편화 문제를 한 번에 묶었다는 평가가 나옵니다. Agile Robots, 두산로보틱스, LG전자, 삼성, Li Auto 같은 파트너가 곧바로 도입을 발표했고, Hugging Face와 NIM 마이크로서비스를 통해 누구나 다운받아 쓸 수 있습니다.

관건은 두 가지입니다. 먼저 NVIDIA가 인용한 Physics-IQ, RoboArena 같은 벤치마크 우위가 외부 검증에서도 유지될지가 향후 몇 달간 검증대에 오를 전망입니다. 또 합성 영상 데이터가 자율주행 영역에서 규제와 데이터 출처 문제를 다시 부를 가능성도 큽니다. 그럼에도 단일 백본 위에서 추론과 생성을 동시에 다루겠다는 아키텍처 메시지는 피지컬 AI 산업 전체의 방향을 끌어당기는 기준점이 될 가능성이 큽니다.