Jensen Huang GTC Keynote Speech Transcript

Welcome to GTC. What an amazing year. We want to be able to make a new world. Through the magic of artificial intelligence, we are going to bring you to NVIDIA’s headquarters. This is where we work.

What an amazing year it was, and we have incredible things to talk about. I just want you to know that I am up here without a net. There are no scripts, there is no teleprompter, and I’ve got a lot of things to cover. Let’s get started.

First of all, I want to thank all the sponsors, all the amazing people who are part of this conference. Just about every single industry is represented. Healthcare is here. Transportation. Retail. The computer industry. Everybody in the computer industry is here, and so it’s really, really terrific to see all of you. Thank you for sponsoring it.

GTC started with GeForce. It all started with GeForce. Today, I have here a GeForce 5090, and 5090, unbelievably, 25 years later—25 years after we started working on GeForce—GeForce is sold out all over the world.

This is the 5090, the Blackwell generation. Comparing it to the 49, it’s 30% smaller in volume. It’s 30% better at dissipating energy, and has incredible performance. Hard to even compare. And the reason for that is because of artificial intelligence.

GeForce brought CUDA to the world. CUDA enabled AI. And AI has now come back to revolutionize computer graphics. What you’re looking at is real-time computer graphics, 100% path traced. For every pixel that’s rendered, artificial intelligence predicts the other 15. Think about this for a second: for every pixel that we mathematically rendered, artificial intelligence inferred the other 15. And it has to do so with so much precision that the image looks right, and it’s temporally accurate, meaning that from frame to frame to frame going forward or backwards, because of this computer graphics, it has to stay temporally stable. Incredible.

Evolution of AI

Artificial intelligence has made extraordinary progress. It has only been 10 years. Now we’ve been talking about AI for a little longer than that, but AI really came into the world’s consciousness about a decade ago.

Started with perception AI: computer vision, speech recognition. Then generative AI. The last five years we’ve largely focused on generative AI, teaching an AI how to translate from one modality to another: text to image, image to text, text to video, amino acids to proteins, properties to chemicals—all kinds of different ways that we can use AI to generate content.

«From a retrieval computing model, we now have a generative computing model… Rather than retrieving data, it now generates answers. Fundamentally changed how computing is done. Every single layer of computing has been transformed.»

Generative AI fundamentally changed how computing is done. From a retrieval computing model, we now have a generative computing model. Almost everything that we did in the past was about creating content in advance, storing multiple versions of it, and fetching whatever version we think is appropriate at the moment of use.

Now AI understands the context, understands what we’re asking, understands the meaning of our request, and generates what it knows. If it needs, it’ll retrieve information, augment its understanding, and generate an answer for us. Rather than retrieving data, it now generates answers. This fundamentally changed how computing is done. Every single layer of computing has been transformed.

The last several years, the last couple of two through years, a major breakthrough happened: a fundamental advance in artificial intelligence. We call it agentic AI. Agentic AI basically means that you have an AI that has agency. It can perceive and understand the context of the circumstance. It can reason—very importantly, can reason—about how to answer or how to solve a problem. It can plan an action. It can use tools.

It now understands multi-modality information. It can go to a website and look at the format of the website, words and videos. Maybe even play a video. It learns from what it learns from that website, understands it, and comes back, and uses that information, that newfound knowledge, to do its job. Agentic AI.

The next wave is already happening. We are going to talk a lot about that today: robotics, which has been enabled by physical AI. AI that understands the physical world. It understands things like friction and inertia, cause and effect, object permanence. The ability to understand the three-dimensional world is what is going to enable a new era of AI we call physical AI, and it will enable robotics.

Each one of these phases, each one of these waves opens up new market opportunities for all of us. It brings more and new partners to GTC. As a result, GTC is now jam packed.

I’m standing here. I wish all of you could see what I see. We are in the middle of a stadium. Last year was the first year back that we did this live. It was like a rock concert. GTC was described as the Woodstock of AI. The only difference is everybody wins at this Super Bowl. Everybody is a winner. Every single year more people come because AI is able to solve more interesting problems for more industries and more companies.

This year we are going to talk a lot about agentic AI and physical AI. At its core, what enables each wave and each phase of AI—three fundamental matters are involved:

  1. The first is, how do you solve the data problem? The reason why that’s important is because AI is a data-driven computer science approach. It needs data to learn from. It needs digital experience to learn from.
  2. The second is, how do you solve the training problem without humans in the loop? The reason why «human in the loop» is fundamentally challenging is because we only have so much time, and we would like an AI to be able to learn at superhuman rates, at super real-time rates, and to be able to learn at a scale that no humans can keep up with.
  3. And the third is, how do you scale? How do you create, how do you find an algorithm whereby the more resource you provide—whatever the resource is—the smarter the AI becomes? The scaling law.

This last year, this is where almost the entire world got it wrong: the computation requirement. The scaling law of AI is more resilient and in fact hyper accelerated. The amount of computation we need at this point, as a result of agentic AI, as a result of reasoning, is easily 100 times more than we thought we needed this time last year.

Reasoning and Computation Requirements

Let’s reason about why that’s true. The first part is, let’s just go from what the AI can do. Let me work backwards. Agentic AI, as I mentioned, at its foundation is reasoning. We now have AIs that can reason, which is fundamentally about breaking a problem down step by step. Maybe it approaches a problem in a few different ways and selects the best answer. Maybe it solves the same problem in a variety of ways and ensures it has the same answer—consistency checking. Or maybe after it’s done deriving the answer, it plugs it back into the equation—maybe a quadratic equation—to confirm that in fact that’s the right answer, instead of just one-shot blurting it out.

Remember two years ago when we started working with ChatGPT, a miracle as it was, many complicated questions and simple questions, it simply couldn’t get right. And it’s understandably so. It took a one-shot—whatever it learned by studying pre-trained data, whatever it saw from other experiences, pre-trained data—it does a one-shot, blurts it out, like a savant. Now we have AIs that can reason step by step, using a technology called chain of thought, best of N, consistency checking, a variety of different path planning, a variety of different techniques. We now have AIs that can reason, step by step by step.

You could imagine, as a result, the number of tokens we generate—and the fundamental technology of AI is still the same: generate the next token, predict the next token. It’s just that the next token now makes up step one. Then the next token, after it generates step one, that step one has gone into the input of the AI again as it generates step two, and step three and step four. So instead of just generating one token or one word after next, it generates a sequence of words that represents a step of reasoning.

The amount of tokens that’s generated as a result is substantially higher, and I’ll show you in a second. Easily a hundred times more. Now a hundred times more, what does that mean? Well, it could generate a hundred times more tokens, and you can see that happening as I explained previously. Or the model is more complex. It generates ten times more tokens, and in order for us to keep the model responsive, interactive, so that we don’t lose our patience waiting for it to think, we now have to compute ten times faster. Ten times tokens, ten times faster—the amount of computation we have to do is a hundred times more, easily.

You’re going to see this in the rest of the presentation—the amount of computation we have to do for inference is dramatically higher than it used to be.

Training AI to Reason

The question then becomes: how do we teach an AI how to do what I just described? How to execute this chain of thought? Well, one method is you have to teach the AI how to reason. And as I mentioned earlier, in training, there are two fundamental problems we have to solve:

  1. Where does the data come from?
  2. How do we not have it be limited by human in the loop?

There’s only so much data and so much human demonstration we can perform. And so this is the big breakthrough in the last couple of years: reinforcement learning with verifiable results. Basically, reinforcement learning of an AI as it attacks or tries to engage solving a problem step by step.

We have many problems that have been solved in the history of humanity where we know the answer. We know the equation of a quadratic equation—how to solve that. We know how to solve a Pythagorean theorem, the rules of a right triangle. We know many, many rules of math and geometry and logic and science. We have puzzle games that we could give it. We have constraint-type problems like Sudoku.

We have hundreds of these problem spaces where we can generate millions of different examples and give the AI hundreds of chances as we use reinforcement learning to reward it as it does a better and better job. So as a result, you take hundreds of different topics, millions of different examples, hundreds of different tries. Each one of the tries generating tens of thousands of tokens.

You put that all together, we’re talking about trillions and trillions of tokens in order to train that model. And now with reinforcement learning, we have the ability to generate an enormous amount of tokens—synthetic data generation, basically using a robotic approach to teach an AI.

The combination of these two things has put an enormous, enormous challenge of computing in front of the industry.

Infrastructure Growth

You can see that the industry is responding. This is what I’m about to show you: Hopper shipments of the top four CSPs—the top four CSPs, the ones with the public clouds: Amazon, Azure, GCP, and OCI. The top four CSPs—not the AI companies, that’s not included, not all the startups, not included, not enterprise, not included, a whole bunch of things not included—just those four.

Just to give you a sense of comparing the peak year of Hopper and the first year of Blackwell. The peak year of Hopper and the first year of Blackwell. So you can kind of see that in fact, AI is going through an inflection point. It has become more useful because it’s smarter, it can reason. It is more used—you can tell it’s more used because whenever you go to ChatGPT these days, it seems like you have to wait longer and longer and longer, which is a good thing. It says a lot of people are using it with great effect. And the amount of computation necessary to train those models and to inference those models has grown tremendously.

So in just one year—and Blackwell is just starting shipping—in just one year, you could see the incredible growth in AI infrastructure.

Well, that’s been reflected in computing across the board. We’re now seeing, and this [slide] is the forecast of analysts about the increase of capital expense of the world’s data centers, including CSPs and enterprise and so on, the world’s data centers through the end of the decade, so 2030.

I’ve said before that I expect data center build-out to reach a trillion dollars, and I am fairly certain we’re going to reach that very soon. Two dynamics are happening at the same time:

  1. The first dynamic is that the vast majority of that growth is likely to be accelerated, meaning we’ve known for some time that general purpose computing has run out of course, run its course, and that we need a new computing approach. And the world is going through a platform shift from hand-coded software running on general purpose computers to machine learning software running on accelerators and GPUs. This way of doing computation is at this point past this tipping point, and we are now seeing the inflection point happening, the inflection happening in the world’s data center buildouts.
  2. Second is an increase in recognition that the future of software requires capital investment. This is a very big idea. Whereas in the past we wrote the software and we ran it on computers, in the future the computer is going to generate the tokens for the software. The computer has become a generator of tokens, not a retrieval of files—from retrieval-based computing to generative-based computing, from the old way of doing data centers to a new way of building these infrastructure, and I call them AI factories.

They are AI factories because they have one job and one job only: generating these incredible tokens that we then reconstitute into music, into words, into videos, into research, into chemicals, or proteins. We reconstitute it into all kinds of information of different types.

The world is going through a transition in not just the amount of data centers that will be built, but also how they are built.

CUDA Libraries and Ecosystem

Well, everything in the data center will be accelerated, not all of it’s AI. This slide is genuinely my favorite. The reason for that is because for all of you coming to GTC all these years, you have been listening to me talk about these libraries this whole time. This is in fact what GTC is all about. This one slide. In fact, a long time ago, 20 years ago, this is the only slide we had—one library after another library after another library.

You can’t just accelerate software. We needed an AI framework in order to create AI, and we accelerate the AI frameworks. You need frameworks for physics, and biology, and multi-physics, and all kinds of different quantum physics. You need all kinds of libraries and frameworks. We call them CUDAX libraries, acceleration frameworks for each one of these fields of science.

[Jensen proceeds to describe various CUDA libraries and their applications across different industries, including CUDF for data frames, CULITHO for computational lithography, CUDF for data frames, and many others.]

This is just a sampling of the libraries that make possible accelerated computing. It’s not just CUDA. We’re so proud of CUDA, but if not for CUDA—in the fact that we have such a large install base—none of these libraries would be useful for any of the developers who use them. For all the developers that use them, you use it because:

  1. It’s going to give you incredible speed up, incredible scale up
  2. Because the install base of CUDA is now everywhere—it’s in every cloud, it’s in every data center, it’s available from every computer company in the world. It’s literally everywhere.

And therefore, by using one of these libraries, your software, your amazing software can reach everyone. We’ve now reached the tipping point of accelerated computing. CUDA has made it possible. And all of you, this is what GTC is about—the ecosystem—all of you made this possible.

AI Everywhere

AI started in the cloud. It started in the cloud for good reason, because it turns out that AI needs infrastructure. It’s machine learning. If the science says machine learning, then you need a machine to do the science. So machine learning requires infrastructure, and the cloud data centers had infrastructure. It would be a good idea to do the science, to do the engineering, computer science, and the research—the perfect circumstance for AI to take off in the cloud and the CSPs.

But that’s not where AI is limited to. AI will go everywhere. The cloud service providers, of course, they like our leading-edge technology. They like the fact that we have full stack, because accelerated computing, as you know, is a big step. It’s not even just the chip in the library, the programming model—it’s the chip, the programming model, and a whole bunch of software that goes on top of it. That entire stack is incredibly complex. Each one of those layers, each one of those libraries, is essentially like SQL.

SQL, as you know, is called in-storage computing. It was the big revolution of computation by IBM. SQL is one library—just imagine, it’s a whole bunch of them. And in the case of AI, there’s a whole bunch more. So the stack is complicated.

CSPs also love that NVIDIA CUDA developers are CSP customers, because in the final analysis, they’re building infrastructure for the world to use. So the rich developer ecosystem is really valued and really, really deeply appreciated.

Now that we’re going to take AI out to the rest of the world, the rest of the world has different system configurations, operating environment differences, domain-specific library differences, usage differences. AI, as it translates to enterprise IT, as it translates to manufacturing, as it translates to robotics or self-driving cars, or even companies that are starting GPU clouds—there’s a whole bunch of companies, maybe 20 of them, who started during this time. And what they do is just one thing: they host GPUs. They call themselves GPU clouds.

One of our great partners, CoreWeave, is in a process of going public, and we’re super proud of them. GPU clouds, they have their own requirements. But one of the areas that I’m super excited about is edge.

Today we announced that Cisco, NVIDIA, T-Mobile (the largest telecommunications company in the world), and Cerberus ODC are going to build a full stack for radio networks here in the United States. That’s going to be the second stack, so that this current stack, we’re announcing today, will put AI into the edge.

Remember, $100 billion of the world’s capital investments each year is in the radio networks and all of the data centers provisioning for communications. In the future, there is no question in my mind that’s going to be accelerated computing infused with AI. AI will do a far, far better job adapting the radio signals, the massive MIMOs to the changing environments in the traffic conditions. Of course it would. Of course we would use reinforcement learning to do that. Of course MIMO is essentially one giant radio robot. Of course it is. And so we will of course provide for those capabilities. Of course, AI can revolutionize communications.

«Jensen, because of your work, I can do my life’s work in my lifetime.» – Jensen quoting a scientist’s comment to him, which he described as deeply moving: «And boy, if that doesn’t touch you, well, you’ve got to be a corpse.»

Autonomous Vehicles

One of the earliest industries that AI went into was autonomous vehicles. The moment I saw AlexNet—and we’ve been working on computer vision for a long time—the moment I saw AlexNet was such an inspiring moment, such an exciting moment. It caused us to decide to go all in on building self-driving cars.

So we’ve been working on self-driving cars now for over a decade. We built technology that almost every single self-driving car company uses. It could be either in the data center—for example, Tesla uses a lot of NVIDIA GPUs in the data center. It could be in the data center or the car. Waymo and Wave uses NVIDIA computers in data centers as well as the car. It could be just in the car—it’s very rare, but sometimes it’s just in the car. It could be they use all of our software in addition. We work with the car industry however the car industry would like us to work with them.

We build all three computers:

  1. The training computer
  2. The simulation computer
  3. The robotics computer, the self-driving car computer

All the software stack that sits on top of it, models and algorithms, just as we do with all of the other industries that I’ve demonstrated.

Today, I’m super excited to announce that GM has selected NVIDIA to partner with them to build their future self-driving car fleet. The time for autonomous vehicles has arrived. And we’re looking forward to building with GM AI in all three areas:

  1. AI for manufacturing—they can revolutionize the way they manufacture
  2. AI for enterprise—they can revolutionize the way they work, design cars and simulate cars
  3. AI for in the car—AI infrastructure for GM

One of the areas that I’m deeply proud of, and it rarely gets any attention, is safety. It’s called HALOS, in our company called HALOS. Safety requires technology from the systems software, the algorithms, the methodologies, everything from diversity to ensuring diversity, monitoring and transparency, explainability.

All of these different philosophies have to be deeply ingrained into every single part of how you develop the system and the software. We’re the first company in the world, I believe, to have every line of code safety assessed. Seven million lines of code, safety assessed. Our chip, our system, our system software, our system algorithms are safety assessed by third parties that crawl through every line of code to ensure that it is designed to ensure diversity, transparency, and explainability. We have filed over a thousand patents.

Blackwell and Data Centers

Blackwell is in full production, and this is what it looks like. For us, this is a site of beauty. This is… How is this not beautiful? How is this not beautiful?

This is a big deal because we made a fundamental transition in computer architecture. I just want you to know that, in fact, I’ve shown you a version of this about three years ago. It was called Grace Hopper and the system was called Ranger. The Ranger system is about maybe about half of the width of the screen. And it was the world’s first NV-Link 32.

Three years ago, we showed Ranger working and it was way too large, but it was exactly the right idea. We were trying to solve scale-up. Distributed computing is about using a whole lot of different computers working together to solve a very large problem. But there is no replacement for scaling up before you scale out. Both are important, but you want to scale up first before you scale out.

While scaling up is incredibly hard, there is no simple answer for it. You’re not going to scale up like Hadoop—take a whole bunch of commodity computers, hook it up into a large network and do in-storage computing using Hadoop. Hadoop was a revolutionary idea as we know. It enabled hyperscale data centers to solve problems of gigantic sizes using off-the-shelf computers.

However, the problem we’re trying to solve is so complex that scaling in that way would have simply cost way too much power, way too much energy. It would have never—deep learning would have never happened. And so the thing that we had to do was scale up first.

[Jensen describes the evolution of NVIDIA’s system architecture from HGX to the new disaggregated NV-Link systems, explaining how they’ve improved scale-up capabilities dramatically.]

We wanted to scale up even further. And I told you that Ranger took this system and scaled it up, scaled it up by another factor of four. And so we had MV-Link 32, but the system was way too large. And so we had to do something quite remarkable: re-engineer how MV-Link worked, and how scale-up worked.

The first thing that we did was we said, listen, the MV-Link switches are in this system embedded on the motherboard. We need to disaggregate the MV-Link system and take it out. So this is the MV-Link system, this is an MV-Link switch. This is the highest performance switch the world’s ever made. And this makes it possible for every GPU to talk to every GPU at exactly the same time at full bandwidth.

We disaggregated it, we took it out, and we put it in the center of the chassis. There are 18 of these switches in nine different racks, nine different switch trays, we call them. And then the switches are disaggregated. The compute is now sitting in here. This is equivalent to these two things in compute.

What’s amazing is this is completely liquid-cooled. And by liquid cooling it, we can compress all of these compute nodes into one rack. This is the big change of the entire industry. All of you in the audience, I want to thank you for making this fundamental shift from integrated MV-Link to disaggregated MV-Link, from air cooled to liquid-cooled, from 60,000 components per computer or so to 600,000 components per rack. 120 kilowatts, fully liquid-cooled, and as a result, we have a one-exaflops computer in one rack.

3,000 pounds… 5,000 cables… about 2 miles worth… just incredible electronics. 600,000 parts, I think that’s like 20 cars—20 cars worth of parts. And it integrates into one supercomputer.

Well, our goal is to do scale-up. And this is what it now looks like. We essentially wanted to build this chip. It’s just that no radical limits can do this. No process technology can do this. It’s 130 trillion transistors—20 trillion of it is used for computing. So it’s not like you can’t reasonably build this anytime soon.

And so the way to solve this problem is to disaggregate it as I described into the Grace Blackwell MV-Link 72 rack. But as a result, we have done the ultimate scale-up. This is the most extreme scale-up the world has ever done. The amount of computation that’s possible here… The memory bandwidth, 570 terabytes per second. Everything in this machine is now in T’s. Everything’s a trillion. And you have an exaflops, which is a million trillion floating point operations per second.

Inference and AI Factories

The reason why we wanted to do this is to solve an extreme problem. And that extreme problem, a lot of people misunderstood to be easy. And in fact, it is the ultimate extreme computing problem. And it’s called inference.

And the reason for that is very simple. Inference is token generation by a factory. And a factory is revenue and profit generating—or lack of. And so this factory has to be built with extreme efficiency, with extreme performance. Because everything about this factory directly affects your quality of service, your revenues, and your profitability.

[Jensen describes the trade-offs between token generation speed and throughput in AI factories, illustrating how these affect both user experience and data center economics.]

You have two axes. On the x-axis is the tokens per second. Whenever you put a prompt into chat, you can see what comes out as tokens. Those tokens are reformulated into words. It’s more than a token per word.

We’ve already established that if you want your AI to be smarter, you want to generate a whole bunch of tokens. And so, those tokens might be second-guessing itself. It might be asking, «Is this the best word I could use?» And so, it talks to itself, just like we talk to ourselves. The more tokens you generate, the smarter your AI.

But, if you take too long to answer a question, the customer is not going to come back. This is no different than web search. There is a real limit to how long it can take before it comes back with a smart answer.

So, you have these two dimensions that you’re fighting against. You’re trying to generate a whole bunch of tokens, but you’re trying to do it as quickly as possible. Therefore, your token rate matters. You want your tokens per second for that one user to be as fast as possible.

However, in computer science and factories, there’s a fundamental tension between latency (response time) and throughput. And the reason is very simple. If you’re in the large high-volume business, you batch up—it’s called batching. You batch up a lot of customer demand, and you manufacture a certain version of it for everybody to consume later. However, from the moment that they batched up and manufactured whatever they did to the time that you consumed it, it could take a long time.

You have these two fundamental tensions. On the one hand, you would like the customer’s quality of service to be as good as possible—smart AIs that are super fast. On the other hand, you’re trying to get your data center to produce tokens for as many people as possible so you can maximize your revenues.

The perfect answer is to the upper right. Ideally, the shape of that curve is a square—that you could generate very fast tokens per person up until the limits of the factory—but no factory can do that. And so, it’s probably some curve. And your goal is to maximize the area under the curve.

[Jensen demonstrates this concept with a video showing traditional LLMs versus reasoning models, where a reasoning model uses many more tokens to solve a complex problem correctly, while a traditional model gives a quick but incorrect answer.]

NVIDIA Dynamo

One of the observations—and this is a really terrific thing about having a homogeneous architecture like NVLink72—is that every single GPU could do all of the things that I just described. And we observe that these reasoning models are doing a couple of phases of computing.

One of the phases of computing is thinking. When you’re thinking, you’re not producing a lot of tokens. You’re producing tokens that you’re maybe consuming yourself. You’re thinking. Maybe you’re reading, you’re digesting information—that information could be a PDF, that information could be a website. You could literally be watching a video, ingesting all of that at super linear rates, and you take all of that information, and you then formulate the answer, formulate a planned answer. And so that digestion of information, context processing, is very flops intensive.

On the other hand, during the next phase, it’s called decode. So the first part we call pre-fill, the next phase of decode requires floating point operations, but it requires an enormous amount of bandwidth.

And it’s fairly easy to calculate. You know, if you have a model, and it’s a few trillion parameters, well, it takes a few terabytes per second. Notice I was mentioning 576 terabytes per second. It takes terabytes per second to just pull the model in from HBM memory and to generate literally one token.

And the reason it generates one token is because, remember, these large language models are predicting the next token. That’s why they say the next token. It’s not predicting every single token. It’s predicting the next token.

Now, we have all kinds of new techniques, speculative decoding, and all kinds of new techniques for doing that faster, and in the final analysis, you’re predicting the next token. And so you ingest, pull in the entire model and the context. We call it a KVCache, and then we produce one token. And then we take that one token, we put it back into our brain, we produce the next token.

Every single time we do that, we take trillions of parameters in, we produce one token. Trillions of parameters in, produce another token. Trillions of parameters in, produce another token. And in that demo, we produced 8,600 tokens. So trillions of bytes of information have been taken into our GPUs and produced one token at a time.

Which is fundamentally the reason why you want NVLink. NVLink gives us the ability to take all of those GPUs and turn them into one massive GPU—the ultimate scale up.

The second thing is that now that everything is on NVLink, I can disaggregate the pre-fill from the decode, and I could decide I want to use more GPUs for pre-fill, less for decode, because I’m thinking a lot. I’m doing agentic activities. I’m reading a lot of information. I’m doing deep research.

Well, this dynamic operation is really complicated. So I’ve just now described pipeline parallel, tensor parallel, expert parallel, in-flight batching, disaggregated inferencing, workload management. And then I’ve got to take this thing called a KV cache, I’ve got to route it to the right GPU, I’ve got to manage it through all the memory hierarchies. That piece of software is insanely complicated.

And so today we’re announcing the NVIDIA Dynamo. And NVIDIA Dynamo does all that. It is essentially the operating system of an AI factory.

Whereas in the past, in the way that we ran data centers, our operating system would be something like VMware. And we would orchestrate, and we still do—we’re a big user—orchestrate a whole bunch of different enterprise applications running on top of our enterprise IT. But in the future, the application is not enterprise IT. It’s agents. And the operating system is not something like VMware. It’s something like Dynamo. And this operating system is running on top of not a data center, but on top of an AI factory.

We decided to call this operating system, this piece of software, insanely complicated software, the NVIDIA Dynamo. It’s open source. And we’re so happy that so many of our partners are working with us on it.

Performance Comparisons

[Jensen shows performance comparisons between Hopper and Blackwell systems, demonstrating up to 40X performance improvements for reasoning models, then explains the roadmap for future architectures.]

Roadmap and Future Architectures

We’re now in full production of Blackwell. Computer companies all over the world are ramping these incredible machines at scale. And I’m just so pleased and so grateful that all of you worked hard on transitioning into this new architecture.

And now in the second half of this year, we’ll easily transition into the upgrade. So we have the Blackwell Ultra and V-Linx 72. It’s one and a half times more flops. It’s got a new instruction for attention. It’s one and a half times more memory. All that memory is useful for things like KV Cash. It’s two times more bandwidth for networking bandwidth. And so now that we have the same architecture, we’ll just gracefully glide into that. And that’s called Blackwell Ultra. That’s coming second half of this year.

The next architecture, one year out, is named after an astronomer—her name is Vera Rubin. She discovered dark matter. Vera Rubin is incredible because the CPU is new. It’s twice the performance of Grace and more bandwidth and yet just a little tiny 50 watt CPU.

And Rubin, brand new GPU, CX9, brand new networking smart neck, NV links 6, brand new NV link, brand new memories, HBM4, basically everything is brand new except for the chassis. And this way we could take a whole lot of risk in one direction and not risk a whole bunch of other things related to the infrastructure. And so Vera Rubin, NV link 144, is the second half of next year.

[He continues to describe future roadmap including Rubin Ultra with NV link 576.]

Silicon Photonics

One of the areas that I’m very excited about is the largest enterprise networking company to take Spectrum X and integrate it into their product line so that they could help the world’s enterprises become AI companies. We’re at 100,000 with CX-8, CX-7, now CX-8 is coming, CX-9 is coming. And during Ruben’s time frame, we would like to scale out the number of GPUs to many hundreds of thousands.

Now, the challenge with scaling out GPUs to many hundreds of thousands is the connection of the scale out. The connection on scale up is copper. We should use copper as far as we can. And that’s, you know, called it a meter or two. And that’s incredibly good connectivity, very high reliability, very good energy efficiency, very low cost. And so we use copper as much as we can on scale up. But on scale out, where the data centers are now the size of the stadium, we’re going to need something much longer distance running. And that’s where silicon photonics comes in.

The challenge of silicon photonics has been that the transceivers consume a lot of energy. To go from electrical to photonic has to go through a transceiver and a certain amount of energy. And so each one of these transceivers is 30 watts. If you buy in high volume, it’s $1,000. This is a plug. On this side is electrical. On this side is optical. Optics come in through the yellow. You plug this into a switch. It’s electrical on this side. There’s transceivers, lasers, and technology called Moxander.

If we had 100,000 GPUs, we would have 100,000 of this side. And then another 100,000, which connects the switch to the switch. And then on the other side, connected to the other network. If we had 250,000, we’ll add another layer of switches. And so each GPU, every GPU, 250,000, every GPU, would have six transceivers. Every GPU would have six of these plugs. And these six plugs would add 180 watts per GPU, 180 watts per GPU, and $6,000 per GPU.

And so the question is, how do we scale out now to millions of GPUs? Because if we had a million GPUs, multiplied by six, right? It would be six million transceivers times 30 watts, 180 megawatts of transceivers. They didn’t do any math. They just moved signals around. And so the question is, how could we afford, and as I mentioned earlier, energy is our most important commodity. Everything is related, ultimately, to energy. So this is going to limit our revenues, our customers’ revenues, by subtracting out 180 megawatts of power.

And so this is the amazing thing that we did. We invented the world’s first MRM (Micro-Ring Modulator), and this is what it looks like. There’s a little wave guide. You see that on that wave guide, goes to a ring. That ring resonates, and it controls the amount of reflectivity of the wave guide as it goes around, and limits and modulates the energy that the amount of light that goes through. And it shuts it off by absorbing it or passing on. Okay, it turns the light, this direct continuous laser beam, into ones and zeros. And that’s the miracle.

And that technology is then that photonic IC is stacked with the electronic IC, which is then stacked with a whole bunch of micro lenses, which is stacked with this thing called fiber array. These things are all manufactured using this technology at TSMC called the COOP. And package using a 3D COAS technology, working with all of these technology providers, a whole bunch of them, the names I just showed you earlier, and it turns it into this incredible machine.

Enterprise Computing

Let’s talk about enterprise computing. This is really important. In order for us to bring AI to the world’s enterprise, we need to take a step back for a second and remind ourselves of this: AI and machine learning has reinvented the entire computing stack. The processor is different, the operating system is different, the applications on top are different. The way the applications work, the way you orchestrate them are different, and the way you run them are different.

Let me give you one example. The way you access data will be fundamentally different than the past. Instead of retrieving precisely the data that you want, and you read it to try to understand it, in the future we will do what we do with perplexity. Instead of doing retrieval that way, I just ask perplexity what I want. Ask a question, and it will tell you the answer.

This is the way enterprise IT will work in the future as well. We’ll have AI agents, which are part of our digital workforce. There’s a billion knowledge workers in the world. They’re probably going to be 10 billion digital workers working with us side by side. 100% of software engineers in the future, they’re 30 million of them around the world. 100% of them are going to be AI assisted. I’m certain of that. 100% of NVIDIA software engineers will be AI assisted by the end of this year.

AI agents will be everywhere, how they run, what enterprises run, and how we run it will be fundamentally different. We need a new line of computers.

This is what a PC should look like. 20 petaflops. Unbelievable. 72 CPU cores. Chip to chip interface. HBM memory. And just in case, some PCI express slots for your GeForce.

This is called DGX Station. DGX Spark and DGX Station are going to be available by all of the OEMs. HP, Dell, Lenovo, ASUS. It’s going to be manufactured for data scientists and researchers all over the world. This is the computer of the age of AI. This is what computers should look like. And this is what computers will run in the future.

And we have a whole lineup for enterprise now. From little tiny ones to workstation ones, the server ones to super computer ones, and these will be available by all of our partners.

We will also revolutionize the rest of the computing stack. Remember, computing has three pillars:

  1. There’s computing. You’re looking at it.
  2. There’s networking, as I mentioned earlier, Spectrum X going to the world of computers.
  3. And the third is storage. Storage has to be completely reinvented.

Rather than a retrieval-based storage system, it’s going to be a semantics-based retrieval system, a semantics-based storage system. And so the storage system has to be continuously embedding information in the background, taking raw data, embedding it into knowledge, and then later on, when you access it, you don’t retrieve it. You just talk to it. You ask your questions. You give it problems.

And one of the examples I wish we had a video of it, but Aaron at Box even put one up in the cloud, worked with us to put it up in the cloud. And it’s basically a super smart storage system. And in the future, you’re going to have something like that in every single enterprise. That is the enterprise storage of the future.

And we’re working with the entire storage industry, really fantastic partners, DDN and Dell, and HP Enterprise, and Hitachi and IBM, and NetApp and Nutanix and Pure Storage and Vast and Weka. Basically, the entire world storage industry will be offering this stack. For the very first time, your storage system will be GPU accelerated.

Models and Partnerships

Today, I’m super excited to announce this incredible model that everybody can run. I showed you earlier, R1, a reasoning model. I showed you versus Llama 3, a non-reasoning model. And obviously, R1 is much smarter. But we can do it even better than that. And we can make it possible to be enterprise ready for any company. And it’s now completely open source. It’s part of our system we call NIMs. And you can download it. You can run it anywhere. You can run it on DGX Spark. You can run it on DGX Station. You can run on any of the servers that the OEMs make. You can run it in the cloud. You can integrate it into any of your agentic AI frameworks.

And we’re working with companies all over the world. I’ve got some great partners in the audience. I want to recognize Accenture, Julie Sweet, and her team are building their AI factory and their AI framework. And we have Amdocs, the world’s largest telecommunication software company, AT&T, John Stankey, and his team building an AT&T AI system, agentic system, Larry Fink at BlackRock and team building theirs.

[Jensen lists multiple corporate partnerships including Cadence, Capital One, Dell, E&Y, Nasdaq, SAP, and ServiceNow.]

Robotics

Let’s talk about robots. Well, the time has come, the time has come for robots. Robots have the benefit of being able to interact with the physical world and do things that otherwise digital information cannot.

We know very clearly that the world has severe shortage of human laborers, human workers. By the end of this decade, the world is going to be at least 50 million workers short. We’d be more than delighted to pay them each $50,000 to come to work. We’re probably going to have to pay robots $50,000 a year to come to work. And so this is going to be a very, very large industry.

There are all kinds of robotic systems. Your infrastructure will be robotic. Billions of cameras and warehouses and factories, 10, 20 million factories around the world. Every car is already a robot, as I mentioned earlier, and then now we’re building general robots.

[Jensen demonstrates a robot called Blue on stage.]

Everything that moves will be autonomous. Physical AI will embody robots of every kind in every industry. Three computers built by NVIDIA enable a continuous loop of robot AI simulation, training, testing, and real world experience.

Training robots requires huge volumes of data. Internet scale data provides common sense and reasoning, but robots need action and control data, which is expensive to capture. With Blueprints built on NVIDIA Omniverse and Cosmos, developers can generate massive amounts of diverse synthetic data for training robot policies.

At its core, we have the same challenges. As I mentioned before, there are three that we focus on:

  1. How do you solve the data problem? How, where do you create the data necessary to train the AI?
  2. What’s the model architecture?
  3. What’s the scaling law? How can we scale either the data, the compute, or both, so that we can make AI smarter and smarter?

In robotics, we created a system called Omniverse. It’s our operating system for physical AI. So you’ve heard me talk about Omniverse for a long time. We added two technologies to it. Today, I’m going to show you two things.

One of them is so that we could scale AI with generative capabilities. A generative model that understands the physical world. We call it Cosmos. Using Omniverse to condition Cosmos and using Cosmos to generate an infinite number of environments allows us to create data that is grounded, controlled by us, and yet be systematically infinite at the same time.

The second thing, just as we were talking about earlier, one of the incredible scaling capabilities of language models today is reinforcement learning, verifiable rewards. The question is, what’s the verifiable rewards in robotics? And as we know very well, it’s the laws of physics. Verifiable physics rewards.

And so we need an incredible physics engine. We need a physics engine that is designed for very fine grain, rigid and soft bodies, designed for being able to train tactile feedback and fine motor skills and actuator controls. We needed to be GPU accelerated so that these virtual worlds could live in super linear time, super real time, and train these AI models incredibly fast. And we needed to be integrated harmoniously into a framework that is used by roboticists all over the world, like MuJoCo.

And so today we’re announcing something really, really special. It is a partnership of three companies, DeepMind, Disney Research, and NVIDIA, and we call it Newton.

Today we’re introducing NVIDIA Isaac Groot N1. Groot N1 is a generalist foundation model for humanoid robots. It’s built on the foundations of synthetic data generation, and learning and simulation. Groot N1 features a dual system architecture for thinking fast and slow, inspired by principles of human cognitive processing. The slow thinking system lets the robot perceive and reason about its environment and instructions, and plan the right actions to take. The fast thinking system translates the plan into precise and continuous robot actions.

Groot N1’s generalization lets robots manipulate common objects with ease and execute multi-step sequences collaboratively. And with this entire pipeline of synthetic data generation and robot learning, humanoid robot developers can post-trained Groot N1 across multiple embodiments and tasks across many environments. Around the world, in every industry, developers are using NVIDIA’s three computers to build the next generation of embodied AI.

And today we’re announcing that Groot N1 is open-sourced!

Closing

I want to thank all of you for coming to GTC. We talked about several things:

  1. Blackwell is in full production. And the ramp is incredible. Customer-demand is incredible. And for good reason. Because there’s an inflection point in AI. The amount of computation we have to do in AI is so much greater as a result of reasoning AI. And the training of reasoning AI systems and agentic systems.
  2. Blackwell MV-Link 72 with Dynamo is 40 times the performance, AI factory performance of Hopper. And inference is going to be one of the most important workloads in the next decade as we scale out AI.
  3. We have an annual rhythm of roadmaps that has been laid out for you so that you could plan your AI infrastructure.

And we have three AI infrastructures we’re building:

  1. AI infrastructure for the cloud
  2. AI infrastructure for enterprise
  3. AI infrastructure for robots

Thank you!