Home » Nvidia's Inference Problem And Alarming Sell Side AI IQ

Nvidia's Inference Problem And Alarming Sell Side AI IQ

by 100IQ Win The Knowledge

I have engaged in several exchanges in public article comment sections regarding the fact that the GPU is ill suited for datacenter ml/dl inference. This is naturally not a popular argument to make as domination of this market is seen critical to Nvidia’s (NVDA) growth story and current premium valuation. Critics view this as some sort of self-serving or short biased argument, but the reality is the empirical technological evidence has always clearly indicated this is a rather obvious conclusion. Every move Nvidia has been making with respect to the inference market is that of a laggard and not a leader in the space. NVDLA, TensorRT, and now the introduction of the Tesla T4 Inference accelerator all reflect Nvidia’s strategy to try and maintain stickiness on the training side in the face of the GPU’s fundamental technological limitations in ml/dl inferencing. Power consumption, latency, and cost may not be a problem when it comes to high performance GPUs for ml/dl training, but for datacenter scale-out inference they are major limiting factors.

I figured today was a good time to address some of what I’m focusing on here as I expect inference accelerators to take center stage at this week’s AI Hardware Summit. Also, a notable announcement out of Facebook (NASDAQ:FB) coupled with a new Google (NASDAQ:GOOG) (NASDAQ:GOOGL) TPU paper and last week’s Nvidia’s T4 GPU introduction make for optimal timing for a market related note on this topic.

Let’s start with the recent TPU paper and Nvidia’s T4 GPU introduction:

TPU vs. GPU Inferencing: Tortoise vs. Hare?

The best analogy for Google vs. Nvidia in inferencing is the tortoise vs. the hare fable, except slow and steady isn’t going to win this race. Google figured out that low precision math is ideal for ml/dl inferencing four years ago. I predicted that with Turing we would see more evidence of the Tpuification of the GPU, and the T4 card has confirmed that. The boosted INT8 performance and added INT4 math clearly demonstrates just how much Nvidia has been playing catch up here.

Now Nvidia has dubbed the Tesla T4 GPU as “the world’s most advanced inference accelerator.”

Here is how the soon to be released card stacks up against a now three-year-old TPU1:


Tesla T4



Q4 2018




Power Use






Process Node



Die Size






Theoretical DIES Per Wafer



(Source Nvidia/ Google, my calculations on theoretical dies per 12” Wafer)

Marketing will be marketing, but it’s hard to get around the fact that on a deployed production basis TPU1 perf/watt is still better than the soon to be released T4. And that’s really only part of the problem.

In the recently published September Issue of Communications of the ACM Journal, Google’s chip engineers shared a paper titled “A Domain-Specific Architecture for Deep Neural Networks.”

Here’s an excerpt from that paper on TPU1 vs. potentially newer GPUs:

“A fair comparison with a newer GPU would include a newer TPU, and, for an additional 10W, we could triple performance of the 28nm, 0.7GHz, 40W TPU just by using the K80’s GDDR5 memory. Moving the TPU to a 16nm process would improve its performance/Watt even further.”

So, at 50W Google’s TPU can deliver up to 270 TOPS of INT8 performance. And remember Google can’t disclose explicit costs and use performance/watt as a proxy measurement, but anyone with decent semiconductor knowledge can make some reasonably inferences about the giant economic advantage here. TPU1 is manufactured at 28nm and using DDR3 memory. This is a less expensive node and cheaper memory for a chip that’s 40% smaller. When you factor in yield differential between nodes and costs you are talking a very significant per chip manufacturing cost differential.

The paper also explicitly points out some things I have cited in the past which may not matter to the technical community who understand this but do cause repeated confusion in the investment community with respect to the potential relationship between these two companies on inferencing accelerators.

From the paper:

“Most architecture research papers are based on simulations running small, easily portable benchmarks that project potential performance if ever implemented. This article is not one of them but rather a retrospective evaluation of machines running real, large production workloads in datacenters since 2015, some used routinely by more than one billion people. These six applications, as in Table 1, are representative of 95% of TPU datacenter use in 2016.

Since we are measuring production workloads, the benchmark platforms for us to compare must also be deployable in Google datacenters, as that is the only place the production workloads run. The many servers in Google datacenters and the requirements for application dependability at Google scale mean machines must at minimum check for memory errors. As the Nvidia Maxwell GPU and the more recent Pascal P40 GPU do not check for errors on internal memory, it is infeasible to deploy these processors at Google scale and meet the strict reliability requirements of Google applications.”

All I can say is keep this rigor in mind next time you read an upgrade note or comment touting Google GCP availability of an Inference capable GPU as evidence that Google is going to be rolling out GPUs across their scale-out consumer facing infrastructure. Simply put, it’s not happening. They are to this day deploying TPU1 for these workloads. And when you factor in the huge cost advantage let alone perf/watt advantage you don’t need to be a finance genius to conclude $3k-6k GPUs are a non-starter for scale-out datacenter inference.

And if you didn’t get this already I suggest paying some close attention to Facebook’s recent GLOW announcement.

Facebook GLOW: Accelerating the Inference Accelerators

Google demonstrated economic/technological datacenter advantage with TPU1 has not gone unnoticed by their hyperscale peers. A couple days ago Facebook announced GLOW. Now, I touched on other in-house chip initiatives at the hyperscale level in my datacenter focused article, and the Facebook GLOW announcement is clearly along these lines.

Facebook describes GLOW as…

“Glow is a machine learning compiler that accelerates the performance of deep learning frameworks on different hardware platforms. It enables the ecosystem of hardware developers and researchers to focus on building next-gen hardware accelerators that can be supported by deep learning frameworks like PyTorch.”

Make no mistake this development is basically a Facebook response to the fact that they don’t yet have their own TPU for inferencing. And to go one step further, it can be interpreted as Facebook acknowledging that at this point building their own TPU makes no sense. So, Facebook’s chip team is in fact taking almost an Android esque approach here by providing a high-level abstraction layer for every single startup making an inference card. This is essentially Facebook’s way of exploiting the fact that technically speaking there are going to be tons of these matrix multiply unit driven coprocessors to choose from, so why not give them all what they need to meet Facebook’s own anticipated future inference needs. This move fits nicely into an economic view that they should leave themselves with the most options possible as this market explodes with choices, and that in doing so they ensure achieving the lowest possible TCO for future inference accelerators. Think a market where the scale-out coprocessor SKUs are sub $400. Not exactly what Intel and Nvidia want to see, but this is planning for a future in which Facebook and the rest of hyperscale are awfully happy. That’s the nature of a cycle like this taking off in semi land, and this is simply more evidence supporting the broader argument I laid in my AI datacenter short thesis article.

What does this all mean? Down the rabbit hole we go…

AI ML/DL And The Sell-Side: Perception vs. Reality

Since the T4 was announced there have been several upgrades of Nvidia shares based on “increased confidence in inferencing datacenter scale out penetration.” Now even if you don’t appreciate the technical aspect of the story, you would think these guys would take a couple of weeks and do some more work before they make bold claims that may quickly prove to be quite embarrassing. Unfortunately, when there’s little accountability and a lot of hype that doesn’t happen, and embarrassing notes marketed as “research” start to flood the market.

How embarrassing?

This embarrassing…

Let’s be clear, Turing GPUs are NOT going to outperform the V100 in ml/dl training let alone deliver 8x-10x improvement in Training Teraflops. This tragedy of a table encapsulates the entire Nvidia sell-side problem. They are utterly clueless on ML/DL. Every announced Turing SKU has less Tensor Cores than the V100 and significantly less memory bandwidth than the 900gb/s V100. The RTX 8000 Quadro is at 672 gb/s and has 576 tensor cores vs. the 640 on the Tesla V100 die and delivering slightly lower FP16 Teraflops, and this is the high end SKU. The Tesla T4, being targeted at inferencing, has half the Tensor cores of the V100 and nearly a third of the FP16 performance in favor of lower precision math and lower power consumption obviously.

Now, this analyst, who clearly doesn’t understand floating point math or tensor cores or even bothered to look closely at Nvidia’s own disclosed specs, as even their marketing machine couldn’t make something like this fly, is who investors are supposed to turn to for advice on the future of the inferencing market?

And this is literally the crux of his thesis:

“Tesla T4 delivers a significant performance improvement over the prior Pascal architecture. Specifically, Mr.Huang noted that Turing delivers an 8x performance improvement over Pascal (Exhibit1). We believe this serves to highlight the company’s consistent execution in driving notable performance improvements generation to generation, which is further reaffirmed by the improvements witnessed in training. Recall that in terms of total teraflops of performance Pascal improved 10x compared to Maxwell and Volta represented a 12x improvement relative to Pascal (Exhibit 2), which has in turn been one of the key factors, in our view, enabling Nvidia to maintain its competitive position in the datacenter market (see Exhibit 2 for relative performance compared to Google’s TPUASIC).” GS Nvidia Note, Sept 17

Volta didn’t improve 12x over Pascal. The V100 delivers about .5x more FP16 Half Precision Teraflops than the P100. That’s the appropriate apples to apples comparison. Now, the notable difference between Volta and Pascal is that Volta introduced tensor cores, and thus started quoting “deep learning” tensor teraflops as a metric. “Deep learning teraflops” are a mixed precision metric based on matrix multiply accumulate operations. Basically, you multiply in FP16 and accumulate the results in FP32 in one clock cycle. This is your TPU mac unit at work on a GPU die (thanks again Google). But the misleading thing about Deep Learning Teraflops is that matrix multiply doesn’t dominate al ML/DL training. In fact, once you get outside of the CNN’s image world, you are talking matrix multiply operations for 20% or less of the training time. This is why for example the performance leaps for Graphcore’s IPU jump massively over the V100 once you get to RNN’s and LSTM’s neural network. But even citing this metric you are looking at 5.5x FP16 vs DL Teraflops compare, and that in of itself is misleading as can be seen by actual CNN benchmark numbers at Volta launch. Here are the actual Pascal to Volta ML/DL performance gains as cited by Nvidia for Resnet 50 when Volta launched.

This analyst’s mistakes aren’t some typo type stuff, crazy TAM math, or even the usual DCF model-based assumption nonsense. These are fundamental technological understanding failures on his part that anyone with basic understanding of this type of hardware would never make. The V100 is an 815mm chip using the fastest memory technology available developed with extremely sophisticated process tech. Simply put when you consider the physics of chip engineering and current cutting edge memory tech, you can’t make a GPU big enough to deliver anything close to 1.4 Petaflops of training performance. And even if we suspended reality and made that an overnight possibility could you imagine a customer who just paid $400k for a DGX2 discovering Nvidia’s is selling a $5-$8k single chip that almost matches that ml/dl compute? I’m sure that hypothetical customer would want no less than a jacket from Jensen’s for such a fleecing.

To understand where the analyst screwed up you simply need to turn to Nvidia’s marketing. They have gotten so out of control with the numbers they throw out and different SKUs and performance compares that anyone who hasn’t seriously done his homework can easily get lost. Not exactly an excuse for a Goldman semiconductor analyst, but if he can stumble this badly you can use your imagination for everyone else. (In his defense I have seen far worse on other ml/dl more economic assumption based stuff.)

I also like how meager the TPU looks on this table – Google has never looked smaller!

But really there isn’t much you can say here as this guy isn’t the only person in the market struggling with this nonsense. I have seen articles talking about finding out how the T4 stacks up against TPU2 or TPU3 as if that’s an appropriate comparison. See, what most people don’t get is large scale ml/dl training is a system problem. There’s not going to be a single chip anytime soon that will be able to train one of these massive neural nets in the time desired which is why single chip comparisons are stupid. TPU Pods, clusters, GPU farms, DGX stations or whatever you want to call them are the story here. Yes, TPU3 can do inferencing, but the chips are far bigger and with water cooling obviously, so is the power consumption. So, when we talk inferencing between GPU and TPU the comparison remains vs. TPU1. And we will be talking similar comparisons vs Cambricon’s MLU100 and Groq’s TPU. Even last night we got some new Inference accelerator startup touting GPU crushing numbers. This startup, called Habana, is led by the former CEO of the now Amazon-owned Anapurna Labs. They had this to share about their Goya inference chip:

“The Goya chip can process 15,000 ResNet-50 images/second with 1.3-ms latency at a batch size of 10 while running at 100 W. That compares to 2,657 images/second for an Nvidia V100 and 1,225 for a dual-socket Xeon 8180. At a batch size of one, Goya handles 8,500 ResNet-50 images/second with a 0.27-ms latency.”

“Habana Labs is showcasing a Goya inference processor card in a live server, running multiple neural-network topologies, at the AI Hardware Summit on September 18 – 19, 2018, in Mountain View, CA.

Habana Labs plans to sample its first Gaudi training processor in the second quarter of 2019. Gaudi has a 2Tbps (Terabits per second) interface per device and its training performance scales linearly to thousands of processors.”

In case you are seeking a comparison here, at GTC Japan, Nvidia claimed fastest latency, efficiency, and thoroughput for inference on the same benchmark. This was just a few days ago, and yet Israeli-based Habana is demonstrating their (literally) orders of magnitude better chip live and already shipping to early access customers. Now, we already have established Nvidia pretty much pretends the TPU1 doesn’t exist when it comes to this topic because it’s not for sale, but it’s not like these new inference accelerators aren’t seriously starting to make the rounds. Just last week Cambricon was picking on them as they demonstrated their MLU100’s lower inference latency over Nvidia GPUs, and I’m sure by the end of this week the noise is really going to ratchet up.

To be frank, I can’t think of anything more catastrophically wrong than telling someone to buy Nvidia shares on scale-out datacenter ML/DL inferencing thesis. You have a better shot selling autonomous vehicles or arguing the Nintendo Switch is going to hit a billion units than selling the GPU inference story for datacenter ml/dl scale-out acceleration to anyone with a proper understanding of the space. Set aside the fact that there already are infinitely better scale-out options out there on perf/watt/$ basis, and simply consider how this market is about to evolve. Google already has their own scale-out datacenter inference chip. The Chinese hyperscale players now have several good domestic options in the space which they will be incentivized to use. You then have dozens of inference startups chasing the same few meaningful scale-out volume opportunities. And so far in inferencing the one thing they all have in common is they are far better suited for the workload than the GPU. That works out to horrible chip economics and leaves those remaining hyperscale customers in great bargaining positions as winning any of their business will prove existential for most of these startups. This means the scale-out datacenter hyperscale won’t even be a place to make real meaningful profits, and all of these inference driven startups will need to look to other markets at the edge for business or diversify into training (already a strategy seemingly being pursued by all). Bottom line the whole thing looks like a dirty street fight for a few years, and as running costs for these well-funded fabless chip companies ain’t crazy, competitors can hang around for a lot longer than in other industries. So, consolidation and the full shake out can take time, which further weighs on margins for everyone.

Now, I don’t expect chip analysts to lay out such a broad stroke big picture thesis for the space, but I do think anyone covering this sector could get good intel on all this stuff by just taking the time to talk to the customers involved or actual chip engineers. But raising price targets and making outrageous claims is a lot more fun in a bullish tape. This is the sad state of affairs in Nvidia sell-side research these days. Anyway, next time someone says I’ll take Goldman Sachs research over some anonymous SA author feel free to cite this article.

Disclosure: I am/we are short NVDA.

I wrote this article myself, and it expresses my own opinions. I am not receiving compensation for it (other than from Seeking Alpha). I have no business relationship with any company whose stock is mentioned in this article.

Source link

You may also like

Leave a Comment