Cache Coherent Chaos

Kevin Morris

February 11, 2020

We are at the dawn of the biggest change in decades in our global computing infrastructure. Despite the slow death of Moore’s Law, the rate of change in our actual computer and networking systems is accelerating, with a number of discontinuous, revolutionary changes that reverberate throughout every corner of computer system architecture.

As Moore’s Law grinds to an economic halt, the rate of performance improvement of conventional von Neumann processors has slowed, giving rise to a new age of accelerators for high-demand workloads. AI in particular puts impossible demands on processors and memory, and an entire new industry has emerged from the startup ecosystem to explore new hardware that can outperform conventional CPUs on AI workloads. GPUs, FPGAs, and a new wave of AI-specific devices are competing to boost the performance of inference tasks – from the data center all the way through to the edge and endpoint devices.

But creating a processor that can crush AI tasks at blinding speed on a miserly power budget is only one aspect of the problem. AI does not exist in a vacuum. Any application that requires AI acceleration also includes a passel of processing that isn’t in the AI realm. That means systems have to be created that allow conventional processors to partner well with special-purpose accelerators. The critical factor in those systems lies in feeding those accelerators with the massive amount of data they can consume, and keeping that in synch with what the application processor is doing. It doesn’t matter how fast your accelerator is if you don’t have big enough pipes to keep data going in and out at an appropriate rate.

Obviously, cache-coherent memory interfaces are an attractive strategy for blasting bits back and forth between processors and accelerators. They minimize latency and can scale to enormous data rates depending on the needs of the application and the structure of the system. With today’s multi-processor systems, cache coherence is already critical to assure that multiple processors are seeing similar contents in their local caches. What we haven’t had is a de-facto standard for cache-coherent interfaces between processors and accelerators.

Hmmm… Wonder why that would be?

I’ve been advised never to attribute to malice that which is better explained by incompetence. Or, more relevant in this case, perhaps, is not to attribute to devious competitive strategy that which is more likely the result of simply tripping over one’s own corporate shoelaces but landing, favorably, on top of your competitors. Specifically, amidst all this chaos in the computing world, Intel is hellbent on defending their dominance of the data center. How important is data center to Intel? It accounts for over 30% of the company’s revenue and estimates of more than 40% of the company’s value, with the data center market expected to grow to $90 billion by 2022. “Rival” AMD is proud when they gain a point or two of market share, with theirs ranging in the 20-30% range against Intel’s 70-80%. So – Intel has, and fully intends to protect, a dominant position in an extremely lucrative business in data center computing hardware.

All this disruptive change poses a risk to Intel, however. The company can’t count on just continuing to crank out more Moore’s Law improvements of the same x86 architecture processors while the world shifts to heterogeneous computing with varied workloads such as AI. Nvidia brilliantly noticed this strategic hole a few years ago and moved quickly to fill it with general purpose GPUs capable of accelerating specialized workloads in the data center. That allowed them to carve out something like a $3B business in data center acceleration – which is almost exactly $3B too high for Intel’s taste. Looming on the horizon at that time were also FPGAs, which clearly had the potential to take on a wide variety of data center workloads, with a power and performance profile much more attractive than GPUs. Intel answered that challenge by acquiring Altera for north of $16B in 2015, giving them a strategic asset to help prevent Xilinx from becoming the next NVidia.

What does all this have to do with cache-coherent interface standards? One way to think about it would be that it would not be to Intel’s advantage to make it super easy for third parties to bring their GPUs and FPGAs into Intel’s servers – when Intel didn’t have their own acceleration strategy in place yet. If Intel didn’t have general-purpose GPUs to compete with NVidia, or FPGAs to compete with Xilinx, why would they want to give their competitors a head start within their own servers?

Companies like AMD and Xilinx saw the need for a cache coherence standard, however, so in 2016 they set about making one – the Cache Coherent Interconnect for Accelerators (CCIX – pronounced “see-six”) was created by a consortium which included AMD, Arm, Huawei, IBM, Mellanox (subsequently acquired by NVidia), Qualcomm, and Xilinx. In June 2018, the first version of the CCIX standard, running on top of PCIe Gen 4, was released to consortium members.

Now, we’ll just hit “pause” on the history lesson for a moment.

How do you react if you’re Intel? Well, you could jump on the CCIX train and make sure your next generation of Xeon processors were CCIX compatible. Then everybody could bring their accelerators in and get high-performance, cache-coherent links to the Intel processors that own something like 80% of the data center sockets. Server customers could mix-and-match and choose the accelerator suppliers they liked best for their workloads. Dogs and cats would live together in peace and harmony, baby unicorns would frolic on the lawn, and everyone would hold hands and sing Ku… Oh, wait, just kidding. None of that would EVER happen.

Instead, of course, Intel went about creating their own, completely different standard for cache-coherent interfaces to their CPUs. Their standard, Compute Express Link (CXL), also uses PCIe for the physical layer. But beyond that, it is in no way compatible with CCIX. Why would Intel create their own, new standard if a broadly supported industry standard was already underway? Uh, would you believe it was because they analyzed CCIX and determined it would not meet the demands of the server environments of the future? Yeah, neither do we.

At this point, many of the trade publications launched into technical analyses comparing CCIX and CXL, noting that CXL is a master-slave architecture where the CPU is in charge, and the other devices are all subservient, but CCIX allows peer-to-peer connections with no CPU. Intel, of course, says that CXL is lighter, faster than a speeding bullet, and can leap tall technical hurdles in a single clock cycle (OK, maybe that’s not exactly what they say.)

Before we get into that, let’s just point out that it basically does not matter whether CCIX or CXL is a technically superior standard. CXL is going to win. Period.

Whoa, what heresy is this? It’s just simple economics, actually. This may seem obvious, but cache-coherent interfaces are useful only among devices that have caches. What computing devices have caches? Well, there are CPUs and … uh… yeah. That’s pretty much it at this point. So, the CCIX vision where non-CPU devices would interact via a cache-coherent interface is a bit of a solution in search of a problem. Sure, we can contrive examples where it would be useful, but in the vast majority of cases, we want an accelerator to be sharing data with a CPU. Whose CPU would that be? Well, in the data center today, 70-80% of the time, it will be Intel’s.

So, if you’re a company that wants your accelerator chips to compete in the data center, you’re probably going to want to be able to hook up with Xeon CPUs, and – they are apparently going to be speaking only CXL. For a quick litmus test, Xilinx, a founder of the CCIX consortium, is adding CXL support to their devices. It would be foolish of them not to. Ah, but here’s a tricky bit. In an FPGA which already has PCIe interfaces, CCIX support can apparently be added via soft IP. That means you can buy a currently-existing FPGA and use some of the FPGA LUT fabric to give it CCIX support. Not so with CXL. For CXL you need actual changes to the silicon, so companies like Xilinx have to spin entirely new chips to get CXL support. Wow, what a lucky break for Intel, who happens to already be building CXL support into their own FPGAs and other devices. You’d almost think they planned it that way.

So, what we have is a situation similar to the x86 architecture. Few people would argue that the x86 ISA – developed around 1978 – is the ideal basis for modern computing. But market forces are far more powerful than technological esoterics. Betamax was pretty clearly superior to VHS, but today we all use… Hah! Well, neither. And, that may be another lesson here. As the data center gorilla, Intel has vast assets it can use to defend its position. There are countless cubby-holes like the CCIX/CXL conundrum where the company can manipulate the game in their favor – some likely deliberate, and others completely by accident. None of those will protect them from a complete discontinuity in the data center where purely-heterogeneous computing takes over. Then, it could be the wild wild west all over again.

It will be interesting to watch.

Comments

Kev

February 11, 2020

Cache-coherence is of limited use because it doesn’t scale, if you want to scale X86/ARM you can do it by moving threads instead of data –

https://youtu.be/Bh5axlxIUvM

Wandering-threads fixes most of the problems of SMP and programming distributed systems, and you don’t have to rewrite your code. It’s the easy way to do in-memory computing, but Intel and ARM’s cores probably still run too hot for that.

Log in to Reply
1. Kevin Morris
  
  February 11, 2020
  
  You make a good point. There is also the outstanding question of what our base-level server setup with accelerators and servers actually is. Intel appears to be primarily thinking of servers that contain both application processors and accelerators such as FPGAs. Other approaches have FPGAs in a pool as shared resources (e.g. Amazon F1). It isn’t clear to me what arrangement will work the best for the most workloads.
  
  Log in to Reply
Christoforos

February 12, 2020

great article Kevin!
Cache coherency and low latency communication using PCIe is extremely important factor for FPGA-based applications.
For many applications the RTT latency is critical for the utilization of FPGAs (i.e. quantitative finance)
…and also not to forget about IBM OpenCAPI that also supports extreme low latency communication and
cache coherency.

Log in to Reply
1. acantle
  
  February 12, 2020
  
  Thanks for bringing OpenCAPI into focus here Christoforos!
  Kevin, I’d highly encourage that you extend this very interesting article by bringing OpenCAPI into the fold.
  As someone involved in attaching coherent busses to FPGAs since 2007, with both Intel and IBM, I can attest that it is a history worth telling that’s not widely publicized. Intel were at it for over a decade with FSB as well as iterations of QPI before they gave up and IBM worked on 3 versions of CAPI from 2014 through to today’s OpenCAPI 3.0 that is now working well, alongside CAPI2.0 over PCIe Gen4, in full production on POWER9. It’s also worth noting that OpenCAPI’s Phy layer is interchangeable with Nvidia’s NVLink, so it is possible to switch between FPGA and GPU attached accelerators in POWER9 Systems or, indeed, mix and match them.
  
  Please take a look at an introductory presentation I gave, alongside IBM, at the OCP Summit in March 2018 for a detailed overview of the the coherent attached FPGA history and OpenCAPI.
  
  https://www.youtube.com/watch?v=LjQ8OE1cdCY
  
  I think that everyone’s agreed that the industry will come together with CXL in the coming years, but, thanks to OpenCAPI, innovation with coherently attached accelerators can take place in production systems today while we await a production worthy CXL system in the coming years…. The shift from OpenCAPI to CXL will also be pretty painless as well because CXL is eerily similar to OpenCAPI!
  
  Log in to Reply