Xilinx Throws Down

Kevin Morris

February 24, 2015

When the #1 FPGA company makes what is arguably their biggest new-technology announcement in a decade, you’d expect there to be a lot of substance. With this week’s announcement of UltraScale+ Virtex, Kintex, and Zynq devices planned to roll out on TSMC’s 16nm FinFET process, the company did not disappoint. This is one of the broadest, most complex announcements we have ever heard from Xilinx. So, with that preface, let’s take a look at what those folks on the south side of San Jose have been up to lately.

In summary, Xilinx is announcing new Virtex, Kintex, and Zynq families of programmable devices with major improvements in capability over previous generations.

Xilinx is unveiling its “UltraScale+” device families. Note the “+”. That means that these families are based on TSMC’s 16nm FinFET process – rather than the 20nm planar process that the current “UltraScale” devices use. And, since we’re on the topic of underlying fabrication process and FinFETs, let’s get that part of the discussion out of the way first. No, this announcement does not answer the mostly irrelevant but marketing-wise omnipresent question of whether Xilinx or Altera will ship the first FinFET-based FPGAs, with Xilinx working with TSMC and Altera jumping over to Intel’s 14nm Tri-Gate (Intel’s name for FinFETs) process. Actually, Xilinx and Altera are now racing each other for second place in that derby right now, because Achronix already crossed the finish line on that one with their Intel-fabbed 22nm “Speedster” FinFET FPGAs.

FinFETs offer a step function in performance/power for FPGAs. This is beyond the normal single-node advantage we’d expect from a typical Moore’s Law shrink. FinFETs can do more work with less power, and leak less, than similar-sized planar transistors. The result is that the families Xilinx is announcing will have more than a typical one-node improvement in performance, power, and density – based on the new process alone. That’s a big deal.

But everybody will eventually have FinFETs, so any competitive advantage from process is a matter of timing. Yes, those of us using the next generation of chips will reap massive benefits from using devices made with the latest process. You’ll do more, with a cheaper chip, and it will take less power.

Now, let’s move on to the stuff in this announcement that is much more interesting. Xilinx isn’t just sitting back and re-doing the same FPGAs (only bigger), taking advantage of the next process node. That would give us a normal, Moore’s Law improvement in FPGA technology. The company has done some significant innovation in the areas of architecture, packaging, tools, and IP that give us considerably more boost than Moore’s Law alone.

One of the biggest advances Xilinx has made is not mentioned in their latest release, but it has its footprints all over it. That is Vivado, the company’s completely overhauled design tool suite. A few years ago, the company made a major investment in a total rewrite of their aging ISE tools. The result was a state-of-the-art comprehensive EDA tool with all the latest bells and whistles in terms of data model, integration, algorithms, and performance. Now Vivado has had a few years to mature and get into fighting trim, and it played a major role in the development of the architecture for the latest devices. Xilinx used Vivado to completely redesign the routing resources on the chip – eliminating bottlenecks and giving the tools exactly what they needed for more demanding designs. As a result, the UltraScale and UltraScale+ families are significantly better than their predecessors in terms of overall routability and utilization. If you’re accustomed to sizing your FPGA based on a 60-70% utilization, you’ll be pleasantly surprised with the 90%+ results many teams are finding with these newly re-architected devices.

What ARE included in the recent announcement are several important architectural improvements that capitalize on the above process, routability, and utilization advantages. These include UltraRAM – a new block that delivers significantly more on-chip memory capacity, SmartConnect – a new interconnect optimization technology, and “Heterogeneous Multi-processing” in the new Zynq devices – which is basically expanding the ARM-based processing system with a lot of optimized hard IP. The company is also touting “3D-on-3D” which is a marketing way of pointing out that they are using both 3D transistors (FinFETs) and 3D packaging technology (silicon interposers and TSVs). All of these improvements are significant, and all of them work together to bring us devices with dramatically more capability than we have ever seen before in programmable logic.

Let’s take a look at them one by one.

In many applications, high-performance memory is at a premium. You need memory for buffering, for shared resources between processors and accelerators, and for various types of caching. Often, it pays to have the memory located where it is needed, rather than at the other end of a busy multi-purpose bus or switch fabric. While FPGAs have always had some memory resources, designers have had to go off chip to get access to large amounts of storage. But off-chip memory interfaces are expensive and power hungry, and they introduce a considerable amount of latency. They also chew up valuable IO on your FPGA, and sometimes IO is the scarcest resource of all.

To address this issue, Xilinx has created what they call “UltraRAM” – high-performance memory blocks strategically located where they are likely to do the most good. Some device configurations have as much as 432 Mb of UltraRAM, significantly more memory than has been available in previous generations. UltraRAM doesn’t replace the existing block RAM and LUT-based memory resources. Different types and sizes of on-chip memory are useful for different purposes, and UltraRAM just augments the existing lineup with a new, much larger, high-performance memory block. Considering the size and complexity of the target applications for these devices, UltraRAM will likely come in extremely handy in many designs, and it is likely to improve the system performance, reduce latency, reduce power consumption, and significantly reduce BOM cost and board complexity when compared with external memory.

Next up is what Xilinx calls “SmartConnect” interconnect technology. Like UltraRAM, SmartConnect is a feature born of the demands of the much larger designs being implemented on these new devices. While the overhaul of the routing resources we described above dramatically improves the detailed routing part of your design, there is a new meta-level of interconnect that needs to be addressed as well. When your chip has large, complex blocks that talk to each other over prescribed interfaces, you typically have specific latency and/or throughput targets for those pipes. That means a “one priority fits all” interconnect strategy is guaranteed to be sub-optimal for at least part of your design. SmartConnect allows distinct optimization of interconnect to meet specific goals of each part of a design, reducing the overall routing resource required and matching the type of interconnect chosen to the specific constraints of each interface.

The Virtex and Kintex FPGA families take advantage of all this new stuff in just the way you’d expect. Interestingly, almost all of the Virtex devices will be fabricated with multi-die 3D silicon-interposer technology. The largest Virtex weighs in with a hefty 3.4 million 4-input LUT equivalent logic cells, 432 Mb UltraRAM, 94.5 Mb block RAM, and 46.4 Mb distributed RAM. It packs a whopping 11,904 DSP slices, four hardened PCIe® Gen3 x16 / Gen4 x8 interfaces, twelve 150G Interlaken interfaces, and eight 100G Ethernet MACs w/ RS-FEC. This is topped off with 832 single-ended IOs and impressive 128 GTY 32.75Gb/s SerDes transceivers.

Clearly, this perfect storm of process scaling, 3D transistor improvement, architectural improvement, packaging technology advances, and design tool progress will give us the largest single leap forward in FPGA technology and capability we’ve ever seen.

Probably the most major advance in this announcement, however, is the new UltraScale+ Zynq offering. Zynq got such a massive upgrade, it almost needs a new name. The current Zynq is a wonderful example of what we call an “HIPP” (Heterogeneous Integrated Processing Platform). It combines a dual-core ARM Cortex-A9 based processing subsystem with copious amounts of FPGA fabric and IO. This combination allows designs to take advantage of hardware acceleration of demanding algorithms along with conventional high-performance applications processing. The result is a highly power-efficient device with formidable processing capability.

With Zynq UltraScale+, the FPGA portion of that equation benefits from all of the enhancements we described above. But the ARM-based subsystem gets a major upgrade as well. The applications processors are now quad-core, 64-bit, ARM Cortex-A53s – packing significantly more MIPS than the old A9s. Then, for the real-time bits of your application, they dropped in dual-core Cortex-R5 real-time processors. Rounding out the passel-o-processors is a Mali-400MP graphics processor. Taken together, that’s a LOT more processing oomph than before, and the addition of the real-time engines and GPU means that you can tailor the type of processing better to the part of your application that needs it.

One of the big application areas for this new Zynq family is video, and Xilinx acknowledged that with the addition of a hardened H.265/264 codec unit. An “Advanced Dynamic Power Management Unit” brings some ASIC/ASSP-grade application power management to a programmable device. There is also a new configuration security unit to help lock down your design, and forward-looking DDR4/LPDDR4 memory interface support – which will be important for the high-performance designs likely to land in the new Zynq’s lap.

Taken together, we feel that the Zynq upgrades are the most significant of the bunch. Of course, the UltraScale+ Virtex and Kintex families are both taking huge leaps forward, but Zynq (with all its new hardened ARM-based IP) seems like a whole new animal. In the hands of capable design teams, Zynq will enable applications that might otherwise be impossible. It should deliver an immense amount of aggregate heterogeneous processing capability at an unmatched performance-per-watt efficiency.

Of course, we will all have to wait awhile before we get to play with these amazing devices. Xilinx plans initial samples late this year with volume production ramping in 2016. But tool early-access support starts much sooner than that (Q2 2015). So, if you’re one of the lucky ones in the early access program, you’ll be able to take these families for at least a virtual test-drive pretty soon.

Comments

kevin

February 24, 2015

What do you think? Are Xilinx’s upcoming UltraScale+ 16nm FinFET families what you were expecting?

Log in to Reply
WEATHERBEE

February 24, 2015

Hi Kevin, the realtime cores on the Zynq UltraScale according to Xilinx’s website are Cortex-R5 microcontrollers not Cortex-A5 microcprocessors. Totally different beast and in my opinion it is about time the Cortex-R started showing up more often in generally available SoCs.

Log in to Reply
kevin

February 24, 2015

Thanks Weatherbee,

That was a typo, and it has been corrected.

Kevin

Log in to Reply
gobeavs

February 25, 2015

Just another nail in the coffin for Achronix. I think it’s funny that Achronix puts a LUT count on the hard IP. That is an interesting marketing ploy. I don’t think Xilinx or Altera put some lame effective LUTs including hard IP number out there.

You know what is funny. Altera was first to market with hard ARM. That was Excalibur. Then they gave it up. Now both Xilinx and Altera have major efforts in hard ARM. If Altera would have stuck it out, maybe they would have become the leader in hard ARM FPGAs.

Also, Altera came out with the huge ram idea. Remember the MegaRAM? Another idea they abandoned. Now Xilinx brings it back. Funny.

Log in to Reply
kevin

February 26, 2015

@gobeavs, I’m as impressed with this announcement as anybody, but I think you’re assigning a bit too much nobility to Xilinx and Altera marketing when you say “I don’t think Xilinx or Altera put some lame effective LUTs… number out there.”

Let’s review who we’re talking about here. First, with all due respect, these are the two marketing departments who brought us “system gates” – remember those?

Moving to the modern era, let’s browse a couple of product tables.
Xilinx:
http://www.xilinx.com/publications/prod_mktg/ultrascale_product_selection_guide.pdf#VUS

Looking at Xilinx, we see (on the XCVU440:
– Effective LEs – 5,391K
Wow, this device must have 5.3 million LUTs, right? Oops, hang on, in 0.1 point font, we see a footnote: “Relative to the effective logic utilization demonstrated in the competition’s 20nm product portfolio” Or, to paraphrase, we think our logic utilization is better than theirs, so we’re gonna inflate our “effective LEs” number by 20% – just because.

So, looking down farther, we see the “real” “Logic Cells” number:
– Logic Cells (K) – 4,433
OK, so that means this device ACTUALLY has 4.4 million LUTs, then. Yes?
Not so fast, there, cowboy. You could get your scanning electron microscope, pop the top off a package, and start counting logic cells (1, 2, 3… ) and you would run out LONG before 4,433K.
How come? Slide down a couple lines and you see this:
CLB LUTs – 2,532,960
THAT’s how many actual, real, logic cells there are on this device. Since FPGA companies are using much wider logic cells than legacy 4-input LUTs, they (BOTH) feel the need to tell us how many of the old, nonexistent LUT4s it would take to do about the same amount of “stuff”.

Result, the FPGA billed as 5.3 million cells, actually has 2.5 million.

Lest we think Xilinx is alone in their datasheetsmansmithing, Looking at the top line of an Altera product table:

Altera:
http://www.altera.com/devices/fpga/arria-fpgas/arria10/overview/arr10-overview.html

For the GT1150:
Equivalent LEs – 1,150K
Adaptive Logic Modules (ALMs) – 427,200
Yep, even in Altera marketing land, a 1.2M logic cell device only has 427K logic cells.

So, by only rolling in some extra juice for hard IP in their otherwise-straightforward LUT4 counts… Achronix marketing just brought a knife to a gunfight.

Log in to Reply
TotallyLost

February 26, 2015

@gobeavs — and it wasn’t that many years ago that key Xilix staff were brutally vocal putting down folks pioneering C/SystemC to RTL.

Now the company stand is: Vivado HLS accelerates design implementation and verification by enabling C, C++, and SystemC specifications to be directly synthesized into VHDL or Verilog RTL

NIH rhetoric is more than common with them, until they release it as their idea.

Log in to Reply
gobeavs

March 1, 2015

I just think Achronix is toast. I would really love to hear what their advantage is. I think the Intel 22 nm may have been a small advantage but probably not enough to make that big of a difference. I think you need to be twice as fast or half the die size to get market traction. Achronix started with a speed story (pico pipes) but has abandoned that for normal FPGA arch.

I think Achronix was dumb not to put the 6 input LUT in there. I think it is clearly an advantage for high performance FPGAs. It reduces levels of logic between flops. The 6 input LUT is a big deal when it comes to Fmax.

Xilinx UltraScale+ and Altera Stratix 10 are going to end any hope for Achronix. Almost all FPGA startups fail, that is the history. The last one to exit was Silicon Blue. They had a niche market idea that did ok. Only a 62 M dollar exit though.

Log in to Reply
tentner

March 5, 2015

Marketing LCs:
While I generally agree, I would like to point out that the Altera ALMs can be used as 2 fully featured 4-input LUTs without restrictions + some extras, so 854K is a realistic count for the mentioned device, which leaves a (still quite large) marketing factor of 1.35x. In contrast, the Xilinx marketing factor is totally ridiculous (2.12x/1.76x). It is also clearly higher than what they use e.g. on Kintex 7 devices, I have no clue why.

Maybe we should also mention Lattice which do not do such stupid marketing numbers, at least to my knowledge…

Log in to Reply
gobeavs

March 6, 2015

It is perfectly reasonable to provide a LUT4 equivalent. A LUT6 requires 64 bits of SRAM to implement any function of 6 inputs. A LUT4 only requires 16 bits. So the SRAM area of LUT6 is 4x of the LUT4. However, it only gives you a real usable ratio of around 2x.

I don’t think it’s easy to compare a LUT6 device with a LUT4 device and understand the comparable logic size. It’s reasonable to create and equivalent LUT4 count for a LUT6 device. Lets say you have a design fitting into a stratix that you want to try and move to cyclone. You can use LUT4 count to get you in the ballpark. Until you compile, you won’t know the final answer.

What is unreasonable is to pile on an extra number on top of the ~2x number. This number is bogus – Effective LEs – 5,391K. The one that says we do better packing than you do. The tools and algorithms are constantly changing from release to release.

Achronix is the first one to count hard IP as a LUT count from anything I have ever seen.

Log in to Reply
tentner

March 8, 2015

I agree that a LUT6 is of more value than a LUT4. But I think the original LUT4 was quite cleverly chosen as it is a good compromise for what is most frequently required. So I think a factor of about 2x is much too high. On the other hand, it is really design dependent, as already mentioned.

I hope that Achronix stops counting hard IP as LUTs. You either need the IP block or you don’t… Otherwise others suddenly start to translates DSP blocks into LUTs… NO, PLEASE DON’T!

Log in to Reply