Where’s the CNN Synthesis?

Kevin Morris

August 16, 2018

The electronic design automation (EDA’s) mission has always been primarily to facilitate the design and verification of electronic circuits. EDA began, of course, with companies like Mentor, Daisy, and Valid providing specialized software for capturing and editing schematic drawings. These tools took the native human-readable language of the designer: schematics, and created the fundamental machine-readable structure of EDA: the netlist.

In the four decades since, EDA has not strayed far from that path, conceptually. The job just got tougher. Moore’s Law took complexity through the roof, for logic design in particular. With designs going from handfuls of gates to billions, the human-readable side of that equation evolved. Schematics gave way to gate-level and then to register-level hardware description languages (HDLs), and EDA responded with an entirely new class of tools: logic synthesis.

A generation of digital designers became experts in HDLs and synthesis. At the same time, the explosion of the synthesis market rocketed Synopsys into the number one position in the EDA industry on the coattails of Design Compiler, a logic synthesis tool that dominated the industry for decades. The foresight (or luck, depending on your view) that Synopsys showed in grabbing onto a major methodology shift won the company a seat at the head of the table that they’ve enjoyed for over twenty years.

Now, another disruptive change is hitting the industry. Artificial intelligence has progressed more in the past four years than in all of its previous long history, fueled primarily by disruptive progress in convolutional neural networks (CNNs). CNNs present a unique challenge to hardware design, as the dominant architecture of the past twenty years – the so-called “system on chip” SoC – is woefully inadequate for meeting the computational demands of CNNs. That means designers need to come up with new and novel digital architectures to implement CNNs efficiently in hardware.

More specifically, executing a CNN model in software on a von Neumann machine is woefully inefficient. In order to meet the demands of applications such as machine vision, we need to make orders of magnitude improvement in latency, throughput, power consumption, and cost. Or, to put it another way, we need custom hardware designed specifically for the task.

Unfortunately, every CNN model is unique in its topology. So far, at least, there is no one-graph-fits-all approach to CNN design. That means that every model or algorithm requires a unique structure of logic, memory, and data flow to optimize it across the key metrics. So, we’ve got software-only implementations that are inadequate – running on multi-core CPUs, GPUs, and some specialized processors, and we’ve got hardware implementations that require unique and specialized logic design on a per-application basis.

It’s time for a new tool/flow.

The demands of CNN implementation are a significant departure from the direction that logic system design flow has taken up until now. Our current tool flow has evolved to ever-higher levels of abstraction. We began with simple logic gates and evolved to ever-larger structures stitched together to create our desired function. Today, most systems are created by combining processors, peripherals, and specialized blocks/accelerators to meet the requirements of our system/application.

In that context, we could view CNNs as just another one of those “specialized blocks/accelerators.” We already have design tools and flows for those. High-level synthesis, for example, is adept at taking a software-like description of an algorithm in C or C++ and synthesizing that into a highly-optimized logic structure (usually a datapath with control, memory, and interfaces). And some approaches to CNN design are taking advantage of HLS today.

There are three major problems with this approach, however. First, it appears that the optimal implementation of CNNs will often be heterogeneous combinations of conventional processors with custom hardware. That means that there is a partitioning of functionality between software and hardware that is well beyond the scope of current HLS technology. Second, HLS is a very general tool for generating hardware architectures from software-like sequential algorithms, but CNN architectures tend to be much more structured and predictable. Throwing some code at HLS and hoping that it magically creates an optimized CNN is quite a roll of the dice. Finally, the very small number of folks who currently know how to develop CNNs don’t tend to have the expertise in hardware design required to use the latest HLS design flows.

So, there’s a disconnect between the state of the art in logic system design and the data science experts who design CNNs. This gap must be bridged with an automated tool flow that can understand the native language of CNN experts and can drive a process that results in customized CNN hardware. Sounds simple, right?

It’s not like nobody has thought of this problem. There are currently a number of tool flows that cobble together pieces of the puzzle in a Rube-Goldbergian fashion, with varying degrees of success. Just in the past couple of years, a number of (mostly academic) efforts have produced notable results. Most of these start with one of the current CNN modeling frameworks, such as Caffe or TensorFlow as input, and produce some kind of synthesizable RTL as output. These flows include fpgaConvNet, ALAMO, Angel-Eye, DeepBurning, Haddoc2, Caffeine, Finn, FP-DNN Snowflake, FFTCodeGen, and perhaps others we’ve overlooked.

Most of these tool flows target FPGA hardware, although for some applications we might want ASIC implementations instead. Some are specific to Xilinx or to Intel FPGA flows, while others make efforts to produce portable results. It is possible that the optimal implementation in many situations might take advantage of new eFPGA IP blocks in an ASIC (such as those provided by Achronix, Flex-Logix, and others) producing a custom chip whose CNN model can be reprogrammed or optimized in the field.

The EDA industry dominates most of the underlying technology required to solve this problem, and it is a problem that will be front-and-center for at least the next couple of decades, with very high percentages of new system designs trying to take advantage of the capabilities of CNNs. The infrastructure that EDA already owns for implementing and verifying logic hardware, from high levels of abstraction down to optimized, verified, placed-and-routed gates, is essential to the solution, yet we see no evidence that EDA is tackling the top level of the problem. Instead, there is a plethora of competing academic efforts underway trying to build clumsy structures around existing EDA (and FPGA vendor) tool flows.

Perhaps EDA is already working on this problem in secret. Or, perhaps there are startups quietly toiling away in hopes of becoming the “Synopsys” of the next era in electronic system design. Or, maybe we’re just too early in the evolution of this technology to start canonizing it with purpose-built tools. Our guess is that EDA has just overlooked the opportunity or doesn’t know where to start. One thing is certain, though. This problem is too important to be ignored for long. It will be interesting to watch.

Comments

Kev

August 16, 2018

CNNs are a lot like the Fast-SPICE problem, and neural networks in general are like circuit simulation in (say) Verilog. The problem for the EDA guys is they have not managed to get off RTL design into more abstract forms like asynchronous-FSM and making “real number” modeling work properly in SystemVerilog.

Speaking of SystemVerilog, I tried to add the support for a-FSM design over a decade ago because asynchronous implementation is something you need for low power and FinFET level design, but the big guys on the SV committees shut down that effort. That inspired me to work out how to do it all in C++ –

http://parallel.cc

I’m currently working on getting LLVM folks to implement that so I don’t have to deal with big EDA companies (who are extremely unlikely to catch on suddenly).

These guys might be doing the back-end piece –

http://ascenium.com/

None of the FPGA companies I know are close to getting a good C++ synthesis flow working, so arbitrary neural networks seem a stretch.

Log in to Reply
TotallyLost

August 17, 2018

I’m not a fan for using C++ as an HDL because classes using dynamic allocation (with pointers) are nearly impossible to implement in a clean way, that is inherently parallel in logic. Take that away, and we are pretty much back to a highly typed C standard.

Other approaches like OpenMP C hand the compiler writer clean and clear parallel handles for synthesis that are highly portable across run time architectures, including logic based synthesis for FPGA and ASICs. And this can include run time constrained pointers within typed memory systems by translating pointers into array structures.

FPGA’s (and ASICs) are good C run time environments for algorithms that have fairly static data and instruction flows, allowing independent parallel logic and memories for high performance and low power. Things like FFT’s, filters, high speed parallel FSM’s are easy.

FPGA’s are not a good C run time environments where the data drives significant variation in logic flows, or is based on a large memory pool with multiple data types … and/or dynamic allocation with method pointers. Architectures with pipelines, caches, and multiple cores do a better job, faster, and lower power. It’s possible ASIC’s can cleanly handle some of these applications, but I’m not yet sold that they can do better for all.

The good news is that Intel is a strong supporter of both OpenMP and open source development tools. Hopefully the DARPA initiatives will take us down that road, with Intel support.

https://www.darpa.mil/attachments/eri_design_proposers_day.pdf

https://spectrum.ieee.org/tech-talk/semiconductors/design/darpa-picks-its-first-set-of-winners-in-electronics-resurgence-initiative

Log in to Reply
1. TotallyLost
  
  August 17, 2018
  
  And this means a CNN coded in std C (or OpenMP) should have a clear synthesis path, that is both low power and high performance in an FPGA or ASIC. Assuming the HDL C synthesis tool provides simple net lists to the vendor optimization and P&R tools.
  
  Log in to Reply
  1. TotallyLost
    
    August 17, 2018
    
    And assuming the C algorithm coders are run time environment aware, and don’t use an implementation architecture or coding style that is inherently difficult to implement on the run time architectures.
    
    Log in to Reply
2. Kev
  
  August 29, 2018
  
  C++ is better than SystemVerilog if you add a couple of things to it. SystemC sucks, you certainly don’t want to use that.
  
  NB: C++ can be viewed as an “executable spec”, you don’t have to do a literal implementation.
  
  Stuff like OpenMP is just bad – you can’t fix a language deficiency with an API.
  
  Log in to Reply
  1. TotallyLost
    
    August 31, 2018
    
    There is certainly an issue trying to make some tool be a “one size fits all”. What is “just bad” for some applications, can certainly be the golden choice for many others.
    
    And a good implementation of OpenMP is awesome … unfortunately that currently is limited to a few popular implementations. But it does have a very portable framework for describing preferred areas of parallelism, that when done right, can generate run time performance gains across multiple target architectures … including FPGA/ASIC. More importantly implementing the algorithms on some architecture like X86 first, and then collecting extensive run time metrics, yields concrete data on where and how to optimize parallelism for the intended target execution platform.
    
    Log in to Reply
mfingeroff

August 20, 2018

I think that it’s true that there is no “one size fits all” solution out there for Convolutional Neural Networks (CNN) and this will continue to be this way for quite a while. What we see today is the proliferation of general purpose accelerators combined with a compiler to implement any network, with some compromise of PPA, and mostly targeted towards FPGA. However if you look at what’s being done with HLS targeted towards ASIC, general purpose CNN accelerators are not sufficient.

Most chip companies working in the computer/machine vision space want to be able to tune their design architecture based on the end application, and hand-coding RTL is no longer practical. This is why HLS has been exploding in the past couple of years, even though it’s be around for close to two decades. Companies get to leverage their in-house design expertise and add their “special sauce” to differentiate their CNN hardware implementations, using HLS to accelerate the design process. Push-button hardware-software flows are already happening in the FPGA space that will enable more people on the software side of design to participate in hardware creation. But just like we’ve seen in HLS over the years, automating high-performance/quality custom hardware creation from any old abstract description is still far in the future.

The current reality is that custom solutions tuned for PPA require a hardware design expert in the loop.

Log in to Reply
1. Kev
  
  August 29, 2018
  
  “HLS has been exploding in the past couple of years” – really, can’t say I’ve seen that anywhere. Working on the language committees for SystemVerilog and other things, only the complete lack of movement is apparent, and the verification world is as bad as it ever was.
  
  Log in to Reply
  1. TotallyLost
    
    August 31, 2018
    
    LOL … have to be aware of choosing (and creating your own) echo chambers
    
    Log in to Reply
Karl Stevens

October 30, 2018

Since neither C or C++ is suitable for hardware design and SystemVerilog or VHDL are not really suitable to program anything, it is time to step back and think about hardware design:
There are inputs, outputs, and functional blocks.
Blocks do arithmetic and/or Boolean functions using logic gates or look-up tables.
Memory blocks store data and also can be used for control sequencing.
It takes time for gates to resolve so there has to be a way to generate time intervals so outputs are
not sampled until fully resolved.
Hardware is inherently parallel so there is no assertion of what is to be done in parallel as must be done for multi-core or multi-threading.
Although a string of if/else statements is analogous to an and/or network of logic gates, the result is resolved sequentially instead of in parallel.
So what do I need to design hardware?
1) An arithmetic block.
2) A Boolean Logic block.
3) A way to generate a time event to allow for resolution time.
OOP programming is block(object) oriented and therefore closer to hardware structure.
C#/Roslyn/Mono is OOP and also has an AST API that is very useful for handling that pesky assignment expression operator precedence. OOP classes can be functional equivalents of hardware blocks.
An arithmetic class, a Boolean class, and a timed event class with a compiler and debugger
pretty much does it.
Then I can debug hardware logic on the same platform used to debug the associated C code.
Just about ready for open source hardware/SOC design.

Log in to Reply
Karl Stevens

October 31, 2018

Here’s what I think is the key paragraph along with that there’s a lot of design that should be done before synthesis and PandR/Timing.
“Unfortunately, every CNN model is unique in its topology. So far, at least, there is no one-graph-fits-all approach to CNN design. That means that every model or algorithm requires a unique structure of logic, memory, and data flow to optimize it across the key metrics. So, we’ve got software-only implementations that are inadequate – running on multi-core CPUs, GPUs, and some specialized processors, and we’ve got hardware implementations that require unique and specialized logic design on a per-application basis.

It’s time for a new tool/flow.”

Here are links an active opensource project.

https://github.com/freechipsproject/chisel3/wiki
https://chisel.eecs.berkeley.edu/

Chisel(Constructing Hardware in a Scala Embedded Language)
outputs Verilog for synthesis AFTER hardware/logic design and AFTER the HARDWARE designer
has learned STILL ANOTHER PROGRAMMING LANGUAGE(Scala).

It is about time for the synthesis programmers LEARN HARDWARE/BOOLEAN LOGIC and realize that most hardware bugs are logic bugs rather than synthesis bugs. Why does a hardware designer have to wait for synthesis, PandR, and timing analysis to run on an incomplete design before simulation can run? The D in EDA has nothing to do with design.

Log in to Reply