Crossing the Reconfigurable Computing Chasm

Kevin Morris

January 3, 2018

In 1960, Gerald Estrin presented “Organization of computer systems: the fixed plus variable structure computer” at the western joint IRE-AIEE-ACM computer conference. His abstract reads in part: “…a growing number of important problems have been recorded which are not practicably computable by existing systems. These latter problems have provided the incentive for the present development of several large scale digital computers with the goal of one or two orders of magnitude increase in overall computational speed.” – and his solution to the problem is given in the title “the fixed plus variable structure computer” – thus giving birth to the concept of reconfigurable computing.

Yep. The idea of reconfigurable computing pre-dates Moore’s Law, which was born six years later in 1966.

Over the next 58 years, we have chased that reconfigurable computing carrot, dangling enticingly just out of reach from the end of our ever-evolving programming pole. And, for at least three of those six decades, we have had reasonable hardware available to fulfill the promise of reconfigurable computing – an architectural alternative that could deliver Estrin’s “one or two” orders of magnitude increase in overall computational speed. In fact, we now know it might deliver as much as three or four orders of magnitude, and that is on top of our almost-entitled biannual doubling due to Moore’s Law.

And, yet, we are still not there.

One could argue that Moore’s Law has prevented us from realizing the promise of reconfigurable computing. After all, simply riding the von Neumann horse from one semiconductor process node to the next gave us a reliable 2x improvement in price, performance, and power every other year, and we didn’t have to rewrite our software or even redesign our computing architecture to get it. When you get that kind of bounty almost for free, who needs to be greedy?

Reconfigurable computing has always had a die-hard cult-like academic following, however. If the real world can’t deliver on a promising technology, at least careers can be made publishing conference papers about what might have been. And, when modern FPGAs finally came along, researchers could easily build actual hardware to prove their point. Time after time, industry and venture capital were pulled into the fray, betting that the pairing of programmable logic with conventional processors would add those much-anticipated zeroes to our performance benchmarks. Time after time, they failed – not because we didn’t know how to build the hardware for reconfigurable computers (we did), but because nobody could program the things.

Of course, we could build the “Demo.”

For years, we’ve seen example after example of systems delivering amazing results with specialized algorithms accelerated to remarkable levels via painstakingly crafted custom FPGA accelerators parked next to conventional processors. If you had a team of expert digital designers, an unlimited budget, and a year or so to spare, you could deliver 100x better performance and power than conventional computers on your performance-critical application.

Or, you could just run it on 100 parallel Intel servers on day one – and get close enough, with far less cost, risk, and development time. And, as long as Intel was tracking Moore’s Law, doubling all our goodies every couple years, there was little incentive to dive into the terrifying world of RTL-based design to go even faster. The sustained systemic exponential improvement of Moore’s Law acted as a kind of sedative for development of the programming methodology that would unlock the potential of reconfigurable computing.

Of course, Intel didn’t continue tracking Moore’s Law forever. Over the past decade, we’ve seen significant slowing in the venerable trend, with less gain on each process node, and longer (and more expensive) development cycles between them. On top of that, the von Neumann architecture itself has started to hit practical limits in the form of power consumption. Faster clocking gave way to more parallelism, and, ultimately, even the practice of stacking racks and racks of servers into giant data centers hit a wall when the power company simply couldn’t deliver any more power.

Intel itself now needs reconfigurable computing, and, of course, they’re working on it. Hard.

With the acquisition of Altera, Intel is now in a position to deliver the required hardware – processors with FPGA accelerators – into the data centers of the world. Intel’s dominance of the data center gives it a substantial leg up on the competition when it comes to widespread deployment of a disruptive technology like reconfigurable computing. But as long as the programming problem remains unsolved, that dominance is vulnerable.

Intel inched closer to that goal recently, with the deployment of their first SDK that merges Intel’s existing software development frameworks and compiler technology with the OpenCL capabilities in Altera’s Quartus Prime FPGA development tools to smooth the path for OpenCL developers wanting to take advantage of FPGA acceleration. The ultimate objective here is to abstract away details of the FPGA implementation so that software developers can write something closer to conventional code and still take advantage of FPGA acceleration, without having to have hardware engineers on the team who are FPGA/RTL experts.

The FPGA SDK adds FPGA support to both Microsoft Visual Studio and Eclipse-based Intel Code Builder for the OpenCL API. This gives OpenCL developers a familiar environment for their FPGA exploits. To address the problem of hours-long code-compile-test cycles on FPGAs, Intel is providing what it calls “Fast FPGA emulation” using Intel’s compilers to emulate the functionality of the FPGA implementation in software for more conventional software-like debug cycle iterations.

Because OpenCL itself is also a bit of an evangelical sale, Intel has included a smorgasbord of features designed to reduce fear, uncertainty, and doubt in the software development crowd. An OpenCL jump-start wizard helps programmers to overcome “blank page” syndrome, and features like syntax highlighting and code auto-completion make it seem like regular-old software development. To help with the FPGA-isms, there is “what-if” kernel performance analysis and quick static FPGA resource and performance analysis. And, when it comes time to push the design into actual hardware, there is support for fast and incremental FPGA compile to reduce the pain of those final (hopefully) few design iterations.

It will be interesting to watch the adoption and evolution of the FPGA SDK for OpenCL. At first, it is likely to be a “power tool” for teams who are already sold on the benefits and aware of the costs and risks of FPGA acceleration. While this is not the magic bullet that will break the dam and allow mainstream software to flow into the long-waiting arms of reconfigurable computing envisioned by Gerald Estrin 58 years ago, it does represent a significant step forward in allowing software engineers with no hardware training to begin to take advantage of some of that promise. For now, tapping into the real potential of FPGA-based acceleration will still probably require the expertise of FPGA designers and the development time of RTL-based implementation. But any progress toward bridging that development gap could be significant.

Comments

Beercandyman

January 3, 2018

Stop with the Gerald Estrin story. His system was a system to add instructions to an existing CPU design. This is no more reconfigurable computing than von Neumann’s machines with patch cords. In fact Gerald Estrin studies under von Neumann, so why not say von Neumann invented reconfigurable computing? Gerald Estrin’s name first appears in reconfigurable lore in the 1995 Scientific American article. They found someone they could say invented reconfigurable computing because they didn’t want anyone active to claim inventing reconfigurable. Darpa had a big program going at the time. My proposal was deemed “fundable but not funded” My proposal was to add a processor with a FPGA in the same package and eventually on the same die. Reconfigurable computing could not start until the FPGA was invented. You can see my first proposal http://commacorp.com/First_contract.pdf and all the other patents I have in reconfigurable computing here bit.ly/WhoIsSteveCasselman

None of the Professors in the Darpa program could claim priority over my work so they dug up someone they could pass off as the father of reconfigurable computing. That person is me. I invented reconfigurable computing and I have the patents to prove it. If you read my first contract it lays out reconfigurable computing as it is known today. I urge to to read my first contract and see. While Estrin’s work is interesting he is arguing for a complex instruction set computer that you can fit a new board into not a on the fly reconfigurable as I clearly proposed, design, built and made a living off of, to this date. You might as well say that Estrin invented the PCIe bus.

Log in to Reply
Kev

January 3, 2018

Reconfigurable computing isn’t that hard, you just need to think outside the Intel/Altera box. They aren’t very good at parallel software or languages, which is why the opted for OpenCL over doing something new.

If you want to solve it you need to tackle the parallel-software problem and how you write software that looks more like HDLs – highly parallel and easy to drop onto different parallel architectures (FPGA, GP-GPU, ASIC). Here’s my attempt –

http://parallel.cc

There are other ways to do it, but OpenCL (APIs in general) are pretty much a dead end. Xilinx seems to be equally bad at executing in this area, so I’m expecting some new players to pop up soon as people work out how to do it all with RISC-V, LLVM and embedded FPGAs.

Log in to Reply
MikePDX

January 3, 2018

The Estrin paper points the way but didn’t actually describe a reconfigurable computer that could be built and used. The paper didn’t lead to any further work. A much better way to see the emergence and development of actual reconfigurable computing is to look at the “FCCM 20” papers, which go back to the first meeting of the IEEE Field-Programmable Custom Computing Machines conference in 1993 (at http://tcfpga.org/fccm20/). These are peer-reviewed papers which describe the evolution of real reconfigurable computing hardware, tools and applications that were built and used, and delivered large speedups over conventional computers of the time.

Log in to Reply
Christoforos

January 3, 2018

The use of OpenCL and HLS will make much easier the rapid prototyping of IP cores for machine learning and AI that need very fast time-to-market and cannot afford the development time of RTL.
The use of accelerators as IP cores, with the right APIs that can interface with the higher level programming frameworks like Pyhton and Spark seamlessly, can be used for the wide deployment of FPGAs in the data centers for example. for more info check the relevant project we are working on: http://vineyard-h2020.eu

Log in to Reply
Kevin Morris

January 4, 2018

Great points on Estrin, and particularly good point that reconfigurable computing as we know it was not possible before the advent of (re)programmable logic.
The problem is still the programming, though. Partly, it’s that we are bad at parallel programming – we have HDLs for those rare/odd folks who can actually think clearly about parallelism, but our enormous legacy of algorithms were almost all conceived with sequential thinking in mind. Personally, I think an evolution of HLS is our best shot at convergence between the sequential way algorithms are written and the parallel way we want to execute them. We are a long way from an HLS that can convert arbitrary sequential code into decently-optimize-but-fits-in-my-hardware parallel implementations, however.
A couple years ago, I came up with “Impact Levels” for reconfigurable computing. i.e. What needs to be done to your existing code/algorithm to take full advantage of reconfigurable computing?

5. “Do-Over” – code must be completely rewritten in a new language/paradigm
4. “Mod” – significant rewrite of existing code
3. “Port” – similar to moving to a new processor/OS/etc – some human intervention is required
2. “Recompile” – source code works as-is with a simple automated recompile
1. “None” – executable works as-is and takes full advantage of reconfigurable hardware

Today, we are mostly at level 5 – rewrite your code from scratch, with the accelerator parts in RTL/HDL. I’d argue that nobody is better than level 4 right now (solutions such as Intel and Xilinx are putting forward today). In order for reconfigurable to hit mainstream – I’d argue we need level 3 or 2. I also claim we’ll never reach level 1 (hypervisor-like-thingy that reverse-compiles machine instructions then runs advanced HLS on the resulting source code in real time – dynamically reconfiguring FPGAs as accelerators on the fly)

Log in to Reply
1. MikePDX
  
  January 4, 2018
  
  It’s curious that we don’t see similar demands placed on GPU computing. You could apply the same Impact Levels to CUDA/OpenCL programming coming from mainstream implementations, and you’d get similar results, mostly at level 5 or maybe 4. In both GPU computing and FPGA (aka reconfigurable) computing the benefits have justified the effort for many applications.
  
  What I find most interesting is level 2 or 3 Impact for moving between GPU computing and FPGA computing. I’ve seen a CUDA graphical benchmark running on an FPGA in OpenCL. FPGA implementations often run at better power/performance than their GPU counterparts. Of course FPGAs and GPUs each have their own strengths and weaknesses as computing platforms.
  
  Log in to Reply
  1. Beercandyman
    
    January 4, 2018
    
    CUDA would never have happened if Nvidia did not let people see the “assembly” code for the part. Image where we would be if Intel never published the opcodes that run their processors. That’s where we are at in FPGA programming.
    
    Log in to Reply
2. Beercandyman
  
  January 4, 2018
  
  The big problem in programming FPGAs for reconfigurable computing has always been the FPGA manufactures. Xilinx, at one time, had a system to program FPGAs called Jbits. This system was portable, fast and easy to use. You could get a FPGA bitstream in seconds. It started with the only open FPGA called the XC6000 series. The system used Java and overloading to write code and run it on a processor and when you had with you wanted Jbits could generate a bitstream in seconds. This approach was abandoned because you could use it to reverse engineer a bitstream in the same way you can disassemble a program. This was a big mistake by Xilinx. Altera/Intel has the same technology that they use in house but they are more afraid of letting the bitstream defination out more than Xilinx.
  
  It’s interesting to note that Xilinx published the bitstream format of the Virtex I FPGA. In one of their appnotes they had all the information to generate your own bitstream. The appnote was published without Xilinx management permission by a rouge (by very smart and nice) employee. This information allowed my to do partial runtime reconfiguration in 1999. I’ll also point out that I talked to one professor that claimed it takes one man year to reverse engineer the bitstream for Xilinx parts. They do this for every part that comes out.
  
  One of the biggest problems is that Xilinx and Altera/Intel want you to use their devices to implement your software but they don’t do this themselves. I was able to get my hands on the Xilinx software code and I implement part of their placer in hardware and it ran 10x faster than on the fastest CPU at the time. I challenge these companies to use their parts to compile bitstreams for their parts. If they did this internally they would learn what it takes to get their software to the point where any software programmer could use it.
  
  They need to eat their own dog food and lead by example.
  
  Log in to Reply
Beercandyman

January 4, 2018

One last comment. I use the Intel SDK for OpenCL on FPGAs all the time and it’s pretty darn good. It works best when you use just the plain vannia C and write the code so it’s pipelinable. I’ll point out that it was a bad management decision not to just go with C/C++ along with OpenCL to program FPGAs. That may change in the future but their argument was that OpenCL is for full programs and C/C++ is for IP generation. I tried to argue there is no real difference and that anything you can write in OpenCL you can write in C/C++ because they are Turing complete languages. That argument fell on deaf ears.

Log in to Reply
Kevin Morris

January 4, 2018

@beercandyman – I’m gonna claim that whether the FPGA bitstream is open or closed is moot. The RTL, LUT placements, routing, and other critical parameters ARE accessible with any of the FPGA flows mentioned. There’s plenty of critical detail in there to engage/confuse even the bit-craziest software developer. And, honestly, 99.9% of software developers would have no idea what they were looking at. Opening the bitstream is simply not the same as opening up assembly code on a GPU. Many humans can understand assembly. Nobody understands the bitstream.

But, would access to the bitstream let hyper-intelligent developers come up with their own design flows/compilers that would perform better than the FPGA vendors’ synthesis and place-and-route? Having spent a few decades developing synthesis and place-and-route software myself with large, super-talented engineering teams in EDA – I’m gonna weigh in at “no F-in way”

Log in to Reply
1. Beercandyman
  
  January 5, 2018
  
  So the way that Jbits worked is that there was a very low level where you (or a hyper-intelligent developer) would describe wire and lookup tables. One level up from there would be functions like add, subtract, multiply, divide … From there a user would just say A = B + C. It was fully object oriented and all the hard work is done in Java classes. Jbits was really great because it went so far and to be able to figure out placement and all sorts of fancy stuff. The API was able to run on a microblaze inside the FPGA and do runtime routing and configuration using the ICAP ports.
  
  It is the best technology I’ve seen for do reconfigurable computing. I’ve spent 30+ years working on reconfigurable computing and I’ve seen many technologies come and go. You could use Jbits to do AI/Deep learning training on FPGAs. You could have libraries of precompiled functions that get stitched together at runtime. You can relocate function in hardware in milliseconds.
  
  I like your articles but I have to say you don’t know what you are talking about on this one.
  
  Maybe you should educate yourself before you weigh in.
  
  The paper most of the low level stuff is done in libraries.
  http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.84.3629&rep=rep1&type=pdf
  
  This paper should the true beauty of Jbits. By taking a data set and an algorithm you can make FPGAs faster than ASICs (unless you want to build an ASIC that only uses one key hardwired key for encryption).
  http://ieeexplore.ieee.org/document/903398/
  
  http://ieeexplore.ieee.org/document/4281794/
  http://ngn.cs.colorado.edu/docs/ekeller_hwdb_raw01.pdf
  https://www.spiedigitallibrary.org/conference-proceedings-of-spie/4525/0000/Run-time-reconfigurable-2D-discrete-wavelet-transform-using-JBits/10.1117/12.434374.short?SSO=1
  https://hal.inria.fr/inria-00072282
  https://www.researchgate.net/publication/2919230_JBits_Java_based_interface_for_reconfigurable_computing
  http://perso.citi-lab.fr/fdedinec/recherche/OldWarez/JBits/
  https://experts.umn.edu/en/publications/linear-placement-for-staticdynamic-reconfiguration-in-jbits
  https://dl.acm.org/citation.cfm?id=740233
  http://www.cse.unsw.edu.au/~cs4211/seminars/va/Slice_Internals.html
  
  there are more but what I will say is that you could use Jbits to do AI/Deep learning training on FPGAs. You could have libraries of precompiled functions that get stitched together at runtime. You can relocate function in hardware in milliseconds.
  
  Log in to Reply
2. Karl Stevens
  
  January 9, 2018
  
  “whether the FPGA bitstream is open or closed is moot” Right on!
  Having to generate a new bitstream is not practical for several reasons:
  1) The bitstream “configures” the fabric by effectively closing the switches between vertical and horizontal
  wires in the interconnect and connecting LUTs and FFs to the fabric. There are hundreds of thousands,
  maybe millions of bits in the bitstream. It simply takes too long to generate and shift in the
  bitstream.
  2) The propagation delay is the biggest factor affecting Fmax so timing driven placement is critical. And
  adds time to compiling the bitstream. Other comments regarding Micro Blaze ignore this performance
  issue.
  3) The tools and HDLs are too focused on the physical design rather than logic design. Customizing a design
  for a new function or algorithm must start with the data flow that contains the necessary computational
  elements arranged in the correct sequence for the algorithm, and then the control logic to get inputs,
  sequence the data through the data flow, and put the outputs to the destination.
  4) RE-configurable computing requires a configurable data flow with multiple ALUs and controls to pass
  the data through the data flow.
  5)No, multi-core, out-of-order, branch-guessing, and speculative-execution is not the answer or we
  would not be having this discussion.
  6)Algorithms are often expressed as loops of expressions where there is computation and summation.
  Compilers can create abstract syntax trees(ASTs) and APIs that extract the details of the algorithm
  needed to design the elusive RCC.
  
  Log in to Reply
Kevin Morris

January 7, 2018

@Beercandyman, I have now “educated myself” via your suggested reading list, and I’m once again ready and qualified to weigh in. Thanks for the advice on that.

I have to say I am not persuaded.

Yes, Jbits is interesting. Yes, it appears that, as you say, you could “use Jbits to do AI/Deep learning training on FPGAs”. You could also use the front panel of a PDP-11 to poke in the entire Photoshop application. There will be a meeting of all engineers wanting to do that next Monday in the phone booth behind our offices.

In short, Jbits would qualify as level 5 on my impact scale – any existing algorithm would have to be completely rewritten using a new paradigm.

I stick with my view that – even if Xilinx suddenly opened up the bitstream tomorrow and released an all-new, updated version of Jbits that was compatible with modern-era FPGAs, it would have absolutely zero effect on the mainstream adoption of reconfigurable computing. I assert that there does not exist a large population of software developers just itching to get their hands on FPGA bitstreams so they can optimize accelerators beyond what they could currently do with RTL synthesis and automated place-and-route, and thinking “Gee, I’d port my HPC application to an FPGA accelerator if only I could use my own personal version of ChipScope to see what’s going on at the LUT/route level.”

In fact, I believe that the opposite is true. We will not see widespread adoption of reconfigurable computing until there are high-level programming approaches that hide the FPGA architecture from the software developer. OpenCL is in the right direction, but IMHO insufficient to spark widespread adoption. When we have the ability to describe our algorithm in natural, logical terms without having to do large amounts of target-technology-specific optimization, mainstream adoption has a chance.

I also believe that it’s ironic to be debating the wisdom of non-encrypted FPGA bitstreams being used in cloud-based data center processors during the same week the world learned about Meltdown and Spectre.

I do agree with you that partial reconfiguration is a critical element of the success of reconfigurable computers. And, I think that using MicroBlaze in the FPGA to control its own reconfiguration is very cool and meta.

As for the FPGA/EDA companies eating their own dog food and porting their own tools to use FPGA-based acceleration, I’d LOVE that. However, I think it’s currently impractical for a number of reasons. EDA algorithms’ complexity is what I’d call “wide” rather than “deep”. In other words, there aren’t often simple nested loop structures that are computationally intensive that could be parallelized with hardware implementations. More often there are tens of thousands of lines of code that are more “control dominated” which wouldn’t reduce easily to parallel/pipelined data paths. And, much of the complexity tends to be in the data model rather than in the software algorithm.

Then, if they succeeded, their customers would need to be using specialized reconfigurable-computing hardware in order to design with their chips – which would raise the barrier to entry for people wanting to buy their parts. Or, they’d have to develop, test, and support two completely different versions of their tools – one for conventional computers and a completely separate thread for the reconfigurable version. It would add enormous cost and complexity to their tool development, and both Xilinx and Intel/Altera already have more engineers working on design tools than they do on silicon. You’ll be hard pressed to talk them into doubling that investment just to give some of their customers better design tool performance.

I don’t doubt your expertise in this area, but I think your judgment is clouded by your personal history with reconfigurable computing, and that’s interfering with your view of what the “average” modern software developer needs in a computing platform.

Log in to Reply
1. Beercandyman
  
  January 8, 2018
  
  So at least the Jbits approach has gone from “no f’ing way” to a level 5 impact so I’m making a little progress! Jbits can be both a low level approach and a high level approach. The low level part should be done by the FPGA manufactures themselves. Once you go up the “class stack” Jbits is very high level you code at the behavioral level. You can call out functions just like any programming language (because it’s Java). You could, for example, automatically translate an OpenCL program into a Jbits program and create a bitstream in seconds instead of days. This is because Jbits is a constructive technique and does not relay on random place and route algorithms. Since Jbits is built on classes you can run your program as a normal Java program in any Java IDE (which you debug in) and when you are happy with the results you run MyFPGACode.generate() and you’ll have a fully functioning bitstream in seconds to run in hardware.
  
  It should be pointed out that both Xilinx and Altera have these kinds of tolls in house which they use to generate bitstream for testing FPGA for manufacturing flaws. The Altera FPGA are problematic because the randomize the routing which means you can’t just move a precompiled bitstream around without a lot of coding work at the low level.
  
  The power behind runtime generation of bitstreams (whether you use Jbits or not) is that you can use precompiled bitstreams and stitch them together at runtime. This allows you to build just in time bitstreams compilers where the front end could be anything from x86 assembly to Python.
  
  Your argument about special chips is a non-sequitur as Jbits works on any FPGA. Jbits was developed by Xilinx for under $5 million dollars via a DARPA grant. That’s a drop in the bucket cost wise.
  
  Once again I’ll state that the FPGA manufactures have hobbled reconfigurable computing by hiding proven technology because of their paranoia about the bitstream.
  
  One of the problems that Xilinx faced with Jbits was support. It’s not a good idea to support every curious engineer as this is cutting edge technology. But what they could have done is use the technology in-house to speed up HLL to hardware systems. Instead they generate verilog and then run that through the same tired old software. Nothing will turn off the “average” software developer like 2 day compile times.
  
  There are also a lot of known FPGA/reconfigurable developers that could use the technology to develop new tools so FPGA companies would not have to support the whole world. But that’s not what Xilinx did. They threw out the baby with the bath water and now tell you how good their software is because the shaved off 10% of the 2 day compile time for devices.
  
  Both Xilinx and Altera need visionaries in their upper management. Something they don’t currently have. They have great engineers and excellent managers but even the latest CEO pick for Xilinx (a really great guy) is described as “”Victor is unique in his ability to translate vision and strategy into world-class execution” which is to say he’s not a visionary.
  
  Cheers!
  
  A product that rewires itself from the inside of the FPGA.
  https://www.thefreelibrary.com/Xilinx+Announces+Industry%27s+First+Programmable+Crossbar+Switch…-a092903782
  
  Log in to Reply
Crossing the Reconfigurable Computing Chasm – EEJournal – CryptoScoop

April 12, 2019

[…] Source Link […]

Log in to Reply