A Brief History of the Single-Chip DSP, Part II

Steven Leibson

September 8, 2021

Communications, Computers, FPGA, MilAero, System Design

After DSP’s annus mirabilis in 1948, another three decades would pass before actual, practical DSP chips would appear. DSP bits and pieces like TRW’s MPY016H hardware multiplier and TI’s TMC0280 LPC speech chip teased – real, integrated DSPs were just around the corner – but it was not until the 1980s that semiconductor technology advanced enough to make programmable DSP chips practical. The number of single-chip DSPs exploded during the 1980s and 1990s. Then, after 20 years, the era of the single-chip DSP came to an abrupt end. (Note: This article is the second half of “A Brief History of the Single-Chip DSP.”

Wally Rhines was working for Texas Instruments (TI) in the 1970s, and he desperately wanted to leave TI’s site in Lubbock, Texas. When an opportunity arose for him to manage TI’s microprocessor operation in Houston, he took the position because he found Houston a far more attractive place to live. Besides, no one else wanted the job. TI’s 16-bit 9900 microprocessor was dead in the water due to its uncompetitive 16-bit address space. Having thus failed to capture a piece of the general-purpose microprocessor market, Rhines’ newly adopted microprocessor team at TI in Houston created a four-pronged application-specific processor strategy. The four prongs of TI’s forked strategy were:

The TMS320 DSP family
The TMS340 family of graphics processors
The TMS360 mass-storage processor (which quickly went nowhere)
The TMS380 token-ring LAN processor for IBM’s networking architecture

Of these, the TMS320 DSP family became the rock star prong in the strategy. As Rhines said in an interview, “…it teaches a lesson: desperation is the mother of innovation.” After a couple of gestational years, TI rolled out the first TMS320 DSPs in April, 1982. However, just building the chip was not sufficient for a new technology like this. TI evangelized DSP and supported its new DSPs with software development tools and training for years before seeing significant success with the parts. According to Rhines, it took another five or six years before TI started to see some real revenue from the products.

TI Wasn’t the First

However, TI’s DSP chips were certainly not the first in the market. Intel had sprinted to an early lead by introducing the ill-fated 2920 Analog Signal Processor in 1979, but another of the company’s products, the 16-bit 8086 microprocessor, caught fire when its little brother with the 8-bit external data bus – the 8088 microprocessor – became the heart of the IBM PC. The Intel 2920 sank from sight, quite possibly because Intel’s full attention was being drawn to the general-purpose microprocessor markets.

TI was only one of several semiconductor companies preparing to enter the DSP arena in the early 1980s. According to Will Strauss, President of Forward Concepts and a DSP analyst for many decades, the first “true” single-chip DSPs with hardware multiplier/accumulators to be announced were the AT&T DSP-1 – developed by Bell Labs and first sampled within AT&T in May, 1979 – and the NEC µPD7720, which was announced at the IEEE Solid State Circuits Conference in February 1980. AT&T incorporated the DSP-1 into its groundbreaking 5ESS electronic switching system for its telephone network. AT&T then continued to evolve the device for a few generations, which included the DSP16 and the DSP32 (the first floating-point DSP chip). However, the AT&T DSP-1 and its successors remained captive within the Bell System, never to become commercially available to other systems companies.

The NEC µPD7720 had a 16×16-bit multiplier and two 16-bit accumulators, so it was a true single-chip DSP. Although NEC announced the device in early 1980, it didn’t become commercially available along with the required development tools until 1981. Strauss notes that the NEC µPD7720 found its greatest success in Japan, as happens with so many programmable ICs from Japan, and it was also popular in Europe.

Motorola Semiconductor became another early contender in the battle for DSP chip dominance during the 1980s, starting with the DSP56000 processor introduced in 1986. The Motorola DSP56000 had a 24-bit hardware multiplier and two 48-bit accumulators that could be extended by another 8 bits using a pair of extension registers. This large data-word capability gave the Motorola DSP56000 the ability to handle high-precision audio, so the Motorola DSP56000 quickly became popular with developers of high-end audio systems.

Duking It Out In The 1990s

The major participants in the DSP arena battled for dominance during the 1980s and 1990s. They produced multiple generations of increasingly powerful devices with multiple hardware multipliers, floating-point hardware multipliers, and larger amounts of on-chip memory. By the late 1990s, TI, Motorola, and Philips had developed DSP monster processors with VLIW architectures, multiple multiplier/accumulators, and additional function units for special operations such as bit swizzling.

Development of bigger and more powerful standalone DSP chips came to an abrupt halt when a competing chip technology veered out of nowhere and blindsided the DSP vendors. Just as the Chicxulub asteroid wiped out the dinosaurs 66 million years ago and left a thin layer of iridium in the rock strata as a calling card, FPGAs crashed the single-chip DSP party at the turn of the millennium.

The combination of one fundamental principle of DSP and some history explain how and why FPGAs quickly wiped out single-chip DSPs as a vibrant processor category. First, the principle: DSP is all math and DSP performance relies on the ability to perform a ton of multiply/accumulate operations (MACs) very quickly. That’s why the latest single-chip DSPs featured multiple hardware multiplier/accumulator units and additional function units to route non-MAC operations away from the multiplier/accumulators. The more MAC units a device has, the faster it can perform DSP operations because most DSP algorithms contain a lot of inherent parallelism that multiple MAC units can exploit.

Now, the history: FPGAs first appeared on the scene in 1984 when Xilinx introduced the XC2064. That first FPGA was little more than a bunch of very slow gates (actually programmable logic blocks based on lookup tables) surrounded by a lot of programmable interconnect. This early architectural design allowed the FPGA to gobble up many TTL chips’ worth of logic on a board design. But the earliest FPGAs were pretty slow; they didn’t threaten the processors of the day and certainly didn’t impinge on DSP territory. Not at first, anyway.

The FPGA Age of Expansion

As intended from the start, FPGAs rode Moore’s Law, and FPGAs grew from the paltry 64 logic blocks in the original Xilinx XC2064 FPGA to tens of thousands of logic blocks by the year 2000. In an article published in Proceedings of the IEEE titled “Three Ages of FPGAs: A Retrospective on the First Thirty Years of FPGA Technology,” former Xilinx Fellow Steve Trimberger called the period of FPGA growth during the 1990s the “Age of Expansion.” During this era, FPGAs rode the Moore’s Law curve and grew larger and larger by incorporating more and more programmable logic blocks. However, when compared to ASICs, the circuits built with programmable logic within an FPGA are relatively slow, they’re inefficient with respect to silicon usage, and they’re more expensive. So MACs built with programmable logic are relatively slow and costly.

Later, during Trimberger’s “Age of Accumulation” – when FPGAs added hardened MAC blocks – FPGAs suddenly became serious DSP competitors. And FPGAs didn’t add just one or two hardware multipliers; enabled by the largesse of Moore’s Law, they added dozens of them.

The first FPGA device family to incorporate fast hardware multipliers was the Xilinx Virtex-II FPGA family. In July, 2001, Xilinx announced that it had already shipped a million dollar’s worth of Virtex-II XC2V6000 FPGAs, each with 144 hardened, on-chip 18×18-bit multipliers. So the first FPGA to incorporate hardware multipliers could already outperform every single-chip DSP that existed at the time, and likely every single-chip DSP that ever will exist.

Altera followed Xilinx and announced its first generation of Stratix FPGAs with 36×36-bit hardware multipliers in 2002. The hardware multipliers in the Stratix FPGAs were fractionable as 18×18-bit or 9×9-bit multipliers to permit even more MAC operations, albeit at lower bit resolution. In the first few years of this millennium, Xilinx and Altera FPGA families far outdistanced single-chip DSPs in the number of simultaneous MAC operations they could perform.

Today’s FPGAs Have MACS a’Plenty

Today, some of the smallest FPGAs from Intel (which bought Altera in 2015) and Xilinx deliver plenty of hardware multipliers. Members of the older but still-available Intel Cyclone IV FPGA family incorporate 80 to 532 18×18-bit embedded multipliers. Similarly, the older Xilinx Spartan 6 FPGA family includes devices with 8 to 180 DSP48A1 slices, while members of the newer Xilinx Artix FPGA family incorporate as many as 740 DSP48E1 slices. Each DSP48A1 slice contains an 18×18-bit multiplier and a 48-bit accumulator, while each DSP48E1 slice contains a 25×18-bit multiplier and a 48-bit accumulator. The number of bits in the DSP48-slice multipliers seems to, er, multiply over time.

The largest FPGAs from Intel and Xilinx feature thousands of DSP blocks and are capable of delivering three orders of magnitude more MACs/second than the fastest DSP chips. For example, members of the largest Intel Stratix 10 TX FPGA family are available with 5760 variable-precision DSP blocks, each containing two 18×19-bit hardware multipliers that can be configured as one 27×27-bit multiplier. That’s as many as 11,520 hardware multipliers on one big chip. The largest Xilinx Virtex UltraScale Plus FPGAs incorporate 12,288 DSP48E2 slices, each containing a 27×18-bit multiplier and a 48-bit accumulator.

Note that Intel and Xilinx are not the only FPGA vendors cramming hardware multipliers into their FPGAs. You can get FPGAs from Achronix, Lattice, and Microchip with various amounts of DSP hardware – MACs – built into the devices. For example, the recently announced Lattice CertusPRO-NX FPGA is available in two sizes, with 96 or 156 on-chip 18×18-bit multipliers. (See “Lattice Launches CertusPro-NX.”)

If you still want to write DSP code and run it on a single-chip DSP, you can. NXP, which bought Motorola Semiconductor, offers the DSP56300, DSP56700, and MSC8000 DSP families. These are the latest – and quite possibly the last – single-chip descendants of the Motorola DSP lines. In addition, you can still purchase members of the TI TMS320 FPGA families off the shelf. Meanwhile, hardware multipliers have become quite common in the design of general-purpose processors, where you can find monster 512-bit SIMD vector units fully capable of delivering respectable DSP performance, and even in microcontrollers, so that you can more easily incorporate DSP into even the smallest embedded designs. For all of this, give thanks to Moore’s Law.

However, there’s simply no comparison at this point. If your high-performance DSP application requires lots of fast MAC operations, FPGAs with their hundreds or thousands of fast hardware multipliers are uniquely qualified for the job.

How about you? How do you DSP? Why not leave a comment below?

Comments

acantle1

September 8, 2021

Back in the early 90’s I used the TI C80 DSP that had similar architecture to the Cell processor. On paper it was a wonderful device until you tried to use it with its fundamentally flawed Transfer Controller for handling ALL IO on and off the device! We used the C80 to Model Optical Transfer functions of Optical Lens on a Thermal Imaging Camera.
In that same system we used the Intel I860 Vector Processor for Matrix Transformations of 3D Wire Frame models of aircraft into 2D Rasterisation wireframe images that we inserted into live video.
We used a 4006, 6K gate, FPGA connected to the I860’s to take the 2D Wire Frame data and insert it into a live video feed from a Defence System. The FPGA shaded the wireframe image using the Gouraud BiLinear Interpolation shading technique. The 6K gate FPGA was over 100x faster than the I860 at this shading and the latency was 256uS of video delay compared to the only competitor, an SGI Onyx 2, having a 10 frame, 200mS, video pipeline delay!
This was the exact event that caused me to believe that FPGAs would takeover the world of computing especially in the DSP arena. I took Nallatech full time in 1995 and the launch of Xilinx’s Virtex FPGA with it’s blockRAMs was a seminal moment where suddenly FPGAs could take on heavy lifting 2D Image processing DSP tasks and leave traditional DSPs in the dust! That was 1998. I then spent 5 years preaching to Xilinx that they had a massive market opportunity in the DSP Arena and it took until 2003 for Wim Roelandts, then Xilnx CEO, to acknowledge to the world that DSP was a $2B market opportunity for Xilinx.

Now here we are today trying to re-educate ourselves in DSP but calling it Data Centric Computing. What goes around comes around! 🙂

Log in to Reply
1. Steven Leibson
  
  September 9, 2021
  
  Thanks for sharing your memories, Acantle1. You were clearly ahead of the pack, on several fronts. Except for the i860. That wasn’t a very good design, in my opinion.
  
  Log in to Reply
rbj

September 8, 2021

The Mot DSP56000 has some salient historical note, but has been well supplanted by the ADI SHArC. Why isn’t the SHArC or even ARM processors in this little history?

Log in to Reply
1. Steven Leibson
  
  September 9, 2021
  
  Hi rbj. Thanks for the comment. I considered including the ADI Sharc. I just couldn’t find anything significant to say about those DSPs except that they’re a successful and popular DSP line. Did they bring anything new to the table technically, in your opinion? Did they break any new ground? ADI’s own Web pages are remarkably modest on these questions, listing only specs. As far as ARM processors, they’re clearly not DSPs. Most are not enhanced with hardware MACs, no Harvard architecture. Just solid general-purpose processors that can be used for DSP, like dozens of other general-purpose processors. –Steve
  
  Log in to Reply
Karl Stevens

September 8, 2021

There is a MS Research paper “Where’s the Beef?” that concluded that FPGAs can outperform traditional CPUs running at 10 times the clock frequency. Hence Project Catapult put FPGAs in all their data centers.

One conclusion was that FPGAs do not have to fetch instructions from slow memory. Embedded soft processors with the traditional load/multiply/add/store/branch kind of architecture are slow because of fmax for FPGA clocks.

DSP data does not have to be fetched, neither do the instructions. So it is up to the multiply/add parallelism.

Of course the pipeline latency of cpu’s has to be a factor because the add has to be done after the multiply ends.

Log in to Reply
1. Steven Leibson
  
  September 9, 2021
  
  You’re right of course, Karl Stevens. FPGAs do keep data close to the computing hardware and they don’t have instruction-fetch latencies because they are physically programmed (“spatially programmed” in Intel speak). However, I think it still comes down to the fact that FPGAs muster three or four orders of magnitude more hardware MACs than general-purpose processors that makes the biggest difference. General-purpose processors may clock 5x faster, but FPGAs overwhelm with sheer numbers of hardware MACs. Even the cheap ones have more than 100 hardware MACs these days.
  
  Log in to Reply
Pat Hays

September 9, 2021

Steve’s essay made a good case that FPGAs caused the demise of the single-chip DSP from high-performance applications as the millennium turned. To round out the story, I’d like to add two other factors that caused the single-chip DSP to also disappear from applications with low-and mid-range performance.
The RISC pioneers didn’t include digital signal processing (lower case, “dsp”) tasks among their 1975-1985 benchmarks because the performance requirements of key dsp algorithms weren’t attainable in CPUs. These requirements led to the rise of the single-chip DSPs in the 1980s. By the 1990s, the DSP algorithms, LPC, APC, that Steve notes, and more, had marched out of the journal pages and into volume production. The time was ripe for a convergence between CPUs and DSPs. For software compatibility and because CPUs already had sophisticated memory management, we needed to add DSP support – SIMD arithmetic, saturation, etc., to CPUs – rather than more sophisticated memory management to DSPs. At the 1999 Microprocessor Forum, dsp extensions to the MIPS, ARM and ARC ISAs were all announced in a single session. In the same year, Intel delivered its first Streaming SIMD Extensions (SSE) to the x86 ISA.
The CPU/DSP convergence eliminated separate DSPs from systems where a CPU was already present, but what about the far larger world of embedded processing? In my opinion, the single-chip DSPs primed the pump for new Application-Specific Standard Products (ASSPs) for primary dsp tasks. As volumes took off for aps like echo cancellation, dynamic time warping, Wi-Fi, digital TV decoding, etc., it became feasible to amortize development across narrower market segments than it had for earlier general-purpose DSPs. It was not only feasible, but it was necessary, to build ASSPs to optimize cost and power. Ironically, the general-purpose single-chip DSP disappeared as a result of its own success, but chips – both CPUs and embedded ASSPs – running digital signal processing tasks became pervasive.

Log in to Reply
1. Steven Leibson
  
  September 9, 2021
  
  Thanks for your analyses, Pat Hays. Certainly, anyone who can afford to develop and ASIC has the high ground. If you can find an ASSP that does exactly what you need, or near enough, you’re light years ahead. –Steve
  
  Log in to Reply
Karl Stevens

September 9, 2021

FPGAs can also do general purpose computing very fast. if/else, for, while, expression evaluation, etc.
add, sub, mpy, and, or, xor, but not div for now because div is sequential.
It takes 3 true dual-port ram blocks, a comparator, an adder(Mac), and a few hundred LUTs.

Basic read two operand addresses and and an operator address to begin, then read two operands and an operator each clock cycle, and write the result on the last cycle. That is one clock to start, one for each operand. The result is written while the next two starting operands are being read.

Source code is compiled into an Abstract Syntax Tree then the Syntax Walker gets the nodes in correct sequence for evaluation.

I just put up an opensource on GIT. Karl15/OSAstEng (OpenSourceASTEngine)

VisualStudio19 and CSharp (C#) Compiler API can produce output showing the evaluation sequence and generates the contents for the BlockRams.

No long pipeline, out of order execution, branch prediction, instruction fetch, or cache. Just stream the memory contents and go.

Log in to Reply
1. Steven Leibson
  
  September 9, 2021
  
  Thanks for the additional comments, Karl Stevens. If you need to go really fast, FPGAs are often the best choice. However, you need to sign up for the power consumption, heat dissipation, and unit costs as well. Software programming is just so much easier. If you can live with the performance, processors are usually the better choice. Unless you absolutely, positively need more performance. –Steve
  
  Log in to Reply
  1. Karl Stevens
    
    September 10, 2021
    
    Hi, Steve: I did not make it clear — I have an FPGA design that is programmable using the same keywords and expressions that software programs use. In fact the source code is compiled by the C# compiler. The compiler API is then used to generate the contents of the BlockRams on a pre-configured FPGA. From the software viewpoint, it is a CPU that does not have a typical load/store/branch/add ISA.
    Instead it has if/else, for, while, and expressions used in C# source code.
    
    Using an FPGA allows for building a running demo without the cost of a chip build.
    
    So write the program, compile, debug, load the block rams and go.
    
    Quartus/PlugInWizard easily connects the rams, a multiply, adder, compare, and a logic net takes a couple of hundred LUTs.
    
    Every clock edge does something useful rather than advance a mile long pipeline.
    
    Log in to Reply
    1. johonkanen
      
      September 19, 2021
      
      VHDL in particular is very well suited for creating code that can be written pretty much like software but compilation results in synthesized logic circuit and can be directly synthesized using standard tools like Quartus, Vivado or even ISE/Planahead or any other common tools that support vhdl93. I recently published a blog post on newton-raphson division in which I have been explaining how using custom record types allow for synthesizing a division module that allows the use of commands like create_division(divisor) and request_division and get_division result. I have also written a couple of blog posts on calculating differential equations on FPGAs. I also have all of the code published on github. My github username is also johonkanen
      
      Log in to Reply