IBM’s new chip is designed to do both high-precision learning and low-precision inference across the three main flavors of deep learning
The field of deep learning is still in flux, but some things have started to settle out. In particular, experts recognize that neural nets can get a lot of computation done with little energy if a chip approximates an answer using low-precision math. That’s especially useful in mobile and other power-constrained devices. But some tasks, especially training a neural net to do something, still need precision. IBM recently revealed its newest solution, still a prototype, at the IEEE VLSI Symposia: a chip that does both equally well.
The disconnect between the needs of training a neural net and having that net execute its function, called inference, has been one of the big challenges for those designing chips that accelerate AI functions. IBM’s new AI accelerator chip is capable of what the company calls scaled precision. That is, it can do both training and inference at 32-, 16-, or even 1- or 2-bits.
“The most advanced precision that you can do for training is 16 bits, and the most advanced you can do for inference is 2 bits,” explains Kailash Gopalakrishnan, the distinguished member of the technical staff at IBM’s Yorktown Heights research center who led the effort. “This chip potentially covers the best of training known today and the best of inference known today.”
The chip’s ability to do all of this stems from two innovations that are both aimed at the same outcome—keeping all the processor components fed with data and working.
“One of the challenges that you have with traditional [chip] architectures when it comes to deep learning is that the utilization is typically very low,” says Gopalakrishnan. That is, even though a chip might be capable of a very high peak performance, typically only 20 to 30 percent of its resources can really be brought to bear on a problem. IBM aimed for 90 percent, for all tasks, all the time.
Low utilization is usually due to bottlenecks in the flow of data around the chip. To break through these information infarctions, Gopalakrishnan’s team came up with a “customized” data flow system. The data flow system is a network scheme that speeds the movement of data from one processing engine to the next. It is customized according to whether it’s handling learning or inference and for the different scales of precision.
The second innovation was the use of a specially designed “scratch pad” form of on-chip memory instead of the traditional cache memory found on a CPU or GPU. Caches are built to obey certain rules that make sense for general computing but cause delays in deep learning. For example, there are certain situations where a cache would push a chunk of data out to the computer’s main memory (evict it), but if that data’s needed as part of the neural network’s inferencing or learning process, the system will then have to wait until it can be retrieved from main memory.
A scratch pad doesn’t follow the same rules. Instead, it’s built to keep data flowing through the chip’s processing engines, making sure the data is at the right spot at just the right time. To get to 90 percent utilization, IBM had to design the scratch pad with a huge read/write bandwidth, 192 gigabytes per second.
The resulting chip can perform all three of today’s main flavors of deep learning AI: convolutional neural networks (CNN), multilayer perceptrons (MLP), and long short-term memory (LSTM). Together these techniques dominate speech, vision, and natural language processing, explains Gopalakrishnan. At 16-bit—typical for training—precision, IBM’s new chip cranks through 1.5 trillion floating point operations per second; at 2-bit precision—best for inference—that leaps to 12 trillion operations per second.
Gopalakrishnan points out that because the chip is made using an advanced silicon CMOS manufacturing process (GlobalFoundries’ 14-nanometer process), all those operations per second are packed into a pretty small area. For inferencing a CNN, the chip can perform an average of 1.33 trillion operations per second per square millimeter. That figure is important “because in a lot of applications you are cost constrained by size,” he says.
The new architecture also proves something IBM researchers have been exploring for a few years: inference at really low precision doesn’t work well if the neural nets are trained at much higher precision. “As you go below eight bits, training and inference start to directly impact each other,” says Gopalakrishnan. A neural net trained at 16 bits but deployed as a 1-bit system will result in unacceptably large errors, he says. So, the best results come from training a network at a similar precision to how it will ultimately be executed.
No word on when this technology might be commercialized in Watson or another form, but Gopalakrishnan’s boss Mukesh Khare, IBM’s vice president of semiconductor research, says to expectit to evolve and improve. “This is the tip of the iceberg,” he says. “We have many more innovations in the pipeline.”
Editor’s note: This story was updated on 2 July 2018.