In our previous post we talked about TPU (Tensor Processing Unit) a google ASIC (Application specific integrated circuit) processor, specially designed for Neural network. So today I will discuss here about “How TPU works?”. To understand this you would need a little intuition of working of Neural networks, CPU and GPU. So, Let’s start with Neural Networks 😃
if you haven’t read our blog on TPU, pls read it before this
Read more– What is TPU and is it better than CPU and GPU?
How Neural Network works
A neural network is a computational model developed from the inspiration of Human neurons, it works the same way a human neuron works. Neural networks have a lot of layers and each layer has its own task. The number of layers in a neural network depends on the complexity of a problem it solves.
As you can see in the above animation, a purest form of neural network has a three layer
- Input Layer: Takes all the input and pass it to the next layer
- Hidden layer: This is where all the calculation and processing happens
- Output layer: Then output layer returns the result
Among all three, the hidden layer can have many layers depending on the problem, Also these layers are made of nodes or neurons. Now that you know about the layers of neural networks, now let’s see how these solve our problems step-wise.
- First, it takes input data from the input layer and passes it to the hidden layer
- Then these numerical inputs are multiplied by the weight (matrix multiplication)
- Then bias is added to it
- Similarly, all the input are processed like this and at last passed to the activation function after summing it all
- And this process repeated until we see the desired accuracy
combination = bias +weights * inputs
In short, neural networks require a massive amount of multiplications and additions between data and parameters. We often organize these multiplications and additions into a matrix multiplication, So the problem is how you can execute large matrix multiplication as fast as possible with less power consumption.
How CPU works
CPU is a general purpose processor based on the von Neumann architecture which is good in performing logical, arithmetic and I/O related tasks and does single task at a time very efficiently. The greatest benefit of CPU is its flexibility with its Von Neumann architecture, you can load any kind of software for millions of different applications.
Because of CPUs flexibility, the hardware doesn’t know what would be the next instruction until it reads it again from the software. For each calculation, the CPU has to store its results on the memory inside the CPU (known as registers or L1 cache). This memory access becomes the downside of CPU architecture called the Von Neumann architecture bottleneck. Even though the huge scale of neural network calculations means that these future steps are entirely predictable, each CPU’s Arithmetic logic units (ALU) executes them one by one accessing the memory every time, which in result reduces its performance and consume more power
How GPU works
GPU, on the other hand, is optimized for graphic rendering tasks, and perform its task parallelly means it can handle multiple tasks at a time. GPU has thousands of ALU which means it makes it a lot faster. Since GPU is designed for rendering 2-3D image/video which in itself is a matrix and neural networks mostly deal with matrix multiplication so GPU is very good at handling neural networks tasks as well. GPU is a lot faster than CPU in matrix multiplication tasks so it is widely used for Deep learning.
But even after having thousands of more ALU than CPU it still works in the same way. For every single calculation in the thousands of ALUs, GPU needs to access registers or shared memory to read and store the intermediate calculation results. This leads us back to our fundamental problem of the Von Neumann bottleneck. Because the GPU performs more parallel calculations on its thousands of ALUs, it also spends proportionally more energy accessing memory and also increases the footprint of GPU for complex wiring.
How TPU works
TPU is ASIC (Application-specific integrated circuit) processor which is specifically designed for Deep Learning and Machine learning. TPU is not a general-purpose processor it is designed as a matrix processor specialized for neural network workloads. Meaning TPU can’t handle the general task which CPU can, like it can’t run word processors, or execute bank transaction, etc. But it can handle massive multiplication and additions for neural networks, at blazingly fast speeds while consuming less power and inside a smaller physical footprint. The major reduction of the TPU is Von Neumann’s bottleneck since its primary task is matrix processing.
Inside the TPU there are thousands of multipliers and adders connected to each other directly to form a large physical matrix of those operators known as systolic array architecture. There are two systolic arrays of 128×128 in Cloud TPUv2 aggregating 32,768 ALUs for 16 bit floating values in a single processor.
How systolic array executes the NN calculations?
- At first, TPU loads the parameters from memory into the matrix of multipliers and adders.
- Then, the TPU loads data from memory.
- As each multiplication is executed, the result will be passed to the next multipliers while taking summation at the same time. So the output will be the summation of all multiplication results between data and parameters. During the whole process of massive calculations and data passing, no memory access is required at all.
Further performance optimization in TPU
Quantization-Sometime you don’t want too much details about something, like if you are testing a cup of tea and it’s too sugary or it got less sugar then you won’t ask your partner that how many grains of sugar have you poured into it, you will just ask the numbers of teaspoon right! Similarly, Neural network predictions generally don’t require the precision of floating-point calculation with 32-bit or even 16-bit numbers. So, considering that point in mind google uses a Technequie called quantization for TPU which is a powerful tool for compressing the 32-bit or 16-bit floating calculation to 8-bit integer. Due to this technique, the cost of neural network predictions, memory usage, energy consumption, and hardware footprint is slightly decreased without significant losses in accuracy
Instruction set- Unlike other processors like CPU and GPU, TPU doesn’t use RISC (Reduced instruction set computer) whose focus is to define simple instructions (eg. load, store, add and multiply). Instead of that TPU uses CISC (complex instruction set computer) style as the basis of TPU instruction set. Which focuses on implementing high-level instructions that run more complex tasks (such as calculating multiply-and-add many times) with each instruction. TPU is designed flexible enough to accelerate the computation needed to run many different kinds of neural network models instead of running just one type of neural network model
The TPU includes the following computational resources:
- Matrix Multiplier Unit (MXU): 65,536 8-bit multiply-and-add units for matrix operations
- Unified Buffer (UB): 24MB of SRAM that works as registers
- Activation Unit (AU): Hardwired activation functions
These are controled with dozen of high-level instructions specifically designed for neural network inference
Parallel processing- Typical RISC processors provide instructions for simple calculations such as multiplying or adding numbers. These are so-called scalar processors, as they process a single operation (= scalar operation) with each instruction. Instead of using that TPU uses MXU as a matrix processor that processes hundreds of thousands of operations(= matrix operation) in a single clock cycle. This can be think as printing a whole document at once unlike GPU which prints a sentence and CPU which prints a word at a time.
systolic array- in traditional architectures (such as CPU and GPU) values are stored in registers and the program tells the ALU (Arithmetic Logic Units) which registers to read, the operation to perform (such as addition, multiplication, or logical AND) and in which register to put the result. A program consists of a sequence of these read/operate/write operations. Which increases its cost and area of the chip. In MXU, matrix multiplication reuses input many times for producing the output. The input value is read once but use for many different operations without storing it back in a register. The ALUs perform only multiplications and additions in fixed patterns, which simplifies their design.