Tensor Processing Unit (TPU) - An AI Powered ASIC for Cloud Computing

Jaskaranvir Singh
Feb 15, 2023
8 min read

Google created Cloud TPUs as a specialized matrix processors for neural network workloads. TPUs can't run word processors, drive rocket engines, or process bank transactions, but they can quickly perform enormous matrix operations in neural networks.

What is a Tensor?

A tensor is a geometric object in mathematics that maps geometric vectors, scalars, and other similar things to a resultant tensor in a multilinear manner. In basic terms, a tensor is a generalized matrix that can be a 1-D matrix (a vector), a 3-D matrix (a cube of numbers), even a 0-D matrix (a single number), or a higher dimensional structure. The rank of the tensor is equivalent to its dimension. The dynamics of a tensor are essential because it exists in a structure and interacts with other mathematical entities. When other entities in the structure are transformed predictably, the tensor follows a similar transformation rule and its numerical values follow the other entities. A tensor's "dynamical" property is what distinguishes it from a simple matrix. A Tensor is an n-dimensional matrix in computer science that is equivalent to a Numpy array, the primary data structure utilized by machine learning techniques.

TensorFlow is an open-source machine-learning platform that may be used for image classification, object detection, language modeling, and speech recognition. It is the most fundamental unit of operation in TensorFlow, and it makes use of Numpy. A tensor processing unit (TPU) which is also known as a TensorFlow processing unit, is a special-purpose machine learning accelerator. It's a processing IC created by Google to handle TensorFlow neural network processing. TPUs are application-specific integrated circuits (ASICs) that are used to accelerate specific machine learning tasks by putting processing elements of small DSPs with local memory on a network and allowing them to communicate and transport data between them.

Quick prototyping, simple models, small and medium batch sizes, pre-existing code that cannot be updated, and various math difficulties, among other things, are better suited to CPUs and GPUs. In 2013, Google realized that unless they could create a chip that could handle machine learning inference, they would have to double the number of data centers they possessed. Google claims that the resulting TPU has "15–30X higher performance and 30–80X higher performance-per-watt" than current CPUs and GPUs.

Cloud Tensor Processing Unit

Google products like Assistant, Gmail, Search, and Translate are powered by a custom-built machine learning ASIC or a Cloud TPU. The TensorFlow community will benefit from Google's second-generation Cloud TPUs, which provide faster processing performance. The low cost and fast performance of Google Cloud TPU make it ideal for teams to learn machine programs to improve their performance. For real-life instances, a user can also customize their own machine-learning systems. It may be done easily by the user if he brings his own data, downloads a Google-optimized reference model, and starts the training. The four services provided by Cloud TPUs are as follows:

Cloud TPU v2 comprises 180 teraflops and 64 GB of high-bandwidth memory (HBM).
Cloud TPU v3 features 420 teraflops and 128 GB of HBM.
Cloud TPU v2 Pod contains 11.5 petaflops, 4 TB of HBM, and a two-dimensional toroidal mesh network.
Cloud TPU v3 Pod comprises a 2-D toroidal mesh network and 100 petaflops with 32 TB HBM.

Characteristics of Cloud TPU

A Model Library is a collection of improved models in the cloud-TPU that provide accuracy, high performance, and quality in object identification, image classification, speech recognition, language modeling, and other areas.
The user can connect to cloud-TPUs from tailored AI Platform Deep Learning VM Image kinds, which balances memory, processor speeds, and high-performing storage resources.
Google Cloud's data and analytics services, as well as Cloud TPUs, are fully integrated with the rest of the Google Cloud Platform. As a result of its networking, storage, and data analytics technologies, the user can benefit.
Clouds that Can Be Prevented – Preemptive clouds are a more cost-effective alternative than on-demand examples, which help save a lot of money.

Google created Cloud TPUs as a specialized matrix processors for neural network workloads. TPUs can't run word processors, drive rocket engines, or process bank transactions, but they can quickly perform enormous matrix operations in neural networks.

Matrix processing, which is a combination of multiplying and accumulating operations, is TPUs' principal responsibility. TPUs are made up of thousands of multiply-accumulators that are linked together to form a big physical matrix.

Data is streamed into an infeed queue by the TPU host. Data is loaded from the infeed queue and stored in HBM memory by the TPU. The TPU loads the results into the outfeed queue after the computation is finished. After that, the TPU host reads the results from the outfeed queue and saves them in memory. The TPU loads the parameters from HBM memory into the MXU to conduct matrix operations. The TPU then accesses data stored in HBM memory. The result of each multiplication is passed on to the next multiply-accumulator. The sum of all multiplication outcomes between the data and parameters is the output. During the matrix multiplication procedure, no memory access is necessary.

TPUs in the cloud is extremely fast when it comes to dense vector and matrix computations. The PCIe bus is substantially slower than both the Cloud TPU interface and the on-chip high bandwidth memory, so data transfer between Cloud TPU and host memory is slow compared to the speed of calculation (HBM). When a model is partially compiled, and execution travels back and forth between the host and the device, the TPU sits idle for the majority of the time, waiting for data to arrive across the PCIe bus. To address this issue, Cloud TPU's programming architecture is designed to run much of the training—ideally the whole training loop—on the TPU.

TPU Programming Model

All model parameters are stored in high-bandwidth memory on the device.
By repeating numerous training stages in a loop, the cost of launching calculations on Cloud TPU is amortized.
On the Cloud TPU, input training data is sent to an "infeed" queue. During each training step, a program running on Cloud TPU collects batches from these queues.
Before "in feeding" data to the Cloud TPU hardware, the TensorFlow server on the host system (the CPU coupled to the Cloud TPU device) retrieves and pre-processes it.
Data parallelism: On a Cloud TPU, each core executes an identical program stored in its own HBM in a synchronous manner. At the end of each neural network stage, a reduction operation is done across all cores.

Different Versions of TPU

First generation TPU
Second generation TPU
Third generation TPU
Fourth generation TPU
Edge TPU
Pixel Neural Core

TPU version 4

Four TPU chips, 32 GiB of HBM, and 128 MiB of shared common memory make up the smallest TPU v4 setup. There are two cores in each TPU chip. There are four MXUs, a vector unit, and a scalar unit in each core.

TPU version 3

Four TPU chips and 32 GB of HBM are included in the smallest TPU v3 configuration. There are two cores in each TPU chip. There are two MXUs, a vector unit, and a scalar unit in each core.

TPU version 2

Four TPU chips and 16 GiB of HBM make up the smallest TPU v2 setup. There are two cores in each TPU chip. An MXU, a vector unit, and a scalar unit are all present in each core.

Edge TPU

Google Edge TPU is a purpose-built ASIC (tailored to execute a specific kind of application) for running AI at the edge developed by Google. It provides high performance in a small physical and power footprint, allowing high-accuracy AI to be deployed at the edge. Google Edge TPU works in tandem with Google Cloud TPU and Google Cloud services to provide an end-to-end, cloud-to-edge hardware and software infrastructure for clients' AI-based solutions deployment. Using multiple Coral prototyping and production solutions, the Edge TPU allows deploying high-quality ML inferencing at the edge. The Coral platform for machine learning at the edge works in conjunction with Google's Cloud TPU and Cloud IoT to provide an end-to-end (cloud-to-edge, hardware-to-software) infrastructure for AI-based solutions. The Coral platform includes a full developer toolkit that allows compiling and customizing the Google AI models for the Edge TPU, integrating Google's AI and hardware expertise. For operating AI at the edge, Google Edge TPU complements CPUs, GPUs, FPGAs, and other ASIC technologies.

Pixel Neural Core

In 2019, Google launched the Pixel 4 mobile comprising an Edge TPU called as Pixel Neural Core. Machine learning processors, like the Neural Core, are tailored for a few specific complex mathematical operations, unlike a standard CPU developed to handle a wide range of computational activities. This makes them more like digital signal processors (DSPs) or graphics processing units (GPUs), but they're tailored for the particular operations of machine learning algorithms. Rather than using several CPU cycles, the Neural Core incorporates dedicated arithmetic logic units into hardware to process these instructions fast and efficiently. There are hundreds of these ALUs are most likely distributed across numerous cores, with shared local memory and a microprocessor in charge of work scheduling in the chip. The Google Pixel 4 and Pixel 4 XL's Neural Core appear to be significant components in a number of new features.

TPU Version 4 v. TPU Version 3:

Version 4 TPU chips have a unified 32 GiB HBM memory area over the whole chip, allowing improved coordination between the two on-chip TPU cores. HBM performance has been improved by adopting the most recent memory standards and speeds. DMA speed has been improved, and native support for high-performance striding at 512B granularities has been included.

TPU Version 3 v. TPU Version 2:

TPU version 3 systems' enhanced FLOPS per core and memory capacity can help models perform better in the following ways:
For compute-bound models, TPU version 3 configurations deliver significant performance improvements per core. If memory-bound models on TPU version 2 setups are similarly memory-bound on TPU version 3 configurations, they may not get the same performance improvement.
TPU version 3 can provide enhanced performance of intermediate values in circumstances where data cannot fit into memory on TPU version 2 systems (re-materialization).

GPU vs. TPU vs. CPU

Despite the fact that GPUs have fewer cores than CPUs, they have a lot more of them. These cores include arithmetic logic units (ALUs), control units, and memory caches, allowing GPUs to perform more calculations. Those ALUs were incorporated to allow for rapid geometric calculations, allowing games to run at a high frame rate. GPUs normally have access to 8GB or 16GB of memory, although CPUs can easily have access to more (depending on your RAM). Transfers to and from RAM are substantially faster than those to and from GPUs.

Returning to our original example, CPUs were designed to handle multiple jobs at once rather than a single difficult one, such as operating system activities. GPUs, on the other hand, was designed to do mathematical calculations as quickly as possible because producing images is based entirely on those processes.

Although both TPUs and GPUs can do tensor operations, TPUs are better at big tensor operations, which are more common in neural network training than 3D graphics rendering. The TPU core of Google is made up of two parts a Matrix Multiply Unit and a Vector Processing Unit. When it comes to the software layer, an optimizer is used to switch between bfloat16 and bfloat32 operations (where 16 and 32 are the number of bits) so that developers don't have to rewrite their code. As a result, the TPU systolic array architecture has a large density and power advantage, as well as a non-negligible speed advantage over a GPU.

CPU	GPU	TPU
Multipurpose	Specialized For Parallel Computation	Specialized In Matrix Processing
Low Latency	High Latency	High Latency
Low Throughput	Very High Throughput	Very High Throughput
Low Compute Density	High Compute Density	High Compute Density

Conclusion

Google is seeking to establish a dominant position in the business through the Google Cloud TPU. The company's major goal is to give outstanding service at a low cost, so it's looking to extend its services, which include everything from optimized search to Android capabilities to driverless vehicles. The research is still ongoing for this purpose, with the long-term impact and competitive advantage of this TPU cloud in mind. The second-generation TPU is based on more study, and as a result, future generations are predicted to have more competence.

References

https://www.jigsawacademy.com/blogs/cloud-computing/cloud-tpu
https://cloud.google.com/tpu/docs/tpus
https://www.jigsawacademy.com/blogs/cloud-computing/cloud-tpu
https://medium.com/sciforce/understanding-tensor-processing-units-10ff41f50e78
https://semiengineering.com/knowledge_centers/integrated-circuit/ic-types/processors/tensor-processing-unit-tpu/
https://analyticsindiamag.com/tpu-beginners-guide-google/
https://towardsdatascience.com/what-is-a-tensor-processing-unit-tpu-and-how-does-it-work-dbbe6ecbd8ad
https://cloud.google.com/tpu/docs/system-architecture-tpu-vm
https://www.androidauthority.com/google-pixel-4-neural-core-1045318/