IBM Re-Architects The Mainframe With New Telum Processor – Forbes

The New IBM Z Telum processor that scan scale up to 32 chips and 256 CPU cores

Similar to what the company did with the new Power10 processors for cloud systems, IBM also started from scratch in designing a new processor for the companys IBM Z mainframe. The IBM Z has a long history and is unique in that it still uses processors specially designed for enterprise security, reliability, scalability, and performance. The new Telum processor for the next generation IBM Z, enhances all these aspects and adds embedded acceleration, something most systems are accomplishing through discrete accelerators. IBM introduced the new Telum processor at the annual Hot Chips technology conference this morning.

IBM Z Telum processor die photo

A key to the design of the Telum processor was to put everything on one die for performance and efficiency. The Telum processor features 8 CPU cores, on-chip workload accelerators, and 32MB of what IBM calls semi-private cache. Each chip module will feature two closely coupled Telum die for a total of 16 cores per socket. Just to indicate how different this architecture is, the prior z15 processor featured twelve cores and a separate chip for a shared cache. The Telum processor will also be manufactured on the Samsung 7nm process as opposed to the 14nm process used for the z15 processor.

Besides the processing cores themselves the most significant change is in the cache structure. Each CPU core has a dedicated L1 cache and 32MB of semi-private low-latency L2 cache. The reason it is semi-private is because the L2 caches are used together to build a shared virtual 256MB L3 between the cores on the chip. The L2 caches are connected through a bi-directional ring bus design for communications and capable of over 320 GB/s bandwidth with an average latency of just 12ns. The L2 caches are also used to build a virtual shared L4 cache between all chips in a drawer. There are up to four sockets per drawer, and two processors per socket, for a total of up to eight chips and 64 CPU cores with 2GB of shared L4 cache per drawer. That can then be scaled up to four drawers in rack for up to thirty-two chips and 256 CPU cores.

The IBM Z Telum processor dual-die chip module and four-chip drawer configuration

The cache architecture was matched with improvements in the CPU cores and accelerators. The Telum CPU cores are an out-of-order design with SMT2 (Simultaneous Multithreading) that can operate at or above a 5GHz base frequency. The CPU cores also feature, amongst other things, enhancements in branch prediction for large footprint and diverse enterprise workloads. The Telum processor also features encrypted memory and improvements to the trusted execution environment for enhanced security and dedicated on-chip accelerators for sort, compression, crypto, and artificial intelligence (AI) to scale with the workload.

One of the key dynamics of the electronics industry today is accelerated computing. Everything from smartphones to cloud servers are using custom or programmable processing blocks to perform tasks more efficiently than general purpose CPUs. This is occurring for two reasons. The first is that as certain tasks mature, it becomes more efficient to perform the tasks through dedicated hardware than through software. Even though some of these tasks may still be performed using a programmable processing engine, there are many programmable engines such as DSP, GPUs, NPUs, and FPGAs that may be able to perform certain tasks more efficiently than CPUs due to the nature of the workload and/or the design of the processing cores.

The second reason for the rise in accelerators is the slowing of Moores Law. As it becomes difficult to improved CPU performance and efficiency through the semiconductor manufacturing technology, the industry is shifting more towards heterogenous architectural improvements. By designing more efficient processing cores, whether they are dedicated to a specific function, or optimized around a specific type of workload or execution, significantly improved performance and efficiency can be achieved in the same or a similar amount of space. As a result, the direction going forward is accelerated computing. Even innovative technologies like quantum and neuromorphic computing, two areas where IBM research is leading the industry, are really forms of accelerating computing that will enhance traditional computing platforms.

AI is one of the most common workloads being accelerated and there is a wide variety of processors and accelerators under development for both AI training and inference processing. The benefits of each will depend on how efficiently the accelerator processes particular workloads. For servers, most AI accelerators are discrete chips. While this does offer more silicon area for higher peak performance, it also increases costs, power consumption, latency, and variability in performing AI tasks. IBMs approach of adding the AI accelerator onto the chip and interfacing it directly with the CPU cores and sharing the memory will allow for secure real-time or close to real-time processing of AI models while increasing overall system efficiency. And because the processor is aimed at enterprise-class workloads, as opposed to large research workloads like scientific or financial modeling, the demands are likely to be spread across multiple AI models with low-latency requirements. The AI accelerators were designed for business workloads like fraud detection, as well as system and infrastructure management like workload placement, database query plans, and anomaly detection.

The AI accelerator features a matrix array with 128 processing tiles designed for 8-way FP-16 SIMD operations and an activation array with thirty-two tiles designed for 8-way FP-16/FP-32 SIMD operations. The reason for the two arrays is to divide the operations between more straightforward matrix multiplication and convolution functions and more complex functions like sigmoid or softmax while optimizing the execution of each. The two arrays are connected through an Intelligent Data Mover and Formatter capable of 600 GB/s of bandwidth internally and have programmable prefetchers and write-back engines connected to the on-chip caches with more than 120 GB/s bandwidth. According to IBM, the AI processor multiplexes AI workloads from the various CPUs and has an aggregate performance of over 6 TFLOPS per chip and is anticipated to be over 200 TLFOPS for a fully populated rack. The AI accelerator also uses the AI tools designed to work with other IBM platforms from the IBM Deep Learning Compiler for porting and optimizing trained models to the platform and Snap ML model libraries.

IBM Z Telum low-latency AI performance scales with the number of chips

According to IBM, the new cache structure has resulted in an estimated 40% performance increase in per socket performance. This is impressive given a platform that has evolved into a scalable mainframe that is optimized across the stack and all the way down to the processor. Ironically, just as Moores Law once drove the industry away from customized processor and systems, it now driving the industry back to customization in the era of accelerated computing. While IBM revenue is driven by software and services, having expertise that drives everything from semiconductor manufacturing to custom chips and systems, gives IBM a competitive advantage in this new accelerated world focused on workload optimization.

See more here:
IBM Re-Architects The Mainframe With New Telum Processor - Forbes

Related Posts

Comments are closed.