Optimizing Resource Utilization and Maximizing ROI with Composable Infrastructure – insideHPC – insideHPC

Sponsored Post

Todays IT organizations must maximize their resource utilization to deliver the computing capabilities their organization needs when and where its needed. This has resulted in many organizations building multi-purpose clusters, which impacts performance.

Even worse from an ROI perspective, in many instances, once resources are no longer required for a particular project, they cannot be redeployed to another workload with precision and efficiency. Composable disaggregated infrastructure (CDI) can hold the key to solving this optimization problem, while also providing bare metal performance.

What is CDI?

At its core, CDI is the concept of using a set of disaggregated resources connected by a NVMe over fabric solution so that you can dynamically provision hardware, regardless of scale. This infrastructure design provides the flexibility of the cloud and the value of virtualization but the performance of bare metal. Because it decouples applications and workloads from the underlying hardware, CDI offers the ability to run diverse workloads on a cluster while still optimizing for each workload and even support multi-tenant environments.

Software providers often used in CDI-based clusters include Liqid CDI and Giga IO. Liqid Command Center is a powerful management software platform that dynamically composes physical servers on demand from pools of bare-metal resources. GigaIO FabreX is an enterprise-class, open-standard solution that enables complete disaggregation and composition of all resources in the rack.

What are the technical and business benefits of clusters that include CDI?

The disaggregated resources in CDI allow you to dynamically provision clusters using best fit hardware without the reduction in performance that you would get in a cloud-based environment. With respect to HPC and AI, the value of CDI comes from the flexibility of the underlying hardware, different workloads, and environments. This improves cost effectiveness and scalability compared to cloud services and cloud service providers, improving ROI and lowering costs.

For AI and HPC workloads, performance is still top priority and on-premises hardware provides better performance, with the ability to burst to the cloud on an as-needed basis. A well-designed cluster built with commercial off-the-shelf (COTS) hardware elements and connected with PCIe, Ethernet, and InfiniBand can increase the utilization, flexibility, and effective use of valuable data center assets. Organizations that implement CDI realize a 2x to 4x increase in data center resource utilization, on average.

Beyond optimizing resource allocation, CDI also provides several additional benefits for your dynamically configured system:

What are ideal use cases for CDI?

A wide variety of technology areas can benefit from CDI. These include:

For deep learning, it is best to keep clusters on-premises because on-premises computing can be more cost-effective than cloud-based computing when highly utilized. Its also advisable to keep primary storage close to on-premises compute resources to maximize network bandwidth while limiting latency.

What are the key components of a CDI cluster?

There are two critical factors in deploying a successful CDI-based cluster. The first is a design that properly integrates leading-edge CDI software.

As mentioned above, two software platforms often used in CDI clusters are Liqid Command Center and GigaIO FabreX. Both are technologies Silicon Mechanics has worked with before and uses in our CDI-based clusters.

Liqid Command Center is a fabric management software for bare-metal machine orchestration. Command Center provides:

GigaIO FabreX is an open-standard solution that allows you to use your preferred vendor and model for servers, GPUS, FPGAs, storage, and for any other PCIe resource in your rack. In addition to composing resources to servers, FabreX can compose servers over PCIe. FabreX enables true server-to-server communication across PCIe and makes cluster scale compute possible, with direct memory access by an individual server to system memories of all other servers in the cluster fabric.

High-performance, low-latency networking, like InfiniBand from NVIDIA Networking, is the second critical element to the way CDI operates. Its possible to disaggregate just about everythingcompute (Intel, AMD, FPGAs), data storage (NVMe, SSD, Intel Optane, etc.), GPU accelerators (NVIDIA GPUs), etc. You can rearrange these components however you see fit, but the networking underneath all those pipes stays the same. Think of networking as a fixed resource with a fixed effect on performance, as opposed to other resources that are disaggregated.

It is important to plan out an optimal network strategy for a CDI deployment. InfiniBand is ideal for large scale or high performance. Conversely, Ethernet is a strong choice for smaller clusters. If you expand over time, youve got that underlying network to support anything that comes up in the lifecycle of that system.

How can CDI help handle demanding HPC and AI workflows?

Today, many organizations run demanding and complex workflows, such as HPC and AI, that require massive levels of costly resources. This drives IT departments to find flexible and agile solutions that effectively manage the on-premises data center while delivering the flexibility typically provided by the cloud. CDI is quickly emerging as a compelling option to meet the demands for deploying applications that incorporate advanced technologies.

Silicon Mechanics is an engineering firm providing custom, best-in-class solutions for HPC/AI, storage, and networking, based on open standards. The Silicon Mechanics Miranda CDI Cluster is a Linux-based reference architecture that provides a strong foundation for building disaggregated environments.

Get a comprehensive understanding of CDI clusters and what they can do for your organization by downloading the Inside HPC white paper on CDI.

More:
Optimizing Resource Utilization and Maximizing ROI with Composable Infrastructure - insideHPC - insideHPC

Related Posts

Comments are closed.