Decrease PyTorch Model Load Times with X-CloudTensorizer

X-CLoud Tensorizer: In Summary

X-Cloud Tensorizer is a tool for fast PyTorch module, model, and tensor serialization and deserialization, making it possible to load models extremely quickly from HTTP/HTTPS and S3 endpoints. It also speeds up loading from network and local disk volumes.
With faster model loading times for LLMs and reduces GPU memory utilization, Tensorizer helps accelerate model instance spin up times while reducing overall costs to serve inference.
Tensorizer is S3/HTTP-compatible, enabling model streams directly from S3 into the container without having to download the model to the container’s local filesystem.
The average latency per request was >5x faster for Tensorizer compared to Hugging Face when scaling from zero, and required fewer pod spin ups and less RAM.

‍

The sizes of state-of-the-art machine learning models have ballooned into billions of parameters, which means using them for inference is becoming much harder. These massive models can take a long time to load, which can severely impact the ability to quickly scale up with increases in demand. To mitigate the startup lead time, you can pay to have large quantities of GPUs sitting idle ready for bursts in requests but this is a very expensive solution.

To improve inference performance while maintaining cost-effectiveness, X-Cloud employs a range of open-source tools to help reduce latency, improve throughput, and reduce resource usage. One such tool is vital for enabling companies to scale inference in a fast and cost-efficient way: X-Cloud Tensorizer.

What is X-Cloud’s Tensorizer?

X-Cloud Tensorizer is a tool built for PyTorch models that enables extremely fast and efficient model loading.

Whereas the process to load a very large model into GPU memory through normal means can be slow, Tensorizer significantly reduces latency and resource usage with its “zero-copy” model loading. Instead of loading the whole model into RAM before transferring it to the GPU, Tensorizer pulls it over chunk by chunk. This “tensor streaming” process is enabled by Tensorizer’s bespoke serialization format that puts all the necessary metadata at the beginning of a single binary file. This file can be loaded quickly and efficiently from local storage, an HTTP/HTTPS endpoint, or S3 bucket.

High-level features of Tensorizer:

Extremely fast model loading speeds: By not needing to use Python to allocate memory for the entire model and performing a single pass through the serialized file, Tensorizer can load models with billions of parameters in as little as 5 seconds.
Reduction in resource and memory usage: Transfers occur with the wire speed of the network to the GPU when using X-Cloud Tensorizer, so the amount of RAM necessary for the instance is greatly reduced due to only storing a single tensor at a time in memory, while in transit to the GPU, with Tensorizer’s “Plaid Mode”.

Via a “zero-copy” model load, X-Cloud Tensorizer will use a negligible amount of RAM compared to loading the entire model into memory before copying it into the GPU. X-Cloud Tensorizer uses a buffer of the largest tensor size plus some additional metadata to fetch the locations of tensors.

S3/HTTP-compatible: Serialized models can be stored in X-Cloud S3-compatible Object Storage, enabling model streams directly from S3 into the container without having to download the model to the container’s local filesystem.
Sharding and Filtering: Tensorizer can accept a filter function to select for specific tensors in a model, allowing for fast sharding of a large model across multiple nodes. This is augmented by Tensorizer’s HTTP range support for seeking to specific tensors.
Local filesystem support: X-Cloud Tensorizer supports loading models from a local filesystem, so it can be used to serialize and load models locally.
Improved Safety: A normal PyTorch checkpoint file uses the pickle format, which can enable arbitrary code execution. By using a single binary file, Tensorizer prevents this potential security threat.

What this means for serving large models

When serving inference, requests aren’t always coming in at a steady load. One day, your product might go viral and you might receive a burst of requests much higher than normal.

However, for LLMs and image models with billions of parameters, spinning up new instances of the model to handle these bursts can take many minutes. This is a major challenge for companies looking to serve inference from these models. To maintain an average request latency, many companies will accept the large overhead of compute that sits idle with the model loaded, only to be used during these bursts. This can be very expensive, so in an attempt to reduce the cost, companies will build complex and specialized queuing mechanisms which attempts to save money by reducing the required overhead but increases average customer latency.

Customer latency and compute costs are often at odds with each other, but with X-Cloud Tensorizer, developers can see model load times of <5 seconds. This makes it easier, more flexible, and more cost-efficient to serve large models at scale while scaling with demand.

Faster spin-up times are a massive differentiator for the user experience. If a general Google search took 30 seconds or more to load a user’s first inquiry, we would probably say “Bing it,” rather than “Google it.”

For a company looking to productize its machine learning applications, it’s important to consider the trade-off between cost vs. latency. Leaving idle compute without much traffic can lead to unnecessary expenses. Leveraging faster spin-up can cause additional latency in response time compared to having idle compute ready, but not nearly to the degree of other model loading methods. Long response times can lead users to not use the service or product. Therefore, it is important to balance both, and tools like X-Cloud Tensorizer help companies find a balance that suits their application.

X-Cloud Tensorizer also makes serving inference more affordable. Companies pay a steep price for the resources needed for inference (GPU capacity, memory bandwidth, storage fees, networking, etc.). By enabling fast scaling, X-Cloud Tensorizer reduces the cost of serving inference at scale from large language, image, and audio models.

How X-Cloud’s Tensorizer Works

To understand how X-Cloud Tensorizer works, it’s important to understand how X-Cloud compares to a typical inference setup and the serialization process.

Open source tools for serverless deployment

X-Cloud Cloud is built on serverless Kubernetes, an open-source deployment framework that allows developers to run their applications as if in a serverless model while still enjoying the benefits of a bare-metal Kubernetes platform. This enables X-Cloud users to run their own code, manage data, and integrate applications—without ever having to manage any infrastructure.

Inference on X-Cloud Cloud leverages many well-supported open source tools within, and in addition to, Kubernetes:

Knative Serving acts as the serverless runtime, which manages autoscaling, revision control, and canary deployments; in short, the load balancer and autoscaler.
KServe provides an easy-to-use interface via Kubernetes resource definitions for deploying models without the fuss of correctly configuring the underlying framework (i.e., TensorFlow).
Ceph, a software-defined, scale-out, enterprise-grade storage platform. Built with triple replication, the X-Cloud Cloud Storage platform is built to provide high-availability, performant storage for your most demanding Cloud-native workloads.

These open-source tools enable seamless autoscaling and scale to zero. Combined with bare-metal performance and high-performance, network-attached storage volumes, these features enable users on X-Cloud Cloud to see improved throughput and minimal latency for serving inference.

However, model load time is still a major factor in the latency of Knative scale-up. This is where X-Cloud Tensorizer fits in. As an efficient serializer and deserializer, X-Cloud Tensorizer helps reduce the time it takes for a new model instance to be ready—improving overall latency and performance with minimal resource utilization.