Run Mixtral-8x7B on Consumer Hardware with Expert Offloading – Towards Data Science

Activation pattern of Mixtral-8x7Bs expert sub-networks source (CC-BY)

While Mixtral-8x7B is one of the best open large language models (LLM), it is also a huge model with 46.7B parameters. Even when quantized to 4-bit, the model cant be fully loaded on a consumer GPU (e.g., an RTX 3090 with 24 GB of VRAM is not enough).

Mixtral-8x7B is a mixture of experts (MoE). It is made of 8 expert sub-networks of 6 billion parameters each.

Since only 2 experts among 8 are effective during decoding, the 6 remaining experts can be moved, or offloaded, to another device, e.g., the CPU RAM, to free up some of the GPU VRAM. In practice, this offloading is complicated.

Choosing which one of the experts to activate is a decision taken at inference time for each input token and each layer of the model. Naively moving some parts of the model to the CPU RAM, as with Accelerates device_map, would create a communication bottleneck between the CPU and the GPU.

Mixtral-offloading (MIT license) is a project that proposes a much more efficient solution to reduce VRAM consumption while preserving a reasonable inference speed.

In this article, I explain how mixtral-offloading implements expert-aware quantization and expert offloading to save memory and maintain a good inference speed. Using this framework, we will see how to run Mixtral-8x7B on consumer hardware and benchmark its inference speed.

The tutorial section is also available as a notebook that you can find here:

Get the notebook (#37)

MoE language models often allocate distinct experts to sub-tasks, but not consistently across long token sequences. Some experts are active in short 24 token sequences, while others have intermittent gaps in their activation. This is well illustrated by the following figure:

Follow this link:

Run Mixtral-8x7B on Consumer Hardware with Expert Offloading - Towards Data Science

Related Posts

Comments are closed.