Kwai, Kuaishou & ETH Zrich Propose PERSIA, a Distributed Training System That Supports Deep Learning-Based Recommenders of up to 100 Trillion…

Modern recommender systems have countless real-world applications and have made astonishing progress thanks to the ever-increasing size of deep neural network models which have grown from Googles 2016 model with 1 billion parameters to Facebooks latest model with 12 trillion parameters. There seems to be no limit to the significant quality boosts delivered by such model scaling, and deep learning practitioners believe the era of 100 trillion parameter systems will arrive sooner than later.

The training of extremely large models that are both memory and computation-intensive is however challenging, even with the support of industrial-scale data centers. In the new paper PERSIA: An Open, Hybrid System Scaling Deep Learning-based Recommenders up to 100 Trillion Parameters, a research team from Kwai Inc., Kuaishou Technology and ETH Zrich proposes PERSIA, an efficient distributed training system that leverages a novel hybrid training algorithm to ensure both training efficiency and accuracy in such recommender models. The team provides theoretical demonstrations and empirical studies to validate the effectiveness of PERSIA on recommender systems of up to 100 trillion parameters.

The team summarizes their studys main contributions as:

The team first proposes a novel sync-async hybrid algorithm, where the embedding module trains in an asynchronous fashion while the dense neural network is updated synchronously. This hybrid algorithm enables hardware efficiency that is comparable with the fully asynchronous mode without sacrificing statistical efficiency.

The team designed PERSIA (parallel recommendation training system with hybrid acceleration) to support the aforementioned hybrid algorithm with two fundamental aspects: 1) the placement of the training workflow over a heterogeneous cluster, and 2) the corresponding training procedure over the hybrid infrastructure. PERSIA features four modules designed to provide efficient autoscaling and support recommender models of up to 100 trillion parameters:

The team evaluated PERSIA on three open-source benchmarks (Taobao-Ad, Avazu-Ad and Criteo-Ad) and the real-world production microvideo recommendation workflow at Kwai. They used two state-of-the-art distributed recommender training systems XDL and PaddlePaddle as their baselines.

The proposed hybrid algorithm achieved much higher throughput compared to all other systems. PERSIA reached nearly linear speedups with significantly higher throughput compared with XDL and PaddlePaddle, and 3.8 higher throughput compared with the fully synchronous algorithm on the Kwai-video benchmark. Moreover, PERSIA demonstrated stable training throughput even when model size increased up to 100 trillion parameters, achieving 2.6 higher throughput than the fully synchronous mode.

Overall, the results show the proposed PERSIA effectively supports the efficient and scalable training of recommender models at a scale of up to 100 trillion parameters. The team hopes their study and insights can benefit both academia and industry.

The code is available on the projects GitHub. The paper PERSIA: An Open, Hybrid System Scaling Deep Learning-based Recommenders up to 100 Trillion Parameters is on arXiv.

Author: Hecate He |Editor: Michael Sarazen

We know you dont want to miss any news or research breakthroughs.Subscribe to our popular newsletterSynced Global AI Weeklyto get weekly AI updates.

Like Loading...

Link:
Kwai, Kuaishou & ETH Zrich Propose PERSIA, a Distributed Training System That Supports Deep Learning-Based Recommenders of up to 100 Trillion...

Related Posts

Comments are closed.