Google & DeepMind Study the Interactions Between Scaling Laws and Neural Network Architectures – Synced

State-of-the-art AI models have ballooned to billions of parameters in recent years. Although the machine learning (ML) community has shown keen interest in the scaling properties of transformer-based models, there has been relatively little research on scaling effects with regard to the inductive biases imposed by different model architectures.

In the new paper Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?, a research team from Google and DeepMind posits that understanding the connections between neural network architectures and scaling laws is essential for designing and evaluating new models. The team pretrains and finetunes over 100 models to reveal useful insights on the scaling behaviours of ten diverse model architectures.

The team summarizes their main contributions as:

The systematic study aims at answering a number of questions: Do different model architectures scale differently? How does inductive bias affect scaling behaviour? How does scaling impact upstream and downstream model performance?

To answer these questions, the team conducted extensive experiments on a broad spectrum of models, including well-established transformer variants such as Evolved Transformer (So et al., 2019), Universal Transformers (Dehghani et al., 2018) and Switch Transformers (Fedus et al., 2021); lightweight models such as Googles ALBERT; and efficient transformers such as Performer (Choromanski et al., 2020) and Funnel Transformers (Dai et al., 2020). The study also looks at non-transformer architectures that include Lightweight Convolutions (Wu et al., 2019), Dynamic Convolutions (Wu et al., 2019), and MLP-Mixers (Tolstikhin et al., 2021).

The study reports the number of trainable parameters, FLOPs (of a single forward pass) and speed (steps per second) for different architectures, as well as validation perplexity (on upstream pretraining) and results on 17 downstream tasks.

The teams analysis leads them to conclude that architecture plays a crucial role in scaling due to intricate factors that are intertwined with architectural choices, that some models may do well on upstream perplexity but fail to transfer to downstream tasks, and that the performance of different models and architectures can fluctuate at different scales. They also show that introducing novel inductive biases can be risky when scaling and suggest ML practitioners be mindful of this when performing expensive runs on transformer architectures that drastically modify the attention mechanism.

The paper Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling? is on arXiv.

Author: Hecate He |Editor: Michael Sarazen

We know you dont want to miss any news or research breakthroughs.Subscribe to our popular newsletterSynced Global AI Weeklyto get weekly AI updates.

Like Loading...

Here is the original post:
Google & DeepMind Study the Interactions Between Scaling Laws and Neural Network Architectures - Synced

Related Posts

Comments are closed.