Search
Close this search box.

New AI Research Studies Point to Faster And More Powerful Language Models

LLM Studies
LLM Studies

New AI Research Studies Point to Faster And More Powerful Language Models

Insider Brief

  • For scalability of artificial intelligence systems, scientists need to create new methods to make language models more capable and efficient.
  • In two recent studies, leading AI labs propose new approaches that could lead to significant improvements in both added capabilities and higher efficiency.
  • The studies included Meta AI and MIT scientists.

A critical challenge for artificial intelligence researchers is creating new methods to make large language models — LLMs — more capable and efficient. Two recent studies from leading AI labs take separate innovative approaches for possible improvements in both capabilities and efficiency of LLMs — but a lot of work remains.

Meta AI Advance

The first paper from the pre-print server ArXiv, Better & Faster Large Language Models via Multi-token Prediction, is a collaboration between several prominent AI research groups, including Meta AI. It introduces a new technique called multi-token prediction, which challenges the standard approach used by most LLMs that we use currently.

Traditionally, LLMs work by predicting one token  — a word or piece of a word — at a time. However, the researchers behind this study suggest that training LLMs to predict multiple future tokens simultaneously could dramatically improve their performance and sample efficiency.

The method proposed by the Meta AI-led research team involves adding multiple independent output heads to the LLM architecture. At each position in the training data, the model must predict the following n tokens using these n output heads, all operating on top of a shared model trunk.

According to the study, the researchers found that treating multi-token prediction as an auxiliary training task significantly improved downstream capabilities, with no added training time overhead for both code and natural language models.

The gains were particularly pronounced in generative benchmarks like coding, where the team reports that their 13 billion parameter models solved 12% more problems on the HumanEval dataset and 17% more on MBPP than comparable next-token models.

Experiments on small algorithmic tasks showed that multi-token prediction could enhance the development of induction heads and algorithmic reasoning capabilities in LLMs.

An additional benefit of this approach is that, even with large batch sizes, models trained with 4-token prediction can be up to 3 times faster during inference.

While the study acknowledges that the method is currently most effective for very large models, it represents a promising step toward faster and more capable LLMs.

The research team included: Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozière, David Lopez-Paz and Gabriel Synnaeve.

Yes It KAN

Switching to the next study, researchers at MIT — including noted physicist and machine learning scientist Max Tegmark — proposed an ambitious alternative to the ubiquitous multi-layer perceptron (MLP) in their paper on ArXiv, called KAN: Kolmogorov-Arnold Networks (KANs). MLP is widely regarded as one of the most important algorithms in deep learning history.

The key innovation behind KANs is the use of learnable activation functions on weights instead of the fixed activation functions on nodes employed by MLPs. In traditional MLPs, there are linear weight parameters and separate non-linear activation functions applied on the nodes. KANs replace the linear weights with learnable non-linear activation functions themselves, and that removes the need for separate node-wise activations. Another way to put this is to suggest that the non-linearity is baked into the weights rather than applied after the linear transformations.

The researchers say this change may seem simple conceptually, but it has far-reaching implications.

According to the researchers, this fundamental difference would help KANs outperform MLPs in terms of both accuracy and interpretability.

The researchers report that in data fitting and partial differential equation (PDE) solving tasks, much smaller KANs achieved comparable or better accuracy than significantly larger MLPs. Theoretical and empirical evidence also suggests that KANs would possess faster neural scaling laws than MLPs.

Interpretability

In the field of AI research, interpretability refers to the ability to explain or to provide insights into how a model works and arrives at its outputs or decisions. As AI systems become more complex and are deployed in high-stakes domains like healthcare and finance, the ability to understand and trust their reasoning process is crucial for accountability, safety and aligning the systems with human values.

KANs offer improvements in interpretability, according to the researchers. They can be intuitively visualized and easily interact with human users. The researchers demonstrated how KANs could assist scientists in (re)discovering mathematical and physical laws through two examples in mathematics and physics.

While the study acknowledges that KANs have only been evaluated in a few targeted use cases so far, the researchers argue that they represent a promising alternative to MLPs, potentially opening up new opportunities for improving deep learning models that currently rely heavily on MLPs.

In addition to Tegmark, the research team included Ziming Liu, Yixuan Wang, Sachin Vaidya, Fabian Ruehle, James Halverson, Marin Soljačić and Thomas Y. Hou.

As these summaries cannot fully demonstrate all of the nuances of the studies, please read the papers for deeper dives into the research.