Weizmann Institue and Intel Labs Researchers Present a New Way to Speed Up LLMs and Lower Computational Costs

AI Research & Advances

Insider Brief

Researchers from the Weizmann Institute of Science and Intel Labs have developed new algorithms that allow large language models (LLMs) from different developers to collaborate, potentially accelerating AI processing speeds by up to 2.8 times while lowering computational costs.
The method, funded by Intel Labs and presented at the International Conference on Machine Learning, overcomes proprietary token language barriers by translating tokens into a shared format and promoting the use of common tokens across models.
Released on the open-source Hugging Face Transformers platform, the technology is already being adopted by AI developers worldwide, offering scalable improvements particularly valuable for edge devices such as autonomous vehicles and drones.

A new method developed by researchers at the Weizmann Institute of Science and Intel Labs could significantly accelerate large language models (LLMs) while lowering the computational burden, according to a study presented this week at the International Conference on Machine Learning in Vancouver. The research, funded by Intel Labs, enables AI models developed by different companies to collaborate for the first time by overcoming the proprietary ‘languages’ or token systems unique to each model.

According to the Weizmann Institute, LLMs such as ChatGPT and Gemini rely on processing vast amounts of data to generate responses. While powerful, they are slow and consume significant computing resources. In recent years, speculative decoding emerged as a method to improve LLM efficiency. In this approach, a small, fast model generates an initial response, and a larger, more accurate model corrects any mistakes. However, until now, this process was limited to models designed to work together, as both needed to share the same internal token structure.

The study involved contributions from multiple researchers, including Nadav Timor and Professor David Harel of the Weizmann Institute, as well as Oren Pereg and colleagues from Intel Labs. The practical tools developed through their work aim to advance the efficiency and scalability of generative AI applications across industries.

The Weizmann and Intel researchers addressed this limitation by developing two algorithms that allow any small model to collaborate with any large model, regardless of their origin. The first algorithm translates a model’s internal tokens into a shared format understood by other models. The second encourages models to use common tokens that have identical meanings across systems, similar to shared words in human languages.

“At first, we worried that too much information would be ‘lost in translation’ and that different models wouldn’t be able to collaborate effectively,” noted Timor. “But we were wrong. Our algorithms speed up the performance of LLMs by up to 2.8 times, leading to massive savings in spending on processing power.”

The researchers reported that their methods improved LLM performance by an average of 1.5 times, with peaks reaching up to 2.8 times faster responses. This improvement translates directly into energy savings and reduced computational costs, which could be substantial in large-scale AI operations. These findings were significant enough to warrant public presentation at ICML, an honor granted to only 1% of submissions.

To make the technology widely accessible, the team has released the algorithms on Hugging Face Transformers, a prominent open-source AI platform. According to the researchers, the new algorithms have already been integrated into standard tools used by developers globally. This move democratizes access to advanced AI acceleration techniques previously confined to major tech firms with proprietary models.

“This new development is especially important for edge devices, from phones and drones to autonomous cars, which must rely on limited computing power when not connected to the internet,” Timor pointed out. “Imagine, for example, a self-driving car that is guided by an AI model. In this case, a faster model can make the difference between a safe decision and a dangerous error.”