Generative AI

What is a Virtual Token Counter? How VTCs Might Improve Fairness For LLM Scheduling

Insider Brief

Large Language Models (LLMs) have rapidly become integral to various applications, from programming assistants to chatbots.
One of these model’s challenges is ensuring that all client requests are processed fairly and efficiently.
Researchers propose an algorithm — called a Virtual Token Counter (VTC) — to achieve fairness in serving by using a cost function that accounts for the number of words and responses that are processed.

Large Language Models (LLMs) — like ChatGPT and BARD — have rapidly become integral to a range of applications, from programming assistants to chatbots. However, their rise has introduced new challenges, particularly in ensuring that all client requests are processed fairly and efficiently. A research paper recently published on the pre-print server ArXiv delves into these challenges and gives an outline of a virtual token counter — which is explained below — as one possible solution to improve the fairness and resource efficiency of these LLM inference services.

The Problem of Fairness and Resource Utilization

LLM inference services, which handle requests ranging from brief chats to extensive document analyses, must maintain fairness to prevent any single client from monopolizing resources. Most major LLM services impose request rate limits to ensure fairness. However, this method isn’t the best use of resources and can lead to customer complains when there is spare capacity.

As the researchers mention in the paper: “While there is a rich literature on fair scheduling, serving LLMs presents new challenges due to their unpredictable request lengths and unique batching characteristics on parallel accelerators.”

The traditional First-Come-First-Serve (FCFS) scheduling method commonly used by LLM systems, though simple, has significant drawbacks, as well. For instance, a client sending a disproportionate number of requests can slow down the service for all other clients. To mitigate this, many services rely on another type of request limit, called a request-per-minute (RPM) limits.

Yet, as the researchers note: “RPM can lead to low resource utilization. A client sending requests at a high rate will be restricted even if the system is underutilized, leading to wasted resources.”

Fairness is Unfair to LLMs?

LLM services struggle with implementing fair scheduling due to its unique challenges. One significant challenge is defining fairness in the context of LLM serving, which differs from traditional fairness concepts in networking and operating systems. Traditional fairness measures the cost of requests as a fixed value, which is relatively easy to estimate. However, in LLM serving, the cost per token can vary significantly. Input tokens are typically processed in parallel, whereas output tokens must be generated sequentially, making them more resource-intensive.

The team also adds the server’s variable token-rate capacity and the unknown output length of requests pose further challenges.

They write: “Without knowing the length in advance, traditional algorithms like SFQ and Deficit Round Robin (DRR) cannot determine how many jobs can be scheduled within the quantum (of service).”

The researchers conducted extensive experiments to validate the performance of the VTC algorithm. The results demonstrated the superior performance of VTC in ensuring fairness compared to other baseline methods, which exhibited shortcomings under various conditions.

“Through extensive experiments, we demonstrate the superior performance of VTC in ensuring fairness, especially in contrast to other baseline methods,” the researchers write.

Virtual Token Counter

To address these issues, the researchers propose a novel scheduling algorithm called the Virtual Token Counter (VTC). This algorithm aims to achieve fairness in serving by using a cost function that accounts for the number of input and output tokens processed. Maybe an easier way to put it: the system is designed to ensure fairness by considering the number of words and responses processed for each user request.

The VTC method keeps track of how much service each user has received and gives priority to those who have received less. This method is a step toward making sure resources are used efficiently without compromising fairness.

The researchers write: “VTC tracks the services received for each client and will prioritize the ones with the least services received, with a counter lift each time a client is new to the queue. This counter lift fills the gap created by a low load period of the client, ensuring fairness.”

Broader Implications for LLM Serving Systems

This research not only addresses fairness but also bridges a critical gap in LLM serving systems.

While well-known techniques like advanced batching mechanisms, memory optimizations and GPU kernel optimizations have improved LLM performance, none have considered fairness among clients, according to the team.

They add that the proposed VTC algorithm could be integrated with many of these techniques, further enhancing their effectiveness.

The researchers conclude with a recognition the client needs to be in the loop to improve services. And, as competition among the LLMs heats up, these services will look for ways to improve market share, client service will be an important consideration. Client service and fairness will be part of those efforts.

“We studied the problem of fair serving in Large Language Models (LLMs) with regard to the service received by each client. We identified unique characteristics and challenges associated with fairness in LLM serving, as compared to traditional fairness problems in networking and operating systems. We then defined what constitutes fairness and proposed a fair scheduler, applying the concept of fair sharing to the domain of LLM serving at the token granularity.”

The research team includes: Ying Sheng from UC Berkeley and Stanford University, along with Shiyi Cao, Dacheng Li, Banghua Zhu, Zhuohan Li, and Joseph E. Gonzalez, all from from UC Berkeley, and Danyang Zhuo from Duke University.

This article attempts to summarize VTC and their findings, but for a deeper dive please read the paper.

Need Deeper Intelligence on the AI Market?

AI Insider's Market Intelligence platform tracks funding rounds, competitive landscapes, and technology trends across the global AI ecosystem in real time. Get the data and insights your organization needs to make informed decisions.

Education, Enterprise AI

AI Psychosis: What Emotional Dependency on Chatbots Means for Enterprise AI

Insider Brief The individual cases get the headlines – someone marrying a chatbot, a teenager developing an emotional attachment to ChatGPT, a music producer convinced

AI, AI Funding & Investment, Robotics

Shifters Raises $10.2M in Seed Round Funding to Develop AI-Powered Ground Robotics for Defense & Security

Insider Brief Autonomous ground robotics startup Shifters has raised $10.2 million in seed funding to expand development of AI-powered robotic teams designed for defense, security

AI, AI Funding & Investment, AI in Healthcare, Robotics

Mayo Clinic and Microsoft Collaborate to Develop a Frontier AI Model for Healthcare

Insider Brief Mayo Clinic and Microsoft are teaming up to build a healthcare AI foundation model that they say is designed specifically for medicine, with

Stay Updated with AI Insider

Get the latest AI funding news, market intelligence, and industry insights delivered to your inbox weekly.

Market Intelligence & Data

Track funding, map landscapes, and access bespoke data cuts.

Strategic Advisory

Market entry playbooks, ecosystem analysis, and technology scouting.

Due Diligence

Technical, commercial, and regulatory assessments for investors.

$ 0 M

Seed round tracked

Gitar — Code Validation

Education, Enterprise AI

AI Psychosis: What Emotional Dependency on Chatbots Means for Enterprise AI

June 3, 2026

AI, AI Funding & Investment, Robotics

Shifters Raises $10.2M in Seed Round Funding to Develop AI-Powered Ground Robotics for Defense & Security

June 3, 2026

AI, AI Funding & Investment, AI in Healthcare, Robotics

Mayo Clinic and Microsoft Collaborate to Develop a Frontier AI Model for Healthcare

June 3, 2026

Get the Weekly Briefing

Funding analysis, market intelligence, and industry trends delivered to your inbox every week.

Need bespoke intelligence?

Our team combines real-time data with decades of sector experience to guide your decisions.

What is a Virtual Token Counter? How VTCs Might Improve Fairness For LLM Scheduling

The Problem of Fairness and Resource Utilization

Virtual Token Counter

Broader Implications for LLM Serving Systems

Need Deeper Intelligence on the AI Market?

Related Articles

AI Psychosis: What Emotional Dependency on Chatbots Means for Enterprise AI

Shifters Raises $10.2M in Seed Round Funding to Develop AI-Powered Ground Robotics for Defense & Security

Mayo Clinic and Microsoft Collaborate to Develop a Frontier AI Model for Healthcare

Stay Updated with AI Insider

Market Intelligence & Data

Strategic Advisory

Due Diligence

Seed round tracked

AI Psychosis: What Emotional Dependency on Chatbots Means for Enterprise AI

Shifters Raises $10.2M in Seed Round Funding to Develop AI-Powered Ground Robotics for Defense & Security

Mayo Clinic and Microsoft Collaborate to Develop a Frontier AI Model for Healthcare

Get the Weekly Briefing

Need bespoke intelligence?

Subscribe today for the latest news about the AI landscape