Guest Post By
Alexey Aylarov
Modern speech-to-speech models now bypass the need to convert speech to text and back; they can speak while listening, and vice versa. They’re also getting faster. Technology is now advancing faster than human perception can keep up. That shifts the bottleneck from infrastructure to interaction design and model behavior. The piece will explore what the thinking pause means for Voice AI, why conversations can feel unnatural or even creepy, where the middle ground lies, and why pacing and emotional context are crucial to resolving this issue.
Why uncanny speed is showing up now
Gone are the days where the biggest complaint about Voice AI was that it sounded robotic. In many demos, the voice is already good enough that your brain stops tagging it as automation. That’s where the trouble begins, because humans have a built-in expectation for conversational timing.
In a normal conversation, a quick response can feel natural for simple facts. For anything that requires lookup, reasoning, or caution, humans pause. This is because pausing signals thought and attention to what was said.
A typical human reaction time is often described in the 200 to 300 ms range for basic stimulus-response. New speech-to-speech models can start producing a response around 150 ms in streaming conditions, which means they can speak while still listening. Technically, that’s a superpower, but in customer support contexts it is tantamount to being cut off.
Timing is the real face of a voice agent
Timing determines whether the other side is listening, whether they are being rushed, and whether the interaction is safe.
There are two common uncanny patterns. First, the agent is too fast in moments where a human would slow down. Second, the agent interrupts like a machine rather than like a person.
Interruptions are normal in human conversations, but they follow norms. We usually interrupt with cues like, a short overlap that yields, or a correction that comes with apology, or a pause that signals you’re taking the floor. When an AI interrupts with full, clean sentences at machine speed, it feels like the system is ignoring the caller’s social signals leading to frustration.
Try it yourself
Don’t just take my word for it, try Moshi, a conversational AI. It’s a simple example where the model listens and speaks at the same time. You’ll notice that the overlap sounds too eager, silences are filled too quickly, and the agent’s presence does not behave like a human who waits for you to finish.
Do you feel comfortable talking to it? Probably not.
How the latency of a classic pipeline compares to speech-to-speech
A well-built ASR plus LLM plus TTS pipeline feels natural because every component can stream.
ASR runs in real time. LLMs used for live calls are usually fast variants of larger models, tuned for streaming output and lower latency rather than maximum reasoning depth. TTS has also shifted toward streaming generation, where it can start producing audio after a small buffer of words.
In that classic pipeline, the latency trap is VAD and turn detection as it decides whether the user is still speaking, and turn detection decides when to send partial or final transcripts to the LLM. If you wait too long, the system feels slow. If you cut too early, you interrupt or you miss the last part of the request.
The most advanced speech-to-speech systems skip the ASR step entirely and ingest speech directly. When you do that, you also stop relying on traditional VAD thresholds in the same way, because turn-taking analysis can be built into the model’s behavior. This creates the uncanny speed effect.
The tech behind the speed
There are limits on how fast a Voice AI can respond.
First, infrastructure. If your setup is good, you start bumping into physics. Data still needs to travel over the internet. You can reduce hops, place components closer to the caller, and streamline streaming paths, but the last gains get expensive, and after a point the bigger improvements come from interaction design rather than shaving milliseconds.
Second, telephony. Phone networks were built decades ago with constraints like codec choices, jitter buffers, routing quirks, and carrier-level variability. They weren’t designed for real-time AI. Even if your model is fast, the call leg can introduce timing drift that makes a conversation choppy or delayed.
How speech-to-speech models work without STT and TTS
At a high level, a speech-to-speech system has three core pieces.
- An audio encoder or tokenizer that converts raw waveforms into discrete audio tokens or latent representations. This is the step that turns messy audio into something a model can reason over in a stable way.
- A multimodal transformer that acts as the brain. It processes the input audio tokens, keeps context across turns, and generates output tokens that represent the response, ideally carrying intent, tone, and emotion along the way.
- An audio decoder or vocoder that converts the generated tokens back into an audible waveform at high fidelity.
This architecture enables the model to respond quickly and naturally, but it also makes control harder because the system is not producing text as the primary intermediate. Guardrails become more important and more difficult at the same time, because the system moves fast enough to do something wrong before your traditional controls catch it.
When faster isn’t better
Should Voice AI respond instantly? It should be able to, because there are scenarios where instant matters, but whether it should depends on the business logic and the stakes of the call.
For simple, low-risk actions, speed is great. For payments, refunds, cancellations, disputes, anything where the customer is anxious or the policy is strict, a small pause plus confirmation feels more trustworthy.
This is where fillers like “um, uh, let me check, one moment” come in. Some developers now add controlled fillers or micro-pauses to voice agents so the system can take longer without sounding broken.
The emotional layer and why text-only sentiment is a limitation
Most production systems still infer emotion mainly from text. They do sentiment analysis on what was said. This is a limitation in customer support. A cheerful voice during a cancellation call, or a casual tone during a fraud dispute, can become reputational damage.
Speech-to-speech models should get better at emotion recognition because they keep more of the original audio signal, and the best ones already try to adjust output accordingly, but we’re still early. This is why many teams add explicit safety logic around tone and escalation, and why affective settings in some real-time model APIs are increasingly becoming a more practical feature.
The missing visual channel
One reason voice can feel uncanny is that it’s blind from the user’s perspective. In a human conversation, we see facial expressions and micro-reactions that tell us whether the other person understands, is confused, or is waiting.
Video-based agents can help, but syncing facial behavior to real speech is hard, and users are extremely sensitive to mismatched lip movement, eye contact, and timing. If the voice is fast but the face lags, the uncanny effect gets stronger.
Cultural norms cannot be overlooked
Turn-taking rules vary by culture and context. Some consider interruption as rude and prefer waiting, while others don’t mind. A human adapts without thinking about it. A model doesn’t, unless you teach it through prompts, training, and evaluation.
This is why a voice agent that performs well in one market can feel rude or intrusive in another. And it’s why choosing a different model doesn’t always fix it. It’s usually a conversation design, localization, and testing with real users.
Looking Ahead
What’s next for Voice AI if the goal is truly human-like conversations?
Omni models are a big part of the answer. Models that handle audio end to end, and mix audio, text, and vision in one system, will learn better which conversations feel normal. They also open the door to agents that can pair voice with visual feedback more coherently, instead of stitching together separate speech and animation components.
Most importantly, the layer around the model—orchestration, observability, evaluation, and connectivity—is imperative. As raw latency improves, the hard work shifts to making behavior stable in production, handling turn-taking under noisy conditions, choosing when to speak and when to stay quiet, and building guardrails that keep fast systems safe.
About the author
Alexey Aylarov is CEO and Co-Founder of Voximplant, a Voice AI Orchestration Platform for real-time communications. He co-founded Voximplant in 2013 to make voice and audio communications programmable for developers. Before Voximplant, Alexey co-founded SIP-based calling services Flashphone and Zingaya, which proved the demand for developer-first communications. Alexey brings a rare mix of deep VoIP infrastructure expertise and hands-on AI platform building.




