Atlas Cloud Introduces Atlas Inference to Streamline Large Language Model Deployment

Atlas Cloud has announced the launch of Atlas Inference, a next-generation AI inference platform engineered to dramatically reduce the GPU and server load required to run large language models (LLMs) at scale. Developed in collaboration with SGLang, Atlas Inference enables faster, cost-effective deployment by maximizing GPU throughput with fewer resources.

CEO Jerry Tang stated the platform was designed to “break down the economics of AI deployment,” highlighting its ability to process 54,500 input and 22,500 output tokens per second per node. In benchmarking tests, a 12-node H100 cluster running Atlas Inference outperformed DeepSeek’s V3 reference implementation using one-third fewer servers.

Powered by innovations like Prefill/Decode Disaggregation and DeepEP Parallelism, Atlas Inference delivers industry-leading speed and efficiency, outperforming larger configurations from Amazon, Microsoft, and NVIDIA. The platform supports more than 10,000 concurrent sessions with sub-5-second latency, and allows enterprises to upload and isolate custom models on dedicated GPUs.

Now available to enterprises and startups, Atlas Inference sets a new benchmark for scalable, high-performance LLM deployment.

James Dargan

James Dargan is a writer and researcher at The AI Insider. His focus is on the AI startup ecosystem and he writes articles on the space that have a tone accessible to the average reader.

Share this article:

AI Insider

Discover the future of AI technology with "AI Insider" - your go-to platform for industry data, market insights, and groundbreaking AI news

Subscribe today for the latest news about the AI landscape