Atlas Cloud has announced the launch of Atlas Inference, a next-generation AI inference platform engineered to dramatically reduce the GPU and server load required to run large language models (LLMs) at scale. Developed in collaboration with SGLang, Atlas Inference enables faster, cost-effective deployment by maximizing GPU throughput with fewer resources.
CEO Jerry Tang stated the platform was designed to “break down the economics of AI deployment,” highlighting its ability to process 54,500 input and 22,500 output tokens per second per node. In benchmarking tests, a 12-node H100 cluster running Atlas Inference outperformed DeepSeek’s V3 reference implementation using one-third fewer servers.
Powered by innovations like Prefill/Decode Disaggregation and DeepEP Parallelism, Atlas Inference delivers industry-leading speed and efficiency, outperforming larger configurations from Amazon, Microsoft, and NVIDIA. The platform supports more than 10,000 concurrent sessions with sub-5-second latency, and allows enterprises to upload and isolate custom models on dedicated GPUs.
Now available to enterprises and startups, Atlas Inference sets a new benchmark for scalable, high-performance LLM deployment.




