Microsoft AI has announced the release of three new multimodal foundation models designed to generate text, voice, and images, marking a continued expansion of its internal AI stack. The models — MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2 — were developed by the MAI Superintelligence team led by CEO Mustafa Suleyman.
The transcription model supports 25 languages and delivers faster performance than existing Azure offerings, while the voice model enables rapid audio generation and custom voice creation. The image model, previously introduced via MAI Playground, is now being deployed more broadly across Microsoft Foundry.
Leadership has positioned the release as part of a broader push toward human-centered AI design and cost-efficient model deployment, as Microsoft strengthens its position in the competitive multimodal AI landscape while maintaining its partnership with OpenAI.