GPT-4o can respond to audio in as little as 232 milliseconds

fact

The GPT-4o System Card reports that the model “can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, which is similar to human response time in a conversation.” This low latency was achieved by training a single model end-to-end across text, vision, and audio, so speech did not have to pass through a separate transcription-and-synthesis pipeline before the model could reply.

Sources

PRIMARY https://arxiv.org/abs/2410.21276

Last verified June 6, 2026

<- Back to the AI Library

GPT-4o can respond to audio in as little as 232 milliseconds

Sources

Related