GPT-4o can respond to audio in as little as 232 milliseconds

The GPT-4o System Card reports that the model “can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, which is similar to human response time in a conversation.” This low latency was achieved by training a single model end-to-end across text, vision, and audio, so speech did not have to pass through a separate transcription-and-synthesis pipeline before the model could reply.

Sources

Last verified June 6, 2026