OpenAI has announced the release of GPT-4o, a new flagship AI model that can process and generate a combination of text, audio, and images in real-time. The “o” in GPT-4o stands for “omni,” signifying the model’s ability to handle multiple modalities.
GPT-4o is a single model trained end-to-end across text, vision, and audio. This allows the model to process all inputs and outputs using the same neural network, preserving information such as tone, multiple speakers, and background noises. GPT-4o is the first model combining all of these modalities, says OpenAI.
GPT-4o aims to provide a more natural human-computer interaction by accepting any combination of text, audio, and image inputs and generating corresponding outputs. The model can respond to audio inputs in as little as 232 milliseconds on average, which is comparable to human response time in a conversation.
The new model matches the performance of GPT-4 Turbo on text in English and code while offering improved performance on text in non-English languages. GPT-4o is also faster and 50% cheaper in the API compared to its predecessor. The model particularly excels in vision and audio understanding compared to existing models, according to OpenAI.
GPT-4o’s text and image inputs and text outputs are being released publicly, while other modalities will be gradually rolled out as the company. Audio outputs will initially be limited to preset voices and will adhere to existing safety policies.
The new model’s capabilities will be iteratively rolled out, with extended red team access starting today. GPT-4o’s text and image capabilities are now available in ChatGPT, with a new version of Voice Mode featuring GPT-4o planned for release in alpha within ChatGPT Plus in the coming weeks.
Developers can access GPT-4o in the API as a text and vision model, with support for audio and video capabilities planned for release to a small group of trusted partners in the coming weeks.
GPT-4o available in the free tier, and to Plus users with up to 5x higher message limits.
[Image courtesy: OpenAI]