OpenAI’s New 'o' Series Is a Giant Leap Toward Multimodal AI Assistants

3 days ago 1

The race to dominate the AI frontier just got another plot twist—and this time, it talks back, looks at you, and maybe even listens with feeling.

OpenAI launched its new “o” series of models today, introducing GPT-4o and its lightweight cousin, GPT-4o-mini (aka o4 and o3). These new models aren’t just tuned-up chatbots—they’re omnimodal, meaning they can understand and generate text, image, audio, and video natively. No Frankenstein modules stitched together to fake visual literacy.

This is effectively AI with eyes, ears, and a mouth.

One model to rule them all?

OpenAI says the “o” stands for “omni,” and the implications are exactly what you’d expect: a unified model that can take in a screenshot, hear your voice crack, and spit out an emotionally calibrated reply—all in real time. It’s the first real hint of a future where AI assistants aren’t just in your phone—they are your phone.

Introducing OpenAI o3 and o4-mini—our smartest and most capable models to date.

For the first time, our reasoning models can agentically use and combine every tool within ChatGPT, including web search, Python, image analysis, file interpretation, and image generation. pic.twitter.com/rDaqV0x0wE

— OpenAI (@OpenAI) April 16, 2025

The o3 (mini) version is built for speed and affordability, with performance closer to Claude Haiku or a well-oiled Mistral, but still retaining that full multimodal superpower set. Meanwhile, o4 (full-fat GPT-4o) is squarely gunning for the big leagues, matching GPT-4-turbo in power but zipping through images and audio like it’s playing a casual round of charades.

And it’s not just speed. These models are cheaper to run, more efficient to deploy, and could—here’s the kicker—operate natively on devices. That’s right: real-time, multimodal AI without the latency of the cloud. Think personal assistants that don’t just listen to commands, but respond like companions.

Beyond chatbots: Enter the agentic era

With this release, OpenAI is laying the groundwork for the agentic layer of AI—those smarter-than-smart assistants that not only talk and write but observe, act, and autonomously handle tasks.

Want your AI to parse a Twitter thread, generate a chart, draft a tweet, and announce it on Discord with a smug meme? That’s not just within reach. It’s practically on your desk—wearing a monocle, sipping espresso, and correcting your grammar in a delightful baritone.

The o series models are meant to power everything from real-time voice bots to AR glasses, offering a hint at the “AI-first” hardware movement that has tech’s old guard (and new) on edge. In the same way the iPhone redefined mobile, these models are the beginning of AI’s native interface era.

OpenAI vs. the field

This isn’t happening in a vacuum. Google’s Gemini is evolving. Anthropic’s Claude is punching above its weight. Meta has a Llama in the lab. But OpenAI’s o series may have done something the rest haven’t yet nailed: real-time, unified multimodal fluency in a single model.

o3 and o4-mini are out!

they are very capable.

o4-mini is a ridiculously good deal for the price.

they can use and combine every tool within chatgpt.

multimodal understanding is particularly impressive.

— Sam Altman (@sama) April 16, 2025

This could be OpenAI’s answer to the inevitable: hardware. Whether through Apple’s rumored AI collaboration or its own “Jony Ive stealth mode” project, OpenAI is prepping for a world where AI isn’t just an app—it’s the OS.

Edited by Andrew Hayward