GPT-4o: Actually Good Multimodal AI

OpenAI just made a big move in the AI space with the release of GPT-4o (“o” stands for “omni”). This new model is crazy because it is a single model that can process not just text, but also audio and images. And it’s going to be accessible to free users (or at least the text version).

This is a clear sign that large language models (LLMs) are becoming a commodity. It’s fantastic news for us as end users because it means we’ll likely see a race to the bottom in terms of pricing. I expect we’ll end up with free access to other models as well (like Claude).

Natural Voice and Active Perception

One thing that really stood out to me in the demos was how incredibly natural the voice sounded. It had little half-laughs and inflections that were eerily similar to the way humans speak. It really was the uncanny valley moment for voice assistants.

But what’s nuts is that GPT-4o can actively “watch” from the camera and use that as input into the model. This extra context is something we have as humans, and it’s going to make the system so much more capable and useful. It’s basically like giving the AI model a pair of eyes.

Copilot Capabilities

Another amazing feature is that the dekstop GPT-4o app and the iPad app they show in the demo videos can “see” the screen on your desktop or iPad and provide comments or act as a copilot to help you learn or accomplish tasks. Imagine having a (competent) personal AI assistant by your side, guiding you through complex tasks and helping you learn.

Best Text-on-Image Generation Ever?

Although they didn’t cover this in the video, the blog post showcases some of the best “text on image” generation I’ve ever seen from any AI art generator. The level of detail and accuracy is mind-blowing compared even to Midjourney and DALLE3 (which are both pretty decent at it now).

How Does GPT-4o Stack Up?

I know some rankings on Twitter showed GPT-4o as better than GPT-4-turbo and Claude Opus, but in my personal testing, I don’t think it’s quite there yet. However, it’s much faster than them, which is a huge plus. And it’s significantly better than GPT-3.5 so it’s going to be a big step up for free users.

Potential Impact on Education and Accessibility

One area where I think GPT-4o could have a massive impact is in education and accessibility. With its ability to process multiple modalities and provide personalized assistance, it could revolutionize the way students learn and make education more accessible to people with different learning styles or disabilities.

It’s been discussed before but this is the first time it’s felt “real” that there could be a world where every student has access to a personal AI tutor that can adapt to their unique needs and help them reach their full potential.

One Last Thing

I will say, it was slightly too flirty and “cute” for an AI assistant, and I can see why there were a ton of comparisons to “Her”. I actually think this is going to perpetuate the AI girlfriend problem, but I think we’ll adapt. I expect as AI starts to eat more tech and desk jobs, along with the current movement against constant entertainment, we’re going to end up course-correcting into a more “in-person” culture. Or at least I hope so.

Thanks!

- Joseph

Sign up for my email list to know when I post more content like this. I also post my thoughts on Twitter/X.