⭐GPT-4o vs GPT-4 vs Gemini 1.5 — Comparative Analysis

5 min readMay 18, 2024

OpenAI’s recent launch of GPT-4o marks a significant milestone in the evolution of AI language models and our interactions with them. One of the standout features is its capability for live interaction with ChatGPT, enabling real-time conversational interruptions. Despite a few minor issues during the live demo, the achievements of the OpenAI team are nothing short of remarkable. Most excitingly, OpenAI immediately made the GPT-4o API accessible following the demonstration.

In this article, I will provide an independent analysis comparing the classification abilities of GPT-4o, GPT-4, and Google’s Gemini and Unicorn models using a custom English dataset I developed.

Which of these models are strongest in English understanding?

Image taken from OpenAI’s live demo — Source

What’s New with GPT-4o?

At the forefront is the concept of an Omni model, designed to comprehend and process text, audio, and video seamlessly.

The focus of OpenAI appears to have shifted towards democratizing GPT-4 level intelligence to the masses, making GPT-4 level language model intelligence accessible even to free users.

OpenAI also announced that GPT-4o includes enhanced quality and speed across more than 50 languages, promising a more inclusive and globally accessible AI experience, for a cheaper price.

They also mentioned that paid subscribers would get five times the capacity compared with non-paid users.

Furthermore, they will release a desktop version of ChatGPT to facilitate real-time reasoning across audio, vision, and text interfaces for the masses.

How to use the GPT-4o API

The new GPT-4o model follows the existing chat-completion API from OpenAI, making it backward compatible and simple to use.

from openai import AsyncOpenAI

OPENAI_API_KEY = "<your-api-key>"
def openai_chat_resolve(response: dict, strip_tokens = None) -> str:
    if strip_tokens is None:
        strip_tokens = []
    if response and response.choices and len(response.choices) > 0:
        content = response.choices[0].message.content.strip()
        if content is not None or content != '':
            if strip_tokens:
                for token in strip_tokens:
                    content = content.replace(token, '')
            return content
    raise Exception(f'Cannot resolve response: {response}')
async def openai_chat_request(prompt: str, model_nane: str, temperature=0.0):
    message = {'role': 'user', 'content': prompt}
    client = AsyncOpenAI(api_key=OPENAI_API_KEY)
    return await client.chat.completions.create(
        model=model_nane,
        messages=[message],
        temperature=temperature,
    )
openai_chat_request(prompt="Hello!", model_nane="gpt-4o-2024–05–13")

GPT-4o is also available using the ChatGPT interface:

Official Evaluation

OpenAI’s blog post includes evaluation scores of known datasets, such as MMLU and HumanEval.

OpenAI’s blog post includes evaluation scores of known datasets, such as MMLU.

As we can derive from the graph, GPT-4o’s performance can be classified as state-of-the-art in this space — which sounds very promising considering the new model is cheaper and faster.

However, during the last year, I have seen multiple models that claim to have state-of-the-art language performance across known datasets. In reality, some of these models have been partially trained (or overfit) on these open datasets resulting in unrealistic scores on leadboards.

Therefore, it is important to do independent analyses of the performance of these models using lesser-known datasets — such as the one that I created 😄

My Evaluation Dataset 🔢

As I have explained in previous articles, I have created a topic dataset that we can use to measure classification performance across different LLMs.

The dataset consists of 200 sentences categorized under 50 topics, where some closely relate intending to make classification tasks harder.

I manually created and labeled the entire dataset in English.

I then used GPT4 (gpt-4–0613) to translate the dataset into multiple languages.

However, during this evaluation, we will only evaluate the English version of the dataset — meaning that the results should not be affected by potential biases originating from using the same language model for dataset creation and topic prediction.

Go and check out the dataset for yourself: topic dataset.

Performance Results 📊

I decided to evaluate the following models:

GPT-4o: gpt-4o-2024–05–13
GPT-4: gpt-4–0613
GPT-4-Turbo: gpt-4-turbo-2024–04–09
Gemini 1.5 Pro: gemini-1.5-pro-preview-0409
Gemini 1.0: gemini-1.0-pro-002
Palm 2 Unicorn: text-unicorn@001

The task given to the language models is to match each sentence in the dataset with the correct topic. This allows us to calculate an accuracy score per language and each model’s error rate.

Since the models mostly classify correctly, I am plotting the error rate for each model.

Remember that a lower error rate indicates better model performance.

Barplot of the Error Rate for each model

As we can derive from the graph, GPT-4o has the lowest error rate of all the models with only 2 mistakes.

We can also see that GPT-4, Gemini 1.5, and Palm 2 Unicorn only had one more mistake than GPT-4o — showcasing their strong performance. Interestingly, GPT-4 Turbo performs slightly worse than GPT-4–0613, which is counter to what OpenAI writes on their model page.

Lastly, Gemini 1.0 is lagging behind, which should be expected given its price range.

Conclusion 💡

This analysis using a uniquely crafted English dataset reveals insights into the state-of-the-art capabilities of these advanced language models.

GPT-4o, OpenAI’s latest offering, stands out with the lowest error rate among the tested models, which affirms OpenAI’s claims regarding its performance.

The AI community and users alike must continue performing independent evaluations using diverse datasets, as these help in providing a clearer picture of a model’s practical effectiveness, beyond what is suggested by standardized benchmarks alone.

Note that the dataset is fairly small and results might vary depending on the dataset. The performance was done using the English dataset only, while a multilingual comparison will have to wait for another time.

Thanks for reading!

Follow to receive similar content in the future!

And do not hesitate to reach out if you have any questions! Watch out for launch of World’s first AI Coding Engine — Code Conductor, coming soon.