Competition Heats Up: New Challengers Take on OpenAI's AI Dominance

Competition Heats Up: New Challengers Take on OpenAI's AI Dominance

Competition Heats Up: New Challengers Take on OpenAI’s AI Dominance

It seems that OpenAI’s unquestioned reign over artificial intelligence is coming to an end. While their generative AI model GPT-4, which was introduced just a year ago, had been the gold standard of the technology so far, it now faces serious competition. In the past month alone, three companies - Mistral AI, Anthropic, and Inflection AI - have unveiled artificial intelligences capable of rivaling GPT-4, just as Google did with their Gemini model in late 2023. As the race for performance heats up, the industry is plunging into a battle of benchmarks. Each new model release is accompanied by a series of standardized evaluations called benchmarks, which aim to quantify the performance of the AI in reasoning, comprehension, coding, and mathematics, among other tasks, in order to compare them to their competitors. The ultimate goal is to prove that their model matches or surpasses GPT-4. This begs the question: Can OpenAI regain its dominance with the highly anticipated GPT-5, whose release date continues to fuel speculation?

Anthropic, as a rival to OpenAI, claimed the top spot in the performance race by presenting its model Claude-3 along with the results of ten benchmarks, in which it consistently outperformed the comparisons. However, upon closer examination, it only surpassed its competitor GPT-4 by a tenth of a percentage point on three of the ten tests. Moreover, Anthropic did not disclose the detailed results of the tests, which brings into question its claimed superiority. “Each company selects the benchmarks in which their model excels,” explains Françoise Soulié-Fogelman, Scientific Advisor at Hub France IA. “They can do this because there is no single dominant benchmark for evaluating large language models,” she adds. Unlike “traditional” artificial intelligence, where standards have emerged, there is no definitive benchmark for ChatGPT and its counterparts. This is because large language models (LLMs) are inherently versatile and must be able to perform nearly any task, including those developers may not have considered. As a result, proving one model’s absolute superiority over another becomes a real challenge, as it requires measuring a wide range of tasks and weighing the importance or relevance of each task relative to the others. Therefore, most companies now measure the performance of their AI models for specific use cases rather than in absolute terms. “Because the output of an LLM is highly dependent on the given prompt, it is essential to closely examine and compare its performance with that of another model,” adds a researcher from a reputable institution. This researcher also notes the saturation of current benchmarks, emphasizing the need to develop new evaluation methods for increasingly powerful models.

“The more profound problem is that public benchmarks can be contaminated, unintentionally skewing their results,” warns Stanislas Polu, co-founder of the French startup Dust and former OpenAI researcher. In practice, benchmarks are human-designed exercises with fixed answers, resulting in a set of predetermined correct responses. Although AI developers commit to not directly feeding their models with the benchmark answers, there’s no guarantee that the “cheat sheet” doesn’t exist elsewhere in the AI’s training data. For example, it might be found in a discussion forum where users talk about the benchmark. The LLM could then simply retrieve the results instead of reasoning through the problem. It’s as if a student takes an exam after reading the answers the night before. Preliminary studies have already shown that by varying the values of benchmark exercises, the model’s performance can dramatically decline.

Before the latest generation of LLMs, developers relied on size criteria (more parameters, larger training datasets, etc.) to demonstrate the superiority of their model over the previous one. Performance was evident in the construction of the AI, and there was less need to analyze the model’s output. However, given the current state of LLMs, the increase in the number of parameters is no longer the sole criterion for improvement. It provides little guarantee of significant enhancement in the AI’s reasoning ability, while also becoming costly to test.

Behind the battle of benchmarks, one observation emerges: no one clearly surpasses GPT-4. As OpenAI delays the release of GPT-5, the industry seems to be reaching a plateau. “The best model is one year old, and more than 20 months if we count from the end of its training. Either it turns out that it’s very difficult to surpass GPT-4, and with other competitors catching up, we enter a performance plateau. Or OpenAI releases a new model - GPT-4.5 or GPT-5 - that is clearly superior, reverting to the model we have known for two years,” projects Stanislas Polu. OpenAI, the company behind ChatGPT, had accustomed industry observers to making major announcements soon after its competitors, in order to steal their thunder. That’s why some observers expected a countermove after Mistral and Anthropic presented their models. However, instead of counterattacking, OpenAI became embroiled in a reputation war with Elon Musk. Aware of the anticipation surrounding their next model, CEO Sam Altman doesn’t hesitate to play with the audience. When asked about the release date of GPT-5, he enigmatically replied, “Patience, it will be worth the wait.”

After officially announcing in November that they were working on GPT-5, Altman has since remained silent, raising curiosity about the progress of their research and potentially impacting the market. “If we enter a phase of stagnation where everyone has the same performance, model developers will have to move up the value chain and invest even more in product creation,” anticipates Stanislas Polu. It is up to OpenAI - or perhaps a less anticipated competitor - to prove that the performance race is not on hold. Meanwhile, investors and the market continue to bet on the continued improvement of artificial intelligence.


Written By

Jiri Bílek

In the vast realm of AI and U.N. directives, Jiri crafts tales that bridge tech divides. With every word, he champions a world where machines serve all, harmoniously.