Shortly after Hunter Lightman joined OpenAI as a researcher in 2022, he watched his colleagues launch ChatGPT, one of the fastest-growing products ever. Meanwhile, Lightman quietly worked on a team teaching OpenAI’s models to solve high school math competitions.
Today that team, known as MathGen, is considered instrumental to OpenAI’s industry-leading effort to create AI reasoning models: the core technology behind AI agents that can do tasks on a computer like a human would.
OpenAI’s models are far from perfect today — the company’s latest AI systems still hallucinate and its agents struggle with complex tasks.
But its state-of-the-art models have improved significantly on mathematical reasoning. One of OpenAI’s models recently won a gold medal at the International Mathematical Olympiad (IMO), a math competition for the world’s brightest high school students. OpenAI believes these reasoning capabilities will translate to other subjects, and ultimately power general-purpose agents that the company has always dreamed of building.
ChatGPT was a happy accident — a low-key research preview turned viral consumer business — but OpenAI’s agents are the product of a yearslong, deliberate effort within the company.
“Eventually, you’ll just ask the computer for what you need and it’ll do all of these tasks for you,” said OpenAI CEO Sam Altman at the company’s first developer conference in 2023. “These capabilities are often talked about in the AI field as agents. The upsides of this are going to be tremendous.”
The reinforcement learning renaissance
The rise of OpenAI’s reasoning models and agents are tied to a machine learning training technique known as reinforcement learning (RL). RL provides feedback to an AI model on whether its choices were correct in simulated environments.
RL has been used for decades. For instance, in 2016, about a year after OpenAI was founded in 2015, an AI system created by Google DeepMind using RL, AlphaGo, gained global attention after beating a world champion in the board game Go.
Around that time, one of OpenAI’s first employees, Andrej Karpathy, began pondering how to leverage RL to create an AI agent that could use a computer. But it would take years for OpenAI to develop the necessary models and training techniques.
By 2018, OpenAI pioneered its first large language model in the GPT series, pretrained on massive amounts of internet data and large clusters of GPUs. GPT models excelled at text processing, eventually leading to ChatGPT, but struggled with basic math.
It took until 2023 for OpenAI to achieve a breakthrough, initially dubbed “Q*” and then “Strawberry,” by combining LLMs, RL, and a technique called “test-time computation.” The latter gave the models extra time and computing power to plan and work through problems, verifying its steps, before providing an answer.
This allowed OpenAI to introduce a new approach called “chain-of-thought” (CoT), which improved AI’s performance on math questions the models hadn’t seen before.
“I could see the model starting to reason,” said OpenAI researcher Ahmed El-Kishky. “It would notice mistakes and backtrack, it would get frustrated. It really felt like reading the thoughts of a person.”
Though individually these techniques weren’t novel, OpenAI uniquely combined them to create Strawberry, which directly led to the development of o1. OpenAI quickly identified that the planning and fact-checking abilities of AI reasoning models could be useful to power AI agents.
“We had solved a problem that I had been banging my head against for a couple of years,” said Lightman. “It was one of the most exciting moments of my research career.”
With AI reasoning models, OpenAI identified two new axes to enhance AI models: utilizing more computational power during post-training and granting AI models more time and processing power when answering a question. “OpenAI, as a company, focuses on scalability,” stated Lightman. Following the 2023 Strawberry breakthrough, OpenAI established the “Agents” team led by researcher Daniel Selsam to delve deeper into this new paradigm. This team initially aimed to equip AI systems to handle complex tasks without differentiating between reasoning models and agents. Eventually, Selsam’s work became part of the larger project to develop the o1 reasoning model, with key figures like co-founder Ilya Sutskever, Chief Research Officer Mark Chen, and Chief Scientist Jakub Pachocki.
OpenAI made significant investments in developing AGI, leading to breakthroughs in AI reasoning models. By prioritizing the creation of highly intelligent AI models over commercial products, OpenAI was able to focus on o1 above other endeavors. As other AI labs started experiencing diminishing returns from traditional pretraining scaling, the decision to explore new training methods proved to be forward-thinking. Today, the AI field’s progress largely hinges on advancements in reasoning models.
**What does it mean for an AI to “reason”?**
The objective of AI research is to emulate human intelligence through computers. Since the introduction of o1, ChatGPT has exhibited more human-like features such as “thinking” and “reasoning.” When questioned about the genuine reasoning capacity of OpenAI’s models, El-Kishky cautiously considered the concept from a computer science perspective. According to El-Kishky, teaching the model to effectively utilize computing resources to derive an answer constitutes reasoning. Lightman emphasizes evaluating the model’s results over its methods or resemblance to human thought processes.
OpenAI’s researchers acknowledge potential disagreements regarding their terminology and definitions of reasoning, but they stress the significance of their models’ capabilities. Other AI researchers echo this sentiment. Lambert, an AI researcher at AI2, likens AI reasoning models to airplanes in a blog post, highlighting their utility despite operating via distinct mechanisms inspired by natural processes. A joint position paper from AI researchers at OpenAI, Anthropic, and Google DeepMind emphasizes the need for further research on AI reasoning models, as their internal workings remain largely unknown.
**The next frontier: AI agents for subjective tasks**
Current AI agents excel in well-defined, verifiable domains like coding. OpenAI’s Codex agent assists software engineers in offloading simple coding tasks, while Anthropic’s models have gained popularity in AI coding tools like Cursor and Claude Code. These AI agents represent a new frontier, with users demonstrating willingness to pay for their services.
General-purpose AI agents like OpenAI’s ChatGPT and Perplexity’s Comet struggle with complex, subjective tasks such as online shopping or finding a long-term parking spot. These agents often take longer than desired and make mistakes, highlighting the need for improvement in training the underlying models for subjective tasks.
New general-purpose RL techniques developed by OpenAI, as described by researcher Noam Brown, allow for teaching AI models skills that are not easily verified. For instance, OpenAI’s IMO model utilizes multiple agents to explore various ideas simultaneously and select the best possible answer. Similar AI models are emerging from Google and xAI, showcasing the rapid progress in this field.
OpenAI aims to enhance the performance of its models, potentially reflected in the upcoming GPT-5 model. The company strives to establish dominance in the AI market by offering the best AI model for developers and consumers, while also focusing on developing user-friendly AI agents that intuitively understand user preferences without the need for manual settings selection.
While OpenAI was once a leader in the AI industry, it now faces strong competition from companies like Google, Anthropic, xAI, and Meta. The challenge for OpenAI lies not only in delivering its envisioned future of AI agents but also in doing so before its competitors.
