Back to Gems of AI

A deep dive comparing GPT-5.4 with Anthropic's Claude 4.6 and Google's Gemini 3.1.

OpenAI just dropped GPT-5.4. Here is an honest comparison of how it performs against Claude 4.6 and Gemini 3.1 for coding, writing, and everyday agent tasks.

I didn't expect OpenAI to drop a massive model update this early in March. But here we are. Yesterday, they announced GPT-5.4, aiming squarely at the professional and agentic workflows that Anthropic and Google have been trying to dominate over the last few months. I've spent the last 24 hours digging into the specs, the pricing, and the actual experience of using the model.

It is fast, it is surprisingly good at operating a browser, and it brings some much-needed sanity to how models handle large tool ecosystems. But is it an instant switch from Claude 4.6 or Gemini 3.1? The answer depends entirely on what you are trying to build.

Here is an honest breakdown of where GPT-5.4 fits into the current frontier landscape.

The big shift into native computer use

We all knew this was coming. Anthropic opened the floodgates with computer use in the Claude 3.5 era, and Claude 4.6 refined it. Now OpenAI has entered the chat.

GPT-5.4 introduces native computer-use capabilities right out of the box. On the OSWorld benchmark, which tests navigating a desktop environment through screenshots and keyboard clicks, it hits a 75.0% success rate. That genuinely outpaces human performance on that specific test. It also scores highly on WebArena and Mind2Web, proving it can navigate messy, unoptimized websites.

Compared to Claude 4.6, GPT-5.4 feels a bit more assertive when taking control of a screen. Anthropic's models still tend to pause and ask for reassurance when things get ambiguous. OpenAI built GPT-5.4 to parallelize work and keep moving. If you are building background agents that need to power through 30,000 property tax portals without stopping, GPT-5.4 is probably your best bet right now. I keep thinking about what this means for data entry jobs. It is impressive, but also a bit unsettling to watch it click around a screen with that level of confidence.

Coding and reasoning in a two-horse race

For the last few months, a lot of developers have been bouncing between Claude 4.6 and GPT-5.3-Codex for programming tasks.

GPT-5.4 basically swallows the Codex engine whole and adds better general reasoning on top of it. OpenAI is claiming a 57.7% score on SWE-Bench Pro. They even introduced an experimental Playwright Interactive skill that allows the model to visually debug web apps while it builds them.

But Claude 4.6 still has that unexplainable human touch for writing and architectural planning. I find myself talking to Claude when I need to figure out how to build something, but I would hand the actual execution to GPT-5.4. Google's Gemini 3.1 Pro is also in this mix, but it tends to shine brightest when you are already deep in the Google Cloud ecosystem or need to process massive amounts of raw text across a 2-million token window.

Managing the tool sprawl

This is arguably the most practical update for developers. As we build more complex agents, we give them more tools. The Model Context Protocol (MCP) made it easy to connect models to databases, APIs, and local files.

The problem is that passing 30 different tool definitions into a model's context window eats up tokens and slows down the response. With GPT-5.4, OpenAI introduced "tool search." Instead of reading every single tool definition upfront, the model gets a lightweight list. When it needs a tool, it looks up the specific definition dynamically.

OpenAI claims this cuts token usage by almost half for tool-heavy workflows. Neither Claude nor Gemini has a native equivalent to this exact mechanism yet. If you are building agents that juggle dozens of APIs, this feature alone makes GPT-5.4 worth testing.

Web research and the long context game

GPT-5.4 is getting much better at persistent web research. On BrowseComp, a benchmark measuring how well agents can find hard-to-locate information, the GPT-5.4 Pro model hits an 89.3% success rate.

It handles needle-in-a-haystack queries by searching across multiple rounds, synthesizing sources, and rarely giving up early. In ChatGPT, the "Thinking" mode now provides an upfront plan of its thought process. You can actually interrupt it mid-response and tell it to adjust course, which saves a lot of time on complex queries.

On the context side, GPT-5.4 supports up to 1 million tokens. This is great, but Gemini 3.1 still holds the crown here with its massive context windows that handle video and audio natively with incredible recall. If you need to drop three full-length movies into a prompt, Gemini is still the answer.

The spreadsheet and document wars

This is where OpenAI is flexing hard for enterprise dollars. They focused heavily on making GPT-5.4 better at formatting spreadsheets, generating presentations, and structuring long legal documents.

They even launched a dedicated ChatGPT for Excel add-in alongside the model. On the GDPval benchmark, which tests knowledge work across 44 occupations, GPT-5.4 matched or beat human professionals 83% of the time. If you work in finance, law, or consulting, GPT-5.4 is designed to speak your language. It hallucinates less on factual claims and handles long-horizon tasks better than GPT-5.2 did.

Gemini 3.1 integrates beautifully with Google Workspace, so if you live in Google Sheets and Docs, sticking with Gemini makes the most sense. But for raw analytical power on a messy dataset, GPT-5.4 is currently leading the pack.

Pricing and the efficiency argument

Cost is where things get interesting. Frontier models are getting cheaper, but "cheap" is relative depending on your scale.

  • GPT-5.4: $2.50 per million input tokens, $15 per million output.
  • GPT-5.4 Pro: $30 per million input, $180 per million output.

OpenAI is banking on token efficiency here. Because GPT-5.4 requires fewer steps to solve problems and uses the new tool search feature, your overall bill might be lower even if the per-token cost is slightly higher than previous baseline models.

That said, if you need to run high-volume, low-latency background tasks, Gemini 3.1 Flash is still the undisputed king of cost efficiency. You don't use GPT-5.4 to classify a million tweets. You use it to build a web app or automate a complex payroll process.

Which model should you choose?

I genuinely don't believe in a single default model anymore. We are in an era of routing tasks to the right engine.

  • Use GPT-5.4 if you are building autonomous agents that need to use external tools, operate a computer, or write complex frontend code. It is the most robust engine for getting work done without hand-holding.
  • Use Claude 4.6 if you are doing creative writing, nuanced reasoning, or simply want an AI coworker that feels more natural to converse with.
  • Use Gemini 3.1 if you need massive context windows, deep integration with Google Workspace, or require the raw speed and low cost of their Flash models.

I genuinely don't know how long OpenAI will hold the crown for computer use and coding. Anthropic and Google are probably weeks away from their next counter-punches. But for today, GPT-5.4 is an incredibly impressive piece of engineering.

Conclusion

The agent era is moving faster than most people can track. We are finally past the point where models just generate text. They are now actively operating software, managing tools, and doing actual desktop work. GPT-5.4 is a major step in that direction.

If you are building products in this space, I recommend testing the new tool search and computer-use capabilities today. It might change how you design your entire architecture. Start small, give it a few complex tools, and watch how it decides to use them. The results are worth the effort.

Frequently Asked Questions

Is GPT-5.4 better than Claude 4.6 for coding?

GPT-5.4 incorporates the engine from GPT-5.3-Codex, making it incredibly strong for complex frontend tasks and long-horizon tool use. However, Claude 4.6 still holds its own for nuanced debugging and specific architectural planning.

How much does GPT-5.4 cost compared to Gemini 3.1?

GPT-5.4 is priced at $2.50 per million input tokens and $15 per million output tokens. This is generally competitive, but Gemini 3.1 Flash remains the undisputed champion for low-cost, high-volume tasks.

Does GPT-5.4 have computer use capabilities?

Yes, GPT-5.4 is OpenAI's first general-purpose model with native computer-use capabilities, scoring 75% on the OSWorld benchmark, which currently leads the industry.