Is GPT-5.2 Truly at a Human Expert Level?

Today we’re joined by Dominic Jainy, an IT professional whose work at the intersection of artificial intelligence, machine learning, and blockchain gives him a unique vantage point on the technology shaping our future. With the recent launch of OpenAI’s GPT-5.2, the battle for AI dominance has intensified, leaving businesses to navigate a landscape of impressive claims and complex realities. We’ll be exploring what this new model’s performance gains truly mean for day-to-day operations, the strategic silence in the ongoing rivalry with Google’s Gemini, and the persistent gap between AI’s promise and its practical application. We will also delve into the critical debate over how to measure success and the surprising economics of these powerful new tools.

The article highlights a huge leap in performance on OpenAI’s GDPval benchmark, from 38.8% to 70.9%. Beyond this metric, what real-world business tasks have you seen it master? Could you share a step-by-step example of how it now handles a complex project differently?

That jump from 38.8% to 70.9% in matching or exceeding human performance is a figure you can really feel in practice. It’s not just an abstract improvement; it’s a tangible shift in capability. I’ve seen it firsthand in tasks that previously felt like they were 80% automated and 20% tedious human cleanup. Take the workforce planning spreadsheet example mentioned in the launch. With GPT-5.1, you’d provide the raw data and your desired structure, and it would assemble a correct but starkly basic table. The real work would begin for you then—formatting, applying formulas, adding conditional highlighting. With GPT-5.2, that entire second stage is absorbed. The model now infers the need for professional presentation, delivering a polished, formatted, and ready-to-use spreadsheet in one go. It’s moved from being a data processor to a genuine project completer.

Given OpenAI’s earlier “code red” over Gemini 3, the launch materials for GPT-5.2 noticeably lacked a direct comparison. What does this strategic silence tell us about the competitive landscape, and how are you measuring these two models against each other in practical applications?

The silence is fascinating, isn’t it? After the urgency of Sam Altman’s “code red” memo, the industry was bracing for a direct, head-to-head comparison in this launch. The fact that it was omitted from the main announcement suggests a shift in strategy. It tells me OpenAI is confident in its product on its own terms and doesn’t feel the need to frame its success solely in relation to its biggest competitor. In practical applications, I’ve stopped relying on any single benchmark. Instead, I create bake-offs for specific, high-value tasks. For example, I’ll task both models with refactoring a large, legacy codebase. I’m not just looking at the output; I’m measuring the entire process. Which model required fewer prompts? Which one grasped the complex, layered context of our internal libraries better? As Rachid Wehbi suggested, a model’s ability to maintain its “train of thought” is infinitely more valuable than a marginal win on a standardized test.

The text notes that GPT-5.2 improves on the “last 20%” of enterprise tasks, like formatting. Where do you still see the biggest gap between AI’s promise and its daily use in business, and what specific metrics should companies use to evaluate this progress?

Bob Hutchins perfectly captured the source of so much enterprise frustration with that “last 20%” comment. GPT-5.2 absolutely narrows that gap in areas like formatting and handoffs. However, the biggest remaining chasm is in true, end-to-end project autonomy. We’re getting excellent results on discrete, well-defined tasks, but asking an AI to manage a complex, multi-step project with multiple dependencies without constant human oversight is still a work in progress. To measure this, companies must look beyond accuracy. I advise tracking metrics like “reduction in manual intervention points” or “total time-to-completion for multi-stage workflows.” These reflect the real economic value. It’s not just about whether the AI can do a single task right; it’s about how much it can do right in a sequence before needing a person to step in.

Maria Sukhareva criticized OpenAI’s proprietary benchmarks, while Vectara’s independent test showed GPT-5.2 still lags in hallucination rates. How much weight should businesses give to official scores versus independent evaluations? Could you provide an anecdote where these two types of results conflicted?

This is the central challenge for any business leader trying to make an informed decision. Maria Sukhareva’s skepticism is well-founded; when a company develops its own benchmark, it’s essentially grading its own homework. It can show progress on those specific 44 tasks, but it doesn’t guarantee broad, real-world competence. That’s why independent evaluations like Vectara’s are indispensable. Their data, showing GPT-5.2 with an 8.4% hallucination rate—better, but still behind competitors—is a crucial reality check. I once worked with a model that topped leaderboards for code generation. It was brilliant in demos. But when we piloted it on production code, it began introducing subtle but critical logic errors that our static analysis tools missed. The model was optimized for the benchmark, not for the messy, unpredictable reality of our enterprise environment. A balanced approach is key: view official scores as a signal of intent and independent tests as a measure of reliability.

OpenAI claims GPT-5.2 is more cost-effective due to token efficiency, despite higher per-token pricing. Can you break down how this works? Please walk me through how a company might see a lower final bill for a complex task like refactoring a large codebase.

It’s a classic case of total cost of ownership over sticker price. Yes, the per-token price is higher at $1.75 for input and $14 for output. But the magic is in “token efficiency.” Imagine refactoring a large, complex piece of software with an older model. You’d feed it a function, get a suggestion back, critique it, send a modified prompt, and repeat. That back-and-forth conversation chews through tokens. With GPT-5.2’s deeper reasoning, it can understand the entire context of the codebase better from the start. You might achieve a superior result in just one or two prompts instead of five or six. So, even though each token costs more, you’re using exponentially fewer of them to reach the finish line. The final bill for that refactoring project drops because the overall token consumption is drastically lower, making the higher unit price a worthwhile investment.

What is your forecast for the AI model supremacy race in the next 12 months?

Over the next year, I believe the race for supremacy will pivot away from a generalist arms race focused on topping broad benchmarks. Instead, the battle will be fought on the grounds of reliability and specialization. The winning platforms won’t be the ones that can write a poem and debug code in the same breath, but the ones that can offer near-zero hallucination rates for financial analysis or provably secure code generation for enterprise software. We will see a greater emphasis on vertical-specific models and a much deeper conversation around trust and verification. The ultimate winner won’t be the model with the highest IQ, but the one that businesses can depend on to execute critical tasks flawlessly, securely, and cost-effectively, day in and day out.

Explore more

How Is OpenAI Building the AI-Native Finance Team?

The traditional image of a bustling corporate finance department overflowing with analysts frantically crunching numbers into spreadsheets has been replaced by a quiet, high-velocity digital nervous system that operates with unprecedented surgical precision. This transformation is currently being led by OpenAI, an organization that is treating artificial intelligence as the foundational architecture of its financial operations rather than a secondary

Can AI Bridge the Gender Gap in Financial Services?

Standing at the precipice of a digital revolution, the financial industry faces a jarring paradox where women populate half the desks but almost none of the corner offices. While women make up nearly half of the financial services workforce, they occupy a staggering 8% of CEO positions in major firms. This disparity is no longer just a social issue; it

Mobile Operators Aim to Avoid 5G Mistakes in 6G Rollout

The global telecommunications landscape is currently vibrating with a cautious intensity as industry leaders reflect on the lessons learned from the previous decade of connectivity hurdles and high-speed promises. While the transition to the fifth generation of mobile networks was meant to usher in an era of instantaneous downloads and automated industrial harmony, many users found the experience to be

Hyperautomation Becomes the New Corporate Nervous System

The modern corporate engine is no longer a collection of gears grinding in isolation but has evolved into a self-correcting organism where every digital impulse triggers a calculated, instantaneous response across the entire organizational architecture. This profound shift marks the era of hyperautomation, a paradigm that transcends the simple mechanical repetition of the past to embrace a holistic, orchestrated ecosystem.

Will LLMs Make Robotic Process Automation Obsolete?

The persistent illusion of total office automation frequently shatters when a single non-standardized PDF document brings a million-dollar robotic process to a grinding halt. Thousands of manual man-hours are still poured into fixing bot errors across global supply chains that were originally marketed as being fully automated. This paradox exists because traditional automation hits a wall when faced with the