Evaluating DeepSeek-R1 and o1: Real-World Performance and Key Insights

In a rapidly evolving landscape of artificial intelligence technology, real-world performance evaluation becomes an integral part of understanding and harnessing these advanced models. The article “Beyond benchmarks: How DeepSeek-R1 and o1 perform on real-world tasks” delves into the in-depth analysis and comparison of two prominent AI models – DeepSeek-R1 and OpenAI’s competing model, o1. By scrutinizing their efficacy in executing real-world tasks, the focus shifts beyond the traditional benchmark tests that typically assess these models, providing a more realistic perspective on their practical applications.

Introduction to DeepSeek-R1 and o1

The primary objective of this comparison was to evaluate the models’ capability to handle ad hoc tasks requiring information gathering from the web, identifying pertinent data, and executing simple yet substantial tasks manually. The experimentation utilized Perplexity Pro Search, a tool supporting both o1 and R1, which ensured a level playing field for both models. A significant observation emerged that both models are prone to certain errors, particularly when the input prompts are not specific enough, which can lead to inaccurate or incomplete outcomes.

Interestingly, although o1 demonstrated a slightly superior aptitude for reasoning tasks, R1 offered an advantage with its transparency in the reasoning process. This transparency proved particularly useful in scenarios where mistakes occurred, which is not uncommon in real-world applications involving complex data sets and multifaceted queries. The ability to understand where and why errors happened allows users to refine their approach and enhance the accuracy of subsequent prompts, making R1 a valuable tool in iterative problem-solving processes.

Real-World Task: Calculating Returns on Investments

To comprehensively assess the models’ abilities, an experiment was designed to test their proficiency in calculating returns on investments (ROI) using web data. The task involved assuming an investment of $140 in seven major companies – Alphabet, Amazon, Apple, Meta, Microsoft, Nvidia, and Tesla – on the first day of every month from January to December 2024. The models needed to pull stock price information for the start of each month, distribute the monthly investment equally among the stocks ($20 per stock), and calculate the current portfolio value.

Despite the seemingly straightforward nature of this task, both models failed to perform it accurately. The o1 model returned a list of stock prices for January 2024 and January 2025, along with an irrelevant formula, and failed to correctly calculate the values. It erroneously concluded that there was no ROI. On the other hand, R1 misplaced the calculation by only investing in January 2024 and then calculating returns for January 2025. However, R1’s transparency in its reasoning process revealed its reliance on Perplexity’s retrieval engine for obtaining the necessary monthly stock prices, pointing directly to the source of its calculation errors.

Addressing Data Retrieval Issues

In a further attempt to mitigate the retrieval issue, additional exploration was performed by providing the models with the required data in a text file. This file included the name of each stock and an HTML table with price data from January to December 2024. Despite this proactive measure to eliminate retrieval problems, both models again failed to deliver accurate results. The o1 model did extract the data but suggested manual calculations using Excel, with its vague reasoning complicating any troubleshooting efforts.

R1, while able to correctly parse the HTML data and perform month-by-month calculations, had its final value lost in the reasoning chain. Additionally, a stock split event for Nvidia further confounded its final output. Nevertheless, the detailed reasoning trace provided by R1 allowed users to understand where the model went wrong and how to refine prompts and data formatting for improved results. This emphasis on transparency provided insightful feedback that can be invaluable for future iterations and refinement of AI models.

Performance Comparison in NBA Statistics Analysis

Another experiment tested the models on comparing the performance improvement of four leading NBA centers in terms of field goal percentage (FG%) from the 2022/2023 to 2023/2024 seasons. Despite this requiring multi-step reasoning, this task was straightforward given the public availability of player stats. Both models successfully identified Giannis Antetokounmpo as having the best improvement, although the statistical figures varied slightly between the two models.

However, a notable error occurred when both models incorrectly included Victor Wembanyama in the comparison. They failed to take into account his rookie status in the NBA, which should have excluded him from this particular analysis. In this instance, R1 stood out by providing a more detailed breakdown, including a comparison table and source links, which helped in refining the prompt. By specifying the need for NBA season stats only, the model was eventually able to exclude Wembanyama from the results, showcasing how user input and model transparency can lead to more accurate outputs.

Insights and Limitations of AI Models

In the rapidly advancing field of artificial intelligence technology, evaluating real-world performance is crucial for truly understanding and utilizing these sophisticated models. The article “Beyond benchmarks: How DeepSeek-R1 and o1 perform on real-world tasks” provides a thorough analysis and comparison of two leading AI models: DeepSeek-R1 and OpenAI’s competing model, o1. By examining their effectiveness in handling real-world tasks, the focus expands beyond conventional benchmark tests that are typically used to evaluate these models. This approach offers a more realistic insight into their practical applications. Assessing AI models through practical use cases reveals how well they perform outside of controlled experimental environments, highlighting the importance of real-world validations in the AI field. This real-world performance assessment is vital in an era where AI technology continues to evolve and become more integrated into various aspects of daily life, demonstrating that practical performance can differ significantly from theoretical benchmarks.

Explore more

How Firm Size Shapes Embedded Finance Strategy

The rapid transformation of mundane business platforms into sophisticated financial ecosystems has effectively redrawn the competitive boundaries for companies operating in the modern economy. In this environment, the integration of banking, payments, and lending services directly into a non-financial company’s digital interface is no longer a luxury for the avant-garde but a baseline requirement for economic viability. Whether a company

What Is Embedded Finance vs. BaaS in the 2026 Landscape?

The modern consumer no longer wakes up with the intention of visiting a bank, because the very concept of a financial institution has migrated from a physical storefront into the digital oxygen of everyday life. This transformation marks the definitive end of banking as a standalone chore, replacing it with a fluid experience where capital management is an invisible byproduct

How Can Payroll Analytics Improve Government Efficiency?

While the hum of a government office often suggests a routine of paperwork and protocol, the digital pulses within its payroll systems represent the heartbeat of a nation’s economic stability. In many public administrations, payroll data is viewed as little more than a digital receipt—a record of transactions that concludes once a salary reaches a bank account. Yet, this information

Global RPA Market to Hit $50 Billion by 2033 as AI Adoption Surges

The quiet hum of high-speed data processing has replaced the frantic clicking of keyboards in modern back offices, marking a permanent shift in how global businesses manage their most critical internal operations. This transition is not merely about speed; it is about the fundamental transformation of human-led workflows into self-sustaining digital systems. As organizations move deeper into the current decade,

New AGILE Framework to Guide AI in Canada’s Financial Sector

The quiet hum of servers across Canada’s financial heartland now dictates more than just basic transactions; it increasingly determines who qualifies for a mortgage or how a retirement fund reacts to global volatility. As algorithms transition from the shadows of back-office automation to the forefront of consumer-facing decisions, the stakes for oversight have never been higher. The findings from the