Evaluating ChatGPT for Software Vulnerability Tasks: A Comparative Analysis

With its impressive 1.7 trillion parameters, ChatGPT has emerged as a powerful language model. However, its applicability to code-oriented tasks, such as software vulnerability analysis and repair, remains relatively unexplored. In this article, we delve into the evaluation of ChatGPT against code-specific models, specifically examining its performance on four vulnerability tasks using the Big-Vul and CVEFixes datasets. This comprehensive analysis sheds light on the potential limitations of using ChatGPT for software vulnerability tasks while emphasizing the need for domain-specific fine-tuning.

Evaluation of ChatGPT against code-specific models

To comprehensively evaluate ChatGPT’s performance, security analysts conducted experiments using the Big-Vul and CVEFixes datasets. These datasets provide a comprehensive set of vulnerability tasks, enabling a thorough comparison of ChatGPT against baseline methods. The evaluation focused on the F1-measure and top-10 accuracy metrics.

The results of the evaluation revealed that ChatGPT achieved an F1-measure of 10% and 29% on the Big-Vul and CVEFixes datasets, respectively. These scores were significantly lower compared to the other baseline methods. Similarly, the top-10 accuracy of ChatGPT was 25% and 65%, which again reflected the lowest performance among the examined models.

Analysis of Multiclass Accuracy

In addition to F1-measure and top-10 accuracy, multiclass accuracy was also considered as a crucial performance indicator. The analysis revealed that ChatGPT achieved the lowest multiclass accuracy of 13%, showcasing a striking 45%-52% difference from the best baseline model. These outcomes underscore the challenges faced by ChatGPT in accurately classifying vulnerability tasks across multiple classes.

Evaluation of Severity Estimation

Severity estimation holds paramount importance in vulnerability analysis to prioritize remediation efforts. However, ChatGPT’s performance in this regard proved to be unsatisfactory. The evaluation indicated that ChatGPT exhibited the highest mean squared error (MSE) of 5.4 and 5.85, implying its inaccurate severity estimation compared to the other baselines. This finding raises concerns about relying on ChatGPT for precise severity estimation in vulnerability assessment.

Assessment of Repair Patch Generation

One vital aspect of vulnerability repair is the generation of correct repair patches. Regrettably, ChatGPT failed to generate accurate repair patches in this evaluation. On the other hand, the baseline models demonstrated success in rectifying vulnerable functions, correctly repairing 7% to 30% of them. This stark contrast highlights the limitations of ChatGPT in generating effective repair solutions.

Limitations of fine-tuning

Fine-tuning is a commonly employed technique to optimize language models for specific tasks. However, in the case of ChatGPT, fine-tuning for vulnerability tasks is not viable due to proprietary parameters. This constraint further underlines the challenges in adapting ChatGPT directly for software vulnerability tasks.

The Importance of Domain-specific Fine-tuning

The analysis of ChatGPT’s performance in vulnerability tasks underscores the significance of domain-specific fine-tuning. The complexity and specificity of software vulnerability tasks necessitate the customization of language models like ChatGPT to better suit the requirements. This suggests the need for further research and work on fine-tuning or adapting ChatGPT specifically for software vulnerability tasks.

Comparison with previous studies

While previous studies have examined the effectiveness of large language models in automated program repair, they have not accounted for the latest versions of ChatGPT. This article bridges that gap by shedding light on the specific performance of ChatGPT in software vulnerability tasks. Additionally, the notable disparities in results indicate the necessity for dedicated exploration of ChatGPT’s potential in this domain.

In conclusion, the evaluation of ChatGPT for software vulnerability tasks reveals its limitations in comparison to code-specific models. The lower F1-measure, top-10 accuracy, multiclass accuracy, inaccurate severity estimation, and inability to generate correct repair patches highlight the challenges faced by ChatGPT in this context. The proprietary nature of its parameters further restricts fine-tuning for vulnerability tasks. As such, this study emphasizes the need for additional research and efforts to fine-tune or tailor ChatGPT specifically for software vulnerability analysis and repair. By addressing these challenges, ChatGPT could potentially be leveraged more effectively in securing software systems in the future.

Explore more

Will the OnePlus Turbo 6X Redefine Budget Battery Life?

The persistent frustration of reaching for a mobile device mid-afternoon only to find a low-battery notification remains a defining struggle for modern smartphone users across all price tiers. While flagship models often receive the latest efficiency optimizations, budget-conscious consumers have traditionally been forced to trade performance for longevity or settle for cumbersome, heavy chassis designs. Recent developments in battery chemistry

How Is the OnePlus 2026 Sale Shaking Up the Indian Market?

Dominic Jainy brings a seasoned perspective from the intersection of high-performance IT and consumer hardware. As an expert in artificial intelligence and machine learning, he understands that the hardware we carry is the foundation for the next generation of software experiences. In this conversation, we explore the strategic implications of the OnePlus Community Sale 2026, examining how significant price corrections

How Are Hackers Exploiting Trusted Services and Plugins?

Dominic Jainy is an IT professional whose career has been defined by a deep curiosity for the structural integrity of the digital world. With extensive expertise in artificial intelligence, machine learning, and blockchain, he has spent years analyzing how complex systems can be both optimized and exploited. Dominic brings a uniquely holistic perspective to cybersecurity, often looking beyond the immediate

Will Pepeto Outperform Dogecoin After Its New Listing?

The digital asset landscape is currently weathering a period of intense turbulence, with the total market value shedding over 8% in a single week, leaving many seasoned traders paralyzed by uncertainty. Amidst this volatility, the original meme coin, Dogecoin, is attempting a massive institutional pivot through high-level enterprise partnerships, while newer utility-focused projects are capturing the capital that has fled

Trend Analysis: Remote Employee Moonlighting

The quiet transition from traditional single-employer loyalty to a stealthy multi-job lifestyle is fundamentally restructuring the modern professional contract. As the digital economy removes the physical barriers of the office, the phenomenon of “polygamous working” has emerged as a significant disruptor for human resource departments globally. What once existed as a side hustle in the gig economy has evolved into