OpenAI researchers have introduced CriticGPT, a new artificial intelligence model designed to detect and critique errors in ChatGPT-generated code, Ars Technica reports. This model aims to improve the alignment of AI systems with human expectations through reinforcement learning based on human feedback (RLHF), which improves the accuracy of the output of a large language model (LLM).

In its research paper “LLM Critics Help Catch LLM Bugs”, OpenAI explains that CriticGPT serves as an assistant for human trainers who review ChatGPT-generated code.

Built on the LLM GPT-4 family, CriticGPT analyzes code and highlights potential errors, helping human reviewers identify bugs that might otherwise go unnoticed.

The development of CriticGPT involved training the model on numerous inputs containing intentional errors. Human trainers modified the code written by ChatGPT, introduced errors, and provided feedback as if they had discovered the errors themselves. This process allowed the model to learn to identify and critique different types of coding errors.

The study showed that the collaboration between a team of humans and CriticGPT provided more comprehensive critiques than humans and reduced the frequency of AI hallucinations.

Interestingly, CriticGPT’s capabilities are not limited to code review. The model was tested on a subset of ChatGPT training data that had previously been evaluated by humans as flawless. CriticGPT found errors in 24% of this data, which was later confirmed by human reviewers. This demonstrates the model’s potential to detect subtle errors that even a thorough human review might miss.

Despite the promising results, CriticGPT has limitations. It was trained on relatively short ChatGPT responses, which may not fully prepare the model for evaluating longer and more complex responses. In addition, while CriticGPT reduces the number of AI hallucinations, it does not eliminate them completely, and human trainers may still make labeling errors based on these false results.

The research team acknowledges that CriticGPT is most effective in detecting errors in specific places in the code. However, real-world errors in AI results can often be spread across multiple parts of an answer, which poses a challenge for future iterations of the model.

OpenAI is planning to integrate models like CriticGPT into its RLHF marking pipeline, providing AI assistance to its trainers. This move is aimed at developing better tools for evaluating the performance of LLM systems, which can be difficult for humans to assess without additional help.