All generative AI models, from Google’s Gemini to Anthropic’s Claude to OpenAI’s GPT-4o, have been found to hallucinate in various degrees. These models can be unreliable narrators, sometimes with hilarious results, and other times with more serious consequences.
### Benchmarking Hallucinations
A recent study conducted by researchers from Cornell, the universities of Washington and Waterloo, and the nonprofit research institute AI2 aimed to benchmark the hallucinations of AI models like GPT-4o against authoritative sources on a range of topics. The results showed that no model performed exceptionally well across all subjects, and the models that hallucinated the least often did so by avoiding questions they would answer incorrectly.
### Trusting AI Outputs
According to Wenting Zhao, a doctoral student at Cornell and co-author of the research, even the best AI models can only generate text without hallucinations about 35% of the time. This raises concerns about the reliability of AI-generated content.
### Testing Different Models
The study evaluated over a dozen popular models, including GPT-4o, Meta’s Llama 3 70B, Mistral’s Mixtral 8x22B, and Cohere’s Command R+. Surprisingly, the results indicated that models are still prone to hallucinations, despite claims from major AI players.
### Future Improvements
Zhao suggests that improvements in reducing hallucinations may be limited. She believes that human involvement is crucial in fact-checking and validating information generated by generative AI models. Developing advanced fact-checking tools and implementing human-in-the-loop processes could help mitigate hallucinations in AI-generated content.
In conclusion, the study highlights the ongoing challenges in ensuring the accuracy and reliability of AI models. While progress has been made, there are still significant opportunities for improvement in reducing hallucinations and enhancing the trustworthiness of AI-generated content.
