FACTS Grounding: A new benchmark for evaluating the factuality of large language models
In this tech blog post, we delve into the latest research on AI-powered language model (LLM) benchmarks, focusing on comprehensive evaluation methods for identifying accurate and detailed responses to user queries. The article introduces the “FACTS Grounding” dataset comprising 1,719 examples, each carefully crafted to require long-form responses grounded in provided documents. The dataset is divided into private and public sets for benchmarking purposes.
To track progress and evaluate AI systems on their ability to generate responses that are not only factually accurate but also detailed, we have launched the FACTS Grounding leaderboard on KaGGLE. We hope that by regularly updating the leaderboard as the field advances, we can maintain and update it for a more thorough evaluation of LLMs.
To ensure data fairness, we employed comprehensive judging prompt templates and verified agreement with human raters for each example. The benchmarking examples comprise complex inputs that require extensive grounding across multiple domains. We evaluated the models using three judge models, each with different expertise in judging LLM responses. This approach ensures that LLMs generate accurate responses to diverse user requests while avoiding hallucinations and false information.
FACTS Grounding scores are based on scores obtained automatically by AI judge models using automated judge templates. The scoring process involves evaluating the eligibility of responses, their grounding accuracy, and overall effectiveness in responding to user queries successfully. The final score for the overall challenge is the average of all judge models’ scores across all examples.
The article delves into the FACTS Grounding dataset and provides insights on its challenges and how it can be used as a benchmark for improving AI systems. We encourage researchers, engineers, and developers to engage with FACTS Grounding by evaluating their models on the open set of examples or submitting their models for evaluation.
The article also highlights the importance of comprehensive benchmarking methods, coupled with continuous research and development, for improving AI systems in fields such as language processing, natural language processing, and dialogue generation. We are grateful for continued support from Google’s Terms and Conditions.