Stop benchmarking in the lab: Inclusion Arena shows how LLMs perform in production

Stop benchmarking in the lab: Inclusion Arena shows how LLMs perform in production

Benchmark testing models are increasingly important for enterprises to assess performance aligned with their needs. However, many existing benchmarks rely on static datasets or fixed testing environments, which may not accurately reflect real-world applications.

Researchers from Inclusion AI, affiliated with Alibaba’s Ant Group, have introduced a new model leaderboard called Inclusion Arena. This initiative aims to evaluate model performance based on real-world usage and user preferences, rather than static knowledge alone.

In a recent paper, researchers outlined the concept of Inclusion Arena as a live leaderboard that connects advanced large language models (LLMs) with actual AI applications. The system randomly initiates comparisons between models during multi-turn dialogues in real-world scenarios.

Inclusion Arena distinguishes itself from other leaderboards like MMLU and OpenLLM by its dynamic approach to ranking models. It utilizes the Bradley-Terry modeling method, which the researchers assert produces more stable ratings than the more commonly used Elo ranking method.

The framework integrates into AI applications, currently utilizing two apps: Joyland, a character chat interface, and T-Box, focused on educational communication. User prompts are sent to various LLMs, and participants select their preferred responses, facilitating the data collection for evaluating model performance. Initial data collection has yielded over 501,000 pairwise comparisons, indicating that Anthropic’s Claude 3.7 Sonnet is currently the highest-performing model.

The emergence of various leaderboards complicates the selection process for enterprises looking to evaluate LLMs. While Inclusion Arena aims to provide insights that reflect practical usage, businesses are encouraged to conduct further internal evaluations to determine the best-fit models for their unique applications. Other recent benchmarks, such as RewardBench 2 from the Allen Institute, also strive to align models with real-life use cases, highlighting the adaptive nature of AI evaluation standards.

Source: https://venturebeat.com/ai/stop-benchmarking-in-the-lab-inclusion-arena-shows-how-llms-perform-in-production/

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top