Beyond generic benchmarks: How Yourbench lets enterprises evaluate AI models against actual data

0


Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More

Every AI model release inevitably includes charts touting how it outperformed its competitors in this benchmark test or that evaluation matrix. 

However, these benchmarks often test for general capabilities. For organizations that want to use models and large language model-based agents, it’s harder to evaluate how well the agent or the model actually understands their specific needs. 

Model repository Hugging Face launched Yourbench, an open-source tool where developers and enterprises can create their own benchmarks to test model performance against their internal data. 

Sumuk Shashidhar, part of the evaluations research team at Hugging Face, announced Yourbench on X. The feature offers “custom benchmarking and synthetic data generation from ANY of your documents. It’s a big step towards improving how model evaluations work.”

He added that Hugging Face knows “that for many use cases what really matters is how well a model performs your specific task. Yourbench lets you evaluate models on what matters to you.”

Creating custom evaluations

Hugging Face said in a paper that Yourbench works by replicating subsets of the Massive Multitask Language Understanding (MMLU) benchmark “using minimal source text, achieving this for under $15 in total inference cost while perfectly preserving the relative model performance rankings.” 

Organizations need to pre-process their documents before Yourbench can work. This involves three stages:

  • Document Ingestion to “normalize” file formats.
  • Semantic Chunking to break down the documents to meet context window limits and focus the model’s attention.
  • Document Summarization

Next comes the question-and-answer generation process, which creates questions from information on the documents. This is where the user brings in their chosen LLM to see which one best answers the questions. 

Hugging Face tested Yourbench with DeepSeek V3 and R1 models, Alibaba’s Qwen models including the reasoning model Qwen QwQ, Mistral Large 2411 and Mistral 3.1 Small, Llama 3.1 and Llama 3.3, Gemini 2.0 Flash, Gemini 2.0 Flash Lite and Gemma 3, GPT-4o, GPT-4o-mini, and o3 mini, and Claude 3.7 Sonnet and Claude 3.5 Haiku.

Shashidhar said Hugging Face also offers cost analysis on the models and found that Qwen and Gemini 2.0 Flash “produce tremendous value for very very low costs.”

Compute limitations

However, creating custom LLM benchmarks based on an organization’s documents comes at a cost. Yourbench requires a lot of compute power to work. Shashidhar said on X that the company is “adding capacity” as fast they could.

Hugging Face runs several GPUs and partners with companies like Google to use their cloud services for inference tasks. VentureBeat reached out to Hugging Face about Yourbench’s compute usage.

Benchmarking is not perfect

Benchmarks and other evaluation methods give users an idea of how well models perform, but these do not perfectly capture how the models will work daily.

Some have even voiced skepticism that benchmark tests show models’ limitations and can lead to false conclusions about their safety and performance. A study also warned that benchmarking agents could be “misleading.”

However, enterprises cannot avoid evaluating models now that there are many choices in the market, and technology leaders justify the rising cost of using AI models. This has led to different methods to test model performance and reliability. 

Google DeepMind introduced FACTS Grounding, which tests a model’s ability to generate factually accurate responses based on information from documents. Some Yale and Tsinghua University researchers developed self-invoking code benchmarks to guide enterprises for which coding LLMs work for them. 



Source link

You might also like
Leave A Reply

Your email address will not be published.