You make your evals, then your evals make you. Introducing AugmentQA.

Updates

March 25, 2025

You make your evals, then your evals make you. Introducing AugmentQA

Introduction

In the world of coding assistants, benchmarks shape progress. You build your evaluations, and then your evaluations build you. But what happens when your benchmarks don’t match reality?

Most existing code retrieval benchmarks optimize for synthetic problems — think isolated snippets or artificially generated question-answer pairs. But real software engineers aren’t solving artificial puzzles; they’re navigating large, messy, interconnected codebases. And that’s exactly why we built AugmentQA, an internal benchmark designed specifically to measure repository-aware code retrieval through realistic question-answering tasks directly sourced from real-world software development scenarios.

Here's the kicker: optimizing for synthetic benchmarks means climbing the wrong hill. Models trained this way perform well on artificial tasks, but stumble in the real world. Augment’s retrieval system, optimized explicitly to understand repository context, significantly outperforms open-source models — even those topping synthetic leaderboards like CoIR.

Below is a quick preview of how Augment stacks up when evaluated on realistic question-answering retrieval (measured by ground-truth keyword recall):

Want your coding assistant to actually help engineers? Then measure what matters most: real-world performance.

Here's how we did it — and how Augment leads the way.

What makes a good retrieval benchmark?

AugmentQA, our internal question-answering retrieval benchmark, was explicitly designed to measure precisely what matters: how well a coding assistant retrieves relevant information from complex repositories to address real-world engineering problems. But what exactly makes a benchmark "good" for this purpose?

When benchmarks don’t represent realistic scenarios, engineering teams risk optimizing for the wrong outcomes—wasting resources, chasing misleading metrics, and ultimately failing to deliver value. To avoid these pitfalls, we identified three key attributes of a truly effective benchmark:

Realistic retrieval corpus: Engineers depend heavily on project structure, coding conventions, and cross-file connections. Using a large, actual software repository captures these complexities.
Authentic, challenging questions:
Questions must reflect genuine struggles from engineers actively working on the codebase. Real questions are ambiguous, context-dependent, and specific, mirroring exactly the complexity developers face daily. For instance, here are actual examples from our engineering team's internal use of Augment:
- "Where is the load balancing strategy for our embedding indexers defined?"
- "Which chat retrievers do we have deployed according to the metadata file?"
- "Implement chat prod formatter adapter (similar to edit prod adapters)."
- "What's our deployment schedule?"
- "Where does binks prompt formatter merge overlapping chunks?"
Robust automatic evaluation:
Manual evaluations slow iteration and introduce human bias. By automatically scoring answers based on clearly defined criteria (like our ground-truth keywords approach), we can iterate rapidly and confidently.

In short, benchmarks shape product outcomes. When you measure realistically, you create tools that genuinely help engineers.

Existing benchmarks fall short

Earlier, we defined three essential criteria for a truly effective retrieval benchmark: a realistic retrieval corpus, authentic, challenging questions, and robust automatic evaluation. Unfortunately, most existing public benchmarks—such as those found in CoIR or CodeRAG-Bench—fail these tests:

Lack of realistic retrieval corpus:
Benchmarks often rely on isolated code snippets, coding-competition solutions, or synthetically generated code samples. This ignores the complexity, interconnectedness, and structural conventions present in real codebases.
Synthetic or oversimplified questions:
Real engineering queries are nuanced and context-dependent, reflecting genuine uncertainty. Many existing benchmarks instead rely heavily on artificially generated or overly simplistic questions, which rarely mirror real-world developer struggles.
Weak evaluation methods:
Popular benchmarks frequently rely on proxy metrics, such as matching functions or metadata labels, to determine relevance. These proxies often don’t align closely with what a real user would genuinely find helpful given their query, weakening confidence that benchmark improvements translate directly to real-world usability.

The result? Retrieval solutions optimized on these benchmarks can appear successful yet consistently stumble in real-world scenarios, precisely when developers need the most support.

AugmentQA: internal question answering benchmark

Recognizing these critical gaps, we built AugmentQA from the ground up as a question-answering retrieval benchmark explicitly designed to evaluate how effectively models retrieve context from real codebases. We sourced authentic questions directly from our internal engineering team's use of Augment. But gathering realistic questions alone wasn’t sufficient—we also needed a reliable and scalable method to objectively evaluate retrieval quality and answer accuracy.

After extensive experimentation, we chose ground-truth keywords as our evaluation mechanism. Every valid answer must include specific keywords, enabling us to automatically quantify answer completeness and retrieval effectiveness. This keyword-based method is not only objective but also more robust than simpler measures (like file path recall), ensuring accurate and actionable feedback at scale.

As our model improves, the benchmark itself evolves too. By continuously collecting new failure cases from actual use, we're able to consistently introduce more challenging and nuanced scenarios. This iterative loop pushes our system forward, steadily increasing AugmentQA’s difficulty to reflect real-world engineering complexity:

In other words, AugmentQA isn’t static—it continually grows tougher, mirroring the real-world challenges that software engineers face, and ensuring our improvements directly translate into tangible productivity gains.

Evaluation results: internal and open source models

To see how well our system performs against existing open-source code retrieval solutions, we compared the top two models from the CoIR leaderboard—Salesforce’s SFR-Embedding-Code-2B_R and CodeSage-large-v2—with our internal systems on AugmentQA.

We evaluated two setups:

Embeddings-only retrieval: To ensure fairness, we tested embedding-based retrieval alone, including several versions of our internal embeddings, notably our recent March update.
Full retrieval system: We also evaluated our complete system, which integrates multiple retrieval methods currently powering Augment’s production chat and agent workflows. (More details in future blog posts!)

The results clearly show Augment embeddings, trained specifically for realistic, real-world coding assistant scenarios, substantially outperform open-source embeddings optimized for synthetic code retrieval tasks. This underscores a critical point: strong performance on artificial benchmarks doesn’t necessarily translate to noticeable real-world improvements.

Additionally, our recent March update alone delivered meaningful improvements. Users should directly notice this quality upgrade in their everyday workflows, for example, with more precise answers to questions and a better ability to locate information in their codebase.

Finally, there's a notable performance boost when comparing embeddings-only retrieval with our full retrieval system, highlighting the practical value of sophisticated retrieval methods tailored explicitly for navigating complex codebases.

What’s next?

AugmentQA has shown clearly that when you measure the right things—realistic scenarios, authentic challenges—you get tools that meaningfully improve software engineers' daily workflows.

But we're not done yet. We’re actively enhancing Augment’s Full Retrieval System, integrating advanced retrieval strategies and richer contextual understanding. In upcoming blog posts, we'll dive deeper into these capabilities, share insights from our ongoing research, and illustrate exactly how Augment is revolutionizing real-world software development.

Stay tuned—the best is yet to come!

Tongfei Chen

Tongfei Chen is a Research Scientist at Augment Code focused on code generation and information retrieval. Previously, he spent four years at Microsoft working as a Senior Researcher. He earned a PhD in Computer Science from Johns Hopkins University.

Yury Zemlyanskiy

Yury Zemlyanskiy is a Research Scientist at Augment Code with previous experience as a Researcher at Google. He earned his PhD in Computer Science from USC and worked at Facebook, where he developed neural machine translation systems.

Keep reading with us.

Yuri Volkov

April 2, 2025

Best practices for using AI coding Agents

No items found.

October 24, 2024

Meet Augment Code: Developer AI for teams

Richard Hankins

March 20, 2025

Augment leads on CCEval: Benchmarking code completion for continuous improvement