Augment leads on CCEval: Benchmarking code completion for continuous improvement

Updates

March 20, 2025

Augment leads on CCEval: Benchmarking code completion for continuous improvement

Speed and quality of inline code completions are critical for maintaining development flow. A good suggestion that completes the next line of code, but that takes two minutes to generate isn't useful. Similarly, a lightning-fast wrong answer is even worse.

Engineers value measurable improvements in their tools and workflows. But how do we objectively evaluate if AI coding assistants improve code quality? And how do we ensure that changes to code completion features actually enhance the developer experience?

How we test inline completions

We evaluate Augment’s inline coding performance using multiple benchmarks and testing processes. The benchmarks range from the sophisticated to the simple, depending on our goal. One simple benchmark we use is a modified version of CCEval (CrossCodeEval), which focuses specifically on a critical capability: understanding code across multiple files. This benchmark is simple enough to use both a fast quality assessment for hill climbing and as a regression test that fits into your standard CI pipeline.

What CCEval tests and why it matters

Our version of CCEval is designed to test important qualities for inline completions in real-world development:

Context awareness: Can the model understand code relationships across different files?
Fill-in-the-middle capabilities: Can it generate code that fits seamlessly within existing structures?
Real-world applicability: The benchmark uses actual open-source repositories, not contrived examples

The standard benchmark tests inline code completions across ~1,000 repositories and ~10,000 completions, covering Python, Java, TypeScript, and C#. What makes CCEval particularly valuable is its focus on cross-file code completion tasks—the kinds of challenges that developers face daily when working in complex codebases.

The evaluation method uses simple metrics, such as exact matching. While this approach might seem basic, it provides a clear, fast signal about the quality of context-aware inline completions.

How Augment performs on CCEval

The chart clearly demonstrates the significant performance advantage Augment has achieved on the CCEval benchmark.

Comparing Augment to the competition

We periodically evaluate Augment against other AI coding assistants using this benchmark as a relative quality check. We do this less for marketing purposes than to ensure we're building the best possible product. To this end, we evaluate competitors under their ideal conditions. For example, some competitors perform better when they have access to all open files, while Augment's performance is not dependent on open files.

The results show a striking difference in performance:

Augment: 67% accuracy
GitHub Copilot with all open files: 50% accuracy
GitHub Copilot without open files: 30% accuracy

This represents a substantial lead in accuracy on this benchmark through improved context awareness and good fill in the middle quality compared to alternatives.

Augment's technical advantages stem from three integrated capabilities:

1. Complete codebase indexing: Unlike systems limited to open files, Augment analyzes your entire codebase to understand cross-file dependencies.

2. Context-aware completion: Our fill-in-the-middle approach generates code that fits seamlessly within existing structures by understanding both preceding and following code.

3. Specialized retrieval models: We use fine-tuned models optimized specifically for code understanding across multiple files and languages.

These capabilities work together to deliver more accurate and contextually appropriate suggestions.

Of course, Augment and competing products are constantly improving, and we expect to see all products improve their performance on this benchmark over time.

Why codebase understanding matters

Augment's key advantage is its ability to understand your entire codebase, not just open files.

This delivers:

1. More accurate completions that correctly reference functions, classes, and variables across your project

2. Suggestions that follow your established code patterns and conventions

3. Reduced context-switching since you don't need to manually open relevant files

The limitations of today's evaluation

While these results are encouraging, we recognize some limitations in how we're measuring success:

Exact match doesn't capture semantic equivalence (different code that accomplishes the same thing)
The benchmark doesn't measure other important factors like suggestion speed or UX
Real-world projects may have different characteristics than the benchmark repositories

Beyond a single benchmark: Our comprehensive evaluation approach

While CCEval gives us valuable insights into cross-file context understanding, we don't rely on just one benchmark. No single benchmark can capture everything that matters in a coding assistant. Each has strengths and limitations, which is why we employ multiple evaluation approaches:

CCEval for an efficient quality assessment and regression testing.
Performance benchmarks that measure runtime efficiency
Benchmarks based on completion quality under more realistic coding conditions
Benchmarks that use code execution within real-world projects
Feedback from users

We actually started with more sophisticated benchmarks but found CCEval's performance to correlate very well with user acceptance rates. This correlation gave us confidence that the simpler benchmark was a good proxy for real-world performance.

Conclusion: Committed to quality through rigorous evaluation

At Augment, we're committed to creating AI coding assistants that understand your codebase and help you write better code. Through comprehensive testing and continuous evaluation, we're building tools that don't just score well on benchmarks but genuinely improve how developers work.

‍

Richard Hankins

Richard Hankins is a Founding Engineer at Augment and was an early engineer at Pure Storage. In a past life, he worked as a Research Engineer at Nokia Research Center, and at Intel. Richard earned his Ph.D. in Computer Science from the University of Michigan, where he specialized in database systems.

Keep reading with us.

Arun Chaganty

Jiayi Wei

February 19, 2025

Introducing Next Edit: AI that understands the ripple effect of code changes

Scott Dietzen

September 5, 2024

Augment Code achieves SOC2 Type II

No items found.

October 24, 2024

Meet Augment Code: Developer AI for teams