Reinforcement Learning from Developer Behaviors: A breakthrough in code generation quality

TL;DR: We've achieved a breakthrough in AI code completion by learning directly from how developers naturally interact with their IDEs. Our approach, which we call Reinforcement Learning from Developer Behaviors (RLDB), has improved our model's performance as much as doubling the model size and 10x more finetuning data would - and we did it without requiring specialized training data collection.

Note: Privacy and security are foundational to our work at Augment. We never train on proprietary customer code, and we maintain strict data minimization principles. All training described in this post was conducted using internal development data and opt-in user contributions from appropriately licensed open-source data, with our practices verified by a trusted 3rd party in our SOC 2 Type II attestation.

Introduction

Building a high-quality code completion system is a hard challenge in AI. While many models can generate syntactically correct code, delivering suggestions that are consistently accurate, contextually relevant, and genuinely helpful requires a deep understanding of both developer intent and the full codebase in which they are operating. At Augment Code, we built a best-in-class completion system that's already performing at the highest levels by offering real-time suggestions while maintaining awareness of large repositories. But we knew we could push the boundaries further. The question was: how could we meaningfully improve a model that's already highly capable?

The answer came through a breakthrough in reinforcement learning. We pioneered a new approach called Reinforcement Learning from Developer Behaviors (RLDB) that learns directly from natural coding workflows, without requiring any special guidance or feedback from developers. By analyzing real development patterns, RLDB has achieved improvements equivalent to doubling our model size or training on 10x more data - a step-function improvement in completion quality.

In this post, we’ll dive into where traditional approaches fall short, how RLDB works, and the dramatic improvements we've seen in real-world usage.

Accuracy of code completion models on our internal benchmark. The boost our model gets from RLDB is similar in size to doubling the model size or doubling the amount of training data.

Traditional SFT and RL Pipeline Overview

To understand why we needed a new approach, it helps to understand how reinforcement learning (RL) is typically used to improve AI models. The standard RL pipeline begins with supervised finetuning (SFT), where a code generation model is trained on a dataset of prepared examples using supervised learning. This can improve a model’s code generation capabilities, but doesn't fully align it with human preferences.

Next comes reward model (RM) training, where a separate model is trained to predict human preferences. The reward model scores a generation model’s outputs based on how well they align with what humans prefer. Typically, this involves collecting at least tens of thousands of comparisons where humans choose which of several model answers is better.

Finally, the SFT model is fine-tuned using reinforcement learning to maximize the scores predicted by the reward model. This process gradually shifts the model's behavior to better align with human preferences.

Challenges with Traditional SFT and RL

The traditional approach described above has demonstrated remarkable success in various language modeling applications. However, when applied to the domain of code generation, we faced unique challenges.

Picture a software developer in their element – seamlessly switching between files, launching commands with shortcuts, and navigating through different windows while coding. This natural flow of development work creates complex patterns of interaction, making it challenging to capture and learn from these activities when training AI models. RLDB is designed to capture valuable signals directly from the actions of software developers in the course of their normal work, addressing limitations in traditional training approaches for code completion.

Human annotation, a cornerstone of SFT and RM/RL training, struggles to scale in this domain. Imagine trying to hand-annotate code suggestions for a million-line codebase; it may take a human tutor an entire day just to understand enough context to label a single example. This inefficiency makes manual annotation impractical for large-scale data collection.

Moreover, the incomplete state of code in an IDE differs significantly from the examples that exist in publicly available open-source code. Developers frequently work with broken code: partially written functions, unresolved imports, fragmented code snippets, and placeholders—scenarios that are rarely seen in public repositories. After all, developers rarely submit half-written code to a repository. Bridging the gap between such incomplete states and complete-code training data is a fundamental challenge when training code generation models.

Some systems, like ChatGPT and Gemini, train their reward models by showing users two options side by side and asking them to choose the better one. While this works for chat workflows, it is disruptive for coding.

On the other hand, recording all developer actions—keystrokes, mouse movements, window switches—seems promising but generates excessive noise. Identifying meaningful signals from this data is nontrivial. These challenges necessitated a reward model and reinforcement learning algorithm tailored to the unique dynamics of developer behavior and incomplete-state code environments.

Our Solution: Learning from Natural Developer Behaviors

To address these challenges, we developed RLDB: Reinforcement Learning from Developer Behaviors, which combines robust data infrastructure and innovative algorithm design.

Data Infrastructure

Our RLDB system is seamlessly integrated with production infrastructure, capturing full repository content and IDE states at regular intervals with minimal overhead. This enables us to reconstruct the complete development context at any given moment, providing a foundation for context-aware RM and RL training. Additionally, we track user behaviors as sequences of text-change events, offering rich, granular insights into developer preferences and supporting the collection of high-quality training data.

Our training process strictly excludes any customer-owned code. The training processes discussed here relied exclusively on our internal development datasets and properly licensed open-source contributions.

Algorithm Design

We use RL to optimize our code completion model by aligning it with signals from the reward model. While traditional approaches rely on ranking multiple responses to determine quality, we found this insufficient for capturing the nuanced needs of developers due to various reasons. Instead, our design prioritizes alignment with the distribution of real-world coding tasks for more robust generalization and allows the reward model to be trained on 100x more data than valid comparison pairs by leveraging data collected from developer environments.

The flexibility of our reward model allows it to optimize towards or against any specified direction, moving beyond traditional “better or worse” preference collection paradigms. Building on this foundation, we developed a custom RL algorithm specifically designed to maximize the potential of our RM, achieving an 8%+ perplexity reduction over traditional RL methods. These innovations create a system that not only learns effectively but also scales with the complexities of real-world development workflows.

The Results: Real-World Examples

Here are some concrete examples showcasing how our RLDB model outperforms the traditional SFT model. By better understanding incomplete code states and capturing user intent, the RLDB model not only delivers more accurate suggestions, but also minimizes issues like repetitions and hallucinations that often plague SFT models.

‍[moving animation]

Privacy and Security

We never train on proprietary customer code. All RLDB training is done using our internal development environment and data from users who have explicitly opted into helping improve the system. We have achieved SOC II Type 2 attestation, reflecting our commitment to security and privacy.

What's Next?

The RLDB journey is just beginning. We’re continuously refining our understanding of developer workflows and transforming those insights into better, more context-aware code suggestions. We see immense potential in this direction: our rich interaction data, combined with innovative reward modeling, unlocks possibilities for AI-assisted development that were previously out of reach. The early success of RLDB proves that we can extract meaningful signals from complex developer environments and translate them into significant performance gains.

If pioneering the future of AI-driven development through novel approaches like RLDB excites you, we'd love to talk! We're looking for engineers and researchers who are passionate about pushing the boundaries of what's possible with context-aware AI for code. To experience firsthand how our RLDB-enhanced model can transform your development workflow, sign up for Augment today.