A real-time index for your codebase: Secure, personal, scalable

TL;DR: At Augment, we built a secure, personalized code indexing system that:

Updates within seconds of code changes (vs. competitors' 10-minute delays)
Uses custom AI models instead of generic embeddings
Processes thousands of files/second using Google Cloud
Maintains separate indices per developer to handle branch switching
Implements strict security with proof-of-possession and no third-party APIs
Efficiently manages RAM usage by sharing overlapping indices between users

The result: Context-aware AI that actually keeps up with real development workflows while protecting code security.

A different approach to retrieval

The dominant approach in retrieval remains calling generic model APIs, like OpenAI, to create embeddings for snippets of code files and documents, and then using embedding search APIs, like Pinecone, to retrieve snippets related to the query. However, this approach leads to poor quality, poor latency, and potentially even poor security.

Therefore, Augment has taken a different approach: We have developed an indexing and embedding search system tailored to give you the most relevant and up-to-date context across all our products.

A personal index for each developer

All of our products understand your codebase. But what is actually “your codebase”? In the real world, this is a surprisingly tricky question.

Consider the scenario that you have just received a comment on one of your pull requests asking you to rename a function. You want to address the comment quickly and move on. So you switch to the branch and navigate through all definitions and usages of the function name and possibly adapt them. Before updating your PR, you probably also merge or rebase your PR to keep it up to date with the branch you want to merge into, and resolve any merge conflicts.

Retrieving from your main or development branch does not cut it: the function in question may not even exist on other branches, and models are likely to hallucinate if faced with names that are not defined in the context. Some of our competition is updating the context every 10 minutes, which is also not sufficient in our experience, as professional developers tend to switch branches fairly often. Retrieving from the incorrect branch may even reintroduce problems and patterns that have recently been eliminated by your team members, exacerbating the pains of working in large teams.

AI that does not respect the exact version of the code you are working on can easily cost you and your team more time than what it saves you.

Therefore we maintain a real-time index of your codebase, for each user.

Keeping up with your codebase

Changing branches in git, search-and-replace, and automatic formatting can quickly change 100s of files within a second. Like in the example above such instructions are frequent and it is especially true for large organizations where hundreds or thousands of developers merge changes into the same code base.

Our goal is to update your personal search index within a few seconds of any change to your files, so that the next prediction benefits from the full context awareness. This creates tough requirements for the indexing system.

The architecture of our indexing system makes heavy use of Google Cloud, especially PubSub, BigTable, and AI Hypercomputer. These technologies provide a highly reliable and scalable backbone, which keeps our services lightweight and simple. To get the most of our GPUs, the embedding model workers build on our custom inference stack.

Today, our indexing system is capable of processing many thousands of files per second, which means that your branch switch is handled almost instantly. Our architecture also allows us to scale way beyond those numbers, and the only limitation is the cost, which is mostly our GPUs creating the embeddings.

Balancing workloads

Besides the throughput spikes arising from day-to-day activity of our users, our indexing system also needs to handle bulk workloads. The first case of bulk uploads is when a new user signs up, or an existing user checks out a new codebase, we can observe uploads of 100k files or more. In these cases, we aim to not let our customers wait more than a few minutes to get started, so we want to use the full throughput for bulk uploads if it is available.

Another case of bulk uploads is the deployment of new search indices. Our context awareness is constantly improving, which makes this a surprisingly common event. This means that a new embedding model needs to run in shadow mode for a possibly extended amount of time to catch up. While one customer may already be switched to the new search index, other customers may still be in catch up mode.

We ensure that all users can enjoy the instantaneous throughput that keeps their personal search index in sync with their codebase, even in times where multiple bulk jobs overlap. We facilitate this by maintaining separate queues in PubSub keeping other queues just long enough to keep the GPUs in the embedding model workers saturated.

Your code stays your code

Augment prioritizes your code’s security with a secure-by-design approach guided by principles like Data Minimization, Least Privilege, and Fail-Safe mechanisms. We self-host our embedding search on Google Cloud, avoiding third-party APIs that could expose embeddings, which research shows can be reverse-engineered into source code (arXiv 2305.03010, arXiv 2004.00053).

Managing multiple code repositories with varying access rights—due to acquisitions, trade secrets, or contractor collaboration—can risk unauthorized data exposure. Augment addresses this with Proof of Possession: the IDE must prove to the backend it knows a file's content by sending a cryptographic hash to our backend before it is allowed to retrieve content from the file. This ensures predictions are strictly limited to the data the user is authorized to access, preventing leaks of restricted information.

Serving embedding search

We perform multiple embedding searches over your codebase on every keystroke. But serving personalized search indices with low latency can be costly. For large codebases, the size of the embeddings of all snippets can easily reach 10 GB and to keep latency low we need to keep a substantial portion of the embeddings in RAM. But using that much RAM per user for extended periods of time would balloon the cost of our service.

To address this, we share the parts of search indices that overlap between users from the same tenant. To guarantee the Proof of Possession principle discussed above we went with our own implementation of embedding search. Our embedding search service does not only score embeddings to find out what is most relevant, but also verifies that the client has proven that it is allowed access to the content.

Custom models

Generic embedding models are good at identifying which text pieces are similar. However, embedding from generic models can easily miss a lot of relevant context that is not similar on a textual level. Callsites are not necessarily similar to function definitions, documentation is not necessarily similar to the code it refers to, and code from different languages is not necessarily similar even if it implements related functionality.

On top of that, even highly “relevant” documents might not need to be retrieved. For example, the LLM for our code completions is closely familiar with popular open source libraries, such as PyTorch. Showing “relevant” pieces of the implementation of PyTorch to that LLM is not improving the quality of its outputs and the limited space in the prompt could be used for something else. So we further need to prioritize helpfulness over relevance.

In our experience, the value of generic embedding models quickly degrades for larger codebases as most embedding models get confused by clutter. While a project developed by a single engineer might work fine with generic embedding models, we have found that they are insufficient for codebases of large teams.

To help professional software engineers working with large and complex codebases, we have developed a set of custom context models that we specifically trained to identify the most helpful context.

Try it yourself!

Augment’s context awareness arises from custom models, and personalized context, and it comes with robust security. Whether you’re handling massive repositories or working on your side project, Augment ensures you have the right context at your fingertips. Sign up for Augment to see for yourself.