What is retrieval-augmented generation (RAG) and are you ready for it?

7 mins

Is RAG the answer to all your generative AI hallucination problems?

While large language models (LLMs) are capable of incredible feats of summarization and translation, deploying them in mission critical ways is beset with problems – even for the largest tech companies in the world.

While they are trained on huge volumes of data, LLM’s are still limited by their training data and the quality of the prompt. Even then, there is always a chance that the model will “hallucinate” and make things up when it doesn’t know the correct answer.

Enter retrieval augmented generation (RAG), a fast-emerging technique to solving these problems. Let’s dig in and look at what it does, where it’s effective, and the limitations and costs to employing it.

What is RAG?

Foundational LLMs are trained on as much data as possible to build their neural networks. The simplest way to think of any given training data set is as the entire contents of the open internet, every book ever published, plus whatever else the researchers could get their hands on. Once this data is in place the months-long training process can start, meaning every model has a natural knowledge cutoff.

While LLMs can be retrained with additional data, or fine-tuned with data for a specific purpose, both processes are expensive, time-consuming, and only temporarily fix the problem. Without easy access to more up-to-date information, any AI model – or tool built using one – is going to be limited.

Retrieval augmented generation attempts to solve this by ‘augmenting’ the LLM with a database of relevant, easily updated information it can pull from. This can be anything from legal documents and news articles, to scientific papers and patent applications (though legal documents really are a popular option). RAG can also be used to give an AI model up-to-date access to proprietary or internal information at any organization.

For example, an IT support bot could use OpenAI’s GPT-4 as its base LLM, but also be augmented by a database that includes all the relevant internal help docs, successful chat logs, codes of conduct, and anything else relevant to the task. When a user asks how to connect to the organization’s VPN on their smartphone, instead of offering general advice based on GPT-4’s training data, the chatbot can use the accurate information in the help docs to respond correctly. And if the organization changes VPN provider or otherwise needs to update the instructions, they only have to update the information in the RAG database, instead of completely retraining an entire LLM.

How does it work?

A RAG pipeline requires an extra layer of processing to be performed on the prompts given to an AI model. While the specific structure will vary significantly between implementations, they mostly work in the same broad way.

As a technique, RAG relies on AI models with larger context windows: which is the quantity of text they can process in one go including both the input and output. Earlier LLMs like GPT-3 had a context window of 2,048 tokens (which equates to around 1,500 words), limiting how much additional context you could provide with any given prompt. Now you can find LLMs with context windows of 128,000 tokens, and even a million tokens, making RAG pipelines more easily applicable.

First, all the information you want to make available to the AI model has to be encoded in a vector database like Pinecone or Qdrant. How this data is broken up and encoded can have a significant impact on the AI’s retrieval speed, performance, and accuracy. Ideally, you want the AI to be able to retrieve the smallest amount of data possible that contains all of the context it needs to generate the correct reply. For some applications, this will just be a few key facts, for others, it might require thousands of words of relevant text. This is the real art of RAG.

When you prompt the AI model, instead of responding immediately, it queries the vector database for information relevant to the prompt. The search strategy you use to query the database is, again, the kind of important design decision that impacts the overall effectiveness of the RAG pipeline.

Any information the AI model retrieves is then appended to the prompt and sent to the LLM for processing. The LLM responds based on both its training data and the additional context pulled from the RAG pipeline. Typically, you are counting on the original training data to get the AI model to respond in clear sentences, while it uses the RAG data to provide relevant, accurate information.

What is RAG good for?

If you’ve played around with ChatGPT, you’ve probably found that it often replies with generic, vague overviews, especially when there isn’t a clearly factual response in its training data. RAG helps avoid this outcome by making sure your applications have additional, relevant data, and that you can easily and cheaply update that information.

Zach Bartholomew, VP of product at Perigon, explains that RAG can also be used to establish “ground truth” in your applications. If you tell the LLM that the data it pulls from the vector database is better and more accurate than what it was trained on, it can significantly reduce hallucinations.

You can even set guardrails to ensure the model doesn’t try and answer a question for which it has no data. For example, if a user asked the IT chatbot how to connect to a service that didn’t exist, instead of replying with advice on how to connect to an imaginary service, it would reply that the service wasn’t available.

Bartholomew also explains that RAG can be used to build traceable citations into AI applications. While it’s almost impossible to attribute any particular LLM response to a specific piece of training data, RAG can be employed with a significant level of transparency by getting the LLM to cite which particular resources it is using to generate its responses. While this may not always work perfectly, it should be considered as part of an effective RAG pipeline – especially if you want to be able to effectively troubleshoot.

RAG can also be cost effective to implement. According to Wordsmith.ai CEO Ross McNairn, “For all but a handful of companies, it’s impractical to even consider training their own foundational model as a way to personalize the output of an LLM.” However, a RAG pipeline can be deployed relatively quickly with far less upfront cost.

The problems with RAG

Implementing a RAG pipeline can’t solve every problem with existing LLMs, and nor does it come without overhead. As Bartholomew explains, “Your LLM to RAG pipeline is only going to be as good as the data you have and that you embedded.” Once again, it’s the old computer science truism: Garbage In/Garbage Out.

Before we even get to the complexities of deploying a RAG pipeline, it’s important to consider that sourcing good data, formatting it so that it’s usable, chunking it so that it’s retrievable, and embedding it in a vector database is not a trivial task. There is a reason that a lot of startups are using RAG for things like legal texts, scientific papers, and patent applications: the dataset is at least somewhat consistent, if not necessarily cleaned up and ready to use.

And then there are those deployment complexities. A RAG pipeline calls for relatively novel technologies like vector databases, natural language processing, data embedding, and LLM integration to be deployed. If you don’t have those skills in-house, you will either need to outsource, develop them internally, or hire someone that does.

A RAG pipeline also adds some latency, especially at first. According to Bartholomew, the first proof-of-concept of Perigon took up to 30 seconds to respond to a query and required several round-trips to the LLM. Whether you’re running your own server, or using an API from a company like OpenAI, that level of compute utilization can get expensive quickly.

On top of all that, RAG can only reduce hallucinations, not eliminate them completely. At a certain point, you are still deploying a black box LLM that you can’t fully understand. There will always be edge cases where it responds in unexpected ways. Dealing with them will always be a challenge, RAG is just an approach to doing so.

Building better AI applications

A well designed RAG pipeline using an appropriate data source enables you to build and deploy AI applications that are significantly more reliable in the real world. With access to accurate information and clear instructions, the chances of them hallucinating are significantly reduced. Whether you’re building an external product or internal tools, RAG could make all the difference to your users.

However, RAG isn’t appropriate for every conceivable AI use case. An effective RAG pipeline relies on a high-quality database of information relevant to the application’s specific purpose. Creating these kinds of databases is not a trivial task, and may even be impossible in some situations, so proceed with caution.

What is retrieval-augmented generation (RAG) and are you ready for it?

Posted in:

Written by:

Share:

What is RAG?

How does it work?

What is RAG good for?

The problems with RAG

Building better AI applications

Related content

How to get started with GenAI

The engineering leader's guide to becoming enterprise ready

The best feature management and experimentation software 2024

Be careful with ‘open source’ AI

The cost of the great developer tool consolidation

Inside the tech overhaul of Big Brother’s voting system

Why Elixir is one of the hottest programming languages to learn

The case against building your own SSO and SCIM

The CrowdStrike disaster is a lesson about testing

Why Zig is one of the hottest programming languages to learn

How to build an effective technical strategy

Why OpenFeature is central to modern feature management

Understanding feature flags

How to standardize codebases across teams

WebAssembly is still waiting for its moment

Minimum viable architecture is the backbone of a successful product

5 mistakes to avoid when picking an AI coding assistant

A buyer’s checklist for AI coding assistants

The best AI coding assistants 2024

How to argue with the AI coding assistant skeptics

PostgreSQL: The database that quietly ate the world

Partner Content: The Engineering Leader’s Guide to Goals and Reporting

AI models can’t understand code. Does that matter?

6 questions to ask when buying a software developer metrics tool

How to combat generative AI security risks

How Zalando uses its own Tech Radar to make better technology choices

4 things you need to know from the latest Thoughtworks Tech Radar

9 women in AI you need to know about

AI and Kubernetes are pushing cloud costs out of control

How to write better AI prompts

A buyer’s checklist for software developer analytics tools

How to plan for and mitigate different types of tech debt

5 mistakes to avoid when choosing a software developer analytics tool

The best software development analytics tools 2024

Who holds the edge in the JavaScript framework wars?

11 generative AI programming tools for developers

Researchers say generative AI isn't replacing devs any time soon

Mastering tough technical decisions

Unlocking productivity with developer platforms

12 things to consider when assessing open source software

Choose a contextualized AI coding assistant

What developers need to know about generative AI in 2024

Leading open-source teams in large organizations

Whatever happened to Big Data?

A journey to tackle legacy code in online travel

6 steps to addressing legacy enterprise code

Learning to live with legacy code

How test coverage can improve code quality

What you need to know about Biden’s AI executive order

How OpenAI fought off security threats and GPU shortages to scale ChatGPT

Balancing build vs buy decisions in a post-boom world

3 strategies for maximizing your cloud savings

Building a cloud architecture that can scale to any challenge

Architecting for profit: A blueprint for modern cloud economics

How are engineering orgs achieving reliability in 2023?

Tech debt for engineering leaders: How a shortcut today impacts tomorrow

What AI has to offer: Using LLM tools in interviews

Tech debt traps to avoid

The 6 biggest generative AI risks for developers

7 generative AI productivity hacks for developers

SRE for engineering managers

Can platform engineering help you do more with less?

When to migrate from a monolithic to a distributed frontend architecture

Let's mitigate bias in tech

The essential tools for software engineering managers

Kubernetes for engineering managers

Solving the mean time to repair problem

The relationship between observability, OpenTelemetry, and UX

Will ChatGPT and generative AI replace internal code documentation?

What makes a front-end developer in 2023?