Is RAG Still Needed? Choosing the Best Approach for LLMs
YouTube transcript, YouTube translate
A quick preview of the first subtitles so you know what the video covers.
There's a fundamental truth about LLMs, large language models. They are frozen in time. They know everything about our world up until their training cutoff date and absolutely nothing about what happened 5 minutes ago. Nor do they know anything about your private data, your internal wikis, your proprietary codebase. And if we do want an LLM to know any of that stuff, well, we have to solve the problem of context injection. How do we get the right data into the model at the right time? And there have been two very different ways to handle this. Now, the first is really what we can think of as the engineering approach. It's RAG, retrieval augmented generation. So here we've got an LLM and we've also got an input prompt from the user. Now ahead of time we take some of the documents that we want to give to this LLM. So these are documents that could be PDFs or code files or entire books and we chunk them. We break them up into smaller chunks and we pass them through to an embedding model and the embedding model takes those chunks and it turns them into vectors and those vectors are then stored in a dedicated vector database. Now when a user asks a question, it performs a semantic search to retrieve the most relevant chunks and then inject them into the context window. So now the context window has the user prompt, but it also has all of these chunks that we have taken from the vector database and together this forms the context window. Now this works but it does rely on something. It relies on the hope that your retrieval logic actually found the right information in the vector database. Now the the second approach is really a bit more of a brute force approach and that one is called long context. Now this is really the model native solution because you skip the database here and you skip the embedding model. All you do is you take your documents and you just well you put them straight into the context window and then you let the model's attention mechanism actually do the heavy lifting of finding the answer. Now for a long time this kind of brute force method wasn't really much of an option because initially context windows were tiny. Early LLMs had context windows that could maybe store what like 4K of tokens. You couldn't fit a novel in there, let alone a corporate knowledge base. You basically had to use RAG. But today's models have much larger context windows. Some of them have, you know, a million tokens plus.