Retrieval Augmented Generation (RAG)

 By Oliver Allchin & Cat Bazela

A woman with blonde hair sat at a desk, with books open on the desk. She appears to have dropped her head in exhaustion at the tasks in front of her.


Literature reviews take ages. Having to find and evaluate different sources, explore and compare key information, theories and ideas, and work it all into some sort of synthesis - it’s all such a palaver.

Don’t worry though, AI can do all that tedious reading and thinking for you, summarising the entirety of human knowledge in seconds, leaving you free to do something presumably more interesting, like hoovering.

Or at least that’s the premise the latest raft of AI-assisted search tools seem to be built on, but can they really replicate the depth, thoroughness, insight and reliability of a human literature reviewer? Perhaps more importantly, do we want them to?

What is RAG?

Retrieval Augmented Generation or RAG systems combine generative AI models with information retrieval. The goal is to make the responses given by GenAI more accurate and reliable by supplementing their training data with information taken from verifiable sources.

When you enter a prompt or question, the AI turns this into a search, finds relevant sources (either from the web or a database or corpus of published literature) and summarises the top results with citations. Some tools let you edit and refine these plans as needed to guide the process. In theory, these tools allow you to use AI / LLM as a ‘research assistant’ that searches for and summarises information, with a short list of sources for further reading.

However, it’s important to be critical of the information provided by such tools to evaluate and verify any responses given. In this article, we take a look at some freely available RAG tools and undertake a light-touch assessment of the quality of their responses to our prompts. 

Perplexity

Perplexity has a tool called "Deep Research" which is designed to "save you hours of time by conducting in-depth research and analysis on your behalf". The search refines with each iteration and then provides you with a report.

We used the prompt “How do universities in the UK typically induct their new and existing students into the digital tools, websites, apps and equipment that they will use during their studies? How do they ensure entry-level digital skills are taught, as well as an understanding of the ethical approaches to using technology?” 

The generated output returned a structured report and provided information on how it identified the 46 references/sources which had been used in the creation of the report. All sources are listed, and upon checking, all were real. The first read-through seemed to produce a well-written report which drew on the resources which had been provided by the tool. It 

The list of resources totalled 46, but only 6 were cited within the report produced. When looking at the 6 citations in the text, three were government web pages, two were web pages from Higher Education institutions, and one was a blog post list. It then became apparent that more of the sources had been used for information but not cited within the text. Similarly, it was prone to attributing information to the wrong source or amalgamating sources to talk about something that doesn’t exist (hallucination). The references cited did not represent the information presented in the report.

A collection of notes, sticky post-its and pens and paper, gathered together in what could be a typical research session.


Gemini Deep Research

Gemini’s new feature is marketed as a 'research assistant' that will save you 'hours of work' and is available via Google Gemini.

Essentially, it Googles stuff for you and writes you a report of what it’s found. This report can be converted into an 'audio overview' so you can listen to it.

We used the prompt "I have blue tits nesting in our garden in the UK. Is it safe to feed them given the current situation with bird flu?"

It generated the following research plan, which is a set of questions it aims to answer. This is potentially helpful for users who are trying to think of research questions to use as the basis for a literature review. 


We found the response generated by Deep Research was mostly useful and accurately reflected current UK guidance on feeding garden birds. However, the sources consulted were a mixed bag. Whilst it did consult appropriate sources such as Defra and the RSPB, there were several citations to the websites of bird food companies, which you might argue have a vested interest in the question of whether or not to feed birds. Despite identifying the city and country I made the request from, Deep Research made frequent references to US legislation and agencies that weren’t relevant to this context.

In the example below, guidance from Defra (a UK agency) is referenced. However, the first source cited is the blog of a company that sells bird feeders, the second is the Pennsylvania Game Commission website, which makes no mention of Defra guidelines:

Whilst the 2800-word report was detailed and mostly helpful, it didn’t necessarily provide more information than we found by simply going to the RSPB website. 

Ai2 Scholar QA

Ai2 Scholar QA is a free tool designed to "satisfy literature searches that require insights from multiple relevant documents, and synthesise those insights into a comprehensive report".

It uses Vespa, a corpus of around 8M open access academic papers mostly in science, engineering, environment and medicine. When you enter a question or prompt, it generates summaries with citations to key evidence from the literature, including tables comparing key themes from the papers it finds. It’s available to use free of charge without the need to sign up for an account.

We used the prompt “How can we encourage people to follow hygiene guidance during pandemics?”

This generated a seemingly coherent, well-structured report with citations to (mostly) real and relevant journal papers. The summary was a little generic and lacking in detail - full of broad brush statements rather than specific data or evidence. One nice feature is the literature comparison table, which summarises evidence from a handful of studies. However, there’s no indication why the papers in the table have been selected over the others cited in the summary. 

Citations give no page numbers, but hovering over the citation brings up a snippet of the original text that’s being summarised. Often this bears little relation to the point being made - for example, a point about using serving utensils and avoiding shaking hands cited a text that didn’t appear to mention either. 

We noticed instances where the citation refers to a section of a paper that is itself paraphrasing yet another paper (secondary citation).


We also noticed some citations to ‘LLM Memory’. Hovering over the citation displays the following text: "Generated by Anthropic Claude…we could not find any reference with evidence that supports this statement". Clearly, the model is prone to hallucination, but at least it flags these up.




It may be that these issues are due to the prompt itself not being well-defined, and it’s possible that we asked about an area not well covered in the training data. The tool doesn’t give any feedback on whether your prompt was appropriate or an indication of whether it can answer it effectively, so it’s difficult to check this. Ai2 Scholar QA is also a work in progress and may well have changed since the time of writing.

Conclusion 

So, can RAG tools allow you to skip the whole literature review process? 

Not really based on what we’ve seen so far. They may seem (and to some extent are) more reliable than standard GenAI chatbots, but they’re still prone to hallucination, bias, oversimplification and missing out key information. The process by which sources are selected isn’t transparent, so there’s no way to know if the summary actually reflects the evidence from the literature, unless you perform your own literature search to compare.  

RAG can be seen as a way for AI developers to add a veneer of ‘scholarliness’ and veracity to their tools. At first glance, RAG summaries look reliable and well-referenced, but even a small amount of digging reveals some dubious sources and inaccurate referencing.

There is a danger that users will take the responses of these tools at face value, so it’s important to encourage people to take a critical approach to evaluating them.

These tools seem founded on the idea that reading and writing are tedious, time-consuming activities that AI is here to save us from. We’d argue that a literature review isn’t just a means to an end, it’s a valuable process in itself - a way of trying out different perspectives, exploring new ideas and reaching new conclusions that you wouldn’t otherwise have been able to arrive at. In handing responsibility for the literature search over to AI, you not only deny yourself this opportunity, but you also risk missing key sources and potentially getting a skewed, incomplete or biased snapshot of a topic. 

AI will only give you the prevailing discourse on any topic, the ‘average’ response. It will also amplify any biases or structural inequalities found in the source material. A human-led literature review is more likely to surface different perspectives, underrepresented voices and outlying ideas. It will also be more interesting.

RAG tools are potentially useful when starting a literature search, suggesting a handful of mostly relevant sources and a top-level summary, as a jumping-off point for further investigation. Provided that is, that you have the skills and time to question the responses they generate and engage with them in a critical manner, and undertake your own exploration of the literature. 

Designing assignments that encourage students to critique the outputs of these tools, perhaps comparing the results against their own literature searches, may help students develop their critical thinking skills. We would recommend signposting the following guides:


Oliver Allchin is the Liaison Librarian for Science at the Western Bank Library
Cat Bazela is a Digital Learning Advisor in the Digital Learning Team