PaperQA2 - Custom Open-source RAG for Scientific Documents with Citation Support

Hazem Abbas

Nov 28, 2024 — 3 min read

Table of Content

PaperQA2 is a package for doing high-accuracy retrieval augmented generation (RAG) on PDFs or text files, with a focus on the scientific literature.

By default, it uses OpenAI embeddings and models with a Numpy vector DB to embed and search documents. However, you can easily use other closed-source, open-source models or embeddings (see details below).

PaperQA2 depends on some awesome libraries/APIs that make our repo possible. Here are some in no particular order:

Features

A simple interface to get good answers with grounded responses containing in-text citations
State-of-the-art implementation including document metadata-awareness in embeddings and LLM-based re-ranking and contextual summarization (RCS)
Support for agentic RAG, where a language agent can iteratively refine queries and answers
Documentation available
Automatic redundant fetching of paper metadata, including citation and journal quality data from multiple providers
A usable full-text search engine for a local repository of PDF/text files
A robust interface for customization, with default support for all LiteLLM models

How does it work?

PaperQA2 Algorithm: Understanding Its Workflow

The PaperQA2 algorithm is designed with a structured workflow to efficiently process information and generate answers. Its default workflow is divided into three main phases:

Phase 1: Paper Search

Generate Candidate Papers: The process begins by utilizing an LLM to create keyword-based queries, retrieving potential papers related to the query.
Chunking and Embedding: Retrieved papers are divided into manageable chunks, embedded as vectors, and stored in the system’s state for further processing.

Phase 2: Gather Evidence

Query Embedding and Ranking: The user’s query is embedded into a vector and compared with the stored document chunks to rank the top k most relevant chunks.
Scored Summaries: Each ranked chunk is summarized and scored in the context of the query.
Rescoring with LLM: An LLM re-evaluates and selects the most relevant summaries for further use.

Phase 3: Generate Answer

Contextual Prompting: The best summaries, along with additional context, are combined into a prompt.
Answer Generation: The prompt is processed to generate a comprehensive and accurate answer to the query.

Dynamic Tool Invocation

One of the strengths of PaperQA2 is its flexibility. Tools can be employed in any order based on the needs of a language agent. For example:

An LLM agent might perform a narrow or broad search depending on the complexity of the query.
Different phrasing may be used for gathering evidence and generating the final answer, optimizing the relevance and accuracy of results.

This adaptability ensures that PaperQA2 is robust and versatile for various query types and contexts.