PaperQA2 - Custom Open-source RAG for Scientific Documents with Citation Support

PaperQA2 - Custom Open-source RAG for Scientific Documents with Citation Support

PaperQA2 is a package for doing high-accuracy retrieval augmented generation (RAG) on PDFs or text files, with a focus on the scientific literature.

By default, it uses OpenAI embeddings and models with a Numpy vector DB to embed and search documents. However, you can easily use other closed-source, open-source models or embeddings (see details below).

PaperQA2 depends on some awesome libraries/APIs that make our repo possible. Here are some in no particular order:

  1. Semantic Scholar
  2. Crossref
  3. Unpaywall
  4. Pydantic
  5. tantivy
  6. LiteLLM
  7. pybtex
  8. PyMuPDF

Features

  • A simple interface to get good answers with grounded responses containing in-text citations
  • State-of-the-art implementation including document metadata-awareness in embeddings and LLM-based re-ranking and contextual summarization (RCS)
  • Support for agentic RAG, where a language agent can iteratively refine queries and answers
  • Documentation available
  • Automatic redundant fetching of paper metadata, including citation and journal quality data from multiple providers
  • A usable full-text search engine for a local repository of PDF/text files
  • A robust interface for customization, with default support for all LiteLLM models

How does it work?

PaperQA2 Algorithm: Understanding Its Workflow

The PaperQA2 algorithm is designed with a structured workflow to efficiently process information and generate answers. Its default workflow is divided into three main phases:


  • Generate Candidate Papers: The process begins by utilizing an LLM to create keyword-based queries, retrieving potential papers related to the query.
  • Chunking and Embedding: Retrieved papers are divided into manageable chunks, embedded as vectors, and stored in the system’s state for further processing.

Phase 2: Gather Evidence

  • Query Embedding and Ranking: The user’s query is embedded into a vector and compared with the stored document chunks to rank the top k most relevant chunks.
  • Scored Summaries: Each ranked chunk is summarized and scored in the context of the query.
  • Rescoring with LLM: An LLM re-evaluates and selects the most relevant summaries for further use.

Phase 3: Generate Answer

  • Contextual Prompting: The best summaries, along with additional context, are combined into a prompt.
  • Answer Generation: The prompt is processed to generate a comprehensive and accurate answer to the query.

Dynamic Tool Invocation

One of the strengths of PaperQA2 is its flexibility. Tools can be employed in any order based on the needs of a language agent. For example:

  • An LLM agent might perform a narrow or broad search depending on the complexity of the query.
  • Different phrasing may be used for gathering evidence and generating the final answer, optimizing the relevance and accuracy of results.

This adaptability ensures that PaperQA2 is robust and versatile for various query types and contexts.

License

Apache 2.0 License

Resources & Downloads

GitHub - Future-House/paper-qa: High accuracy RAG for answering questions from scientific documents with citations
High accuracy RAG for answering questions from scientific documents with citations - Future-House/paper-qa
PaperQA2
Download PaperQA2 for free. High accuracy RAG for answering questions from scientific documents . PaperQA2 is a package for doing high-accuracy retrieval augmented generation (RAG) on PDFs or text files, with a focus on the scientific literature. See our recent 2024 paper to see examples of PaperQA2’s superhuman performance in scientific tasks like question answering, summarization, and contradiction detection.







Open-source Apps

9,500+

Medical Apps

500+

Lists

450+

Dev. Resources

900+