From PDFs to Structured XML: How GROBID Simplifies the Process

From PDFs to Structured XML: How GROBID Simplifies the Process

Table of Content

GROBID is a powerful machine-learning library designed to transform raw documents, like PDFs, into structured XML/TEI documents, with a focus on technical and scientific publications.

It started originally a hobby project in 2008, GROBID was open-sourced in 2011 and has steadily evolved as a side project ever since.

GROBID offers a reliable, open-source solution for researchers, developers, and organizations needing to extract and structure complex information from scientific and technical documents efficiently.

The app is built in Java, ensuring compatibility across Windows, Linux, and macOS platforms.

GROBID includes a comprehensive web service APIDocker imagesbatch processing, a JAVA API, a generic training and evaluation framework (precision, recall, etc., n-fold cross-evaluation), systematic end-to-end benchmarking on thousand documents and the semi-automatic generation of training data.

Key Features:

  1. Header Extraction and Parsing: Automatically extracts bibliographic details, including title, abstract, authors, affiliations, keywords, and more.
  2. Affiliation and Address Parsing: Accurately extracts and structures affiliation and address blocks from articles.
  3. Date Parsing: Parses dates into ISO-standardized formats, including day, month, and year.
  4. Full-Text Extraction: Extracts and structures full text from PDF articles for seamless analysis.
  5. Reference Extraction and Parsing:
    • Handles both patent and non-patent references from patent publications.
    • Achieves high accuracy:
      • ~0.87 F1-score on a PubMed Central set of 1,943 PDFs containing 90,125 references.
      • ~0.89 F1-score on a bioRxiv set of 2,000 PDFs using its Deep Learning citation model.
  6. Citation Context Recognition: Identifies and resolves citation contexts within the text.
  7. PDF Coordinates for Extracted Information: Maps extracted content to its exact location within the PDF for precision and usability.
  8. Comprehensive Metadata Handling: Supports standard publication metadata, including DOI, PMID, and other critical identifiers.

Clients

For facilitating the usage GROBID service at scale, we provide clients written in Python, Java, node.js using the web services for parallel batch processing:

A third party client for Go is available offering functionality similar to the Python client:

All these clients will take advantage of the multi-threading for scaling large set of PDF processing. As a consequence, they will be much more efficient than the batch command lines (which use only one thread) and should be preferred.

GROBID Modules

A series of additional modules have been developed for performing structure aware text mining directly on scholar PDF, reusing GROBID's PDF processing and sequence labelling weaponry:

  • software-mention: recognition of software mentions and associated attributes in scientific literature
  • datastet: identification of sections and sentences introducing datasets in a scientific article, identification of dataset names and attributes (implict and named datasets) and classification of the type of datasets
  • grobid-quantities: recognition and normalization of physical quantities/measurements
  • grobid-superconductors: recognition of superconductor material and properties in scientific literature
  • entity-fishing, a tool for extracting Wikidata entities from text and document, which can also use Grobid to pre-process scientific articles in PDF, leading to more precise and relevant entity extraction and the capacity to annotate the PDF with interactive layout
  • grobid-ner: named entity recognition
  • grobid-astro: recognition of astronomical entities in scientific papers
  • grobid-bio: a toy bio-entity tagger using BioNLP/NLPBA 2004 dataset
  • grobid-dictionaries: structuring dictionaries in raw PDF format

License

GROBID is distributed under Apache 2.0 license.

The documentation is distributed under CC-0 license and the annotated data under CC-BY license.

How to cite

If you want to cite this work, please refer to the present GitHub project, together with the Software Heritage project-level permanent identifier. For example, with BibTeX:

@misc{GROBID,
    title = {GROBID},
    howpublished = {\url{https://github.com/kermitt2/grobid}},
    publisher = {GitHub},
    year = {2008--2024},
    archivePrefix = {swh},
    eprint = {1:dir:dab86b296e3c3216e2241968f0d63b68e8209d3c}
}

Resources & Downloads

GitHub - kermitt2/grobid: A machine learning software for extracting information from scholarly documents
A machine learning software for extracting information from scholarly documents - kermitt2/grobid
GROBID
Download GROBID for free. A machine learning software for extracting information. GROBID is a machine learning library for extracting, parsing, and re-structuring raw documents such as PDF into structured XML/TEI encoded documents with a particular focus on technical and scientific publications. First developments started in 2008 as a hobby.







Open-source Apps

9,500+

Medical Apps

500+

Lists

450+

Dev. Resources

900+

Read more

Why We're Betting Big on DeepSeek-V3: A Personal Dive into the Open-Source AI That’s Changing the Game and Redefining AI Excellence

Why We're Betting Big on DeepSeek-V3: A Personal Dive into the Open-Source AI That’s Changing the Game and Redefining AI Excellence

In a bold challenge to AI giants like OpenAI, DeepSeek has unleashed DeepSeek-R1—a revolutionary open-source model that marries brute-force intelligence with surgical precision. Boasting 671 billion parameters (only 37B active per task), this MIT-licensed marvel slashes computational costs while outperforming industry benchmarks in coding, mathematics, and complex reasoning. With

By Hazem Abbas