pdf

From PDFs to Structured XML: How GROBID Simplifies the Process

Hazem Abbas

Nov 24, 2024 — 4 min read

Table of Content

GROBID is a powerful machine-learning library designed to transform raw documents, like PDFs, into structured XML/TEI documents, with a focus on technical and scientific publications.

It started originally a hobby project in 2008, GROBID was open-sourced in 2011 and has steadily evolved as a side project ever since.

GROBID offers a reliable, open-source solution for researchers, developers, and organizations needing to extract and structure complex information from scientific and technical documents efficiently.

The app is built in Java, ensuring compatibility across Windows, Linux, and macOS platforms.

GROBID includes a comprehensive web service API, Docker images, batch processing, a JAVA API, a generic training and evaluation framework (precision, recall, etc., n-fold cross-evaluation), systematic end-to-end benchmarking on thousand documents and the semi-automatic generation of training data.

Key Features:

Header Extraction and Parsing: Automatically extracts bibliographic details, including title, abstract, authors, affiliations, keywords, and more.
Affiliation and Address Parsing: Accurately extracts and structures affiliation and address blocks from articles.
Date Parsing: Parses dates into ISO-standardized formats, including day, month, and year.
Full-Text Extraction: Extracts and structures full text from PDF articles for seamless analysis.
Reference Extraction and Parsing:
- Handles both patent and non-patent references from patent publications.
- Achieves high accuracy:
  - ~0.87 F1-score on a PubMed Central set of 1,943 PDFs containing 90,125 references.
  - ~0.89 F1-score on a bioRxiv set of 2,000 PDFs using its Deep Learning citation model.
Citation Context Recognition: Identifies and resolves citation contexts within the text.
PDF Coordinates for Extracted Information: Maps extracted content to its exact location within the PDF for precision and usability.
Comprehensive Metadata Handling: Supports standard publication metadata, including DOI, PMID, and other critical identifiers.

Clients

For facilitating the usage GROBID service at scale, we provide clients written in Python, Java, node.js using the web services for parallel batch processing:

Python GROBID client (the most complete one in term of supported services and options)
Java GROBID client
Node.js GROBID client

A third party client for Go is available offering functionality similar to the Python client:

Go GROBID client

All these clients will take advantage of the multi-threading for scaling large set of PDF processing. As a consequence, they will be much more efficient than the batch command lines (which use only one thread) and should be preferred.

GROBID Modules

A series of additional modules have been developed for performing structure aware text mining directly on scholar PDF, reusing GROBID's PDF processing and sequence labelling weaponry:

software-mention: recognition of software mentions and associated attributes in scientific literature
datastet: identification of sections and sentences introducing datasets in a scientific article, identification of dataset names and attributes (implict and named datasets) and classification of the type of datasets
grobid-quantities: recognition and normalization of physical quantities/measurements
grobid-superconductors: recognition of superconductor material and properties in scientific literature
entity-fishing, a tool for extracting Wikidata entities from text and document, which can also use Grobid to pre-process scientific articles in PDF, leading to more precise and relevant entity extraction and the capacity to annotate the PDF with interactive layout
grobid-ner: named entity recognition
grobid-astro: recognition of astronomical entities in scientific papers
grobid-bio: a toy bio-entity tagger using BioNLP/NLPBA 2004 dataset
grobid-dictionaries: structuring dictionaries in raw PDF format

License

GROBID is distributed under Apache 2.0 license.

The documentation is distributed under CC-0 license and the annotated data under CC-BY license.

How to cite

If you want to cite this work, please refer to the present GitHub project, together with the Software Heritage project-level permanent identifier. For example, with BibTeX:

@misc{GROBID,
    title = {GROBID},
    howpublished = {\url{https://github.com/kermitt2/grobid}},
    publisher = {GitHub},
    year = {2008--2024},
    archivePrefix = {swh},
    eprint = {1:dir:dab86b296e3c3216e2241968f0d63b68e8209d3c}
}

Resources & Downloads

GROBID

Download GROBID for free. A machine learning software for extracting information. GROBID is a machine learning library for extracting, parsing, and re-structuring raw documents such as PDF into structured XML/TEI encoded documents with a particular focus on technical and scientific publications. First developments started in 2008 as a hobby.

SourceForge

pdf pdf converter Open-source Apache License pdf edit PDF Editor pdf ocr pdf reader pdf search Artificial Intelligence (AI) Artificial Intelligence programming pdf split pdf viewer r programming pdf scanners Machine Learning Machine Vision Java Cross-Platform Windows Linux Arch Linux Linux Mint macos

From PDFs to Structured XML: How GROBID Simplifies the Process

Hazem Abbas

Table of Content

Key Features:

Clients

GROBID Modules

License

How to cite

Resources & Downloads

Are You Truly Ready to Put Your Mobile or Web App to the Test?

Articles

Systems

Development

Apps

Science - Healthcare

Open-source Apps

Medical Apps

Lists

Dev. Resources

Read more

Pinokio: The Ultimate AI Playground for Your Computer (and Why Our Local AI Club is Obsessed) 20+ AI Apps for Free!

Running LLMs on Apple Silicon M1: A Quantization Love Story

11 Mind-Blowing Open-Source AI Art Generators You Can Run Locally for Free

aPS3e: Play PS3 Games on Your Android Device for FREE! The Best PS3 Emulator for Android YET!

Table of Content

Key Features:

Clients

GROBID Modules

License

How to cite

Resources & Downloads

Read More Articles in pdf

Doc2Podcast Turn any PDF into an Engaging Podcast with this Open-source Free AI App

Explore 11 Open-Source Free Markdown to PDF Converter Apps

Academic Pandoc template - Professional Template for Markdown and Pandoc

Easily Convert Markdown files to HTML, LaTeX/PDF and EPUB with this Free Tool: Crowbook

docconv - Converts PDF, DOC, DOCX, XML, HTML, RTF, etc to plain text with Go

Eisvogel - Free App To Convert Markdown to PDF and LaTeX The Easy Way

Articles

Systems

Development

Apps

Science - Healthcare

Open-source Apps

Medical Apps

Lists

Dev. Resources

Read more

Pinokio: The Ultimate AI Playground for Your Computer (and Why Our Local AI Club is Obsessed) 20+ AI Apps for Free!

Running LLMs on Apple Silicon M1: A Quantization Love Story

11 Mind-Blowing Open-Source AI Art Generators You Can Run Locally for Free

aPS3e: Play PS3 Games on Your Android Device for FREE! The Best PS3 Emulator for Android YET!