What is Text annotation?

Text annotation is the process of associating labels or tags to specific parts of a text, such as phrases, words, or sentences. The aim is to provide additional information about the text, which can then be used for further analysis or processing, particularly in the field of Artificial Intelligence (AI).

Significance of Text annotation

Text annotation is crucial for supervised machine learning tasks in AI applications. It helps to train AI models to understand language-based data more accurately.

By annotating text, we can teach AI models to recognize patterns in the text, understand context, and make accurate predictions when presented with new, similar data.

Benefits of Text Annotation

  1. Improved Accuracy: Text annotation can enhance the accuracy of AI models by providing more data for the model to learn from.
  2. Contextual Understanding: It can help AI models understand the context of a text, leading to more nuanced and accurate responses.
  3. Better Predictions: With annotated data, AI models can make more accurate predictions, as they can better understand the patterns and context within the text. This can lead to improved performance in tasks such as text classification, sentiment analysis, and language translation.

In the context of AI applications, text annotation can be used in various ways. For instance, it can be used in natural language processing to help machines understand human language. It can also be used in chatbots to understand user queries and generate appropriate responses.

Furthermore, in sentiment analysis, text annotation can help determine the sentiment expressed in a piece of text, which can be particularly useful for businesses to understand customer feedback.

Overall, text annotation plays a fundamental role in enhancing the capabilities of AI applications by providing them with the necessary data to learn and understand language-based patterns.

In this post, you will find the best open-source text labeling and text annotation tools, that you can download, install, use and customize for free.

1. Label Studio

Label Studio is an open-source data labeling tool that supports various data types and exports to multiple model formats. It is used to prepare raw data or enhance existing training data for more accurate machine learning models.

GitHub - HumanSignal/label-studio: Label Studio is a multi-type data labeling and annotation tool with standardized output format
Label Studio is a multi-type data labeling and annotation tool with standardized output format - HumanSignal/label-studio

2. Doccano

Doccano is an open-source text annotation tool offering features for text classification, sequence labeling, and sequence to sequence tasks. It supports collaborative annotation, multiple languages, mobile use, emojis, a dark theme, and a RESTful API.

It can be easily installed using Docker and Docker Compose.

GitHub - doccano/doccano: Open source annotation tool for machine learning practitioners.
Open source annotation tool for machine learning practitioners. - doccano/doccano

3. Universal Data Tool

The Universal Data Tool is a versatile application for editing and annotating various types of data, including images, text, audio, and documents. It supports a wide range of data types and offers features such as real-time collaboration, an easy-to-use GUI for project configuration, and the ability to create training courses for labelers. The tool can be used on the web or as a desktop application, and supports data download/upload in CSV or JSON formats.

Features

  • Collaborate with others in real time, no sign up!
  • Usable on web or as Windows,Mac or Linux desktop application
  • Configure your project with an easy-to-use GUI
  • Easily create courses to train your labelers
  • Download/upload as easy-to-use CSV (sample.udt.csv) or JSON
  • Support for Images, Videos, PDFs, Text, Audio Transcription and many other formats
  • Can be easily integrated into a React application
  • Annotate images or videos with classifications, tags, bounding boxes, polygons and points
  • Fast Automatic Smart Pixel Segmentation using WebWorkers and WebAssembly
  • Import data from Google Drive, Youtube, CSV, Clipboard and more
  • Annotate NLP datasets with Named Entity Recognition (NER), classification and Part of Speech (PoS) tagging.
  • Easily load into pandas or use with fast.ai
  • Runs eaily with docker docker run -p 3000:3000 universaldatatool/universaldatatool
  • Runs with singularity singularity run universaldatatool/universaldatatool
GitHub - UniversalDataTool/universal-data-tool: Collaborate & label any type of data, images, text, or documents, in an easy web interface or desktop app.
Collaborate & label any type of data, images, text, or documents, in an easy web interface or desktop app. - UniversalDataTool/universal-data-tool
From Segmentation to Transcription: Simplifying Data Annotation with Universal Data Tool
What is the Universal Data Tool? The Universal Data Tool is a web and desktop application for editing and annotating various types of data, including images, text, audio, and documents. It supports a range of data tasks such as image segmentation, text classification, audio transcription, and more. The tool allows

4. YEDDA

YEDDA is a tool for annotating text in various languages, symbols, and emojis. It supports shortcut annotation, command annotation model, and exports annotated text into sequence text. It also includes intelligent recommendation and administrator analysis.

YEDDA is compatible with all mainstream operating systems including Windows, Linux, and MacOS.

GitHub - jiesutd/YEDDA: YEDDA: A Lightweight Collaborative Text Span Annotation Tool. Code for ACL 2018 Best Demo Paper Nomination.
YEDDA: A Lightweight Collaborative Text Span Annotation Tool. Code for ACL 2018 Best Demo Paper Nomination. - jiesutd/YEDDA

5. Argilla

Argilla is a self-hosted open-source collaboration platform for AI engineers and domain experts, offering high-quality outputs, full data ownership, and increased efficiency.

It aids in improving AI output quality through data quality, offers control over data and models, and enhances efficiency by enabling quick iterations on data and models. Argilla also provides tools for effective data management and model training.

GitHub - argilla-io/argilla: Argilla is a collaboration platform for AI engineers and domain experts that require high-quality outputs, full data ownership, and overall efficiency.
Argilla is a collaboration platform for AI engineers and domain experts that require high-quality outputs, full data ownership, and overall efficiency. - argilla-io/argilla

6. Knowtator

Knowtator is a text annotation tool integrated with the Protégé knowledge representation system, designed to facilitate the creation of training and evaluation corpora for biomedical language processing tasks.

It allows for the easy definition and incorporation of complex annotation schemas. The tool is licensed under the Mozilla Public License Version 1.1, with potential different licensing restrictions for 3rd party dependencies.

Knowtator
Download Knowtator for free. Knowtator is a general-purpose text annotation tool that is integrated with the Protégé knowledge representation system. Knowtator facilitates the manual creation of training and evaluation corpora for a variety of biomedical language processing tasks.

7. KernAI Refinery

Refinery is an open-source platform by Kern AI, is designed for data scientists working with natural language data. It supports semi-automated labeling, data subset quality assessment, and centralized data monitoring, aiming to enhance manual labeling efficiency.

The tool utilizes technologies like Hugging Face and spaCy for pre-built language models and integrates with other labeling tools for flexible data processing.

Features

  • (Semi-)automated labeling workflow for NLP tasks
  • Manual and programmatic classifications and span-labeling
  • Integration with state-of-the-art libraries and frameworks
  • Creation and management of lookup lists/knowledge bases
  • Neural search-based retrieval of similar records and outliers
  • Sliceable labeling sessions
  • Multiple labeling tasks per project
  • Rich library of ready-made automations
  • Extensive data management and monitoring
  • Integration with Hugging Face for automatic creation of embeddings
  • JSON-based data model for data upload/download
  • Overview of project metrics
  • Data accessible and extendable via Python SDK
  • In-place attribute modifications
  • Team workspaces in the managed version
  • Role-based access and minimized labeling views for multiple users
  • Integration of crowd labeling workflows
  • Automated calculation of inter-annotator agreements
GitHub - code-kern-ai/refinery: The data scientist’s open-source choice to scale, assess and maintain natural language data. Treat training data like a software artifact.
The data scientist’s open-source choice to scale, assess and maintain natural language data. Treat training data like a software artifact. - code-kern-ai/refinery

8. Recogito.js

RecogitoJS is a JavaScript library for text annotation, useful for adding annotation functionality to a webpage or building custom annotation apps.

It can be installed via npm or by downloading the latest release. The library allows for event handlers and full documentation is available on the Wiki.

GitHub - recogito/recogito-js: A JavaScript library for text annotation
A JavaScript library for text annotation. Contribute to recogito/recogito-js development by creating an account on GitHub.

9. Label Sleuth

Label Sleuth is an open-source, no-code system for text annotation and classifier creation. It empowers domain experts like physicians, lawyers, and psychologists to build custom NLP models without NLP experts.

Typically, real-world NLP model creation requires both domain and machine learning expertise. Label Sleuth eliminates this need with an intuitive UX for data labeling and model building. As users label data, machine learning models are trained in the background, making predictions and suggesting what to label next.

Being a no-code system, it requires no machine learning knowledge and allows for rapid model development, from task definition to a working model in hours.

GitHub - label-sleuth/label-sleuth: Open source no-code system for text annotation and building of text classifiers
Open source no-code system for text annotation and building of text classifiers - label-sleuth/label-sleuth

Markup is an online annotation tool that can be used to transform unstructured documents into structured formats for NLP and ML tasks, such as named-entity recognition. Markup learns as you annotate to predict and suggest complex annotations, and also provides integrated access to common and custom ontologies for concept mapping.

Features

  • Predictive annotation - Markup's machine learning-powered predictive annotation feature suggests complex annotations as you work, making the process of annotating documents more efficient and saving you valuable time.
  • Integrated ontology access Markup provides integrated access to a wide range of common ontologies (e.g. UMLS, SNOMED-CT, ICD-10), as well as the ability to upload custom ontologies, for concept mapping.
  • Predictive ontology mapping - Markup's predictive ontology mapping feature uses machine learning to suggest appropriate mappings to standard and custom terminologies based on the text you're annotating.
  • User-friendly interface - Whether you're a technical expert or a beginner, Markup's user-friendly interface makes it easy for anyone to start annotating documents with minimal setup.
GitHub - samueldobbie/markup: A web-based document annotation tool, powered by GPT-4 :rocket:
A web-based document annotation tool, powered by GPT-4 :rocket: - samueldobbie/markup

11. Slate

Slate is a text document labeling tool that supports various scales and types of annotation, useful for tasks like Part-of-Speech tagging, Named Entity Recognition, Text Classification, and more.

It stands out for its speed, trivial installation, screen space optimization, terminal-based operation for constrained environments, and easy configuration and modification.

GitHub - jkkummerfeld/slate: A Super-Lightweight Annotation Tool for Experts: Label text in a terminal with just Python
A Super-Lightweight Annotation Tool for Experts: Label text in a terminal with just Python - jkkummerfeld/slate

12. DataLabel

DataLabel is a UI-based data editing tool that makes it easy to create labeled text data in a dataframe. With DataLabel, you can quickly and effortlessly edit your data without having to write any code. Its intuitive interface makes it ideal for both experienced data professionals and those new to data editing.

DataLabel is best used inside of a jupyter notebook or other Ipython envs. Once DataLabel is installed, you can start using it right away. Simply import DataLabel in your script and use the edit function to open the UI-based editor.

GitHub - CoreLabsAI/datalabel: datalabel is a UI-based data editing tool that makes it easy to create labeled text data in a dataframe. With datalabel, you can quickly and effortlessly edit your data without having to write any code. Its intuitive interface makes it ideal for both experienced data professionals and those new to data editing.
datalabel is a UI-based data editing tool that makes it easy to create labeled text data in a dataframe. With datalabel, you can quickly and effortlessly edit your data without having to write any…

13. Anno

Anno is a Go package for text annotation, anno, offers a simple interface, is very extensible, and presents well in JSON, making it suitable for APIs.

GitHub - matryer/anno: Go package for text annotation.
Go package for text annotation. Contribute to matryer/anno development by creating an account on GitHub.

14. MedTator

MedTator is a serverless text annotation tool designed for ease of use in corpus development, built with HTML5 and open-source packages. It requires no Java, Python, PHP, Docker, MySQL, or any server or client runtime installation for corpus annotation.

GitHub - OHNLP/MedTator: A Serverless Text Annotation Tool for Corpus Development
A Serverless Text Annotation Tool for Corpus Development - OHNLP/MedTator

15. Label

LABEL is an open-source text annotation software developed by the French Supreme Court for publishing court decisions. It allows proofreading and review of pre-annotated decisions by an NLP algorithm.

It features an admin panel for managing documents and accounts, and allows for document assignment and reassignment.

Features

  • Admin Panel: Manages documents and accounts, assigns and reassigns documents.
  • Contextual Actions & Search: Allows viewing of anonymized documents, assignment of documents, and document search.
  • Advanced Filters: Enables document filtering by various fields (treatment date, import date, source database etc.)
  • Supplementary Annotations: Customizable annotation types, including non-generic ones.
  • Inline Editing: Provides interactive labels for in-text editing and annotation.
  • Linked Annotations: Allows linking of annotations for consistency in anonymized documents.
GitHub - Cour-de-cassation/label: Open source text annotation software created by the french supreme court ‘Cour de cassation’
Open source text annotation software created by the french supreme court 'Cour de cassation' - GitHub - Cour-de-cassation/label: Open source text annotation software created by the french…

16. Annotato (React)

Annotato is a React component designed to annotate or display annotations in a text, with applications in creating training data for machine learning or enhancing text reading experiences. It features two modes: read and edit. The read mode displays stored annotations, while the edit mode allows for new annotations and removal of existing ones.

The component also supports custom onClick and onHover events for annotations.

GitHub - YusufCelik/annotato: Annotato is a React component that helps to annotate or display and add interactivity to previously made annotations in a given text.
Annotato is a React component that helps to annotate or display and add interactivity to previously made annotations in a given text. - YusufCelik/annotato

17. TextFlow

TextFlow is a minimalist interface for full text processing, supporting use-cases like data annotation, annotation monitoring, inter-annotator agreement, final dataset generation, and auto-models, with the last three being experimental.

Features

  • Data Annotation
  • Annotation Monitoring
  • Inter-annotator Agreement (Experimental)
  • Final Dataset Generation (Experimental)
  • Auto-Models (Experimental)
GitHub - ysenarath/textflow: Framework for Text Annotation.
Framework for Text Annotation. Contribute to ysenarath/textflow development by creating an account on GitHub.

18. Potato

Potato is a user-friendly, web-based text annotation tool that allows for quick setup and deployment of various text annotation tasks. It operates as a web server and is driven by a single configuration file, requiring no coding to start.

Although no additional web design is typically needed, Potato is easily customizable to adjust the interface and elements seen by annotators.

Key Features

  • Easy setup and customization
  • Wide range of built-in schemas and templates
  • Supports diverse data types
  • Supports multi-task setup
  • Improves annotator productivity with features like keyboard shortcuts, dynamic highlighting, and label tooltips
  • Features to know better about your annotators, like pre and post screening questions
  • Quality control features like attention test, qualification test, and built-in time check
GitHub - davidjurgens/potato: potato: portable text annotation tool
potato: portable text annotation tool. Contribute to davidjurgens/potato development by creating an account on GitHub.

19. Brat

Brat is a free and open-source rapid text annotation tool.

GitHub - huntdatacenter/bratserver: brat rapid annotation tool
brat rapid annotation tool. Contribute to huntdatacenter/bratserver development by creating an account on GitHub.

20. Textcodify (PDF Annotation)

Textcodify is a program designed for creating NLP-style annotations for PDF files. Currently, the codebase is messy with a lot of global state and mutations, and it can't export yet as all annotations are stored in an SQLite database.

It's more of an experimental tool with Vala, Poppler, and SQLite, and it may be useful for those who need a free tool for text analysis on PDF files, despite their limitations.

GitHub - FransHeuvelmans/Textcodify: Text annotator for pdf files
Text annotator for pdf files. Contribute to FransHeuvelmans/Textcodify development by creating an account on GitHub.

21. Vogon

Vogon is a Ontology-based text annotation tool for creating relations between terms in a text. This relations can then be exported as RDF triples.

Vogon
Download Vogon for free. Vogon is a Ontology-based text annotation tool for creating relations between terms in a text. This relations can then be exported as RDF triples.

22. Screenity (Google Chrome)

Screenity is a comprehensive screen and camera recorder for Chrome, offering unlimited recordings of your tab, desktop, or any application. It allows annotations, highlights clicks, and provides individual audio controls.

It also offers customization options and supports export in multiple formats. Features include trimming recordings, highlighting mouse activity, and running Screenity locally without installation from the Chrome Store.

Screenity
Download Screenity for free. The most powerful screen recorder & annotation tool for Chrome. Screenity is a feature-packed screen and camera recorder for Chrome. Annotate your screen to give feedback, emphasize your clicks, edit your recording, and much more.

23. Erupt Framework

Erupt Framework is a pure Java annotation, single class file for rapid development of Admin management background. It does not generate any code, supports all mainstream databases, custom pages, multiple data sources, and provides over 20 types of business components.

It uses core technologies like Spring Boot, JPA, Reflect, TypeScript, and NG-ZORRO. Erupt is a low-code full-stack class framework that dynamically generates pages and background functions.

Features

  • Automatic table creation: table structure is automatically generated, no need to manually create tables
  • Basic knowledge of Spring Boot is enough
  • You only need to understand the two annotations @Erupt and @EruptField to start developing
  • Only one .javafile , template, controller, service, dao do not need to be created
  • Dynamic condition processing, tombstone, LDAP, custom login logic, RedisSession, operation log, etc.
  • Supports MySQL, Oracle, SQL Server, PostgreSQL , H2, even MongoDB

GitHub - erupts/erupt: 🚀 低代码后台管理框架,对象视图模型 → 0️⃣ 零前端代码、零代码生成、零SQL、零API声明、零DTO / VO / BO 创建,表结构注释自动生成 🛡 内置严密安全策略,细颗粒权限隔离 ☁️ Cloud开发能力,不停机升级,轻依赖,集群内每个服务都可以轻松实现数据可视化 ️☁️
🚀 低代码后台管理框架,对象视图模型 → 0️⃣ 零前端代码、零代码生成、零SQL、零API声明、零DTO / VO / BO 创建,表结构注释自动生成 🛡 内置严密安全策略,细颗粒权限隔离 ☁️ Cloud开发能力,不停机升级,轻依赖,集群内每个服务都可以轻松实现数据可视化 ️☁️ - erupts/erupt
Erupt Framework
Download Erupt Framework for free. Pure Java annotation, single class file, rapid development . Pure Java annotation, single class file, rapid development of Admin management background. Does not generate any code, zero front-end code, zero CURD, automatic table creation, annotated API, custom service logic, supports all mainstream databases, supports custom pages, supports multiple data sources, provides more than 20 types of business components, more than a dozen It supports logical deletion, dynamic timing tasks, front-end and back-end separation, etc.

24. Acharya

Acharya is a data-centric annotation tool designed to enhance the accuracy of Named Entity Recognition projects. It allows for rapid identification and correction of labeling errors, supports multiple data formats, and facilitates the training of models to assist in the annotation process.

The tool also enables the setup of an MLOps pipeline to experiment with different algorithms on the same data, thereby improving accuracy and performance. Acharya features a data-centric dashboard, an advanced workbench, in-built data versioning, and auto labeling suggestions.

GitHub - astutic/Acharya: A Data Centric NER annotation tool for your Named Entity Recognition projects
A Data Centric NER annotation tool for your Named Entity Recognition projects - astutic/Acharya
Acharya
Download Acharya for free. A Data Centric annotation tool for your Named Entity Recognition . A data-centric annotation tool to increase the accuracy of your Named Entity Recognition projects which helps rapidly identify and fix labeling errors in your dataset. Import/export datasets in multiple formats, train a model and use it to aid in the annotation process.

25. TagEditor (Windows)

TagEditor is a desktop application that allows quick text annotation using the spaCy library. It can annotate dependencies, parts of speech, named entities, text categories, and coreference resolution. It also allows creating customized annotated data or training datasets in .json or .spacy formats. No installation is required, simply download, unpack, and run 'TagEditor.exe'.

Features

  • Quick text annotation using the spaCy library
  • Ability to annotate dependencies, parts of speech, named entities, text categories, and coreference resolution
  • Creation of customized annotated data or training datasets in .json or .spacy formats
  • No installation required
  • Option to start tagging immediately or load existing datasets
  • Context menu for editing, deleting, inserting words or sentences and merging or splitting sentences
  • Ability to assign new paragraphs and remove all newline characters and extra whitespaces in the text
  • Creation of training data in "simple training style" or JSON
  • Option to save and load projects for future editing
  • Proper tokenization for different languages
  • Assignment of labels to named entities and creation of output data with char/token offset or BILUO / IOB scheme
  • Option to annotate on top of already annotated text in different modes
  • Editing of POS tags (fine-grained) and viewing coarse-grained pos tags and morphs
  • Coreference annotation according to PreCo 'Data Format'
  • Ability to assign labels to paragraphs, sentences or to spans in Text Categories
  • 'Spans classification mode' that allows multiple overlapping labels
GitHub - d5555/TagEditor: 🏖TagEditor - Annotation tool for spaCy
🏖TagEditor - Annotation tool for spaCy. Contribute to d5555/TagEditor development by creating an account on GitHub.

26. SMART

SMART is an open-source application that aids data scientists and research teams in creating labeled training datasets for supervised machine learning tasks. If used for a research publication, it should be cited as detailed in the document.

It can be easily installed as a Docker app.

GitHub - RTIInternational/SMART: Smarter Manual Annotation for Resource-constrained collection of Training data
Smarter Manual Annotation for Resource-constrained collection of Training data - RTIInternational/SMART

27. Piaf

Piaf is an open-source QA annotation platform with features such as a user-friendly interface, contributor enrollment and certification, admin team administration, input/output of texts in SQuAD format, user scoring, and annotation management.

GitHub - etalab/piaf: Question Answering annotation platform - Plateforme d’annotation
Question Answering annotation platform - Plateforme d’annotation - etalab/piaf