docconv - Converts PDF, DOC, DOCX, XML, HTML, RTF, etc to plain text with Go

docconv - Converts PDF, DOC, DOCX, XML, HTML, RTF, etc to plain text with Go

docconv is a free Go wrapper library to convert PDF, DOC, DOCX, XML, HTML, RTF, ODT, Pages documents and images (see optional dependencies below) to plain text. See go help install for details on the installation location of the installed docd executable.

Make sure that the full path to the executable is in your PATH environment variable. To add image support to the docconv library you first need to install and build gosseract.

Now you can add -tags ocr to any go command when building/fetching/testing docconv to include support for processing images. Documents can be sent as a multipart POST request and the plain text (body) and meta information are then returned as a JSON object.

Features

  • Add image support to the docconv library
  • Go wrapper library to convert PDF, DOC, DOCX, XML, HTML, RTF, ODT
  • Now you can add -tags ocr to any go command when building/fetching/testing docconv to include support for processing images
  • The docd tool runs as a service on port 8888
  • Run locally
  • Request over the network
  • Runs as a local service with a dedicated port

Install

With Go

$ go install code.sajari.com/docconv/v2/docd@latest

With macOS

$ brew install poppler-qt5 wv unrtf tidy-html5
$ go get github.com/JalfResi/justext

docd tool

The docd tool runs as either:

  1. a service on port 8888 (by default)Documents can be sent as a multipart POST request and the plain text (body) and meta information are then returned as a JSON object.
  2. a service exposed from within a Docker container.
    This also runs as a service, but from within a Docker container. Official images are published at https://hub.docker.com/r/sajari/docd.

    Optionally you can build it yourself: $ cd docd
    $ docker build -t docd .
  3. via the command line.Documents can be sent as an argument, e.g.

$ docd -input document.pdf

Usage

send documents using curl

$ curl -s -F input=@your-file.pdf http://localhost:8888/convert

License

MIT License

Resources & Downloads

GitHub - sajari/docconv: Converts PDF, DOC, DOCX, XML, HTML, RTF, etc to plain text
Converts PDF, DOC, DOCX, XML, HTML, RTF, etc to plain text - sajari/docconv
docconv
Download docconv for free. Converts PDF, DOC, DOCX, XML, HTML, RTF, etc to plain text. A Go wrapper library to convert PDF, DOC, DOCX, XML, HTML, RTF, ODT, Pages documents and images (see optional dependencies below) to plain text. See go help install for details on the installation location of the installed docd executable.







Open-source Apps

9,500+

Medical Apps

500+

Lists

450+

Dev. Resources

900+

Read more