Ambar: Libre Document Search Engine for Office, Text and PDF Documents

Ambar: Libre Document Search Engine for Office, Text and PDF Documents
Photo by Jametlene Reskp / Unsplash

Ambar is an open-source document search engine with automated crawling, OCR, tagging and instant full-text search.

Ambar defines a new way to implement full-text document search into your workflow.

  • Easily deploy Ambar with a single docker-compose file
  • Perform Google-like search through your documents and contents of your images
  • Tag your documents
  • Use a simple REST API to integrate Ambar into your workflow

Tutorial: Mastering Ambar Search Queries

  • Fuzzy Search (John~3)
  • Phrase Search ("John Smith")
  • Search By Author (author:John)
  • Search By File Path (filename:*.txt)
  • Search By Date (when: yesterday, today, lastweek, etc)
  • Search By Size (size>1M)
  • Search By Tags (tags:ocr)
  • Search As You Type
  • Supported language analyzers: English ambar_en, Russian ambar_ru, German ambar_de, Italian ambar_it, Polish ambar_pl, Chinese ambar_cn, CJK ambar_cjk

Crawling

Ambar 2.0 only supports local FS crawling, if you need to crawl an SMB share of an FTP location - just mount it using standard Linux tools. Crawling is automatic, no schedule is needed due to crawlers monitor file system events and automatically process new, changed and removed files.

Content Extraction

Ambar supports large files (>30MB)

Supported file types:

  • ZIP archives
  • Mail archives (PST)
  • MS Office documents (Word, Excel, PowerPoint, Visio, Publisher)
  • OCR over images
  • Email messages with attachments
  • Adobe PDF (with OCR)
  • OCR languages: Eng, Rus, Ita, Deu, Fra, Spa, Pl, Nld
  • OpenOffice documents
  • RTF, Plaintext
  • HTML / XHTML
  • Multithread processing

License

Ambar is released under the MIT License.

Resources