Magda is an open-source Big Data cataloging system

Magda is a data catalog system that will provide a single place where all of an organization's data can be cataloged, enriched, searched, tracked and prioritized - whether big or small, internally or externally sourced, available as files, databases or APIs. Magda is designed specifically around the concept of federation - providing a single view across all data of interest to a user, regardless of where the data is stored or where it was sourced from.

The system is able to quickly crawl external data sources, track changes, make automatic enhancements and make notifications when changes occur, giving data users a one-stop shop to discover all the data that's available to them.

Magda was originally developed for the Australian government’s federal open data portal data.gov.au, providing a single place for Australia’s citizens, scientists, journalists and businesses to discover and access 80,000+ datasets, from linked data APIs to small Excel files.

Features

  1. Supports big and small data
  2. Improves metadata search and cataloging
  3. Built-in search with full-text search support and maps
  4. Filter data by date, organization, and data formats
  5. Powerful and scalable search based on ElasticSearch
  6. Quick and reliable aggregation of external sources of datasets
  7. An unopinionated central store of metadata, able to cater for most metadata schemas
  8. Federated authentication via passport.js - log in via Google, Facebook, WSFed, AAF, CKAN, and easily create new providers.
  9. Based on Kubernetes for cloud agnosticism - deployable to nearly any cloud, on-premises, or on a local machine.
  10. Easy (as long as you know Kubernetes) installation and upgrades
  11. Extensions are based on adding new docker images to the cluster, and hence can be developed in any language

Magda Architecture

Magda is built around a collection of microservices that are distributed as docker containers. This was done to provide easy extensibility - Magda can be customised by simply adding new services using any technology as docker images, and integrating them with the rest of the system via stable HTTP APIs.

Using Helm and Kubernetes for orchestration means that configuration of a customised Magda instance can be stored and tracked as plain text, and instances with identical configuration can be quickly and easily reproduced.

License

Magda is released under the Apache-2.0 License

Resources