Python is an interpreted general-purpose programming language. It is used for web development, desktop application development, system scripting and automation.

It is a high-level language created in the early 1991 by Guido van Rossum and maintained by Python Software Foundation.

The language is easy to learn which makes it suitable for beginners and students. We recommended it for teens and children in this article. It also works on different platforms and operating systems like Windows, Linux, macOS and Raspberry Pi.

Python can be treated in a functional, object-oriented or procedural way.

The current and most active Python version is Python 3. However, some applications and frameworks are still using Python 2.7.

In the recent years it gained a popularity among data scientists and data engineers because of its usability and rich ecosystem.

The python ecosystem contains dozens of packages, libraries and frameworks which ease data science tasks.

Python for data science

Photo by Christina Morillo from Pexels

In this article we collected several libraries for scrapping, data manipulation, machine learning, deep learning, statistics and data visualization.

Our primary goal is to create an ever-green list to help data scientists find what they need.

Please note, that is an evergreen article which we will keep updating with libraries and frameworks.

Here are the best open-source Python packages for data science and data engineering.

1- Data-tools

Data-tools is a command-line tool written in Python for data extraction, data manipulation, and file format conversion.

It has date conversion, several file-formats, joining data, data trimming, utf-8 support, data sorting and more.

GitHub - clarkgrubb/data-tools: File format conversion tools
File format conversion tools. Contribute to clarkgrubb/data-tools development by creating an account on GitHub.

2- Pandas

Pandas is a popular Python library for data analysis and data manipulation. It is used by most data scientists and engineers. The Pandas library is easy to learn for beginners with its flat learning curve.

pandas - Python Data Analysis Library

3- Scrapy

Scrapping is an essential part of data collection. Scrapy is a web scrapping framework written on top of Python. It helps developers and data engineers to extract structured data from web pages.

Scrapy | A Fast and Powerful Scraping and Web Crawling Framework

4- BeautifulSoup


BeautifulSoup is a Python library for extracting data from HTML and XML files. It supports multiple parsers like lxml, html5lib, Python's HTML parser.

- Download and install BeautifulSoup.

GitHub - wention/BeautifulSoup4: git mirror for Beautiful Soup 4.3.2
git mirror for Beautiful Soup 4.3.2. Contribute to wention/BeautifulSoup4 development by creating an account on GitHub.

5- NumPy


NumPy is a scientific computing library in for Python. NumPy has dozens of useful functions for mathematical computation and provide C/C++ and Fortran code integration.

NumPy has a rich ecosystem of sub libraries and large community of developers. It has been widely used for data science, machine learning, data visualization and data manipulation.

- Download and install NumPy.

NumPy
Why NumPy? Powerful n-dimensional arrays. Numerical computing tools. Interoperable. Performant. Open source.

6- SciPy


SciPy is a mathematical, statistical and scientific Python library built on top NumPy. SciPy provide seamless on N-dimensional array manipulation.

GitHub - scipy/scipy: SciPy library main repository
SciPy library main repository. Contribute to scipy/scipy development by creating an account on GitHub.

7- PyTorch


PyTorch is an essential Python library for tensor computation and deep neural networks. PyTorch can be extended with other Python libraries when needed like NumPy, SciPy and others.

PyTorch provides seamless GPU support and it works on Linux, Windows and macOS.

- Get PyTorch.

GitHub - pytorch/pytorch: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Tensors and Dynamic neural networks in Python with strong GPU acceleration - GitHub - pytorch/pytorch: Tensors and Dynamic neural networks in Python with strong GPU acceleration

8- TensorFlow


TensorFlow is an open-source platform for machine learning. It has a vast ecosystem supported by a large community of data scientists and engineers. TensorFlow supports CUDA-enabled GPU it also offers a CPU-only package tensorflow-cpu.

  pip install tensorflow

GitHub - tensorflow/tensorflow: An Open Source Machine Learning Framework for Everyone
An Open Source Machine Learning Framework for Everyone - GitHub - tensorflow/tensorflow: An Open Source Machine Learning Framework for Everyone

9- Seaborn


Seaborn is a rich data visualization library that based on Matplotlib. It does not require a steep learning curve as matplotlib and provides a high-end interface for all matplotlib functions with extra tools.

seaborn: statistical data visualization — seaborn 0.12.2 documentation

10- Matplotlib


Matplotlib is a popular visualization library for Python. It offers different distributions and test data sets. Matplotlib is easily installed on Linux (Debian, Fedora, Red Hat and Arch). It can be also installed with PyPi, ActiveState and Anaconda.

The library depends on several Python libraries like NumPy, Cycler, pyparsing and requires Python >=3.6 to work.

Matplotlib documentation — Matplotlib 3.7.2 documentation

11- Pingouin


Pingouin is an open-source statistical package for Python 3. It is an interface on Pandas and NumPy. It is easy to learn and packed with many statistical tests and plotting functions.

Installation — pingouin 0.5.3 documentation

12- tick


The tick is a lightweight machine learning library for Python. It is consisting of several modules and focuses on statistical learning for time-dependent systems.

The library has several built-in tools and helpers for simulation, linear computation, Hawkes processing for parametric and non-parametirc estimation, a plot helper, a dataset and R integration support.

- Get Python tick.

Tick — tick 0.6.0 documentation

13- GrasPy / graspologic



This package is written by Johns Hopkins University's NeuroData lab and Microsoft Research's Project Essex. It is open-source project for analysis of graphs or networks.

graspy
A set of python modules for graph statistics

14- Scikit-Learn


Scikit-Learn is a Python-based framework for data analysis. It's built on top of NumPy matplotlib and SciPy.

Scikit-Learn is packed with dozens of algorithms and tools which make predictive data analysis easier.

scikit-learn: machine learning in Python — scikit-learn 1.3.0 documentation

15- Plotly's Python


Plotly's is a data visualization library that eases building interactive graphs. It is free to use as an open-source project and works smoothly offline. It also works with Plotly's dash which is licensed same under  MIT license.

Plotly
Plotly’s

16- TinyDB


Sometimes a local flat-file database is required to save data.  TinyDB is a lightweight flat-file local database. It can work with large datasets as a document-oriented database.

Welcome to TinyDB! — TinyDB 4.8.0 documentation

17- Theano


Theano is a lightweight Python library for data processing and analysis. It offers speed, dynamic C code generation and full GPU support. It has a similar interface to NumPy,

Theano project is popular among data scientists and students on GitHub.

GitHub - Theano/Theano: Theano was a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. It is being continued as aesara: www.github.com/pymc-devs/aesara
Theano was a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. It is being continued as aesara: www.github.co…

18- PyBrain


PyBrain is a modular machine learning framework written in Python.

- Install PyBrain.

GitHub - pybrain/pybrain
Contribute to pybrain/pybrain development by creating an account on GitHub.

19- Gensim


Gensim is a free Python library for data processing, training large scaling NLP models, data streaming and text analysis. It depends on NumPy and smart_open libraries.  

Gensim requires Python 3.6 or higher.
- Get Gensim.

Gensim: topic modelling for humans
Efficient topic modelling in Python

20- Shogun


Shogun is an old machine learning framework. It supports several programming languages notably: Python, R, Java, Scala, Ruby and Lua.

It is released as an open-source project under GPL v3.0.
- Get Shogun.

GitHub - shogun-toolbox/shogun: Shōgun
Shōgun. Contribute to shogun-toolbox/shogun development by creating an account on GitHub.

21- ArcGIS


ArcGIS is a set of Python libraries for processing, manipulating and visualizing geographical data, automate spatial workflows, perform advanced spatial analytics, and build models for spatial machine learning and deep learning.
- Install ArcGIS.

ArcGIS Python Libraries | Python Packages for Spatial Data Science
ArcGIS Python libraries are Python packages that include ArcPy & ArcGIS API for Python for spatial data science. Discover their capabilities and features.

22- PyCaret


PyCaret is a low-code machine learning library written in Python. It aims for usability, productivity through its ease of use. It is well documented and has several tutorials and code samples.
- PyCaret.

Home - PyCaret
An open source low-code machine learning library in Python

23- Open Mining


Open Mining is a business intelligence application server written in Python. It is not a simple library rather than a complete application development suite for data mining.

It requires Python 2.7, Lua5.2, MongoDB, Redis, and NodeJS (NPM).

GitHub - mining/mining: Business Intelligence (BI) in Python, OLAP
Business Intelligence (BI) in Python, OLAP. Contribute to mining/mining development by creating an account on GitHub.

24- jsonschema


jsonschema is an implementation of JSON schema for Python. It supports Draft7,6,4, and Draft 3. It offers lazy validation and programmatic querying.

GitHub - python-jsonschema/jsonschema: An implementation of the JSON Schema specification for Python
An implementation of the JSON Schema specification for Python - GitHub - python-jsonschema/jsonschema: An implementation of the JSON Schema specification for Python

25- Volupuous


Volupuous is a data validation library for Python. It helps validate the data from JSON, Yaml, CSV and TSV files. It is built to support complex data structures.

GitHub - alecthomas/voluptuous: CONTRIBUTIONS ONLY: Voluptuous, despite the name, is a Python data validation library.
CONTRIBUTIONS ONLY: Voluptuous, despite the name, is a Python data validation library. - GitHub - alecthomas/voluptuous: CONTRIBUTIONS ONLY: Voluptuous, despite the name, is a Python data validatio…

26- pickleDB


pickleDB is yet another flat-file key-value JSON database for Python. It may come in handy to save or record data on the fly.

GitHub - patx/pickledb: pickleDB is an open source key-value store using Python’s json module.
pickleDB is an open source key-value store using Python’s json module. - GitHub - patx/pickledb: pickleDB is an open source key-value store using Python’s json module.

27- Caffe Deep Learning


Caffe is a deep learning framework written in Python 3. It offers speed and  modularity. Caffe has custom distributions: Intel Caffe; a CPU optimized version  for Intel and Xeon processors, OpenCL Caffe for AMD or Intel processors and Windows Caffe for Windows machines.

It is developed by Berkeley AI Research (BAIR)/The Berkeley Vision and Learning Center (BVLC) and community contributors.

GitHub - BVLC/caffe: Caffe: a fast open framework for deep learning.
Caffe: a fast open framework for deep learning. Contribute to BVLC/caffe development by creating an account on GitHub.

28- Toolz


Toolz is a functional library in Python which comes with a set of utilities for functions, dictionaries and iterators.

GitHub - pytoolz/toolz: A functional standard library for Python.
A functional standard library for Python. Contribute to pytoolz/toolz development by creating an account on GitHub.

29- fn.py


fn.py is a tiny functional programming Python library. It is packed with dozens of tools that speed up projects development especially data-science releated ones.

GitHub - kachayev/fn.py: Functional programming in Python: implementation of missing features to enjoy FP
Functional programming in Python: implementation of missing features to enjoy FP - GitHub - kachayev/fn.py: Functional programming in Python: implementation of missing features to enjoy FP

30- Graph-tool


Graph-tool is an advanced visualization module for Python. It is packed with dozens of functions and algorithms to build versatile and interactive diagrams.

graph-tool: Efficent network analysis with python
graph-tool: Efficent network analysis with python

31- Pydot


Pydot is a python visualization library built as an interface for Graphviz. It has limited dependencies and written completely in Python.

GitHub - pydot/pydot: Python interface to Graphviz’s Dot language
Python interface to Graphviz’s Dot language. Contribute to pydot/pydot development by creating an account on GitHub.

32- pytablewriter


pytablewriter is an open-source Python library for writing data tables in multiple various formats. It supports CSV, TSV, JSON, LTSV, LaTeX, markdown (with different flavors), MediaWiki, TOML and YAML.

It also supports binary file formats as Microsoft Excel (xlsx, xls), SQLite database and pandas.DataFrame.

GitHub - thombashi/pytablewriter: pytablewriter is a Python library to write a table in various formats: AsciiDoc / CSV / Elasticsearch / HTML / JavaScript / JSON / LaTeX / LDJSON / LTSV / Markdown / MediaWiki / NumPy / Excel / Pandas / Python / reStructuredText / SQLite / TOML / TSV.
pytablewriter is a Python library to write a table in various formats: AsciiDoc / CSV / Elasticsearch / HTML / JavaScript / JSON / LaTeX / LDJSON / LTSV / Markdown / MediaWiki / NumPy / Excel / Pan…

33- Keras


Keras is a deep-learning library for Python. It is easy to learn with a large community of developers and data scientists who are supplying it with tutorials and  code samples.

Keras: Deep Learning for humans
Keras documentation

34- statsmodels


The statsmodels is a Python module packed with several statistical models for statistical data analysis or exploration. It is an open-source library which is released under BSD (3-clause) license.

statsmodels It works with other libraries like NumPy, SciPy, and pandas. It also supports R-style formulas and pandas data frames.

statsmodels 0.14.0

35- Bokeh


Bokeh is yet another visualization library for Python. It has built-in server for creating a browser-ready graphs. Bokeh offers out-of-box Geo data and mapping visualization, interactive annotations, command-line interface and full Jupyter integration.

Bokeh has built-in WebGL acceleration and JavaScript development support.

Bokeh documentation
Bokeh is a Python library for creating interactive visualizations for modern web browsers. It helps you build beautiful graphics, ranging from simple plots to complex dashboards with streaming data…

Conclusions

In addition to this list, the Python ecosystem is gaining new packages every day. Therefore, we will keep updating this what our new findings when possible. If you create or find a new data science related library that need to be on this list, please drop us a message.