Python is an interpreted general-purpose programming language. It is used for web development, desktop application development, system scripting and automation.
It is a high-level language created in the early 1991 by Guido van Rossum and maintained by Python Software Foundation.
The language is easy to learn which makes it suitable for beginners and students. We recommended it for teens and children in this article. It also works on different platforms and operating systems like Windows, Linux, macOS and Raspberry Pi.
Python can be treated in a functional, object-oriented or procedural way.
The current and most active Python version is Python 3. However, some applications and frameworks are still using Python 2.7.
In the recent years it gained a popularity among data scientists and data engineers because of its usability and rich ecosystem.
The python ecosystem contains dozens of packages, libraries and frameworks which ease data science tasks.
Python for data science
In this article we collected several libraries for scrapping, data manipulation, machine learning, deep learning, statistics and data visualization.
Our primary goal is to create an ever-green list to help data scientists find what they need.
Please note, that is an evergreen article which we will keep updating with libraries and frameworks.
Here are the best open-source Python packages for data science and data engineering.
Data-tools is a command-line tool written in Python for data extraction, data manipulation, and file format conversion.
It has date conversion, several file-formats, joining data, data trimming, utf-8 support, data sorting and more.
Pandas is a popular Python library for data analysis and data manipulation. It is used by most data scientists and engineers. The Pandas library is easy to learn for beginners with its flat learning curve.
Scrapping is an essential part of data collection. Scrapy is a web scrapping framework written on top of Python. It helps developers and data engineers to extract structured data from web pages.
BeautifulSoup is a Python library for extracting data from HTML and XML files. It supports multiple parsers like lxml, html5lib, Python's HTML parser.
- Download and install BeautifulSoup.
NumPy is a scientific computing library in for Python. NumPy has dozens of useful functions for mathematical computation and provide C/C++ and Fortran code integration.
NumPy has a rich ecosystem of sub libraries and large community of developers. It has been widely used for data science, machine learning, data visualization and data manipulation.
- Download and install NumPy.
SciPy is a mathematical, statistical and scientific Python library built on top NumPy. SciPy provide seamless on N-dimensional array manipulation.
PyTorch is an essential Python library for tensor computation and deep neural networks. PyTorch can be extended with other Python libraries when needed like NumPy, SciPy and others.
PyTorch provides seamless GPU support and it works on Linux, Windows and macOS.
- Get PyTorch.
TensorFlow is an open-source platform for machine learning. It has a vast ecosystem supported by a large community of data scientists and engineers. TensorFlow supports CUDA-enabled GPU it also offers a CPU-only package tensorflow-cpu.
pip install tensorflow
Seaborn is a rich data visualization library that based on Matplotlib. It does not require a steep learning curve as matplotlib and provides a high-end interface for all matplotlib functions with extra tools.
Matplotlib is a popular visualization library for Python. It offers different distributions and test data sets. Matplotlib is easily installed on Linux (Debian, Fedora, Red Hat and Arch). It can be also installed with PyPi, ActiveState and Anaconda.
The library depends on several Python libraries like NumPy, Cycler, pyparsing and requires Python >=3.6 to work.
Pingouin is an open-source statistical package for Python 3. It is an interface on Pandas and NumPy. It is easy to learn and packed with many statistical tests and plotting functions.
The tick is a lightweight machine learning library for Python. It is consisting of several modules and focuses on statistical learning for time-dependent systems.
The library has several built-in tools and helpers for simulation, linear computation, Hawkes processing for parametric and non-parametirc estimation, a plot helper, a dataset and R integration support.
- Get Python tick.
This package is written by Johns Hopkins University's NeuroData lab and Microsoft Research's Project Essex. It is open-source project for analysis of graphs or networks.
Scikit-Learn is a Python-based framework for data analysis. It's built on top of NumPy matplotlib and SciPy.
Scikit-Learn is packed with dozens of algorithms and tools which make predictive data analysis easier.
15- Plotly's Python
Plotly's is a data visualization library that eases building interactive graphs. It is free to use as an open-source project and works smoothly offline. It also works with Plotly's dash which is licensed same under MIT license.
Sometimes a local flat-file database is required to save data. TinyDB is a lightweight flat-file local database. It can work with large datasets as a document-oriented database.
Theano is a lightweight Python library for data processing and analysis. It offers speed, dynamic C code generation and full GPU support. It has a similar interface to NumPy,
Theano project is popular among data scientists and students on GitHub.
PyBrain is a modular machine learning framework written in Python.
- Install PyBrain.
Gensim is a free Python library for data processing, training large scaling NLP models, data streaming and text analysis. It depends on NumPy and smart_open libraries.
Gensim requires Python 3.6 or higher.
- Get Gensim.
Shogun is an old machine learning framework. It supports several programming languages notably: Python, R, Java, Scala, Ruby and Lua.
It is released as an open-source project under GPL v3.0.
- Get Shogun.
ArcGIS is a set of Python libraries for processing, manipulating and visualizing geographical data, automate spatial workflows, perform advanced spatial analytics, and build models for spatial machine learning and deep learning.
- Install ArcGIS.
PyCaret is a low-code machine learning library written in Python. It aims for usability, productivity through its ease of use. It is well documented and has several tutorials and code samples.
23- Open Mining
Open Mining is a business intelligence application server written in Python. It is not a simple library rather than a complete application development suite for data mining.
It requires Python 2.7, Lua5.2, MongoDB, Redis, and NodeJS (NPM).
jsonschema is an implementation of JSON schema for Python. It supports Draft7,6,4, and Draft 3. It offers lazy validation and programmatic querying.
Volupuous is a data validation library for Python. It helps validate the data from JSON, Yaml, CSV and TSV files. It is built to support complex data structures.
pickleDB is yet another flat-file key-value JSON database for Python. It may come in handy to save or record data on the fly.
Caffe is a deep learning framework written in Python 3. It offers speed and modularity. Caffe has custom distributions: Intel Caffe; a CPU optimized version for Intel and Xeon processors, OpenCL Caffe for AMD or Intel processors and Windows Caffe for Windows machines.
It is developed by Berkeley AI Research (BAIR)/The Berkeley Vision and Learning Center (BVLC) and community contributors.
Toolz is a functional library in Python which comes with a set of utilities for functions, dictionaries and iterators.
fn.py is a tiny functional programming Python library. It is packed with dozens of tools that speed up projects development especially data-science releated ones.
Graph-tool is an advanced visualization module for Python. It is packed with dozens of functions and algorithms to build versatile and interactive diagrams.
Pydot is a python visualization library built as an interface for Graphviz. It has limited dependencies and written completely in Python.
pytablewriter is an open-source Python library for writing data tables in multiple various formats. It supports CSV, TSV, JSON, LTSV, LaTeX, markdown (with different flavors), MediaWiki, TOML and YAML.
It also supports binary file formats as Microsoft Excel (xlsx, xls), SQLite database and pandas.DataFrame.
Keras is a deep-learning library for Python. It is easy to learn with a large community of developers and data scientists who are supplying it with tutorials and code samples.
The statsmodels is a Python module packed with several statistical models for statistical data analysis or exploration. It is an open-source library which is released under BSD (3-clause) license.
statsmodels It works with other libraries like NumPy, SciPy, and pandas. It also supports R-style formulas and pandas data frames.
Bokeh is yet another visualization library for Python. It has built-in server for creating a browser-ready graphs. Bokeh offers out-of-box Geo data and mapping visualization, interactive annotations, command-line interface and full Jupyter integration.
In addition to this list, the Python ecosystem is gaining new packages every day. Therefore, we will keep updating this what our new findings when possible. If you create or find a new data science related library that need to be on this list, please drop us a message.