data science

35 Data Science Python Libraries for Scientists

Hamza Musa

25 Dec 2020 — 14 min read

Python is an interpreted general-purpose programming language. It is used for web development, desktop application development, system scripting and automation.

It is a high-level language created in the early 1991 by Guido van Rossum and maintained by Python Software Foundation.

The language is easy to learn which makes it suitable for beginners and students. We recommended it for teens and children in this article. It also works on different platforms and operating systems like Windows, Linux, macOS and Raspberry Pi.

Python can be treated in a functional, object-oriented or procedural way.

The current and most active Python version is Python 3. However, some applications and frameworks are still using Python 2.7.

In the recent years it gained a popularity among data scientists and data engineers because of its usability and rich ecosystem.

In Medevel.com, we covered several articles and collections regarding Python, which can benefits developers, data scientists, data engineers, and web developers, you can check them in the following list:

The python ecosystem contains dozens of packages, libraries and frameworks which ease data science tasks.

Python for data science

Photo by **Christina Morillo** from **Pexels**

In this article we collected several libraries for scrapping, data manipulation, machine learning, deep learning, statistics and data visualization.

Our primary goal is to create an ever-green list to help data scientists find what they need.

Please note, that is an evergreen article which we will keep updating with libraries and frameworks.

Here are the best open-source Python packages for data science and data engineering.

1- Data-tools

Data-tools is a command-line tool written in Python for data extraction, data manipulation, and file format conversion.

It has date conversion, several file-formats, joining data, data trimming, utf-8 support, data sorting and more.

2- Pandas

Pandas is a popular Python library for data analysis and data manipulation. It is used by most data scientists and engineers. The Pandas library is easy to learn for beginners with its flat learning curve.

pandas - Python Data Analysis Library

Python Data Analysis Library

3- Scrapy

Scrapping is an essential part of data collection. Scrapy is a web scrapping framework written on top of Python. It helps developers and data engineers to extract structured data from web pages.

Scrapy | A Fast and Powerful Scraping and Web Crawling Framework

A Fast and Powerful Scraping and Web Crawling Framework

4- BeautifulSoup

BeautifulSoup is a Python library for extracting data from HTML and XML files. It supports multiple parsers like lxml, html5lib, Python's HTML parser.

- Download and install BeautifulSoup.

https://www.crummy.com/software/BeautifulSoup/

5- NumPy

NumPy is a scientific computing library in for Python. NumPy has dozens of useful functions for mathematical computation and provide C/C++ and Fortran code integration.

NumPy has a rich ecosystem of sub libraries and large community of developers. It has been widely used for data science, machine learning, data visualization and data manipulation.

- Download and install NumPy.

6- SciPy

SciPy is a mathematical, statistical and scientific Python library built on top NumPy. SciPy provide seamless on N-dimensional array manipulation.

7- PyTorch

PyTorch is an essential Python library for tensor computation and deep neural networks. PyTorch can be extended with other Python libraries when needed like NumPy, SciPy and others.

PyTorch provides seamless GPU support and it works on Linux, Windows and macOS.

- Get PyTorch.

8- TensorFlow

TensorFlow is an open-source platform for machine learning. It has a vast ecosystem supported by a large community of data scientists and engineers. TensorFlow supports CUDA-enabled GPU it also offers a CPU-only package tensorflow-cpu.

  pip install tensorflow

9- Seaborn

Seaborn is a rich data visualization library that based on Matplotlib. It does not require a steep learning curve as matplotlib and provides a high-end interface for all matplotlib functions with extra tools.

10- Matplotlib

Matplotlib is a popular visualization library for Python. It offers different distributions and test data sets. Matplotlib is easily installed on Linux (Debian, Fedora, Red Hat and Arch). It can be also installed with PyPi, ActiveState and Anaconda.

The library depends on several Python libraries like NumPy, Cycler, pyparsing and requires Python >=3.6 to work.

Matplotlib documentation — Matplotlib 3.7.2 documentation

Logo image

11- Pingouin

Pingouin is an open-source statistical package for Python 3. It is an interface on Pandas and NumPy. It is easy to learn and packed with many statistical tests and plotting functions.

12- tick

The tick is a lightweight machine learning library for Python. It is consisting of several modules and focuses on statistical learning for time-dependent systems.

The library has several built-in tools and helpers for simulation, linear computation, Hawkes processing for parametric and non-parametirc estimation, a plot helper, a dataset and R integration support.

- Get Python tick.

13- GrasPy / graspologic

This package is written by Johns Hopkins University's NeuroData lab and Microsoft Research's Project Essex. It is open-source project for analysis of graphs or networks.

14- Scikit-Learn

Scikit-Learn is a Python-based framework for data analysis. It's built on top of NumPy matplotlib and SciPy.

Scikit-Learn is packed with dozens of algorithms and tools which make predictive data analysis easier.

scikit-learn: machine learning in Python — scikit-learn 1.3.0 documentation

15- Plotly's Python

Plotly's is a data visualization library that eases building interactive graphs. It is free to use as an open-source project and works smoothly offline. It also works with Plotly's dash which is licensed same under MIT license.

Get Plotly for Python.

16- TinyDB

Sometimes a local flat-file database is required to save data. TinyDB is a lightweight flat-file local database. It can work with large datasets as a document-oriented database.

Welcome to TinyDB! — TinyDB 4.8.0 documentation

Logo

17- Theano

Theano is a lightweight Python library for data processing and analysis. It offers speed, dynamic C code generation and full GPU support. It has a similar interface to NumPy,

Theano project is popular among data scientists and students on GitHub.

18- PyBrain

PyBrain is a modular machine learning framework written in Python.

- Install PyBrain.

19- Gensim

Gensim is a free Python library for data processing, training large scaling NLP models, data streaming and text analysis. It depends on NumPy and smart_open libraries.

Gensim requires Python 3.6 or higher.
- Get Gensim.

20- Shogun

Shogun is an old machine learning framework. It supports several programming languages notably: Python, R, Java, Scala, Ruby and Lua.

It is released as an open-source project under GPL v3.0.
- Get Shogun.

21- ArcGIS

ArcGIS is a set of Python libraries for processing, manipulating and visualizing geographical data, automate spatial workflows, perform advanced spatial analytics, and build models for spatial machine learning and deep learning.
- Install ArcGIS.

22- PyCaret

PyCaret is a low-code machine learning library written in Python. It aims for usability, productivity through its ease of use. It is well documented and has several tutorials and code samples.
- PyCaret.

23- Open Mining

Open Mining is a business intelligence application server written in Python. It is not a simple library rather than a complete application development suite for data mining.

It requires Python 2.7, Lua5.2, MongoDB, Redis, and NodeJS (NPM).

24- jsonschema

jsonschema is an implementation of JSON schema for Python. It supports Draft7,6,4, and Draft 3. It offers lazy validation and programmatic querying.

25- Volupuous

Volupuous is a data validation library for Python. It helps validate the data from JSON, Yaml, CSV and TSV files. It is built to support complex data structures.

26- pickleDB

pickleDB is yet another flat-file key-value JSON database for Python. It may come in handy to save or record data on the fly.

27- Caffe Deep Learning

Caffe is a deep learning framework written in Python 3. It offers speed and modularity. Caffe has custom distributions: Intel Caffe; a CPU optimized version for Intel and Xeon processors, OpenCL Caffe for AMD or Intel processors and Windows Caffe for Windows machines.

It is developed by Berkeley AI Research (BAIR)/The Berkeley Vision and Learning Center (BVLC) and community contributors.

28- Toolz

Toolz is a functional library in Python which comes with a set of utilities for functions, dictionaries and iterators.

29- fn.py

fn.py is a tiny functional programming Python library. It is packed with dozens of tools that speed up projects development especially data-science releated ones.

30- Graph-tool

Graph-tool is an advanced visualization module for Python. It is packed with dozens of functions and algorithms to build versatile and interactive diagrams.

graph-tool: Efficent network analysis with python

graph-toolEfficient network analysisTiago de Paula Peixoto <tiagoskewed.de>

31- Pydot

Pydot is a python visualization library built as an interface for Graphviz. It has limited dependencies and written completely in Python.

32- pytablewriter

pytablewriter is an open-source Python library for writing data tables in multiple various formats. It supports CSV, TSV, JSON, LTSV, LaTeX, markdown (with different flavors), MediaWiki, TOML and YAML.

It also supports binary file formats as Microsoft Excel (xlsx, xls), SQLite database and pandas.DataFrame.

33- Keras

Keras is a deep-learning library for Python. It is easy to learn with a large community of developers and data scientists who are supplying it with tutorials and code samples.

Keras: Deep Learning for humans

Keras documentation

Keras Team

34- statsmodels

The statsmodels is a Python module packed with several statistical models for statistical data analysis or exploration. It is an open-source library which is released under BSD (3-clause) license.

statsmodels It works with other libraries like NumPy, SciPy, and pandas. It also supports R-style formulas and pandas data frames.

statsmodels 0.14.0

logo

35- Bokeh

Bokeh is yet another visualization library for Python. It has built-in server for creating a browser-ready graphs. Bokeh offers out-of-box Geo data and mapping visualization, interactive annotations, command-line interface and full Jupyter integration.

Bokeh has built-in WebGL acceleration and JavaScript development support.

Conclusions

In addition to this list, the Python ecosystem is gaining new packages every day. Therefore, we will keep updating this what our new findings when possible. If you create or find a new data science related library that need to be on this list, please drop us a message.

Reducing Front Desk Bottlenecks In Busy Clinics

Building Effective Screening Routes For Community Health Teams

Whiplash Recovery: What To Expect And How To Support Healing

Zvec: The Embedded Vector Database That Could Change How You Build RAG Systems