Rust

The Rust Revolution in Data Science: Why It's a Game Changer and 10 Projects to Prove It

Hazem Abbas

Aug 12, 2024 — 5 min read

Table of Content

Rust, a systems programming language, is gaining traction in various fields, including data science.

Rust is known for its performance, safety, and concurrency capabilities, Rust offers several advantages for data engineering, databases, real-time data processing, and data analytics.

Why Rust?

Rust is designed to be fast and reliable, making it ideal for data-intensive applications. Here are some key features that make Rust suitable for data science:

Memory Safety: Prevents common bugs and security issues.
Concurrency: Allows for efficient parallel data processing.
Performance: Compiles to machine code for maximum speed.
Interoperability: Integrates well with other languages and tools.
Community and Ecosystem: Growing number of libraries and frameworks for data science.

Features of Rust

Zero-cost abstractions
Ownership and borrowing system
Pattern matching
Concurrency without data races
Safe memory management
High performance with low-level control
Robust type system

Pros and Cons of Rust

Pros:

Safety: Ensures memory safety and prevents data races.
Performance: Comparable to C/C++ with modern features.
Concurrency: Excellent support for concurrent programming.
Community Support: Growing ecosystem with active development.

Cons:

Learning Curve: Steep learning curve for beginners.
Tooling: Less mature tooling compared to Python or R.
Ecosystem: Smaller ecosystem for data science libraries.

Successful Stories

Dropbox: Uses Rust for parts of its file storage system, leading to significant performance improvements.
Discord: Rewrote its core in Rust to handle real-time messaging with low latency.
Figma: Migrated parts of its backend to Rust for better performance and reliability.
Coursera: Utilizes Rust for data processing tasks to achieve faster and safer computations.

Top 10 Rust Applications and Projects for Data Science

1 . Polars (DataFrame Library)

Polars is a DataFrame library for Rust. It is based on Apache Arrow’s memory model. Apache arrow provides very cache efficient columnar data structures and is becoming the defacto standard for columnar data.

2. DataFusion (SQL Query Engine)

Apache DataFusion is a very fast, extensible query engine for building high-quality data-centric systems in Rust, using the Apache Arrow in-memory format. Python Bindings are also available. DataFusion offers SQL and Dataframe APIs, excellent performance, built-in support for CSV, Parquet, JSON, and Avro, extensive customization, and a great community.

3. Ballista (Distributed Compute Platform)

Ballista is a scalable distributed SQL query engine powered by the Rust implementation of Apache Arrow and Apache Arrow DataFusion. It supports parallel processing among several useful features.

Features

Supports HDFS as well as cloud object stores. S3 is supported today and GCS and Azure support is planned.
DataFrame and SQL APIs available from Python and Rust.
Clients can connect to a Ballista cluster using Flight SQL.
JDBC support via Arrow Flight SQL JDBC Driver
Scheduler web interface and REST UI for monitoring query progress and viewing query plans and metrics.
Support for Docker, Docker Compose, and Kubernetes deployment, as well as manual deployment on bare metal.

4. Tantivy (Search Engine)

Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust. Tantivy is closer to Apache Lucene than to Elasticsearch or Apache Solr in the sense it is not an off-the-shelf search engine server, but rather a crate that can be used to build such a search engine.

Tantivy features a Full-text search, fast indexing, and query execution.

5. Dozer (Real-time Data Processing)

Dozer is a real time data movement tool leveraging CDC from various sources to multiple sinks.

Dozer is magnitudes of times faster than Debezium+Kafka and natively supports stateless transformations. Primarily used for moving data into warehouses. In our own application, we move data to Clickhouse and build data APIs and integration with LLMs.

6. TiKV

TiKV is an open-source, distributed, and transactional key-value database. Unlike other traditional NoSQL systems, TiKV not only provides classical key-value APIs, but also transactional APIs with ACID compliance.

Built in Rust and powered by Raft, TiKV was originally created by PingCAP to complement TiDB, a distributed HTAP database compatible with the MySQL protocol.

7. Redox OS (Operating System)

Redox OS is a microkernel-based, complete and general-purpose operating system written in Rust created in 2015.

8. RustPython (Python Interpreter)

Python interpreter written in Rust, integrates Rust's safety with Python's flexibility.

9. MeiliSearch (Search Engine)

This is a free and open-source ⚡ A lightning-fast search engine that fits effortlessly into your apps, websites, and workflow 🔍.

Its features include hybrid search, filtering and sorting, Geosearch, Restful API, and more.

10. Weld (Runtime for Data-Intensive Applications)

Weld is a language and runtime for improving the performance of data-intensive applications. It optimizes across libraries and functions by expressing the core computations in libraries using a common intermediate representation, and optimizing across each framework.

Modern analytics applications combine multiple functions from different libraries and frameworks to build complex workflows. Even though individual functions can achieve high performance in isolation, the performance of the combined workflow is often an order of magnitude below hardware limits due to extensive data movement across the functions. Weld’s take on solving this problem is to lazily build up a computation for the entire workflow, and then optimizing and evaluating it only when a result is needed.

Conclusion

Rust is proving to be a powerful tool in the data science ecosystem. Its focus on safety, performance, and concurrency makes it a strong candidate for data engineering, databases, real-time data processing, and analytics tasks. With growing community support and successful implementations by major companies, Rust is worth considering for your next data-intensive project.

Rust programming Open-source data science data engineering data analysis data annotation Machine Learning Artificial Intelligence (AI) List

Why A-Frame is the Best Web Framework for Building 3D/AR/VR Experiences, 10+ Reasons

How Did I Beat Musical Burnout (and How You Can Too) - Fighting the ADHD Burnout - My Global Game Jam Experience

Patient Portals and Their Role in Contemporary Healthcare

AI Agent, How I see it as a Doctor, Developer and AI User

Table of Content

Why Rust?

Features of Rust

Pros and Cons of Rust

Pros:

Cons:

Successful Stories

Top 10 Rust Applications and Projects for Data Science

1 . Polars (DataFrame Library)

2. DataFusion (SQL Query Engine)

3. Ballista (Distributed Compute Platform)

Features

4. Tantivy (Search Engine)

5. Dozer (Real-time Data Processing)

6. TiKV

7. Redox OS (Operating System)

8. RustPython (Python Interpreter)

9. MeiliSearch (Search Engine)

10. Weld (Runtime for Data-Intensive Applications)

Conclusion

Read More Articles in Rust

Rooster - Free Simple Password Manager Written in Rust with 1Password Support - Works Windows, Linux and macOS

Explore 11 Open-Source Free Markdown to PDF Converter Apps

Easily Convert Markdown files to HTML, LaTeX/PDF and EPUB with this Free Tool: Crowbook

Running LLMs as Backend Services: 12 Open-source Free Options - a Personal Journey on Utilizing LLMs for Healthcare Apps

Finding Your Perfect Code Editor: My Weeks Journey Switching Between VS Code, Lapce, and Zed

Xiu - Self Hosted Free Live Media Server written with Rust

Articles

Systems

Development

Apps

Science - Healthcare

Open-source Apps

Medical Apps

Lists

Dev. Resources

Read more

Why A-Frame is the Best Web Framework for Building 3D/AR/VR Experiences, 10+ Reasons

How Did I Beat Musical Burnout (and How You Can Too) - Fighting the ADHD Burnout - My Global Game Jam Experience

Patient Portals and Their Role in Contemporary Healthcare

AI Agent, How I see it as a Doctor, Developer and AI User