Data With Rust

Setting up a Rust developer environment

Karim Jedda — Sun, 29 Dec 2024 15:28:03 GMT

To install Rust, the current widely accepted method is to use an installer called rustup. When you visit that website, depending what operating system is detected, you’ll have the choice of either downloading an .exe file or running a script in your terminal.

I’ll link some videos to show how it can be done, if for example you don’t know what a terminal is.

Installing on MacOS/Linux

As of February 2026, the instructions to install Rust on an MacOS/Unix that are on the rustup website consist of running the following command in a terminal.:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

💡

Note
The rustup website presents the same information as the official Rust website.

While much can be said about the dangers of piping a bash script from the internet into your command line, it is beyond the scope of this website/guide. One thing that is for sure is that this problem exists for whatever you have to install.

Of course you can read the code by running curl -s https://sh.rustup.rs | vi - to see what’s inside.

Installing on Windows

As of February 2026, the instructions to install Rust on an Windows that are on the rustup website consist of downloading and installing the rustup-init.exe.

Verifying the installation

Verifying the installation works the same on all the systems. Open a command prompt or terminal and run:

rustc --version
# Should output something like 
# rustc 1.91.1 (ed61e7d7e 2025-11-07)

cargo --version
# Should output something like 
# cargo 1.91.1 (ea2d97820 2025-10-10)

If both these commands ran properly, there are high chances that everything worked properly. If it didn’t, which Murphy’s Law ensures might happen, reach out per email and I’ll help you.

Additional remarks

It is possible to either use brew for MacOS or chocolatey for Windows or apt on Unix, however, I do not recommend using those if you’re just starting up. The reason is that you might end up getting a version that is either outdated or not properly configured.
It is possible to run all of your code in Docker, but at this stage it is completely unnecessary since we’re just starting up.
Remember how we were running code in the browser previously? The trick is that this website is using the Rust Playground behind the scenes to run the code there and display the output here. You can continue working in the Playground without installing anything, unfortunately though, that website doesn’t contain all the packages we will need going forward. Besides, now you have your Rust on your own machine, nobody can take it away!
To uninstall or remove the Rust toolchain from your system, you can run rustup self uninstall but you won’t do that just now, right? Right!? 👻

Chapter 1 Recap

Karim Jedda — Thu, 06 Jun 2024 11:21:00 GMT

So far we've covered many topics ranging from a light introduction to the Rust programming language, it's history, features and benefits for Data Engineering. We also covered how it compares to other programming languages and which that made you excited about what is ahead of us: writing code, running and building the future of data engineering using this new programming language.

Summary of Rust's features and advantages for data engineering

To recap what we've discussed, Rust is a perfect language for all things data. It makes it easier to build more robust systems, data pipelines and more. It has features and tooling built-in that other programming languages achieve only by shoehorning or concatenating multiple different tools that add to the complexity of the code base.

Future of Rust in the data engineering field and Rust jobs (transferable skill)

As we've shown very briefly, Rust is strong performer in the Data Engineering space. However, learning Rust will open up so many other doors since it's a general purpose purpose programming language, you can use it for many other things than data engineering.

Although we're mainly covering Data Engineering, Rust's growing popularity and increasing adoption by huge companies and open-source projects mean that Rust skills are in high demand. Learning Rust for Data Engineering can lead to new job opportunities and career growth in a variety of fields beyond data engineering, such as cloud computing, embedded systems, and network programming.

What's next?

So far, we've been using some magic to run Rust in your browser but we can agree that it's time to leverage more of what Rust has to offer. That's why it's important to understand how Rust and the tooling around it can be installed on your local machine, or any machine for that matter, how the different parts fit together and how we can get you to create your first Rust projects.

I hope you're hyped! Let's go.

How does Rust compare to Python (and other programming languages)?

Karim Jedda — Wed, 05 Jun 2024 11:09:00 GMT

It's important to know how Rust compares to Python if you consider switching some of your workloads to Rust. Using Rust instead of Python is a tough sell especially for things like data engineering. For any data problem that you can think of, there might surely be some Python implementation somewhere that can provide you with parts, if not all the solution.

So why Rust? Let's have a look.

Syntax and readability

With enough precautions, one can write Rust code that almost reads like Python code. I documented many examples in the initial article that kickstarted this website on my personal blog.

It is not a secret though, that Rust code will be a bit more verbose since it's statically typed (you must declare variables with their types), requires a lot of punctuation (semicolons, curly braces..) and leverages borrowing, making it at times tedious to write and read. However, these things provide you with the guarantees of performance and safety.

Writing a similar program in Python, with the same features and guarantees would most certainly yield code that is hard to read. Python's clarity comes from the happy path that it assumes it is always in: it will gladly run wrong code.

Writing and reading Rust definitely needs some getting used to, but the ROI (return on investment) is significant.

Here's an example of a function that concatenates an array in both Python and Rust:

# Python
def main(arr):
    concatenated = "".join(arr)
    print(concatenated)

main(["Hello", "world", "!"]) # works
main("Hello there") # works too

// Rust
fn main() {
    let arr = ["Hello", "world", "!"];
    let concatenated = &arr.join("");
    println!("{:?}", concatenated);
}

You'll notice that the Rust function is a bit more verbose than the Python one. But here's the thing, the Python function would still function if it were given a string as input, the Rust function would not. Don't believe me? Edit the code and try it out.

If you were to change the Python method to provide the same guarantees, this is the code you'll need:

# Python
from typing import List

def main(arr: List[str]) -> None:
    if not isinstance(arr, list):
        raise TypeError("Expected 'arr' to be a list")
    concatenated = "".join(arr)
    print(concatenated)

main(["Hello", "world", "!"]) # works
main("Hello there") # doesn't work

So who's verbose now? There are a lot of things more to do in Python to get the same level of guarantees that you'd have with Rust.

Dynamic vs static typing

Now there's a case that can be made for Python and typing, since Python has some support for optional typing as previously shown. However, from experience, introducing types to Python makes it extremely verbose and awkwared to read.

Types on Python side are essentially just type hints, which can can be circumvented at runtime if needed. This makes Python at best a dynamically typed language, even if there exists static type checkers for Python (mypy) which can be used to approach the benefits of static type checking. Even if there's benefit of adding static type checkers, their usage feels like a limitation when using a language like Python where the main benefit is flexibility and development speed over strict type safety or absolute correctness. Besides, they'll be yet another tool you'd have to maintain and create a configuration file for. Typing in Python feels like shoehorning, at least to me.

Don't just take my word for it, here's what the creator of Flask things about typing in Python:

We clearly don't have agreement on the costs and benefits of typing of Python code at @getsentry but the amount of time wasted in getting shit typed at Sentry and how ugly it looks afterwards make me mad. And it does not even catch regressions.
— Armin Ronacher (@mitsuhiko) March 6, 2023

So what's the alternative here? I think you guessed it, Rust!

Rust on the other hand is statically typed, meaning types are enforced during compile time and cannot be changed on the fly. Variables that are declared with a specific type can only hold values of that type and errors are caught early on before the code even runs. In general, having typing as a main component in a language instead of an add-on is preferred since enforcement of a good practice is safer than suggestion of a good practice: it's on you and the team if somehow someone forgot to add a type to a variable in Python and it caused an error - and this will happen, eventually. Knowing that all tools can fail, it's better to catch most of the failures during compilation, before the code is even built, rather than afterwards.

Examples of other statically typed languages are Java, C++ and Go. I'll add some comparisons to Python too, with interactive examples down the line.

Benchmarking & efficiency

Now I long wondered why types are even useful and why anyone would bother needing them. Using types is a tradeoff between code flexibility and code readability but don't seem to bring much at first. Adding types all over the place can make the code difficult to read but also not flexible to change if you used an incorrect type, meaning you need to refactor quite a bit if for example your values grow beyond the capacity of their current type (type widening int32 to int64).

The case then needs to be made why types help.

Code runs on computers and how computers work is extremely well understood. What is still an ongoing effort is how we can convert abstractions (human readable code) into machine readable code in the most efficient way. This is why there are so many different programming languages. They all cater to different developer audiences and to different use cases. The commonality being converting human thoughts into things machines can run. When doing this conversion, many optimisations can be made, as well as catching obvious bugs, when we can already predict what the code will be doing.

An example: You can think of a compiler like an architect who designs and reviews blueprints for a house before construction starts. The architect can spot errors or inefficiencies in the plans early on and get them corrected before any physical work begins. This avoids wasting time and money on implementation that would not meet requirements or pass inspections. With a solid plan in place upfront, the actual construction can also proceed more efficiently and effectively due to the level of detail in the blueprints. Dynamic/loosely typed "building" lacks these types of careful upfront checks and optimizations.

Knowing types in advance can help the compiler prepare and implement these optimisations for you. In essence, you can consider the code you write almost like a big configuration file for the compiler. The compiler is itself a piece of code purposely made to efficiently translate human input into maximally efficient machine code. When I learned this my approach to programming changed quite a bit.

All of this is of course only as good as our ability to model and predict things in advance. At least we can give guarantees for the things that are predicted and stay flexible/robust to change. But maybe to drive the point home, an example is in order. Let's take the example of summing all prime numbers between 1 and 1,000,000.

Here is the Rust code:

fn is_prime(n: u32) -> bool {
    if n <= 1 {
        return false;
    }
    for i in 2..(n as f64).sqrt() as u32 + 1 {
        if n % i == 0 {
            return false;
        }
    }
    return true;
}

fn main() {
    let mut sum: u64 = 0;
    for i in 1..1000000 {
        if is_prime(i) {
            sum += i as u64;
        }
    }
    println!("Sum: {}", sum);
}

and here is the Python code:

import math

def is_prime(n):
    if n <= 1:
        return False
    for i in range(2, int(math.sqrt(n)) + 1):
        if n % i == 0:
            return False
    return True

sum = 0
for i in range(1, 1000000):
    if is_prime(i):
        sum += i
print("Sum:", sum)

To benchmark the Rust code, you can run cargo bench but we're not going to use it right now, what we'll do is just use the unix time tool to keep things simple and get a rough estimate.

Below are the execution times for the two pieces of code above. The important thing to note is the relative time difference and not the absolute numbers since those will be different on different machines.

# Rust
time ./rustprimes # 0.50s user 0.00s system 74% cpu 0.673 total

# Python
time python3 pythonprimes.py # 2.25s user 0.02s system 99% cpu 2.270 total

Given relatively similar code, the Rust code is a clear winner meaning the Rust compiler did a great job optimising the input it was given.

Now you might object that I'm not comparing the Rust code the a typed version of the Python code:

import math

def is_prime(n: int) -> bool:
    if n <= 1:
        return False
    for i in range(2, int(math.sqrt(n)) + 1):
        if n % i == 0:
            return False
    return True

sum: int = 0
for i in range(1, 1000000):
    if is_prime(i):
        sum += i
print("Sum:", sum)

And guess what? The results are the same:

# Python
time python3 pythonprimestyped.py # 2.24s user 0.02s system 99% cpu 2.257 total

This clearly makes the case that adding types to Python doesn't get any substantial benefit besides the remote (unsubstantiated) eventuality that it will make it easier for large teams to collaborate on the code base.

All of this to drive the point home that Rust is geared and designed from the ground up for leveraging types (amongst other things) to be efficient, as opposed to Python where types are rather more of a convenience.

Comparing with other data engineering languages

That was Python, now how does Rust fare compared to other programming languages that are compiled and typed? The following is not meant to be an exhaustive list but rather a list comparing languages that are usually used in data engineering task.

Amongst programming languages usually used for data engineering, either in the development of the big data systems themselves or as interfaces to using those tools, we mainly find: Java, Scala, Go and C++.

C++ might be surprising, but has been used with quite some success for developing the big data systems themselves (Redpanda, DuckDB).

The following is completely biased, but informed, from my personal experience through out the years.

Java/Scala

Java and Scala are currently the most used language for building big data systems. Many frameworks, tools and libraries are written in and support Java & Scala out of the box. Amongst these are notably Apache Spark, Apache Hadoop, Apache Flink, Apache Beam and many more. Java offers good performance, portability and can handle large scale projects. Java is however very verbose which can make it difficult to read, write and maintain. This is why Scala is sometimes presented as a good and more concise alternative to Java, that still runs on the JVM.

From my personal experience I can say:

Scala has a very steep learning curve
Projects written in Java very quickly grow in complexity, making them extremely difficult to operate and maintain (Kafka is an example of such behemoth)
Java projects require a lot of boilerplate code
Garbage collection in both can cause unnecessary latency
Java has a larger community than Scala, easier to find help

Both Java and Scala can do the job but they are more tailored towards enterprise settings, where robustness largely outweighs delivery speed. Generally speaking, for every Java project you can find a huge company (or cloud provider) that offers support and services for the tool, since it requires a lot of resources to just understand, let alone maintain, what is going on. It's also always a good thing for an enterprise to have someone else responsible for these things and file the costs incurred by these big systems under operative costs.

Depending on your team setup, they might be a good choice. But I'm here to convince you otherwise, right? ;)

Go

Go is a very interesting programming language. It features fast compile times and has built in concurrency support. Besides, it's - relatively - simple to learn. In my personal opinion I think it's a great language for things like microservices where there is a lot of chatter/traffic over a network. It has limited library support for data processing.

I'd recommend Go for data scrapers or for tasks where it's only important to fetch data from an API/Database for example, but not for data transformation.

C++

I've used C++ quite a lot growing up and it's been one of the first programming languages - with Java - that I learned. It features high performance, low level control over hardware and has many libraries for efficient data processing. DuckDB, Redpanda and other tools are written in C++.

It has a steep learning curve, that's for sure. On top of that, it requires manual data management and is prone to memory leaks and segmentation faults.

Most of these issues are properly addressed in Rust.

Tooling

The tooling and ecosystem around a programming language are extremely important. They can either stand in your way or boost your efficiency.

For most programming languages (except perhaps Scala), you need to stitch multiple tools together to get a simulacrum of working environment. For instance, in Python, to get typing, tests, package management you'll need at least:

mypy: For typing
pip: Package management for Python, or Poetry or pipenv or or or ... (the investigation is still open on this one)
venv: virtual environment to not pollute your system libraries
pytest: Testing framework - let's be real, very few people use the integrated unit test library ;)
...

These are all things that need to be properly maintained and configured to provide a setup that is robust in a team setting. Some might say it's not needed, we can abstract all of it away by introducing yet another tool like Docker to hide away the complexity in a CI/CD pipeline, somewhere. But let's be real, it's a lot. Although these libraries are great and do their job pretty well, they add friction to something that should be smooth (this is written in 2023) especially for beginners.

The real challenge is shoveling & processing data and not yak shave around tooling.

In Rust's case, the tooling is designed to make the development process as smooth and efficient as possible. Most of the work you'll do involves using Cargo, which is Rust's package manager and build tool. It comes bundled with Rust. It even makes it easy to build and bundle projects written in Rust. Cargo also provides a simple way to manage different versions of a package, so you can easily switch between different versions of a dependency without worrying about conflicts.

One of the major advantages of Cargo is its ability to automatically build and link native libraries, which can be a complex task in other programming languages. This means that with Rust and Cargo, you can easily build and distribute cross-platform applications with native performance.

It's not all perfect, one small caveat persist: you sometimes get long compile times compiling Rust code. So that's that.

There's a lot more to unpack over the next chapters, but for an overview it suffices to say that Rust's tooling has been purposefully built to help with programming and not added as an afterthought.

When to use which?

As my favorite tax consultant would say: "it depends".

This is not a question that can usually be answered easily although my goal with this website is to motivate you to try Rust out and see if it works for you.

These considerations might help in your decision, on top of the things we have already covered. They are mostly covering why you should use Rust and are not meant to be applicable in absolutely every case, your mileage may vary.

The team's capabilities

A very important factor to consider first is how the team is set up and what capabilities are at the team's disposition. If you already have a team of Rust developers we wouldn't be having the discussion right now as it'll be the obvious choice to use Rust. If the team is mostly composed of Java or Python developers it might be tough to convince them to try something new. Refer them to this website and I'll hope to make a good enough job to get them comfortable to learn something new by building many of the things they are accustomed to.

Either way, if the team is open minded in terms of what tool to use, Rust can be a good candidate to try out, even as a first MVP (minimum viable product) and compare with a Python one. I'll share more resources as we go, of teams who made the jump and their conclusions.

The budget & constraints

If you have a lot of budget, work in a big enterprise with many teams and a lot of legacy systems you need to interface with, it will be very difficult to avoid Java, unless you're the one calling the shots on a new greenfield project. The question in big companies is usually "who's going to maintain this" and as long as there are no "big Rust shops", it might be a risky bet.

Even if the case can be made that using Rust will make things easier to maintain in the long run, it remains to be proven. What works for one project might not work on another so it's important to take into consideration how much wiggle room there is to try out something new.

The scope & goal

Beyond the points mentioned above, the most important point that should dictate which tools you use is what you actually want to build.

If you're building a landing page, Rust might not be the best tool.

If you're building a data stream processing pipeline and are hitting road blocks, or high costs, with other methods, Rust might definitely help.

Define your goal and run some tests and comparisons: ultimately, the best tool for the job is the tool that gets the task done within your context and unique setting.

From my personal experience

I'll leave you with this though, if you you would rank (from 1 = good to 4 = less good) the different programming languages using system maintainability, ease of use and flexibility to change, I'd place the programming languages as follows (this is very subjective and based on my personal and very subjective experience):

Language	Maintainability	Ease of use	Flexibility to change	Performance
Rust	1	3	3	2
Java/Scala	2	2	2	3
Python	3	1	1	4
C++	4	4	4	1

Some comments on this table:

The numbers are completely subjective and aim to provide relative comparison metrics instead of absolute ones. The numbers will look completely different for somebody else, especially depending on the years of experience with those languages. Over the following chapters, I'll provide more color to these metrics and explain a little bit more. For now, the main takeaways are:

Rust systems are more maintainable than Java, Python or C++ ones, but place second after C++ in terms of performance, even if they come relatively close in different benchmarks
Python is the language that is the easiest to use and the easiest to change (meaning changing or updating the code)
Java and Scala are still strong contenders, all things considered.

This table can be updated with including things like which programming language is more efficient and more. This will however be the topic for a longer chapter later on.

I think so far we covered a lot without even discussing about programming language efficiency and resource usage, which are also important things. It's time we get going and start with programming, don't you think? It's the only way to really get an idea of what we're dealing with.

Let's make a small recap now

Chapter 1 Recap

So far we’ve covered many topics ranging from a light introduction to the Rust programming language, it’s history, features and benefits for Data Engineering. We also covered how it compares to other programming languages and which that made you excited about what is ahead of us: writing code, running and

Data With RustKarim Jedda

Advantages of Rust for Data Engineering

Karim Jedda — Tue, 04 Jun 2024 11:04:00 GMT

After going through some of Rust's main features, it might make sense to have a closer look at how these features help our data engineering efforts. Some of the points have already been covered but it makes sense to list them again in a general overview.

Reliability and scalability

Rust is a relatively reliable programming language. Writing the code and compiling it successfully provides certain guarantees that help prevent many runtime errors: it's statically typed and ensures memory safety at compile time. Both these features make Rust very reliable in translating ideas correctly into machine instructions.

Rust's type system can also make the code more readable which helps increase reliability not only in function but also in representation: another developer can reason about the code base without too much overhead (given the initial developer didn't try to be "too smart" implementation complicated code).

Rust is also very performant, with performance comparable to that of C++ and often times better than interpreted languages like Python and Ruby. It's concurrency model and memory safety help make it a good choice for building scalable systems that can make efficient use of multicore CPUs.

Rust is a scalable and reliable programming tool to cover a wide variety of data engineering tasks.

Performance

I'd like to elaborate a little bit more on performance by starting with the fact that most programs can be made performant with enough time invested debugging, profiling and tweaking different parts. Usually this time invested in tuning the software is not reflected in online benchmarks, however it helps seeing performance in relation to other things and not as an absolute metric (unless you just want to calculate decimals of PI). In Rust's case, it is easy to write performant code without having to refer to arcane and obscure knowledge: use the basics and the software should be good enough. It's still possible to tune a lot more but it's more than enough for 80% of the cases.

At the end of the day, performance is a function of how fast a set of instructions lead to the result and how much effort is necessary to get there.

There is this thing called Rosetta code online, comparing performance of programming languages on a set of tasks, where the implementations are "tailored" at being the most efficient implementations. These don't reflect real life conditions.

Here's a better benchmark: given two developers with comparable skillsets, ask them to implement the same thing in under 30 minutes (for example: invert a matrix). One can use Rust and one can use Python. Do this experiment often enough and then see what the average performance of the different implementations are, clustered by programming language. I'm almost certain Rust wins by a landslide.

Of course all implementations can be tuned but it doesn't mean they will ever be looked at again if they work.

Popularity and community support

There is much to be said about Rust's steadily growing popularity and support within the community. Many years in a row now, Rust has been voted most loved programming language.

It's being used now accross multiple industries like web development (example), embedded programming, game development and just like this website will show: data engineering. It's been adopted by several large tech companies: Amazon, Microsoft, Mozilla, Discord and Google among others.

There is also a growing and welcoming community of developers who contribute to the language and tools built with it. Rust's community offers multiple avenues for learning and this guide hopes to be one of those. More and more events around Rust are being organized and more content is being generated daily to tech people about Rust.

New developments like WASM raise a lot of interest and curiosity as to how they tie with Rust's capabilities. The future is looking bright for Rust's adoption.

Libraries and ecosystem

Rust's community has developed a rich ecosystem of libraries and tools, many of which are open-source and available on platforms like crates.io, GitHub, and GitLab. There are a lot of libraries you can use for data engineering tasks, notably Pola.rs & Apache DataFusion and Apache Arrow which we will cover on this website. The list is constantly growing with new additions that leverage Rust's performance and interface seamlessly with the language's capabilities.

Some libraries might still be missing though and you won't find (yet) the same coverage of libraries that programming language ecosystems like Python provide.

All things considered, Rust is a great candidate for almost all of the tasks that require data transformation. Over the next chapters, we'll go through all the details with a lot more code.

Next, we'll see how Rust compares to Python in a bit more detail.

How does Rust compare to Python (and other programming languages)?

It’s important to know how Rust compares to Python if you consider switching some of your workloads to Rust. Using Rust instead of Python is a tough sell especially for things like data engineering. For any data problem that you can think of, there might surely be some Python implementation

Data With RustKarim Jedda

Features of Rust

Karim Jedda — Mon, 03 Jun 2024 11:00:00 GMT

Now let's have a look at Rust's features, without going too much into details. Just a light overview of what this is all about and how it might be helfpul for us data engineers.

Statically-typed and compiled

Rust is a statically typed language. To simplify, this means that the type of variables and values must be known or at least specified at compile time. This mostly allows for better error checking. It can also be helfpul in eliminating the guess work when using code you didn't write yourself. I had many times the case where Python failed at runtime when I specified a variable with an incompatible type. Not that I didn't check beforehand, I just couldn't check all possible variants. In Rust, the compiler does that for me.

Rust is in fact compiled, it just means the code is transformed into machine code (LLVM) before actually running. It can have impact on the efficiency and execution time of the programs but that always depends on the implementation.

Being statically-typed and compiled, you can get more guarantees as to the performance (it's predictable) of your program and a reduced likelihood of runtime errors.

In short, Rust is statically-typed and compiled which helps us avoid shipping code that might have avoidable mistakes, had we run enough tests beforehand. These guarantees are built into the Rust toolchain.

💡

Rust being compiled is sometimes a blessing and a curse. Compilation often requires some time and it's a major criticism Rust recieves compared to other languages.

By also being typed, it slightly increases the verbosity but doesn't do so unnecessarily like other programming languages. There is for example no need for defining a class nor writing public static void to confuse yours truly.

It's a give and take.

Safety features (borrowing and ownership)

Rust comes with a series of safety features that enable it to be a great tool for efficiently working with data. Borrowing and ownership refer to models for safe and efficient memory management.

Ownership means that any given value has one owner that is responsible for it's lifetime. In other words, this owner is responsible for allocating and deallocating the memory occupied by the value past it's lifetime (ie: when it's no longer needed). This is very helpful in avoiding things like memory leaks and other unpleasant SRE nightmare fuel.

Borrowing on the other hand, means that it is possible to pass around references of the values without changing/transferring ownership. Think of this "reference to a value" like an alias or a symlink to the contents of the value. Different parts of the program can use the value without requiring a copy step. This borrowing can be done in a mutable way (we can change the value) or in an immutable way (we can't change the value). Only one mutable borrow of a value can be active at a time.

Now the natural question might be how borrowing affects ownership, especially in the mutable borrow. In this case, the ownership of the value is temporarily given up (aka borrowed).

Let's take a handheld console as an example. You borrow it from your friend Louis. Meanwhile, it is not possible for Louis (or any other friend) to use that console even though he still has ownership over that console. The handheld console has only 1 game in it and you manage to beat your friend's high score. Now you give the console back to Louis and disappoint him by pulverising his high score.

Rust kinda works the same. Let's see through an example. Just like previously, you can run the code. Can you guess the output before running it?


fn main() {
    // Louis owns the console
    let mut console_high_score = 8999;

    {   // I borrow the console for this block

        let y = &mut console_high_score;

        // I'm ruining the high score here by performing an action (mutation)
        *y += 2;

    }  // At the end of this block, I return it to Louis

    println!("console_high_score is now {}", console_high_score);
}

For every line of code or object in your Rust code, you'll get used to reason about these two features to not end up scratching your head but a bit of planning upfront will save a lot of headache afterwards. This won't be easy at first but the compiler will help as you'll see in the next chapters.

Don't worry too much about the & or * characters for now.

Concurrency support

The ownership/borrowing model explained above helps preventing data races by enforcing strict rules over shared resources. In the case of the handheld console, another friend cannot borrow the console from Louis while it's in my hands (unlike some stocks and derivatives, for the financially inclined).

This is especially helpful when dealing concurrent programs. To simplify to the extreme, concurrency is when two or more overlapping tasks are solved by a single processor, for example: Serializing, validating and storing a piece of data can be run on one single processor.

It is usually necessary to share data between concurrent tasks and there are many ways Rust helps with that, other than just ownership and borrowing. There are things like threads, futures, streams and more but it might be a bit too soon to explore.

The point is, Rust has built-in support for safe concurrency without the mess and this is very good news for data intensive tasks.

Let's checkout the advantages of Rust for Data engineering:

Advantages of Rust for Data Engineering

After going through some of Rust’s main features, it might make sense to have a closer look at how these features help our data engineering efforts. Some of the points have already been covered but it makes sense to list them again in a general overview. Reliability and scalability Rust

Data With RustKarim Jedda

History of Rust

Karim Jedda — Sun, 02 Jun 2024 10:55:00 GMT

Now that we had a look at a short introduction on Rust for Data Engineering, let's get an understanding for where it came from and perhaps where it's headed to.

The beginnings

Graydon Hoare created Rust as a personal project in 2006 while working at Mozilla. The language was announced in 2010 and it's first stable version, Rust 1.0 was released in 2015.

When Rust was first released, it was met with a bit of criticism - which is quite natural in the programming realm. The main remarks were around it's complexity, steep learning curve, verbosity (boilerplate code) and lack of maturity / limited ecosystem.

This has changed so much since then and it's worth noting that Rust has improved by a lot, the community has grown and many (if not most) of the early criticism of Rust has been addressed.

The beginnings

Fast forward to 2024 and Rust is gaining visibility and popularity by the day.

Rust in the Linux Kernel
CTO's of multiple companies encouraging the use of Rust
Big corporations officially announcing their support for Rust
Me talking about Rust for Data Engineering 👻

... and the list goes on.

Next, let's have a look at Rust's features:

Features of Rust

Now let’s have a look at Rust’s features, without going too much into details. Just a light overview of what this is all about and how it might be helfpul for us data engineers. Statically-typed and compiled Rust is a statically typed language. To simplify, this means that the type

Data With RustKarim Jedda

Introduction to Rust

Karim Jedda — Sat, 01 Jun 2024 19:59:00 GMT

We'll start this guide by a short introduction to Rust as well as showing a few code snippets for how you can start using Rust.

💡

It's not important to immediately understand everything. The important part is to get you to run Rust code as fast as possible, like in the next 5 minutes and without having to install anything just yet. It's always helpful to get a general feeling for the language before diving in.. This guide is structured in a way that you can run almost every code block and see the result it outputs.

That way, we keep the examples practical and down to earth. When you hover on a code block, you'll see a play button, upon clicking on it, the code will run. More on that, in a following chapter.

What is Rust?

Rust is a programming language which according the official website (January 2023) is "a language empowering everyone to build reliable and efficient software". It was designed by Graydon Hoare and was released a little more than a decade ago ~in 2006, more on Rust's history on the next chapter. The key points are that it's multi purpose programming language that has influences from a wide range of programming languages, notably C++.

It is frequently advertised as a memory safe and performant alternative to C/C++.

The following is Rust code that outputs "Hello World", try and run it 😄

fn main() {
  println!("Hello, World!");
}

How Rust works is that the code you write is first compiled to machine code and then executed. This leverages what's called the Rust compiler and LLVM. It's not important to know what these are yet, but you can look them up if you want. The gist of it being: the code generated by this process is platform independent. Matter of fact, this code could also run in your browser ◕ ◡ ◕

The following is also Rust code. Can you guess what it does before running it? You can edit this one as well to replace the names of the functions if you like but keep it short, we have work to do.

fn mystery_function(input: &str) -> i32 {
    let mysteries = "aeiouAEIOU";
    input.chars().filter(|c| mysteries.contains(*c)).count() as i32
}

fn main(){
    let input_string = "Hello, world!";
    let mystery_count = mystery_function(&input_string);
    println!("The number of mysteries in '{}' is {}", input_string, mystery_count);
}

That looks pretty much like Python, right? The only weird looking parts are those &, * and the -> symbols and perhaps the str and i32 too. We'll get to those in time, not to worry.

Why is Rust a good choice for data engineering?

The argument that Rust is a memory safe and performant alternative to C/C++ doesn't matter that much to Data Engineers, since we don't usually deal with C or C++.

What is important is that Rust is a systems programming language which enforces a certain approach to programming that is at the same time efficient and good at eliminating a whole category of errors that usually happen with data workloads.

Here's perhaps a very simplified example that doesn't require dwelling over performance, low-level control or even memory usage.

Let's take the following Python code & run it multiple times:

import random 

def count_characters(s: str, c: str) -> int:
    return s.count(c)

def get_count():
    if(random.random() < 0.5):
        return count_characters("Hello world", "o") 
    else:
        return count_characters("Hello world", 23) 

result = get_count()
print(result)

If you run the above code multiple times, you'll see that it sometimes fails but sometimes works. Of course, a linter, a battery of tests or even a type checker if properly set up, might show this up as a warning or even an error to the user, but it won't prevent that code to end up running somewhere, especially if it's not caught in due time.

When using complex objects or when the scope of the project increases, these type of errors sneak in and they're discovered only too late. You can't test everything.

This happens more than you think, this simplistic example is meant to convey that preventing these issues to hit production is an after thought and there will never be any guarantees.

Consider the equivalent code in Rust and try to run it:

extern crate rand;
use rand::Rng;

fn count_characters(s: &str, c: char) -> usize {
    s.chars().filter(|x| *x == c).count()
}

fn main() {
    let mut rng = rand::thread_rng();
    if rng.gen_bool(0.5) {
        count_characters("Hello world", 'o');
    } else {
        count_characters("Hello world", 23);
    }
}

Notice what the compiler says. Does it make a bit more sense now? The compiler doesn't allow you to build this application, needless to say this won't end up in your production server.

💡

Bonus: Try to fix the code and make it run (it's editable! 😄 )

In essence, interface definitions and specifications are enforced at the lowest level and not as an afterthought in a CI/CD pipeline or in some external YAML file serving as documentation. This of course always requires a bit more work and planning upfront but you'll get a few guarantees in exchange, especially the guarantees that matter in the context of working with data. This way of approaching things is not the reason for Rust's performance and scalability but certainly enables it.

We saw through a simple example what makes Rust perfect for Data Engineering, even though it's a very simple one. Over the next chapters we'll discover some more and get them to run.

If you like it so far, consider subscribing for free to get the new chapters.

Continue the lesson here:

History of Rust

Now that we had a look at a short introduction on Rust for Data Engineering, let’s get an understanding for where it came from and perhaps where it’s headed to. The beginnings Graydon Hoare created Rust as a personal project in 2006 while working at Mozilla. The language was announced

Data With RustKarim Jedda

Welcome to Data With Rust!

Karim Jedda — Sun, 17 Mar 2024 12:38:17 GMT

This website is dedicated to teaching you why and how to use Rust for data. It's structured as a book, but is a practical handbook on how to get started using the Rust programming language for Data Engineering tasks. I'll show you practical examples of how Rust can be used for most of the tasks expected from a data engineer.

💡

This project is a personal hobby project made by Karim Jedda. Data with Rust is no way endorsed nor affiliated with the Rust foundation. Rust & Cargo are registered trademarks of the Rust foundation.

With that out of the way, let me start by saying that most of the content will be openly accessible, however some sections and upcoming features will be reserved to members of my newsletter. This website is a project I started working after getting multiple requests and positive signals to compile resources and guides on using Rust as an alternative to Python for Data Engineering.

Read the book

As a short teaser, and hopefully to get you excited, here's the recording of my EuroPython 2023 talk, titled "Rust for Python Data Engineers" in which I make the case for using Rust for data engineering workloads:

What you can expect

First of all, I'm not a Rust expert but I've been writing Python code for at least a decade now (feeling old...). One thing I know is how to get things done, maintainable and simple (mostly). I've worked in a lot of different industries and in different settings. You can find an example of my teaching style, approach to learning and a glimpse of how my brain works on my personal blog.

I'll focus here on giving out practical ways of introducing Rust into your day to day tasks, theoretical concepts will not be discussed in depth however I'll refer to people, guides, videos or websites who can explain it better than me - wherever applicable.

The code and examples in this guide will be practical and very hands-on, the main focus being on getting things done.

The chapters are going to be published and updated regularly, you can subscribe to the newsletter to be the first to get the new chapters and updates. You can also follow me on LinkedIn or alternatively on Twitter as I'll post updates there, if that's easier and requires less commitment. Here's a link to the privacy policy for this project, just in case.

Who is this course/guide for?

You might be wondering who this guide could be for and it's a fair question. The audience for this book I'll keep in mind are as follows:

Anyone curious about how Rust can be useful for data problems
Anyone who heard about Rust but doesn't know where to start
Python developers that like a hands-on guide starting point for Rust
Developers searching for a practical and updated guide tailored to data workloads
Developers looking to add a new skill but like to see some code before investing more time
Junior to Senior developers, everyone should be able to find something interesting
Myself: Someone who knows Python very well and got interested in Rust in the context of my job

One thing is for sure though: This guide might not be for Rust experts that certainly know already more than me. Feedback and corrections are always welcome, which is why most of the guide is available for free.

Having fun

To be completely fair and transparent, learning Rust is a little bit difficult.

This guide won't be an academic lecture of what parts constitute the Rust language but rather a compilation of useful things you can do with Rust, today, to solve a wide variety of data challenges.

By seeing the code and understanding what task it's supposed to solve, I believe that understanding of how Rust works can build up over time.

In other words, some concepts might not click immediately or overnight. It might take some time and that's completely normal.

It doesn't mean that there won't be any explanation or theoretical concepts, it's just that they take second place to making things work.

The bet is that it might make a complicated topic approachable and perhaps fun too.

Let's start.

Read the book

Running code in the browser

Karim Jedda — Sat, 09 Mar 2024 10:43:18 GMT

The best way to learn is to try things out. With embedded coding environments this has never been easier:

print("This code will run on a container somewhere and return here!")

You can edit the code, and it will still run, feel free to experiment:

print("This code will run on a container somewhere and here!")

This works across programming languages and tools, no need for setting up anything locally until you feel confident.

fn main() {
  println!("Whats up people?");
}

This allows to tailor the test environment to showcase a specific thing.

fn main(){
    let mut console_high_score = 8999;
    {   // I borrow the console for this block
        let y = &mut console_high_score;
        // I'm ruining the high score here by performing an action (mutation)
        *y += 2;
    }  // At the end of this block, I return it to Louis
    println!("console_high_score is now {}", console_high_score);
}

This is ideal to tinker, try things out, and learn from mistakes.

Data With Rust: Update #2

Karim Jedda — Wed, 17 May 2023 19:41:00 GMT

Data With Rust: Update #1

Karim Jedda — Sun, 12 Mar 2023 19:39:00 GMT

Data With Rust

Setting up a Rust developer environment

Installing on MacOS/Linux

Installing on Windows

Verifying the installation

Additional remarks

Chapter 1 Recap

Summary of Rust's features and advantages for data engineering

Future of Rust in the data engineering field and Rust jobs (transferable skill)

What's next?

How does Rust compare to Python (and other programming languages)?

Syntax and readability

Dynamic vs static typing

Benchmarking & efficiency

Comparing with other data engineering languages

Java/Scala

Go

C++

Tooling

When to use which?

The team's capabilities

The budget & constraints

The scope & goal

From my personal experience

Advantages of Rust for Data Engineering

Reliability and scalability

Performance

Popularity and community support

Libraries and ecosystem

Features of Rust

Statically-typed and compiled

Safety features (borrowing and ownership)

Concurrency support

History of Rust

The beginnings

The beginnings

Introduction to Rust

What is Rust?

Why is Rust a good choice for data engineering?

Sign up for Data With Rust

Welcome to Data With Rust!

What you can expect

Who is this course/guide for?

Having fun

Running code in the browser

Data With Rust: Update #2

Data With Rust: Update #1