15 Python Libraries Every Data Engineer Needs

Reduce complexity and improve your data engineering work

Sep 25, 2024

Python's ecosystem is still growing strong and the explosions of libraries can make one getting into data engineer a bit scared.

So I sat down and thought, "If I could keep only 15 Python libraries for most of my data engineering work, which ones would I choose?" To make this more digestible, I sorted these into four categories: data ingestion, data transformation, developer tools, and data validation.

If you prefer watching over reading, I've got you covered.

🌊 Data ingestion

1. Requests

This is the basic HTTP library in Python. It is essential for querying APIs and fetching data from the web, including web scraping tasks. Mastering more than just a GET request is essential. Understanding how to handle status codes and manage retries will help you build robust pipelines.

2. BeautifulSoup

Used alongside Requests, BeautifulSoup is the Python standard for parsing HTML content. It's a must-have if you do web scraping.
To illustrate how the above two libraries work together, here is a short snippet:

import requests
from bs4 import BeautifulSoup

# URL to fetch
url = 'https://example.com'

# Send a GET request
response = requests.get(url)

# Check if the request was successful
if response.status_code  200:
    # Parse the HTML content
    soup = BeautifulSoup(response.text, 'html.parser')

    # Find all 'a' tags
    links = soup.find_all('a')

    # Print each link's URL and text
    for link in links:
        print(f"Text: {link.text.strip()}, URL: {link.get('href')}")
else:
    print(f"Failed to retrieve the webpage. Status code: {response.status_code}")

3. Dlt

Dlt from Dlthub is a bit more than just a library. It's a framework that follows best practices for creating data pipelines. It supports various sources and destinations, including REST APIs and databases, making it a versatile choice for data ingestion. Check Dlt's documentation around their core concepts to understand how things work.

🛠️ Data Transformation

4. DuckDB

DuckDB is an in-process OLAP database written in C++ that acts like a Swiss army knife for data engineering. It supports various data formats like CSV, JSON, and Parquet and table formats like Iceberg and Delta Lake. It works well with dataframe libraries like Polars and Pandas thanks to Arrow : you can query/process directly your Pandas/Polars dataframe using DuckDB. DuckDB focuses on SQL, offering many functions to simplify data manipulation tasks.

Note : For full disclosure, at this point of this blog I'm working for MotherDuck (DuckDB in the Cloud) - so yes I'm kind of bias here. If you want to learn. more about DuckDB, you can check my work on MotherDuck YouTube channel.

5. Polars

At the opposite of DuckDB's friendly SQL, Polars has a dataframe approach. Polars is a high-performance library written in Rust with Python bindings. Like DuckDB, it's especially good for single-node computing environments on local machines or in the cloud. It handles different data file types and transformations efficiently, making it ideal for fast data processing tasks.

6. PySpark

PySpark (Apache Spark's Python API) has been the gold standard for the past decade for handling large datasets across distributed systems. Note that large datasets start to be actually a tiny percentage of use cases, given the power single node machine we can have today. Most of your use cases can be handled with other straightforward frameworks like DuckDB/Polars. PySpark can't run on any Python runtime (or it will be overkill as a standalone); you will still need Spark Cluster, hence the complexity.

🧰 Developer Tools

7. Loguru

I never liked the default logging features of Python and this library provides a simpler approach. It's a one-line setup, and it's designed to replace traditional logging and print statements. Good logging means easier debugging. Easier debugging means more robust pipelines.

8. Typer

Typer is an intuitive tool for building command-line interfaces (CLIs). It is based on the principles of FastAPI (by the same author) and simplifies the creation of CLIs. CLI for data pipelines are also crucial. Running a pipeline with specific parameters (e.g., specific dates) or backfilling are all being enabled by powerful CLI. You don't want to have to modify the code for any custom runs, use your CLI parameters!

9. Fire

For simpler projects that still require a CLI, Fire offers a less powerful but easier-to-bootstrap option in my opinion compared to Typer. It automatically detects function parameters and includes them in the CLI, streamlining the process of setting up and running scripts.

import fire

def hello(name="World"):
  return "Hello %s!" % name

if __name__  '__main__':
  fire.Fire(hello)

Then you can run :

python hello.py  # Hello World!
python hello.py --name=David  # Hello David!
python hello.py --help  # Shows usage information.

10. Ruff

Ruff is a tool that helps clean up and organize your code. It's a linter and code formatter built using Rust, which means it runs blazingly fast compared to its competitors. As it combines multiple tools in one (linter, formater), it can replace your pylint/black tool kit. Fewer dependencies, fewer struggles.

11. Pytest

Pytest is a common standard for testing. It's a very popular because it gives clear information on what parts of your test are failing and why. Pytest is also powerful because it has many plugins you can add to help test specific parts of your code more effectively. For instance, the pytest-compare plugin helps check if the parts of your code that interact with each other are doing so correctly.

12. python-dotenv

When working on projects, especially locally, you often need to handle sensitive information like passwords or API keys. Python-dotenv helps manage these secrets safely by letting you store them in a .env file, which you keep out of your main project files. This means you can use all your secrets in your code without risking them being exposed. When you're ready to move your project to the cloud, transition is easier as not a single sensible information is left and you just have to provision your cloud runtime with the appropriate environment to make it work.

🚦Data Validation

13. Pydantic

This tool is great for making sure the data you receives is exactly what it expects. Think of it as a supercharged version of Python's own data classes. Pydantic lets you define exactly how your data should look and behave, which is especially useful for ensuring that data from the internet or other sources meets your standards. For example, you can set up Pydantic to check that a URL or a user's password meets specific criteria, which helps prevent errors in your data processing.

14. Pandera

This tool checks that data organized in tables (dataframe) fits a specific format. This is very useful when you pass data around different parts of your program to make sure nothing breaks. Pandera allows you to define a schema, or a blueprint, for your data tables, and it checks incoming data against these schemas. You can use Pandera with Pandas dataframes, Polars, or even Pydantic models to ensure your data is correct before you proceed with processing.
If you're unsure how Pydantic and Pandera relate, think of Pydantic as handling data validation at the Python object level (typically for dict structures), while Pandera focuses on validating data at the dataframe level.

15. PyArrow

PyArrow is a bit like the hidden machinery that helps various data tools work together seamlessly by standardizing how they describe and store data in memory. This compatibility is crucial for tools like DuckDB, which can work with data from Pandas or Polars without converting the data formats. While PyArrow isn’t typically used by developers directly for everyday tasks, it plays a vital role in the background. I definitely use it sometimes to type data to avoid inference through PyArrow's magic.

With these 15 libraries, we cover the essentials. However data engineering is such a wide domain than even with scoping on these categories, you sometimes need a little extra.

What did I miss? Which library do you think should be on this list ?