Member-only story

Building a Data Engineering Project in 20 Minutes

Learning web-scraping with real-estates, uploading to S3, Spark, and Delta Lake, adding Jupyter notebooks and ingesting into Druid, visualizing with Superset, and managing everything with Dagster.

17 min readMar 9, 2021

A collection of open-source tools used in this project

This post focuses on practical data pipelines with examples from web-scraping real-estates, uploading them to S3 with MinIO, Spark and Delta Lake, adding some Data Science magic with Jupyter Notebooks, ingesting into Data Warehouse Apache Druid, visualising dashboards with Superset and managing everything with Dagster.

The goal is to touch on the common data engineering challenges and using promising new technologies, tools or frameworks, which most of them I wrote about in Business Intelligence meets Data Engineering with Emerging Technologies. As well everything runs on Kubernetes in a scaleable fashion but as well locally with Kubernetes on Docker Desktop.

The source-code you can find on practical-data-engineering for the data pipeline or in data-engineering-devops with all it’s details to set things up. Although not all is finished, you can observe the current status of the project on real-estate-project.

Building a Data Engineering Project in 20 Minutes

Learning web-scraping with real-estates, uploading to S3, Spark, and Delta Lake, adding Jupyter notebooks and ingesting into Druid, visualizing with Superset, and managing everything with Dagster.

Written by Simon Späti

Responses (1)