Building a Data Engineering Project in 20 Minutes
Learning web-scraping with real-estates, uploading to S3, Spark, and Delta Lake, adding Jupyter notebooks and ingesting into Druid, visualizing with Superset, and managing everything with Dagster.
--
This post focuses on practical data pipelines with examples from web-scraping real-estates, uploading them to S3 with MinIO, Spark and Delta Lake, adding some Data Science magic with Jupyter Notebooks, ingesting into Data Warehouse Apache Druid, visualising dashboards with Superset and managing everything with Dagster.
The goal is to touch on the common data engineering challenges and using promising new technologies, tools or frameworks, which most of them I wrote about in Business Intelligence meets Data Engineering with Emerging Technologies. As well everything runs on Kubernetes in a scaleable fashion but as well locally with Kubernetes on Docker Desktop.
The source-code you can find on practical-data-engineering for the data pipeline or in data-engineering-devops with all it’s details to set things up. Although not all is finished, you can observe the current status of the project on real-estate-project.
Table of Content:
- What are we building, and why?
- What will you learn?
- Hands-On with Tech, Tools and Frameworks
+ Getting the Data — Scraping
+ Storing on S3-MinIO
+ Change Data Capture (CDC)
+ Adding Database features to S3 — Delta Lake & Spark
+ Machine Learning part — Jupyter Notebook
+ Ingesting Data Warehouse for low latency — Apache Druid
+ The UI with Dashboards and more — Apache Superset
+ Orchestrating everything together — Dagster
+ DevOps engine — Kubernetes
- Conclusion
What are we building, and why?
A data application that will collect real-estates coupled with Google maps way calculation but potential other macro…