Sign in

Simon Späti

Learning web-scraping with real-estates, uploading to S3, Spark, and Delta Lake, adding Jupyter notebooks and ingesting into Druid, visualizing with Superset, and managing everything with Dagster.

A collection of open-source tools used in this project

This post focuses on practical data pipelines with examples from web-scraping real-estates, uploading them to S3 with MinIO, Spark and Delta Lake, adding some Data Science magic with Jupyter Notebooks, ingesting into Data Warehouse Apache Druid, visualising dashboards with Superset and managing everything with Dagster.


Always enjoy your articles, Javier, well written again! If you liked this article, you might like https://sspaeti.com/blog/olap-whats-coming-next/ as well. I wrote it a bit earlier, yet it still holds in most of the cases today.


Nice write up and well explained! But I don't agree with your conclusion, I would guess you end up in a chaos if you use trigger, trigger, trigger.


How to make BI better with new rising technologies and twelve data engineering approaches.

Today we have more requirements with ever-growing tools and framework, complex cloud architectures, and with data stack that is changing rapidly. I hear claims: “Business Intelligence (BI) takes too long to integrate new data”, or “understanding how the numbers match up is very hard and needs lots of analysis”. The goal of this article is to make business intelligence easier, faster and more accessible with techniques from the sphere of data engineering.


The first example of email can be found on computers at MIT in a program called MAILBOX, all the way back in 1965. Besides its advantages of sharing fast and in a direct way, there are several studies stating that average office workers receive 110 messages a day. Given the fact that every interruption takes 20 minutes to get back, this is a major distraction and also adds to the overall stress.

Why email is so popular

Let’s start with why…


Data Warehousing with Open-Source Druid, Apache Airflow & Superset

These days, everyone talks about open-source software. However, this is still not common in the Data Warehousing (DWH) field. Why is this?

Druid — the data store

Druid is an open-source, column-oriented, distributed data store written in Java. It’s designed to quickly ingest massive quantities of event data, and provide low-latency queries on top of the data.


Are you on the lookout for a replacement for the Microsoft Analysis Cubes, are you looking for a big data OLAP system that scales ad libitum, do you want to have your analytics updated even real-time? In this blog, I want to show you possible solutions that are ready for the future and fits into existing data architecture.

What is OLAP?

OLAP is an acronym for Online Analytical Processing. OLAP performs multidimensional analysis of business data and provides the capability for complex calculations, trend analysis, and sophisticated data modelling. An OLAP cube is a multidimensional database that is optimised for data warehouse and…


stag beetle and his tool

Tools I use will focus on simplifying your life in your communication, grammar, focus at work or study, little notes and two tools for Google Chrome. Some of the tools you might already know but hopefully not all the functions and features as I try to elaborate on the once not obvious on the first place. Without further ado, please enjoy the tools below.

Instant language Translation

Around you

Google Translate — play.google.com and itunes.apple.com


Today, there are 6,500 people on LinkedIn who call themselves data engineers according to stitchdata.com. In San Francisco alone, there are 6,600 job listings for this same title. The number of data engineers has doubled in the past year, but engineering leaders still find themselves faced with a significant shortage of data engineering talent. So is it really the future of data warehousing? What is data engineering? These questions and much more I want to answer in this blog post.


There is a bit of a confusion between Data Warehouse vs Data Lake or ETL vs ELT. I hear that Data Warehouses are not used anymore, that they are replaced by Data Lakes altogether, but is that true? And why do we need Data Warehouses anyway? I will go into that as well as the definitions of both pluses explain the differences between them.

Data Warehouse vs Data Lake

Data Warehouse definition

A Data Warehouse, in short DWH and also known as an Enterprise Data Warehouse (EDW), is the traditional way of collecting data as we do since 31 years. The DWH serves the purpose of being the…

Simon Späti

Data Engineer & Dad / passionate about data / curious in life / author sspaeti.com

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store