Simon Späti

Today we have more requirements with ever-growing tools and framework, complex cloud architectures, and with data stack that is changing rapidly. I hear claims: “Business Intelligence (BI) takes too long to integrate new data”, or “understanding how the numbers match up is very hard and needs lots of analysis”. …

A collection of open-source tools used in this project

This post focuses on practical data pipelines with examples from web-scraping real-estates, uploading them to S3 with MinIO, Spark and Delta Lake, adding some Data Science magic with Jupyter Notebooks, ingesting into Data Warehouse Apache Druid, visualising dashboards with Superset and managing everything with Dagster.

The goal is to touch…

Excellent write-up Doug, appreciate it! I also see it more and more in my promoted ads. I think they're now trying to get users. I haven't tried either; however, I don't like the proprietary F3 format, or do you know if you could also use Parquet, Delta or other formats?

I think you cannot compare it with Dremio as Demio is just a Data Visualisation tool with a lot of in-memory Apache Arrow, so not persisted. I categorised these tools in OLAP-Technologies (Druid, ClickHouse, ...), Cloud Data Warehouses (FireBold, Snowflake, BigQuery, ...), Data Visualisations (Dremio, Informatica Data Virtualisation, ...) and Serviced Cloud & Analytics (Looker, Sisense, ...). In case of interest, here I wrote about the different technologies:

Always enjoy your articles, Javier, well written again! If you liked this article, you might like as well. I wrote it a bit earlier, yet it still holds in most of the cases today.

Nice write up and well explained! But I don't agree with your conclusion, I would guess you end up in a chaos if you use trigger, trigger, trigger.

I agree that Jupyter notebooks are easy for people not engineers. But with Papermill and Dagster, you can easily integrate them into your orchestrator or pipelines:

I would also have a look to alternatives to Airflow as Airflow is not data driven, but execution driven:

Plus triggers are also supported by orchestrators and set-up with Kubernetes is also getting easier and easier.

Data Warehousing with Open-Source Druid, Apache Airflow & Superset

These days, everyone talks about open-source software. However, this is still not common in the Data Warehousing (DWH) field. Why is this?

For this post, I chose some open-source technologies and used them together to build a full data architecture for a Data Warehouse system.

I went with Apache Druid

Simon Späti

Data Engineer & Technical Author with 15+ years of experience. I enjoy maintaining awareness of new innovative and emerging open-source technologies.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store