This post focuses on practical data pipelines with examples from web-scraping real-estates, uploading them to S3 with MinIO, Spark and Delta Lake, adding some Data Science magic with Jupyter Notebooks, ingesting into Data Warehouse Apache Druid, visualising dashboards with Superset and managing everything with Dagster.
The goal is to touch on the common data engineering challenges and using promising new technologies, tools or frameworks, which most of them I wrote about in Business Intelligence meets Data Engineering with Emerging Technologies. As well everything runs on Kubernetes in a scaleable fashion but as well locally with Kubernetes on Docker Desktop.
Nice write up and well explained! But I don't agree with your conclusion, I would guess you end up in a chaos if you use trigger, trigger, trigger.
I agree that Jupyter notebooks are easy for people not engineers. But with Papermill and Dagster, you can easily integrate them into your orchestrator or pipelines: https://towardsdatascience.com/business-intelligence-meets-data-engineering-with-emerging-technologies-8810c3eed8b1
I would also have a look to alternatives to Airflow as Airflow is not data driven, but execution driven: https://www.quora.com/What-are-common-alternatives-to-Apache-Airflow
Plus triggers are also supported by orchestrators and set-up with Kubernetes is also getting easier and easier.
Today we have more requirements with ever-growing tools and framework, complex cloud architectures, and with data stack that is changing rapidly. I hear claims: “Business Intelligence (BI) takes too long to integrate new data”, or “understanding how the numbers match up is very hard and needs lots of analysis”. The goal of this article is to make business intelligence easier, faster and more accessible with techniques from the sphere of data engineering.
In an earlier post, I pointed out what data engineering is and why it’s the successor of business intelligence and data warehousing. When a data engineer is needed…
The first example of email can be found on computers at MIT in a program called MAILBOX, all the way back in 1965. Besides its advantages of sharing fast and in a direct way, there are several studies stating that average office workers receive 110 messages a day. Given the fact that every interruption takes 20 minutes to get back, this is a major distraction and also adds to the overall stress.
I asked myself isn’t there a smarter way to communicate (discuss, argue, debate) at work? This is what I’m going through in this article.
Let’s start with why…
These days, everyone talks about open-source software. However, this is still not common in the Data Warehousing (DWH) field. Why is this?
For this post, I chose some open-source technologies and used them together to build a full data architecture for a Data Warehouse system.
Druid is an open-source, column-oriented, distributed data store written in Java. It’s designed to quickly ingest massive quantities of event data, and provide low-latency queries on top of the data.
Are you on the lookout for a replacement for the Microsoft Analysis Cubes, are you looking for a big data OLAP system that scales ad libitum, do you want to have your analytics updated even real-time? In this blog, I want to show you possible solutions that are ready for the future and fits into existing data architecture.
OLAP is an acronym for Online Analytical Processing. OLAP performs multidimensional analysis of business data and provides the capability for complex calculations, trend analysis, and sophisticated data modelling. An OLAP cube is a multidimensional database that is optimised for data warehouse and…
Tools I use will focus on simplifying your life in your communication, grammar, focus at work or study, little notes and two tools for Google Chrome. Some of the tools you might already know but hopefully not all the functions and features as I try to elaborate on the once not obvious on the first place. Without further ado, please enjoy the tools below.
As this is an obvious one to translate text, many might not know about the app having on-the-fly translation using your camera. See how it work in the video.
Today, there are 6,500 people on LinkedIn who call themselves data engineers according to stitchdata.com. In San Francisco alone, there are 6,600 job listings for this same title. The number of data engineers has doubled in the past year, but engineering leaders still find themselves faced with a significant shortage of data engineering talent. So is it really the future of data warehousing? What is data engineering? These questions and much more I want to answer in this blog post.
There is a bit of a confusion between Data Warehouse vs Data Lake or ETL vs ELT. I hear that Data Warehouses are not used anymore, that they are replaced by Data Lakes altogether, but is that true? And why do we need Data Warehouses anyway? I will go into that as well as the definitions of both pluses explain the differences between them.
A Data Warehouse, in short DWH and also known as an Enterprise Data Warehouse (EDW), is the traditional way of collecting data as we do since 31 years. The DWH serves the purpose of being the…