Hi Iris, first of all, thanks for the very kind words and the terrific questions.

2 min readJun 26, 2020

Hi Iris, first of all, thanks for the very kind words and the terrific questions. Secondly, all of my views are my own, so might differ from yours—and some looking into the future, which is difficult to tell what will happen next.

However, to your questions:

1. The notebooks are for opening data silos in large organisations, where non-developers can access data. If you're a data engineer, I would as mentioned below an IDE and something like dagster including papermill, and add it to your central data-pipeline. But in git you do not have access to your data and visualisations, meaning you cannot share your findings, only code. That's what notebooks are great at. There is nothing against git. I'm pushing my jupyter-files (.ipynb) to git as well (you could automate that from your jupyter server)

2. I agree with you that Scala would fit more into functional programming. But on the other hand Python with, e.g. Spark, removed a lot of the filler-code to make it cleaner and also functional.

I guess one downside for me, and maybe for others is Scala is coming from Java, which scares away many developers already. Also, the learning curve is much higher than with Python. And the ecosystem around data engineering, data science, web development, and others are just enormous with Python. That's why for me, Python is the future, especially in the sphere of data engineering. But again, all this is just best guessing from my subjective point of view, and you might have a different with Scala. The future will tell :-)

3. Around data quality, testing, auditing I would suggest you check out Great Expectations (https://github.com/great-expectations/great_expectations), this is build for data pipelines specific, which is way harder than in traditional SW engineering project. Around data governance, this is harder, and that's where I would want the best data catalog or metadata store as possible, With this, you can apply much better governance. As well mentioned tool Amundsen includes a rating, which gives you a sound feeling or score, which data set to use.

Written by Simon Späti

No responses yet