A designer knows he has achieved perfection not when there is nothing left to add, but when there is nothing left to take away. – Antoine de Saint-Exupéry
Links
Context
- Modern Data Stack. Read what the modern data stack is and how came around.
- Find out about Redshift's impact on modern data stackFind out about Redshift's impact on modern data stack
We saw a tremendous amount of innovation immediately following the launch of Redshift in 2012, unlocking brand new levels of performance, efficiencies, and new behaviors. We then saw a maturation period as these nascent products were deployed by the market, improved their technology, and rounded out their feature sets. By now, these products are ready to act as a foundation on which successive innovations can be built.
Ingestion: Fivetran, Stitch
Warehousing: Snowflake, Bigquery, Redshi... - Data Stack. Learn how some of the most amazing companies in the world are organising their data stack. Learn more about the tools that they are using and why.
- The modern data stack was built around SQL.
Fullstack
- How I made an open housing dataset with BigQuery and dbt
- Data Stacks For Fun & Nonprofit — Part III
- A Simple Analytics Project Using Airflow, dbt and Superset
- dbt Core & Airflow
- Startup analytics 101: from seed to scale
- Open Source Data Stack Conference
- Gaining insights on my workout data with Apache Superset
- The four priorities of a one-person analytics team: lessons from Lola.com
- Bootstrap a Modern Data Stack in 5 minutes with Terraform
- Jan Soubusta – Data Pipeline as Code: Journey of our Blueprint
- Recipes from Airbyte:
- mdsinabox – a sports monte carlo simulator
- Modern Data Stack in a Box with DuckDB
- Presto SQL + S3 Data + Superset = Data Lake
Hosting
- Hosting web apps for free
- For long time, the best choice was Heroku, hence it's used rather extensively below. However, Salesforce who acquired the company in 2011 decided to eliminate free plans in 2022. There are quite a few modern alternatives though, here or here.
Heroku
Scheduling
Airflow
Other paid
Other open-source
ELT
Meltano
Airbyte
- Airbyte vs Meltano
- Building an Open-source Ingestion Layer with Airbyte
- Ingestion with Airbyte: A Guided Tutorial
CI
- Setup a slim CI for dbt with BigQuery and Docker
- GitHub Actions
- Team might offer good value for money
- There is even a specific action for dbt and for follow-up notification to Slack
Python
- data load tool (dlt): open-source Python library that makes data loading easy
Transformation
dbt
- dbt Cloud: free for one developer
- dbt Core:
- How to Deploy dbt to Production using GitHub Actions
- hint on deploying from dbt
- to specify steps based on trigger:
if: github.event_name == 'workflow_dispatch'
Docs
- How to deploy dbt docs to GitLab Pages
- Make it password protected
SDF
SQLMesh
- official site, github, docs, blog, paid offering
- comparison to dbt
- my thoughts after testing it a bit:
SQLMesh is heavily built around the “incremental idea” — it saves database snapshots and compares them over time, adjusting or supplementing as needed. This approach allows for the “terraform plan & apply” model they use and optimises for cost by default, recognising that storage is cheap but compute is expensive. This is quite a paradigm shift, especially compared to dbt, which primarily relies on the “full refresh” concept.
Because of this incremental approach, scheduling is tightly integrated. SQLMesh needs to “understand the state” of the data. It’s built to work well with Airflow, but also supports DLT and Kestra — all open-source tools, which is evident in the project’s open nature. There’s no paid offering either.
One key feature is its primary focus on SQL, while understanding the code itself. It knows when you’re only making formatting changes, can translate between SQL flavours, and automatically handles column lineage. Importantly, there’s no YAML or other configuration languages — just SQL and occasionally Python. They don’t even use Jinja for templating; macros are built directly into SQL, and you can write Python macros using any Python package. This is possible thanks to their sqlglot library, which directly manipulates SQL semantics rather than just doing text substitution.
The whole platform feels very geared toward the engineering experience. The Browser UI and Observer, for example, are clearly designed for developers, not end users. This focus is visible across the product: everything is written in SQL, it has the “terraform” style plan, a built-in CI/CD bot, unit tests, audits (like dbt tests), and so on.
They do, however, acknowledge some of the things dbt does well, such as SQL-based transformations, the CLI experience, managing environments (dev vs. prod), and project structures. They ensure backward compatibility with dbt through a “dbt package,” which allows the use of dbt projects and packages.
However, some aspects of SQLMesh feel more complicated. It took me longer to get the basics compared to dbt, possibly due to its more complex underlying concept. There are more commands to learn, the documentation is somewhat weaker, and overall it doesn’t feel as straightforward or streamlined. It gives the impression of senior engineers piecing together a complex solution, while dbt started as a simpler idea that gradually matured and expanded with additional tools and packages.
Storage
BigQuery
- Console
- Creating a service account
- Meltano setup
- dbt setup:
- Superset setup
- SQL functions
- Monthly free: 10 GB storage, 1 TB for analysis
PostgreSQL
- How to log into a Postgresql database
- How to get started with PostgreSQL
- PostgreSQL Database startup / shutdown /restart
- psql docs
- PostgreSQL on Heroku
- Only 10K rows, 1GB storage; otherwise $9/mo; see pricing
DuckDB
- Official site
- Modern Data Stack in a Box with DuckDB
pg_duckdb
: DuckDB-powered Postgres for high performance apps & analytics- And they have friendly SQL
Dashboarding
Superset
- Set up using Makefile
- Running on Heroku:
# Set up basics
$ pip install apache-superset
$ pip install psycopg2
$ pip freeze > requirements.txt # or just limit to installed above
$ echo "web: superset db upgrade" > Procfile
$ echo "python-3.8.2" > runtime.txt
# add configs to superset_config.py
# Generate secret key
$ python
>>> from cryptography import fernet
>>> fernet.Fernet.generate_key()
# Set up Heroku app
$ heroku login
$ heroku create mds4all-superset
$ heroku config:set SECRET_KEY=<SECRET_KEY>
$ heroku config:set PORT=<PORT>
$ heroku addons:create heroku-postgresql:hobby-dev
$ git push heroku main
$ heroku logs –-tail # check for errors
# Initialise Superset
$ heroku run bash
$ export FLASK_APP=superset
$ superset fab create-admin
$ superset init
$ exit
# Finalise app
$ echo 'web: gunicorn "superset.app:create_app()"' > Procfile
$ git push heroku main
$ heroku logs --tail # check for errors
# Open app
$ heroku open
# add pybigquery to requirements.txt for BigQuery connections
- Running Superset on GCP
- Cloud Run: example one and two
- App Engine: example here
- Cloud Run vs App Engine
- Adding New Database Drivers in Docker
Querying
Trino
Editor
VSCode
Atom
Code
Local installs
- data stack: PostgreSQL/BigQuery, Meltano, dbt, Superset on Heroku
### BigQuery
# Install gcloud from https://cloud.google.com/sdk/docs/install
$ gcloud auth login
# Add var
$ PROJECT_ID=$(gcloud config get-value project)
# Create user
$ gcloud iam service-accounts create bigquery-sa --display-name="BigQuery SA"
# Add perms
$ gcloud projects add-iam-policy-binding $PROJECT_ID --member="serviceAccount:bigquery-sa@$PROJECT_ID.iam.gserviceaccount.com" --role="roles/bigquery.user"
$ gcloud projects add-iam-policy-binding ${PROJECT_ID} --member="serviceAccount:bigquery-sa@${PROJECT_ID}.iam.gserviceaccount.com" --role="roles/bigquery.dataEditor"
# Get creds to local
$ gcloud iam service-accounts keys create bigquery-sa.json --iam-account=bigquery-sa@${PROJECT_ID}.iam.gserviceaccount.com
### PostgreSQL
# $ brew install postgresql
# $ postgres --version
$ pg_ctl -D /usr/local/var/postgres start
# - for error running it
# $ ps ax | grep postmaster ## finds the running task
# $ sudo kill INT ## kill the process
#
# log_min_messages = error in postgres.conf
# to remove warnings
$ psql -d postgres michal -W # michal, pw:"admin"
> create database ds4fnp;
> create user ds4fnp with password 'ds4fnp';
> grant all privileges on database ds4fnp to ds4fnp;
> \q
$ psql -U ds4fnp
# > \c ds4fnp
> create schema ds4fnp;
> \q
### Directory
$ mkdir ds4fnp
$ cd ds4fnp
$ python3 -m venv .venv
$ source .venv/bin/activate
### Meltano
$ pip install meltano
$ meltano init meltano
$ cd meltano
$ meltano add extractor tap-spreadsheets-anywhere
# $ meltano invoke tap-spreadsheets-anywhere --help
$ meltano add loader target-postgres
# $ meltano invoke target-postgres --help
# - error on macOS: https://stackoverflow.com/a/62931654
### ELT
# - add config info to tap and target in meltano.yml
$ meltano elt tap-spreadsheets-anywhere target-postgres
### dbt
# $ brew update
# $ brew install git
# $ brew tap fishtown-analytics/dbt
# $ brew install dbt
# - set up profiles.yml
$ cd ..
$ pip install dbt
$ dbt init dbt --adapter postgres
$ cd dbt
$ dbt debug
# - set up dbt_project.yml
# - set up schema.yml files
$ dbt run
$ dbt docs generate
$ dbt docs serve
### Superset
$ cd ..
$ pip install apache-superset
# $ pip install sqlalchemy==1.3.24
$ superset db upgrade
$ superset fab create-admin
$ superset init
$ superset run -p 8088 --with-threads --reload --debugger # why just one?
# postgresql+psycopg2://ds4fnp:ds4fnp@127.0.0.1:5432/ds4fnp
# or for docker:
# docker network ls
# docker network connect postgres_default superset_app
# postgresql://ds4fnp:ds4fnp@postgres:5432/ds4fnp
#
# $ pip install pybigquery
# bigquery://{project_id}
### Closing
# - ctrl+c several times
$ deactivate
$ pg_ctl -D /usr/local/var/postgres stop
Docker
- data stack: PostgreSQL, Superset, (dbt locally)
- How to Deploy Docker Containers to The Cloud
# Postgres
$ docker pull postgres
$ docker run --name postgres -e POSTGRES_USER=cookie -e POSTGRES_PASSWORD=cookie -p 5432:5432 -d postgres
$ brew install libpq
$ brew link --force libpq
$ docker exec -it postgres psql -U cookie -d postgres
# Superset
$ docker pull apache/superset
$ docker run -d -p 8080:8088 --name superset apache/superset
# -v $PWD/local_config.py:/app/pythonpath/superset_config.py
$ docker exec -it superset superset fab create-admin \
--username admin \
--firstname Superset \
--lastname Admin \
--email admin@superset.com \
--password admin
$ docker exec -it superset superset db upgrade
# $ docker exec -it superset superset load_examples
$ docker exec -it superset superset init
# postgresql+psycopg2://cookie:cookie@host.docker.internal:5432/postgres
Tasks
- Test if Github works if I copy-paste it
- Move Github from GDrive to michal
- Test one last time from scratch
- Commit
- Play around with:
- BigQuery
- Different hosting (GCP)
- Google Data Studio
- Keboola
- Airflow
- Dockerisation and running on servers
- Airbyte, Trino
Notes
- What might be worth waiting for
- A follow-up article on ds4np
- Meltano: new integrations in July (?)