Build modern data stack

data-engineering
learn
summer

A designer knows he has achieved perfection not when there is nothing left to add, but when there is nothing left to take away. – Antoine de Saint-Exupéry

Links

Context

Modern Data Stack. Read what the modern data stack is and how came around.
Find out about Redshift's impact on modern data stackFind out about Redshift's impact on modern data stack
We saw a tremendous amount of innovation immediately following the launch of Redshift in 2012, unlocking brand new levels of performance, efficiencies, and new behaviors. We then saw a maturation period as these nascent products were deployed by the market, improved their technology, and rounded out their feature sets. By now, these products are ready to act as a foundation on which successive innovations can be built.

Ingestion: Fivetran, Stitch
Warehousing: Snowflake, Bigquery, Redshi...
Data Stack. Learn how some of the most amazing companies in the world are organising their data stack. Learn more about the tools that they are using and why.
The modern data stack was built around SQL.

Fullstack

Hosting

Hosting web apps for free
For long time, the best choice was Heroku, hence it's used rather extensively below. However, Salesforce who acquired the company in 2011 decided to eliminate free plans in 2022. There are quite a few modern alternatives though, here or here.

Heroku

Scheduling

Airflow

Running Airflow on Heroku

Other paid

Dagster

Other open-source

ELT

Meltano

Airbyte

CI

Setup a slim CI for dbt with BigQuery and Docker
GitHub Actions
- Team might offer good value for money
- There is even a specific action for dbt and for follow-up notification to Slack

Python

data load tool (dlt): open-source Python library that makes data loading easy

Transformation

dbt

dbt Cloud: free for one developer
dbt Core:
- How to Deploy dbt to Production using GitHub Actions
- hint on deploying from dbt
- to specify steps based on trigger: if: github.event_name == 'workflow_dispatch'
  Docs
How to deploy dbt docs to GitLab Pages
Make it password protected

SDF

a new dbt engine

SQLMesh

official site, github, docs, blog, paid offering
comparison to dbt
my thoughts after testing it a bit:

SQLMesh is heavily built around the “incremental idea” — it saves database snapshots and compares them over time, adjusting or supplementing as needed. This approach allows for the “terraform plan & apply” model they use and optimises for cost by default, recognising that storage is cheap but compute is expensive. This is quite a paradigm shift, especially compared to dbt, which primarily relies on the “full refresh” concept.

Because of this incremental approach, scheduling is tightly integrated. SQLMesh needs to “understand the state” of the data. It’s built to work well with Airflow, but also supports DLT and Kestra — all open-source tools, which is evident in the project’s open nature. There’s no paid offering either.

One key feature is its primary focus on SQL, while understanding the code itself. It knows when you’re only making formatting changes, can translate between SQL flavours, and automatically handles column lineage. Importantly, there’s no YAML or other configuration languages — just SQL and occasionally Python. They don’t even use Jinja for templating; macros are built directly into SQL, and you can write Python macros using any Python package. This is possible thanks to their sqlglot library, which directly manipulates SQL semantics rather than just doing text substitution.

The whole platform feels very geared toward the engineering experience. The Browser UI and Observer, for example, are clearly designed for developers, not end users. This focus is visible across the product: everything is written in SQL, it has the “terraform” style plan, a built-in CI/CD bot, unit tests, audits (like dbt tests), and so on.

They do, however, acknowledge some of the things dbt does well, such as SQL-based transformations, the CLI experience, managing environments (dev vs. prod), and project structures. They ensure backward compatibility with dbt through a “dbt package,” which allows the use of dbt projects and packages.

However, some aspects of SQLMesh feel more complicated. It took me longer to get the basics compared to dbt, possibly due to its more complex underlying concept. There are more commands to learn, the documentation is somewhat weaker, and overall it doesn’t feel as straightforward or streamlined. It gives the impression of senior engineers piecing together a complex solution, while dbt started as a simpler idea that gradually matured and expanded with additional tools and packages.

Storage

BigQuery

Console
Creating a service account
Meltano setup
dbt setup:
- FAQ
- video
- config
- profile
Superset setup
SQL functions
Monthly free: 10 GB storage, 1 TB for analysis

PostgreSQL

How to log into a Postgresql database
How to get started with PostgreSQL
PostgreSQL Database startup / shutdown /restart
psql docs
PostgreSQL on Heroku
- Only 10K rows, 1GB storage; otherwise $9/mo; see pricing

DuckDB

Official site
Modern Data Stack in a Box with DuckDB
Fully Local Data Transformation with dbt and DuckDB
pg_duckdb: DuckDB-powered Postgres for high performance apps & analytics
And they have friendly SQL
And now they also have UI
Oh, and they are experimenting with DuckLake

Apache Doris

website, use case in Cisco
for real-time analytics

Dashboarding

Superset

Set up using Makefile
Running on Heroku:
- Running Superset on Heroku
  - Heroku Template no. 1 and no. 2

                    # Set up basics
$ pip install apache-superset
$ pip install psycopg2
$ pip freeze > requirements.txt # or just limit to installed above
$ echo "web: superset db upgrade" > Procfile
$ echo "python-3.8.2" > runtime.txt
# add configs to superset_config.py

# Generate secret key
$ python
>>> from cryptography import fernet  
>>> fernet.Fernet.generate_key()

# Set up Heroku app
$ heroku login
$ heroku create mds4all-superset
$ heroku config:set SECRET_KEY=<SECRET_KEY>
$ heroku config:set PORT=<PORT>
$ heroku addons:create heroku-postgresql:hobby-dev
$ git push heroku main
$ heroku logs –-tail # check for errors

# Initialise Superset
$ heroku run bash
$ export FLASK_APP=superset
$ superset fab create-admin
$ superset init
$ exit

# Finalise app
$ echo 'web: gunicorn "superset.app:create_app()"' > Procfile
$ git push heroku main
$ heroku logs --tail # check for errors

# Open app
$ heroku open
# add pybigquery to requirements.txt for BigQuery connections

                  

Running Superset on GCP
- Cloud Run: example one and two
- App Engine: example here
- Cloud Run vs App Engine
Adding New Database Drivers in Docker

Querying

Trino

Trino SQL, PrestoDB vs PrestoSQL, i.e. Presto vs Trino

Editor

How we set up our computers for working on dbt projects

VSCode

Atom

atom-dbt

Code

Local installs

data stack: PostgreSQL/BigQuery, Meltano, dbt, Superset on Heroku

                    ### BigQuery

# Install gcloud from https://cloud.google.com/sdk/docs/install 
$ gcloud auth login
 
# Add var
$ PROJECT_ID=$(gcloud config get-value project)

# Create user
$ gcloud iam service-accounts create bigquery-sa --display-name="BigQuery SA"

# Add perms
$ gcloud projects add-iam-policy-binding $PROJECT_ID --member="serviceAccount:bigquery-sa@$PROJECT_ID.iam.gserviceaccount.com" --role="roles/bigquery.user"

$ gcloud projects add-iam-policy-binding ${PROJECT_ID} --member="serviceAccount:bigquery-sa@${PROJECT_ID}.iam.gserviceaccount.com" --role="roles/bigquery.dataEditor"

# Get creds to local
$ gcloud iam service-accounts keys create bigquery-sa.json --iam-account=bigquery-sa@${PROJECT_ID}.iam.gserviceaccount.com


### PostgreSQL
# $ brew install postgresql
# $ postgres --version
$ pg_ctl -D /usr/local/var/postgres start
# - for error running it
# $ ps ax | grep postmaster ## finds the running task
# $ sudo kill INT ## kill the process
#
# log_min_messages = error in postgres.conf
# to remove warnings
$ psql -d postgres michal -W # michal, pw:"admin"
> create database ds4fnp;
> create user ds4fnp with password 'ds4fnp';
> grant all privileges on database ds4fnp to ds4fnp;
> \q
$ psql -U ds4fnp
# > \c ds4fnp
> create schema ds4fnp;
> \q

### Directory 
$ mkdir ds4fnp
$ cd ds4fnp
$ python3 -m venv .venv
$ source .venv/bin/activate

### Meltano
$ pip install meltano
$ meltano init meltano
$ cd meltano
$ meltano add extractor tap-spreadsheets-anywhere
# $ meltano invoke tap-spreadsheets-anywhere --help
$ meltano add loader target-postgres
# $ meltano invoke target-postgres --help
# - error on macOS: https://stackoverflow.com/a/62931654

### ELT
# - add config info to tap and target in meltano.yml
$ meltano elt tap-spreadsheets-anywhere target-postgres

### dbt
# $ brew update
# $ brew install git
# $ brew tap fishtown-analytics/dbt
# $ brew install dbt
# - set up profiles.yml
$ cd ..
$ pip install dbt  
$ dbt init dbt --adapter postgres
$ cd dbt
$ dbt debug
# - set up dbt_project.yml
# - set up schema.yml files
$ dbt run
$ dbt docs generate
$ dbt docs serve

### Superset
$ cd ..
$ pip install apache-superset
# $ pip install sqlalchemy==1.3.24
$ superset db upgrade
$ superset fab create-admin
$ superset init
$ superset run -p 8088 --with-threads --reload --debugger # why just one?

# postgresql+psycopg2://ds4fnp:ds4fnp@127.0.0.1:5432/ds4fnp
# or for docker:
# docker network ls
# docker network connect postgres_default superset_app
# postgresql://ds4fnp:ds4fnp@postgres:5432/ds4fnp
#
# $ pip install pybigquery
# bigquery://{project_id}

### Closing
# - ctrl+c several times
$ deactivate
$ pg_ctl -D /usr/local/var/postgres stop


                  

Docker

data stack: PostgreSQL, Superset, (dbt locally)
How to Deploy Docker Containers to The Cloud

                    
# Postgres
$ docker pull postgres
$ docker run --name postgres -e POSTGRES_USER=cookie -e POSTGRES_PASSWORD=cookie -p 5432:5432 -d postgres

$ brew install libpq
$ brew link --force libpq
$ docker exec -it postgres psql -U cookie -d postgres

# Superset
$ docker pull apache/superset
$ docker run -d -p 8080:8088 --name superset apache/superset
# -v $PWD/local_config.py:/app/pythonpath/superset_config.py
$ docker exec -it superset superset fab create-admin \
              --username admin \
              --firstname Superset \
              --lastname Admin \
              --email admin@superset.com \
              --password admin
$ docker exec -it superset superset db upgrade
# $ docker exec -it superset superset load_examples
$ docker exec -it superset superset init

# postgresql+psycopg2://cookie:cookie@host.docker.internal:5432/postgres

                  

Tasks

Notes

What might be worth waiting for
- A follow-up article on ds4np
- Meltano: new integrations in July (?)

Build modern data stack

Links

Context

Fullstack

Hosting

Heroku

Scheduling

Airflow

Other paid

Other open-source

ELT

Meltano

Airbyte

CI

Python

Transformation

dbt

Docs

SDF

SQLMesh

Storage

BigQuery

PostgreSQL

DuckDB

Apache Doris

Dashboarding

Superset

Querying

Trino

Editor

VSCode

Atom

Code

Local installs

Docker

Tasks

Notes

Metadata

Related Notes

Be able to explain what you do to your granny

Become an analytics storyteller

Consider a role of an Analytics Engineer

Consider using notebooks for analyses