Build modern data stack

  • data-engineering
  • learn
  • summer

A designer knows he has achieved perfection not when there is nothing left to add, but when there is nothing left to take away. – Antoine de Saint-Exupéry

Links

Context

  • Modern Data Stack. Read what the modern data stack is and how came around.
  • Find out about Redshift's impact on modern data stackFind out about Redshift's impact on modern data stack
    We saw a tremendous amount of innovation immediately following the launch of Redshift in 2012, unlocking brand new levels of performance, efficiencies, and new behaviors. We then saw a maturation period as these nascent products were deployed by the market, improved their technology, and rounded out their feature sets. By now, these products are ready to act as a foundation on which successive innovations can be built.


    Ingestion: Fivetran, Stitch
    Warehousing: Snowflake, Bigquery, Redshi...
  • Data Stack. Learn how some of the most amazing companies in the world are organising their data stack. Learn more about the tools that they are using and why.
  • The modern data stack was built around SQL.

Fullstack

Hosting

Heroku

Scheduling

Airflow

Other paid

Other open-source

ELT

Meltano

Airbyte

CI

Python

Transformation

dbt

SDF

SQLMesh

  • official site, github, docs, blog, paid offering
  • comparison to dbt
  • my thoughts after testing it a bit:

    SQLMesh is heavily built around the “incremental idea” — it saves database snapshots and compares them over time, adjusting or supplementing as needed. This approach allows for the “terraform plan & apply” model they use and optimises for cost by default, recognising that storage is cheap but compute is expensive. This is quite a paradigm shift, especially compared to dbt, which primarily relies on the “full refresh” concept.

    Because of this incremental approach, scheduling is tightly integrated. SQLMesh needs to “understand the state” of the data. It’s built to work well with Airflow, but also supports DLT and Kestra — all open-source tools, which is evident in the project’s open nature. There’s no paid offering either.

    One key feature is its primary focus on SQL, while understanding the code itself. It knows when you’re only making formatting changes, can translate between SQL flavours, and automatically handles column lineage. Importantly, there’s no YAML or other configuration languages — just SQL and occasionally Python. They don’t even use Jinja for templating; macros are built directly into SQL, and you can write Python macros using any Python package. This is possible thanks to their sqlglot library, which directly manipulates SQL semantics rather than just doing text substitution.

    The whole platform feels very geared toward the engineering experience. The Browser UI and Observer, for example, are clearly designed for developers, not end users. This focus is visible across the product: everything is written in SQL, it has the “terraform” style plan, a built-in CI/CD bot, unit tests, audits (like dbt tests), and so on.

    They do, however, acknowledge some of the things dbt does well, such as SQL-based transformations, the CLI experience, managing environments (dev vs. prod), and project structures. They ensure backward compatibility with dbt through a “dbt package,” which allows the use of dbt projects and packages.

    However, some aspects of SQLMesh feel more complicated. It took me longer to get the basics compared to dbt, possibly due to its more complex underlying concept. There are more commands to learn, the documentation is somewhat weaker, and overall it doesn’t feel as straightforward or streamlined. It gives the impression of senior engineers piecing together a complex solution, while dbt started as a simpler idea that gradually matured and expanded with additional tools and packages.

Storage

BigQuery

PostgreSQL

DuckDB

Dashboarding

Superset

# Set up basics
$ pip install apache-superset
$ pip install psycopg2
$ pip freeze > requirements.txt # or just limit to installed above
$ echo "web: superset db upgrade" > Procfile
$ echo "python-3.8.2" > runtime.txt
# add configs to superset_config.py

# Generate secret key
$ python
>>> from cryptography import fernet  
>>> fernet.Fernet.generate_key()

# Set up Heroku app
$ heroku login
$ heroku create mds4all-superset
$ heroku config:set SECRET_KEY=<SECRET_KEY>
$ heroku config:set PORT=<PORT>
$ heroku addons:create heroku-postgresql:hobby-dev
$ git push heroku main
$ heroku logs –-tail # check for errors

# Initialise Superset
$ heroku run bash
$ export FLASK_APP=superset
$ superset fab create-admin
$ superset init
$ exit

# Finalise app
$ echo 'web: gunicorn "superset.app:create_app()"' > Procfile
$ git push heroku main
$ heroku logs --tail # check for errors

# Open app
$ heroku open
# add pybigquery to requirements.txt for BigQuery connections

Querying

Trino

Editor

VSCode

Atom

Code

Local installs

  • data stack: PostgreSQL/BigQuery, Meltano, dbt, Superset on Heroku
### BigQuery

# Install gcloud from https://cloud.google.com/sdk/docs/install 
$ gcloud auth login
 
# Add var
$ PROJECT_ID=$(gcloud config get-value project)

# Create user
$ gcloud iam service-accounts create bigquery-sa --display-name="BigQuery SA"

# Add perms
$ gcloud projects add-iam-policy-binding $PROJECT_ID --member="serviceAccount:bigquery-sa@$PROJECT_ID.iam.gserviceaccount.com" --role="roles/bigquery.user"

$ gcloud projects add-iam-policy-binding ${PROJECT_ID} --member="serviceAccount:bigquery-sa@${PROJECT_ID}.iam.gserviceaccount.com" --role="roles/bigquery.dataEditor"

# Get creds to local
$ gcloud iam service-accounts keys create bigquery-sa.json --iam-account=bigquery-sa@${PROJECT_ID}.iam.gserviceaccount.com


### PostgreSQL
# $ brew install postgresql
# $ postgres --version
$ pg_ctl -D /usr/local/var/postgres start
# - for error running it
# $ ps ax | grep postmaster ## finds the running task
# $ sudo kill INT ## kill the process
#
# log_min_messages = error in postgres.conf
# to remove warnings
$ psql -d postgres michal -W # michal, pw:"admin"
> create database ds4fnp;
> create user ds4fnp with password 'ds4fnp';
> grant all privileges on database ds4fnp to ds4fnp;
> \q
$ psql -U ds4fnp
# > \c ds4fnp
> create schema ds4fnp;
> \q

### Directory 
$ mkdir ds4fnp
$ cd ds4fnp
$ python3 -m venv .venv
$ source .venv/bin/activate

### Meltano
$ pip install meltano
$ meltano init meltano
$ cd meltano
$ meltano add extractor tap-spreadsheets-anywhere
# $ meltano invoke tap-spreadsheets-anywhere --help
$ meltano add loader target-postgres
# $ meltano invoke target-postgres --help
# - error on macOS: https://stackoverflow.com/a/62931654

### ELT
# - add config info to tap and target in meltano.yml
$ meltano elt tap-spreadsheets-anywhere target-postgres

### dbt
# $ brew update
# $ brew install git
# $ brew tap fishtown-analytics/dbt
# $ brew install dbt
# - set up profiles.yml
$ cd ..
$ pip install dbt  
$ dbt init dbt --adapter postgres
$ cd dbt
$ dbt debug
# - set up dbt_project.yml
# - set up schema.yml files
$ dbt run
$ dbt docs generate
$ dbt docs serve

### Superset
$ cd ..
$ pip install apache-superset
# $ pip install sqlalchemy==1.3.24
$ superset db upgrade
$ superset fab create-admin
$ superset init
$ superset run -p 8088 --with-threads --reload --debugger # why just one?
# postgresql+psycopg2://ds4fnp:ds4fnp@127.0.0.1:5432/ds4fnp
# $ pip install pybigquery
# bigquery://{project_id}

### Closing
# - ctrl+c several times
$ deactivate
$ pg_ctl -D /usr/local/var/postgres stop

Docker


# Postgres
$ docker pull postgres
$ docker run --name postgres -e POSTGRES_USER=cookie -e POSTGRES_PASSWORD=cookie -p 5432:5432 -d postgres

$ brew install libpq
$ brew link --force libpq
$ docker exec -it postgres psql -U cookie -d postgres

# Superset
$ docker pull apache/superset
$ docker run -d -p 8080:8088 --name superset apache/superset
# -v $PWD/local_config.py:/app/pythonpath/superset_config.py
$ docker exec -it superset superset fab create-admin \
              --username admin \
              --firstname Superset \
              --lastname Admin \
              --email admin@superset.com \
              --password admin
$ docker exec -it superset superset db upgrade
# $ docker exec -it superset superset load_examples
$ docker exec -it superset superset init

# postgresql+psycopg2://cookie:cookie@host.docker.internal:5432/postgres

Tasks

  • Test if Github works if I copy-paste it
  • Move Github from GDrive to michal
  • Test one last time from scratch
  • Commit
  • Play around with:
    • BigQuery
    • Different hosting (GCP)
    • Google Data Studio
    • Keboola
    • Airflow
    • Dockerisation and running on servers
    • Airbyte, Trino

Notes

  • What might be worth waiting for
    • A follow-up article on ds4np
    • Meltano: new integrations in July (?)
Metadata