Build modern data stack

data-engineering
learn
summer

A designer knows he has achieved perfection not when there is nothing left to add, but when there is nothing left to take away. – Antoine de Saint-Exupéry

Links

Context

Modern Data Stack. Read what the modern data stack is and how came around.
Find out about Redshift's impact on modern data stackFind out about Redshift's impact on modern data stack
We saw a tremendous amount of innovation immediately following the launch of Redshift in 2012, unlocking brand new levels of performance, efficiencies, and new behaviors. We then saw a maturation period as these nascent products were deployed by the market, improved their technology, and rounded out their feature sets. By now, these products are ready to act as a foundation on which successive innovations can be built.

Ingestion: Fivetran, Stitch
Warehousing: Snowflake, Bigquery, Redshi...
Data Stack. Learn how some of the most amazing companies in the world are organising their data stack. Learn more about the tools that they are using and why.
The modern data stack was built around SQL.

Fullstack

Hosting

Hosting web apps for free
For long time, the best choice was Heroku, hence it's used rather extensively below. However, Salesforce who acquired the company in 2011 decided to eliminate free plans in 2022. There are quite a few modern alternatives though, here or here.

Heroku

Scheduling

Airflow

Running Airflow on Heroku

Prefect

20K free runs every month

ELT

Meltano

Airbyte

CI

Setup a slim CI for dbt with BigQuery and Docker
GitHub Actions
- Team might offer good value for money
- There is even a specific action for dbt and for follow-up notification to Slack

Transformation

dbt

dbt Cloud: free for one developer
dbt Core:
- How to Deploy dbt to Production using GitHub Actions
- hint on deploying from dbt
- to specify steps based on trigger: if: github.event_name == 'workflow_dispatch'
  Docs
How to deploy dbt docs to GitLab Pages
Make it password protected

Storage

BigQuery

Console
Creating a service account
Meltano setup
dbt setup:
- FAQ
- video
- config
- profile
Superset setup
SQL functions
Monthly free: 10 GB storage, 1 TB for analysis

PostgreSQL

How to log into a Postgresql database
How to get started with PostgreSQL
PostgreSQL Database startup / shutdown /restart
psql docs
PostgreSQL on Heroku
- Only 10K rows, 1GB storage; otherwise $9/mo; see pricing

DuckDB

Modern Data Stack in a Box with DuckDB

Dashboarding

Superset

Set up using Makefile
Running on Heroku:
- Running Superset on Heroku
  - Heroku Template no. 1 and no. 2

# Set up basics
$ pip install apache-superset
$ pip install psycopg2
$ pip freeze > requirements.txt # or just limit to installed above
$ echo "web: superset db upgrade" > Procfile
$ echo "python-3.8.2" > runtime.txt
# add configs to superset_config.py

# Generate secret key
$ python
>>> from cryptography import fernet  
>>> fernet.Fernet.generate_key()

# Set up Heroku app
$ heroku login
$ heroku create mds4all-superset
$ heroku config:set SECRET_KEY=<SECRET_KEY>
$ heroku config:set PORT=<PORT>
$ heroku addons:create heroku-postgresql:hobby-dev
$ git push heroku main
$ heroku logs –-tail # check for errors

# Initialise Superset
$ heroku run bash
$ export FLASK_APP=superset
$ superset fab create-admin
$ superset init
$ exit

# Finalise app
$ echo 'web: gunicorn "superset.app:create_app()"' > Procfile
$ git push heroku main
$ heroku logs --tail # check for errors

# Open app
$ heroku open
# add pybigquery to requirements.txt for BigQuery connections

Running Superset on GCP
- Cloud Run: example one and two
- App Engine: example here
- Cloud Run vs App Engine
Adding New Database Drivers in Docker

Querying

Trino

Trino SQL, PrestoDB vs PrestoSQL, i.e. Presto vs Trino

Editor

How we set up our computers for working on dbt projects

VSCode

Atom

atom-dbt

Code

Local installs

data stack: PostgreSQL/BigQuery, Meltano, dbt, Superset on Heroku

### BigQuery

# Install gcloud from https://cloud.google.com/sdk/docs/install 
$ gcloud auth login
 
# Add var
$ PROJECT_ID=$(gcloud config get-value project)

# Create user
$ gcloud iam service-accounts create bigquery-sa --display-name="BigQuery SA"

# Add perms
$ gcloud projects add-iam-policy-binding $PROJECT_ID --member="serviceAccount:bigquery-sa@$PROJECT_ID.iam.gserviceaccount.com" --role="roles/bigquery.user"

$ gcloud projects add-iam-policy-binding ${PROJECT_ID} --member="serviceAccount:bigquery-sa@${PROJECT_ID}.iam.gserviceaccount.com" --role="roles/bigquery.dataEditor"

# Get creds to local
$ gcloud iam service-accounts keys create bigquery-sa.json --iam-account=bigquery-sa@${PROJECT_ID}.iam.gserviceaccount.com


### PostgreSQL
# $ brew install postgresql
# $ postgres --version
$ pg_ctl -D /usr/local/var/postgres start
# - for error running it
# $ ps ax | grep postmaster ## finds the running task
# $ sudo kill INT ## kill the process
#
# log_min_messages = error in postgres.conf
# to remove warnings
$ psql -d postgres michal -W # michal, pw:"admin"
> create database ds4fnp;
> create user ds4fnp with password 'ds4fnp';
> grant all privileges on database ds4fnp to ds4fnp;
> \q
$ psql -U ds4fnp
# > \c ds4fnp
> create schema ds4fnp;
> \q

### Directory 
$ mkdir ds4fnp
$ cd ds4fnp
$ python3 -m venv .venv
$ source .venv/bin/activate

### Meltano
$ pip install meltano
$ meltano init meltano
$ cd meltano
$ meltano add extractor tap-spreadsheets-anywhere
# $ meltano invoke tap-spreadsheets-anywhere --help
$ meltano add loader target-postgres
# $ meltano invoke target-postgres --help
# - error on macOS: https://stackoverflow.com/a/62931654

### ELT
# - add config info to tap and target in meltano.yml
$ meltano elt tap-spreadsheets-anywhere target-postgres

### dbt
# $ brew update
# $ brew install git
# $ brew tap fishtown-analytics/dbt
# $ brew install dbt
# - set up profiles.yml
$ cd ..
$ pip install dbt  
$ dbt init dbt --adapter postgres
$ cd dbt
$ dbt debug
# - set up dbt_project.yml
# - set up schema.yml files
$ dbt run
$ dbt docs generate
$ dbt docs serve

### Superset
$ cd ..
$ pip install apache-superset
# $ pip install sqlalchemy==1.3.24
$ superset db upgrade
$ superset fab create-admin
$ superset init
$ superset run -p 8088 --with-threads --reload --debugger # why just one?
# postgresql+psycopg2://ds4fnp:ds4fnp@127.0.0.1:5432/ds4fnp
# $ pip install pybigquery
# bigquery://{project_id}

### Closing
# - ctrl+c several times
$ deactivate
$ pg_ctl -D /usr/local/var/postgres stop

Docker

data stack: PostgreSQL, Superset, (dbt locally)
How to Deploy Docker Containers to The Cloud

# Postgres
$ docker pull postgres
$ docker run --name postgres -e POSTGRES_USER=cookie -e POSTGRES_PASSWORD=cookie -p 5432:5432 -d postgres

$ brew install libpq
$ brew link --force libpq
$ docker exec -it postgres psql -U cookie -d postgres

# Superset
$ docker pull apache/superset
$ docker run -d -p 8080:8088 --name superset apache/superset
# -v $PWD/local_config.py:/app/pythonpath/superset_config.py
$ docker exec -it superset superset fab create-admin \
              --username admin \
              --firstname Superset \
              --lastname Admin \
              --email admin@superset.com \
              --password admin
$ docker exec -it superset superset db upgrade
# $ docker exec -it superset superset load_examples
$ docker exec -it superset superset init

# postgresql+psycopg2://cookie:cookie@host.docker.internal:5432/postgres

Tasks

Notes

What might be worth waiting for
- A follow-up article on ds4np
- Meltano: new integrations in July (?)

Build modern data stack

Links

Context

Fullstack

Hosting

Heroku

Scheduling

Airflow

Prefect

ELT

Meltano

Airbyte

CI

Transformation

dbt

Docs

Storage

BigQuery

PostgreSQL

DuckDB

Dashboarding

Superset

Querying

Trino

Editor

VSCode

Atom

Code

Local installs

Docker

Tasks

Notes

Metadata

Related Notes

Be able to explain what you do to your granny

Become an analytics storyteller

Consider a role of an Analytics Engineer

Consider using notebooks for analyses