A designer knows he has achieved perfection not when there is nothing left to add, but when there is nothing left to take away. – Antoine de Saint-Exupéry
Links
Context
- Modern Data Stack. Read what the modern data stack is and how came around.
- Find out about Redshift's impact on modern data stackFind out about Redshift's impact on modern data stack
We saw a tremendous amount of innovation immediately following the launch of Redshift in 2012, unlocking brand new levels of performance, efficiencies, and new behaviors. We then saw a maturation period as these nascent products were deployed by the market, improved their technology, and rounded out their feature sets. By now, these products are ready to act as a foundation on which successive innovations can be built.
Ingestion: Fivetran, Stitch
Warehousing: Snowflake, Bigquery, Redshi... - Data Stack. Learn how some of the most amazing companies in the world are organising their data stack. Learn more about the tools that they are using and why.
- The modern data stack was built around SQL.
Fullstack
- How I made an open housing dataset with BigQuery and dbt
- Data Stacks For Fun & Nonprofit — Part III
- A Simple Analytics Project Using Airflow, dbt and Superset
- dbt Core & Airflow
- Startup analytics 101: from seed to scale
- Open Source Data Stack Conference
- Gaining insights on my workout data with Apache Superset
- The four priorities of a one-person analytics team: lessons from Lola.com
- Bootstrap a Modern Data Stack in 5 minutes with Terraform
- Jan Soubusta – Data Pipeline as Code: Journey of our Blueprint
- Recipes from Airbyte:
Hosting
- Hosting web apps for free
- For long time, the best choice was Heroku, hence it's used rather extensively below. However, Salesforce who acquired the company in 2011 decided to eliminate free plans in 2022. There are quite a few modern alternatives though, here or here.
Heroku
Scheduling
Airflow
Prefect
ELT
Meltano
Airbyte
- Airbyte vs Meltano
- Building an Open-source Ingestion Layer with Airbyte
- Ingestion with Airbyte: A Guided Tutorial
CI
- Setup a slim CI for dbt with BigQuery and Docker
- GitHub Actions
- Team might offer good value for money
- There is even a specific action for dbt and for follow-up notification to Slack
Transformation
dbt
- dbt Cloud: free for one developer
- dbt Core:
- How to Deploy dbt to Production using GitHub Actions
- hint on deploying from dbt
- to specify steps based on trigger:
if: github.event_name == 'workflow_dispatch'
Docs
- How to deploy dbt docs to GitLab Pages
- Make it password protected
Storage
BigQuery
- Console
- Creating a service account
- Meltano setup
- dbt setup:
- Superset setup
- SQL functions
- Monthly free: 10 GB storage, 1 TB for analysis
PostgreSQL
- How to log into a Postgresql database
- How to get started with PostgreSQL
- PostgreSQL Database startup / shutdown /restart
- psql docs
- PostgreSQL on Heroku
- Only 10K rows, 1GB storage; otherwise $9/mo; see pricing
DuckDB
- Modern Data Stack in a Box with DuckDB
pg_duckdb
: DuckDB-powered Postgres for high performance apps & analyticsDashboarding
Superset
- Set up using Makefile
- Running on Heroku:
# Set up basics
$ pip install apache-superset
$ pip install psycopg2
$ pip freeze > requirements.txt # or just limit to installed above
$ echo "web: superset db upgrade" > Procfile
$ echo "python-3.8.2" > runtime.txt
# add configs to superset_config.py
# Generate secret key
$ python
>>> from cryptography import fernet
>>> fernet.Fernet.generate_key()
# Set up Heroku app
$ heroku login
$ heroku create mds4all-superset
$ heroku config:set SECRET_KEY=<SECRET_KEY>
$ heroku config:set PORT=<PORT>
$ heroku addons:create heroku-postgresql:hobby-dev
$ git push heroku main
$ heroku logs –-tail # check for errors
# Initialise Superset
$ heroku run bash
$ export FLASK_APP=superset
$ superset fab create-admin
$ superset init
$ exit
# Finalise app
$ echo 'web: gunicorn "superset.app:create_app()"' > Procfile
$ git push heroku main
$ heroku logs --tail # check for errors
# Open app
$ heroku open
# add pybigquery to requirements.txt for BigQuery connections
- Running Superset on GCP
- Cloud Run: example one and two
- App Engine: example here
- Cloud Run vs App Engine
- Adding New Database Drivers in Docker
Querying
Trino
Editor
VSCode
Atom
Code
Local installs
- data stack: PostgreSQL/BigQuery, Meltano, dbt, Superset on Heroku
### BigQuery
# Install gcloud from https://cloud.google.com/sdk/docs/install
$ gcloud auth login
# Add var
$ PROJECT_ID=$(gcloud config get-value project)
# Create user
$ gcloud iam service-accounts create bigquery-sa --display-name="BigQuery SA"
# Add perms
$ gcloud projects add-iam-policy-binding $PROJECT_ID --member="serviceAccount:bigquery-sa@$PROJECT_ID.iam.gserviceaccount.com" --role="roles/bigquery.user"
$ gcloud projects add-iam-policy-binding ${PROJECT_ID} --member="serviceAccount:bigquery-sa@${PROJECT_ID}.iam.gserviceaccount.com" --role="roles/bigquery.dataEditor"
# Get creds to local
$ gcloud iam service-accounts keys create bigquery-sa.json --iam-account=bigquery-sa@${PROJECT_ID}.iam.gserviceaccount.com
### PostgreSQL
# $ brew install postgresql
# $ postgres --version
$ pg_ctl -D /usr/local/var/postgres start
# - for error running it
# $ ps ax | grep postmaster ## finds the running task
# $ sudo kill INT ## kill the process
#
# log_min_messages = error in postgres.conf
# to remove warnings
$ psql -d postgres michal -W # michal, pw:"admin"
> create database ds4fnp;
> create user ds4fnp with password 'ds4fnp';
> grant all privileges on database ds4fnp to ds4fnp;
> \q
$ psql -U ds4fnp
# > \c ds4fnp
> create schema ds4fnp;
> \q
### Directory
$ mkdir ds4fnp
$ cd ds4fnp
$ python3 -m venv .venv
$ source .venv/bin/activate
### Meltano
$ pip install meltano
$ meltano init meltano
$ cd meltano
$ meltano add extractor tap-spreadsheets-anywhere
# $ meltano invoke tap-spreadsheets-anywhere --help
$ meltano add loader target-postgres
# $ meltano invoke target-postgres --help
# - error on macOS: https://stackoverflow.com/a/62931654
### ELT
# - add config info to tap and target in meltano.yml
$ meltano elt tap-spreadsheets-anywhere target-postgres
### dbt
# $ brew update
# $ brew install git
# $ brew tap fishtown-analytics/dbt
# $ brew install dbt
# - set up profiles.yml
$ cd ..
$ pip install dbt
$ dbt init dbt --adapter postgres
$ cd dbt
$ dbt debug
# - set up dbt_project.yml
# - set up schema.yml files
$ dbt run
$ dbt docs generate
$ dbt docs serve
### Superset
$ cd ..
$ pip install apache-superset
# $ pip install sqlalchemy==1.3.24
$ superset db upgrade
$ superset fab create-admin
$ superset init
$ superset run -p 8088 --with-threads --reload --debugger # why just one?
# postgresql+psycopg2://ds4fnp:ds4fnp@127.0.0.1:5432/ds4fnp
# $ pip install pybigquery
# bigquery://{project_id}
### Closing
# - ctrl+c several times
$ deactivate
$ pg_ctl -D /usr/local/var/postgres stop
Docker
- data stack: PostgreSQL, Superset, (dbt locally)
- How to Deploy Docker Containers to The Cloud
# Postgres
$ docker pull postgres
$ docker run --name postgres -e POSTGRES_USER=cookie -e POSTGRES_PASSWORD=cookie -p 5432:5432 -d postgres
$ brew install libpq
$ brew link --force libpq
$ docker exec -it postgres psql -U cookie -d postgres
# Superset
$ docker pull apache/superset
$ docker run -d -p 8080:8088 --name superset apache/superset
# -v $PWD/local_config.py:/app/pythonpath/superset_config.py
$ docker exec -it superset superset fab create-admin \
--username admin \
--firstname Superset \
--lastname Admin \
--email admin@superset.com \
--password admin
$ docker exec -it superset superset db upgrade
# $ docker exec -it superset superset load_examples
$ docker exec -it superset superset init
# postgresql+psycopg2://cookie:cookie@host.docker.internal:5432/postgres
Tasks
- Test if Github works if I copy-paste it
- Move Github from GDrive to michal
- Test one last time from scratch
- Commit
- Play around with:
- BigQuery
- Different hosting (GCP)
- Google Data Studio
- Keboola
- Airflow
- Dockerisation and running on servers
- Airbyte, Trino
Notes
- What might be worth waiting for
- A follow-up article on ds4np
- Meltano: new integrations in July (?)