Learning
- SQL Climber
- SQL Zoo
- DataCamp
- Lost at SQL
- Window Functions – intro, questions
Reading
- Keep in mind that relational database is just one of several database paradigms, even though one of the most commonly used.
- Learn why SQL should be the default choice for data transformation logic
- I don't want to learn your garbage query language. I just want my SQL back.
- Understand that SQL queries start with FROMUnderstand that SQL queries start with FROM
When executing, the order actually is:
FROM
JOIN
ON
WHERE
GROUP BY
HAVING
SELECT -- including window functions
ORDER BY
LIMIT
- Understand that CROSS JOIN UNNEST works through value tablesUnderstand that CROSS JOIN UNNEST works through value tables
Now let's talk about unnesting.
The UNNEST function takes an array and returns a value table of the array's element type. Whereas most BigQuery tables are SQL tables defined as a collection of columns, a value table has rows of some value type.
For numbers_array, UNNEST(numbers_array) returns a value table whose value type is INT64, since numbers_array is an array with an element type of INT64. This value table contains all of the elements in numbers_array for the current row from t1
- Understand that CTEs are mostly pass-throughs nowUnderstand that CTEs are mostly pass-throughs now
All modern analytical database optimizers appear to treat our “import statement” CTEs as pass-throughs. These CTEs have no impact on performance whatsoever, and just act in the way that we want: as a great tool to clean up our code but not ultimately to change the explain plan.
Tested on: Redshift, Bigquery, Snowflake
Yet, CTEs originally used to be – and sometimes still are – optimisation boundaries, i.e. databases are not allowed to optimise across CTEs, hence they are always be reso... - Use PARTITION BY on columns based on which one filters oftenUse PARTITION BY on columns based on which one filters often
Partitioning divides your table into parts and keeps the related data together based on column values such as date, country, region, etc. Partitions act as virtual columns. You define them at table creation, and they can help reduce the amount of data scanned per query, thereby improving performance.
When deciding the columns on which to partition, consider the following:
Columns that are used as filters are good candidates for partitioning.
Partitioning has a cost. As the number of par... - Use linter, even for SQLUse linter, even for SQL
Roles of a linting tool
Style
Nobody likes to nitpick on style in PR reviews, and nobody likes to get these reviews either. Not only is it a bad experience for everyone involved, it also gets in the way of the important conversation around business logic, architecture and testing.
Raising style issues is a good start, but enforcing them automatically is what allows developers to truly forget about formatting.
Code Smells
Beyond “ugly” code, there is another category of “problematic”... - Beware using NULL in WHERE NOT INBeware using NULL in WHERE NOT IN
SELECT *
FROM pony
WHERE id NOT IN (1, 2, NULL)
-- 0 rows, major wtf
It's because:
SELECT *
FROM pony
WHERE NOT (
id = 1
OR id = 2
OR id = NULL
)
And then remove the parenthesis using De Morgan’s laws:
SELECT *
FROM pony
WHERE
id != 1
AND id != 2
AND id != NULL
Like explained in the intro, id != NULL is always NULL, therefore the entire WHERE clause is always FALSE.
Advice:
Keep using subqueries, mind NULL values. Still better than JOINs that can cause du... - Create fizzbuzz for SQL to test analyst candidatesCreate fizzbuzz for SQL to test analyst candidates
What is fizzbuzz?
Fizzbuzz for SQL
Implementation for free
Paid implementation
Tools
- Trino – SQL query engine that runs at ludicrous speed. Used to be PrestoSQL.
- SQLite
- Sribe – API that provides line by line descriptions of your queries.
- Malloy – "React of SQL"
- Splink – fast, accurate and scalable probabilistic data linkage
- Ibis – translate Python to SQL
- SQLGlot – translate between SQL flavours
Snippets
- Snippets
- How to use Jinja templating for SQL in Python:
from jinja2 import Template
- Combine minus signs on the start of the opening block and the start of the ending block.
- Since we now have a nested loop, we need to keep track of two indices. We can do this by using the block
{% set outer_loop = loop %}
to assign the outer loop to a new variableouter_loop
before it is “replaced” by the inner loop.
Notes
"SQL is declarative, but used for imperative ends—we need to know how it works, step by step. Software is the opposite: It typically uses imperative means for declarative ends." – Benn Stancil, The smol analyst