Learn SQL

  • data-coding
  • learn
  • sql
  • summer

Learning#

Reading#

  • Keep in mind that relational database is just one of several database paradigms, even though one of the most commonly used.
  • Learn why SQL should be the default choice for data transformation logic
  • I don't want to learn your garbage query language. I just want my SQL back.
  • Understand that SQL queries start with FROMUnderstand that SQL queries start with FROM
    When executing, the order actually is:

    FROM
    JOIN
    ON
    WHERE
    GROUP BY
    HAVING
    SELECT -- including window functions
    ORDER BY
    LIMIT

  • Understand that CROSS JOIN UNNEST works through value tablesUnderstand that CROSS JOIN UNNEST works through value tables
    Now let's talk about unnesting.

    The UNNEST function takes an array and returns a value table of the array's element type. Whereas most BigQuery tables are SQL tables defined as a collection of columns, a value table has rows of some value type.

    For numbers_array, UNNEST(numbers_array) returns a value table whose value type is INT64, since numbers_array is an array with an element type of INT64. This value table contains all of the elements in numbers_array for the current row from t1
  • Understand that CTEs are mostly pass-throughs nowUnderstand that CTEs are mostly pass-throughs now

    All modern analytical database optimizers appear to treat our “import statement” CTEs as pass-throughs. These CTEs have no impact on performance whatsoever, and just act in the way that we want: as a great tool to clean up our code but not ultimately to change the explain plan.
    Tested on: Redshift, Bigquery, Snowflake
    Yet, CTEs originally used to be – and sometimes still are – optimisation boundaries, i.e. databases are not allowed to optimise across CTEs, hence they are always be reso...
  • Use PARTITION BY on columns based on which one filters oftenUse PARTITION BY on columns based on which one filters often
    Partitioning divides your table into parts and keeps the related data together based on column values such as date, country, region, etc. Partitions act as virtual columns. You define them at table creation, and they can help reduce the amount of data scanned per query, thereby improving performance.

    When deciding the columns on which to partition, consider the following:

    Columns that are used as filters are good candidates for partitioning.
    Partitioning has a cost. As the number of par...
  • Use linter, even for SQLUse linter, even for SQL
    Roles of a linting tool
    Style

    Nobody likes to nitpick on style in PR reviews, and nobody likes to get these reviews either. Not only is it a bad experience for everyone involved, it also gets in the way of the important conversation around business logic, architecture and testing.
    Raising style issues is a good start, but enforcing them automatically is what allows developers to truly forget about formatting.


    Code Smells

    Beyond “ugly” code, there is another category of “problematic”...
  • Beware using NULL in WHERE NOT INBeware using NULL in WHERE NOT IN
    SELECT *
    FROM pony
    WHERE id NOT IN (1, 2, NULL)
    -- 0 rows, major wtf


    It's because:
    SELECT *
    FROM pony
    WHERE NOT (
    id = 1
    OR id = 2
    OR id = NULL
    )


    And then remove the parenthesis using De Morgan’s laws:

    SELECT *
    FROM pony
    WHERE
    id != 1
    AND id != 2
    AND id != NULL


    Like explained in the intro, id != NULL is always NULL, therefore the entire WHERE clause is always FALSE.

    Advice:
    Keep using subqueries, mind NULL values. Still better than JOINs that can cause du...
  • Create fizzbuzz for SQL to test analyst candidatesCreate fizzbuzz for SQL to test analyst candidates

    What is fizzbuzz?
    Fizzbuzz for SQL
    Implementation for free
    Paid implementation

Tools#

  • Trino – SQL query engine that runs at ludicrous speed. Used to be PrestoSQL.
  • SQLite
  • Sribe – API that provides line by line descriptions of your queries.
  • Malloy"React of SQL"
  • Splink – fast, accurate and scalable probabilistic data linkage
  • Ibis – translate Python to SQL
  • SQLGlot – translate between SQL flavours

Snippets#

  • Snippets
  • How to use Jinja templating for SQL in Python:
    • from jinja2 import Template
    • Combine minus signs on the start of the opening block and the start of the ending block.
    • Since we now have a nested loop, we need to keep track of two indices. We can do this by using the block {% set outer_loop = loop %} to assign the outer loop to a new variable outer_loop before it is “replaced” by the inner loop.

Notes#

"SQL is declarative, but used for imperative ends—we need to know how it works, step by step. Software is the opposite: It typically uses imperative means for declarative ends." – Benn Stancil, The smol analyst

Metadata