Think of a Relational Database Management System (RDBMS) as a vast collection of Excel workbooks. Each workbook (in database terms, we call these “tables”) contains different sheets, and on each sheet, you have rows and columns of data. The sheets are organized in a way that the data can be quickly searched, updated, inserted, or deleted.
A relational database consists of multiple tables that relate to each other. The relation between tables is formed in the sense of shared columns.
A transaction is a sequential group of statements or queries to perform single or multiple tasks in a database. Transactions may consist of a single read, write, delete, or update operations or a combination of these.
ACID – zbiór właściwości gwarantujących poprawne przetwarzanie transakcji w bazach danych
Atomicity - każda transakcja albo zostanie wykonana w całości, albo w ogóle
Consistency - po wykonaniu transakcji system będzie spójny, czyli nie zostaną naruszone zasady integralności (nie będzie nieautoryzowanych zmian).
Isolation - jeśli dwie transakcje wykonują się współbieżnie, to ich efekt będzie taki jakby były wykonane sekwencyjnie
Durability - jeśli transakcja się powiedzie, to zostaje ona utrwalona i aktualne dane są dostępne i spójne nawet pomimo możliwych awarii
Database schema is the structure of a database, refers to the organization of data as a blueprint that demonstrates how the database is constructed. So, a database schema describes how the data may relate to other tables or other data models. However, the schema does not actually contain data.
Database instance is a snap shot of data in a database at a single moment in time.
Keys help you to uniquely identify row of data in a table despite the size of the database, which might contains thousand of records. It establishes and identify the relationship between tables.
Normalization rules divide large tables into smaller tables and join them using relationships. The purpose of normalization in SQL is:
SQL stands for “Structured Query Language”. It’s used to manage & query data stored in a relational database management system (RDBMS).
People casually use “SQL” interchangeably with “MySQL” and “PostgreSQL”. While that’s not technically correct, in most cases for beginners in the field, it doesn’t matter too much unless you want to be pedantic.
SQL is general, high-level language for querying and manipulating relational databases (RDBMS). MySQL, Postgres, SQLite, and SQL Server are all RDBMS. You use varying flavors of SQL syntax to query each unique RDBMS. For example, to query Postgres, you write PostgreSQL.
Here we use PostgreSQL because it is 2nd most popular, open source and extensible.
BETWEEN is a logical operator that selects values within a given range (inclusive)
SELECT column_name(s)
FROM table_name
WHERE column_name BETWEEN value1 AND value2;
IN operator allows us to specify multiple values in a single line’s WHERE clause, instead
of the more tedious approach of having to use multiple OR conditions to filter for multiple
values.
SELECT ...
FROM ...
WHERE column IN (..., ...);
Single quotes are used to indicate the beginning and end of a string in SQL. Double quotes generally aren’t used in SQL, but that can vary from database to database. Stick to using single quotes.
LIKE:
SELECT ...
FROM ...
WHERE column LIKE ... ;
LIKE is often accompanied with wildcards % (zero or multiple characters) and _ (single
character).
Theoretically, rows in a relational database such as Postgres, MySQL, SQL Server, etc. aren’t
stored in any particular or guaranteed sort order. In fact, executing an identical SELECT *
query twice on the same dataset doesn’t ensure the results will come back in the same order.
ORDER BY clause allows you to reorder your results based on the data in one or more
columns.
SELECT column1, column2
FROM table_name
WHERE condition(s)
ORDER BY column;
To reverse the default ascending sort order for ORDER BY, you can use the DESC SQL keyword.
The ORDER BY clause doesn’t just work with one column – you can sort on two, or even multiple
columns
Pro tip: you can substitute numbers for column names in the ORDER BY clause. The numbers correspond to the columns you specify in the SELECT clause (starting from 1).
LIMIT and OFFSET are used for limiting the output
SELECT *
FROM callers
ORDER BY call_received DESC
LIMIT 5;
will get you the 5 most recent phone calls received, while here
SELECT *
FROM callers
ORDER BY call_received DESC
OFFSET 10
LIMIT 5;
we skip the first 10 phone calls and fetch the subsequent 5 phone calls received.
SUM adds together all the values in a particular column.MIN returns the lowest value in a particular columnMAX returns the highest value in a particular columnAVG calculates the average of a group of selected values.COUNT counts how many rows are in a particular column.GROUP BYGROUP BY command tells the database to separate your data into groups, which can then be
aggregated independently.
SELECT
ticker,
AVG(open),
MAX(close)
FROM stock_prices
GROUP BY ticker
ORDER BY ticker;
You can GROUP BY two columns, even multiple columns, in SQL. To group by multiple categories, just
separate column names with commas
HAVINGHAVING allows you to filter data based on values from aggregate functions (as opposed to WHERE
clause which filters BEFORE grouping).
SELECT ticker, AVG(open)
FROM stock_prices
GROUP BY ticker
HAVING AVG(open) > 200;
DISTINCTThe DISTINCT SQL command is used in conjunction with the SELECT statement to return only
distinct (different) values.
SELECT DISTINCT manufacturer
FROM pharmacy_sales;
You can use DISTINCT with aggregate functions – the most common one being COUNT
SELECT COUNT(DISTINCT user_id)
FROM trades;
ABS()ROUND()CEIL()FLOOR()POWER()MOD()CASTIn SQL, integer division discards the remainder from the output, providing only the integer (whole number) part of the result. You can still make SQL achieve decimal output with a few tricks such as:
CAST() function
SELECT
CAST(10 AS DECIMAL)/4,
CAST(10 AS FLOAT)/4,
10/CAST(6 AS DECIMAL),
10/CAST(6 AS FLOAT);
In SQL, NULL indicates the absence of a value. Unlike an empty string or zero, NULL doesn’t represent a specific value, but rather, a missing or unknown piece of information.
Now, to correctly identify NULL values, we can utilize the IS NULL or IS NOT NULL condition:
SELECT *
FROM goodreads
WHERE book_title IS NULL;
SELECT COALESCE(book_rating, 0)
FROM goodreads;
If column_name is NULL, it returns the specified ‘expression’. Otherwise, it returns the value of column_name.
IFNULL() function: Handles two arguments, returning the second if the first is null; else, it
returns the first.
SELECT book_title, IFNULL(book_rating, 0) AS rated_books
FROM goodreads;
CASEThe CASE statement evaluates the specified conditions sequentially and returns a value when it
encounters the first true condition. If none of the conditions are met, the ELSE clause provides a
default value
SELECT
column_1,
column_2,
CASE
WHEN condition_1 THEN result_1
WHEN condition_2 THEN result_2
WHEN ... THEN ...
ELSE result_3 -- If condition_1 and condition_2 are not met, return result_3 in ELSE clause
END AS column_3_name -- Give your new column an alias
FROM table_1;
CASE Statement in WHERE Clause
SELECT
actor,
character,
platform
FROM marvel_avengers
WHERE
CASE
WHEN platform = 'Instagram' THEN followers >= 500000
WHEN platform = 'Twitter' THEN followers >= 200000
ELSE followers >= 100000
END;
COUNT() in CASE Statement
SELECT
platform,
COUNT(CASE
WHEN followers >= 500000 THEN 1
ELSE NULL
END) AS popular_actor_count,
COUNT(CASE
WHEN followers < 500000 THEN 1
ELSE NULL
END) AS less_popular_actor_count
FROM marvel_avengers
GROUP BY platform;
SUM(), AVG()SELECT *
FROM artists
JOIN songs
ON artists.artist_id = songs.artist_id;
JOIN keyword, which is short for an INNER JOIN. There are actually 4 different types of JOINS in
SQL.
| Type of Joins | Description |
|---|---|
| INNER JOIN | Returns only the rows with matching values from both tables. |
| LEFT JOIN | Returns all the rows from the left table and the matching rows from the right table. |
| RIGHT JOIN | Returns all the rows from the right table and the matching rows from the left table. |
| FULL OUTER JOIN | Returns all rows when there is a match in either the left or the right table. If there is no match, NULL values are returned for columns from the table without a match. |
CURRENT_DATE() Returns today’s date.CURRENT_TIME() Returns the current time without the date.NOW() Returns the current date and time.SELECT *
FROM messages
WHERE sent_date > '2022-08-10 00:00:00';
EXTRACT() function extracts a specific component (i.e. year, month, day, hour, or minute)
SELECT
message_id,
sent_date,
EXTRACT(YEAR FROM sent_date) AS extracted_year,
EXTRACT(MONTH FROM sent_date) AS extracted_month,
EXTRACT(DAY FROM sent_date) AS extracted_day,
EXTRACT(HOUR FROM sent_date) AS extracted_hour,
EXTRACT(MINUTE FROM sent_date) AS extracted_minute,
FROM messages
LIMIT 3;
DATE_TRUNC() rounds down a date or timestamp to a specified unit of timeINTERVAL function is used to handle date and time gaps by adding and subtracting intervals such
as “3 days”, “5 hours”, “45 minutes”.
SELECT
message_id,
sent_date,
sent_date + INTERVAL '2 days' AS add_2days,
sent_date - INTERVAL '3 days' AS minus_3days,
sent_date + INTERVAL '2 hours' AS add_2hours,
sent_date - INTERVAL '10 minutes' AS minus_10mins
FROM messages
LIMIT 3;
TO_CHAR() converts a date or timestamp to a string with a specified format.::DATE or TO_DATE(): Convert strings into dates.::TIMESTAMP or TO_TIMESTAMP(): Convert strings into timestamps.Common Table Expression, is like a query within a query. With the WITH statement, you can create temporary tables to store results, making complex queries more readable and maintainable. These temporary tables exist only for the duration of the main query, streamlining your analysis process.
-- Start of a CTE
WITH genre_revenue_cte AS (
SELECT
genre,
SUM(concert_revenue) AS total_revenue
FROM concerts
GROUP BY genre
)
-- End of a CTE
SELECT
g.genre,
g.total_revenue,
AVG(c.concert_revenue) AS avg_concert_revenue
FROM genre_revenue_cte AS g
INNER JOIN concerts AS c
ON g.genre = c.genre
WHERE c.concert_revenue > g.total_revenue * 0.5
GROUP BY g.genre, g.total_revenue;
By nesting queries within parentheses, you can generate temporary tables to perform calculations and filter data within the main query.
SELECT artist_name
FROM concerts
WHERE concert_revenue > (
SELECT AVG(concert_revenue) FROM concerts);
In simple terms, they are functions that operate by creating virtual “windows” within the dataset.
SELECT
spend,
SUM(spend) OVER (
PARTITION BY product
ORDER BY transaction_date) AS running_total
FROM product_spend;
Here’s what each SQL command in the window function is doing:
SUM(spend) is a typical aggregate function
OVER is required for window functions
PARTITION BY: makes each product it’s own section / window,
ORDER BY: the data is ordered by transaction_date, and the running_total accumulates the sum
across the current row and all subsequent rows of spend
ORDER BY essentially sorts the data by the specified column, similar to an ORDER BY
clause. Without ORDER BY, each value would be a sum of all the spend values without its respective
product.
| Clause | Order | Description |
|---|---|---|
| FROM | 1 | The query begins with the FROM clause, where the database identifies the tables involved and accesses the necessary data. |
| WHERE | 2 | The database applies the conditions specified in the WHERE clause to filter the data retrieved from the tables in the FROM clause. |
| GROUP BY | 3 | If a GROUP BY clause is present, the data is grouped based on the specified columns, and aggregation functions (such as SUM(), AVG(), COUNT()) are applied to each group. |
| HAVING | 4 | The HAVING clause filters the aggregated data based on specified conditions. |
| SELECT | 5 | The SELECT clause defines the columns to be included in the final result set. |
| ORDER BY | 6 | If an ORDER BY clause is used, the result set is sorted according to the specified columns. |
| LIMIT/OFFSET | 7 | If LIMIT or OFFSET clause is present, the result set is restricted to the specified number of rows and optionally offset by a certain number of rows. |
CONCAT function can be used for formatting e.g.
CONCAT(
'$',
ROUND(SUM(total_sales) / 1000000),
' million'
) AS sales
MAKE_DATE(year::INTEGER, month::INTEGER, day::INTEGER) can be used to format date
WITH cte1 AS (
...
),
cte2 AS (
...
),
cte3 AS (
...
)
SELECT
user_id,
tweet_date,
ROUND(AVG(tweet_count) OVER (
PARTITION BY user_id
ORDER BY tweet_date
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW), 2) AS rolling_avg_3
FROM tweets;
SELECT
merchant_id,
credit_card_id,
amount,
transaction_timestamp
FROM
transactions AS t1
WHERE
EXISTS (
SELECT
*
FROM
transactions AS t2
WHERE
t1.merchant_id = t2.merchant_id
AND t1.credit_card_id = t2.credit_card_id
AND t1.amount = t2.amount
AND t1.transaction_timestamp < t2.transaction_timestamp
AND t2.transaction_timestamp <= t1.transaction_timestamp + INTERVAL '10 min'
)