SQL
SQL (Structured Query Language) is a declarative programming language.
Pronunciation: /ˈɛs kjuː ˈɛl/
(an SQL query).
Database I use in examples
psql template1 template1=# CREATE USER test WITH PASSWORD 'test'; template1=# CREATE DATABASE test; template1=# GRANT ALL PRIVILEGES ON DATABASE test TO test; template1=# \q psql -d test
Practice
BLOB field
Storing images in the database has some advantages:
- The image data is stored in the database. There is no extra step to load it. There is no risk that the file's pathname is incorrect
- Deleting a row deletes the image automatically
- Changes to an image are not visible to other clients until you commit the change
- Rolling back a transaction restores a previous state of the image
- Updating a row creates a lock, so no other client can update the same image concurrently
- Database backups include all the images
- SQL privileges control access to the image as well as the row
Disadvantages are:
- Performance
- Database network usage
- Backups size
CHECK constraint
Example:
CREATE TABLE DaysOfWeek ( name CHAR(20) PRIMARY KEY, CHECK (name IN ('Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday')) ); INSERT INTO DaysOfWeek(name) VALUES ('Sunday'); INSERT INTO DaysOfWeek(name) VALUES ('Another');
INSERT 0 1 ERROR: new row for relation "daysofweek" violates check constraint "daysofweek_name_check" DETAIL: Failing row contains (Another ).
Use it only if You are 100% sure that the list of allowed options will be constant, otherwise - use FOREIGN KEY
:
CREATE TABLE GenderChoices ( gender VARCHAR(20) PRIMARY KEY ); INSERT INTO GenderChoices(gender) VALUES ('male'); INSERT INTO GenderChoices(gender) VALUES ('female'); CREATE TABLE users ( name VARCHAR(100) PRIMARY KEY, gender VARCHAR(20) REFERENCES GenderChoices ON UPDATE CASCADE ); INSERT INTO users(name, gender) VALUES ('user1', 'male'); INSERT INTO users(name, gender) VALUES ('user2', 'other');
INSERT 0 1 ERROR: insert or update on table "users" violates foreign key constraint "users_gender_fkey" DETAIL: Key (gender)=(other) is not present in table "genderchoices".
There are other ways to solve the problem: ENUM
field and restriction on the client side.
DEFAULT
Be careful when you alter existing table with a great number of rows, it locks the database.
DISTINCT
Use it carefully, may cause performance problems on large data sets.
INDEX
Mistakes in defining indexes:
- Defining no indexes or not enough indexes
- Defining too many indexes or indexes that don't help
- Running queries that no index can help
MENTOR - technique to maintain indexes:
- Measure (analyze logs, search for slowest/most time consuming/most popular queries)
- Explain (analyze query execution plan using
EXPLAIN
) - Nominate
- Test
- Optimize
- Rebuild
A fat index may be fast
If your query references only the columns included in the index data structure, the database generates your query results by reading only the index.
Create index concurrently
Creating new indexes blocks the table. Use CONCURRENTLY to run it in background.
FLOAT VS NUMERIC (DECIMAL)
The advantage of NUMERIC
(DECIMAL
) are that they store rational numbers without rounding, as the FLOAT
data types do.
Example:
CREATE TABLE numbers ( number_id SERIAL PRIMARY KEY, value_float FLOAT, value_decimal NUMERIC(10,2) ); INSERT INTO numbers (value_float, value_decimal) VALUES (1./3, 1./3); INSERT INTO numbers (value_float, value_decimal) VALUES (1./3, 1./3); INSERT INTO numbers (value_float, value_decimal) VALUES (1./3, 1./3); SELECT SUM(value_float) * 1000 as sum_float, SUM(value_decimal) * 1000 as sum_decimal from numbers;
sum_float | sum_decimal -----------+------------- 1000 | 990.00
IEEE-754 (IEEE Standard for Floating-Point Arithmetic)
One advantage of IEEE-754 is that by using the exponent, it can represent fractional values that are both very small and very large.
DECIMAL
is appropriate type for an amount of money because it can handle money values just as easy as FLOAT
and more accurately.
FOREIGN KEY
FOREIGN KEY
if a "key" to consistency.
How to declare:
CREATE TABLE artwork ( artwork_id SERIAL PRIMARY KEY, author_id INTEGER REFERENCES author ) /* or */ CREATE TABLE artwork ( artwork_id SERIAL PRIMARY KEY, author_id INTEGER, FOREIGN KEY (author_id) REFERENCES author ) /* or */ CREATE TABLE artwork ( artwork_id SERIAL PRIMARY KEY, author_id INTEGER ) ALTER TABLE artwork ADD CONSTRAINT artwork_author_id_fk FOREIGN KEY (author_id) REFERENCES author (author_id);
Full-text search
Use a specialized search engine, for instance: Elasticsearch.
CASCADE UPDATE/DELETE
Allows to update or delete the parent row and lets the database takes care of any child rows that reference it.
CREATE TABLE artwork ( artwork_id SERIAL PRIMARY KEY, author_id INTEGER, FOREIGN KEY (author_id) REFERENCES author ON UPDATE CASCADE ON DELETE SET DEFAULT )
CSV/JSON/etc. or FOREIGN KEY?
Storing lists in a text field is a bad practice:
- We are limited in number of items can be placed into text field (mostly because of performance degradation)
- Foreign Key helps us to keep data consistency on database layer
- Additional validation, encoding/decoding logic may be required on the client side
- Unable to use joins and
IN
statement (must useLIKE
or regexp) - No chance to use the field as part of compound index
- We can't use count (and other aggregation functions) to get number of value usages
- Updates are much easier if you use Foreign Key
About regexp:
Some people, when confronted with a problem, think, "I know, I'll use regular expressions." Now they have two problems.
Jamie Zawinski
If you are concerned about performance:
- Joins must work fast enough if number of possible values is small (hundreds and even thousands)
- Use caching to speed up your app
- Remember about YAGNI, caching or denormalization may not worth time spent on their implementation (do it only if you have evidence that it will improve performance drastically)
Example of Foreign Key constraint usage:
CREATE TABLE anime ( anime_key CHAR(255) PRIMARY KEY, title CHAR(255) NOT NULL ); CREATE TABLE genre ( genre_key CHAR(255) PRIMARY KEY, title CHAR(255) NOT NULL ); CREATE TABLE anime_genre ( anime_key CHAR(255) REFERENCES anime, genre_key CHAR(255) REFERENCES genre, PRIMARY KEY (anime_key, genre_key) ); INSERT INTO anime (anime_key, title) VALUES ('kill-la-kill', 'Kill La Kill'); INSERT INTO genre (genre_key, title) VALUES ('action', 'Action'); INSERT INTO genre (genre_key, title) VALUES ('comedy', 'Comedy'); INSERT INTO anime_genre (anime_key, genre_key) VALUES ('kill-la-kill', 'action'); INSERT INTO anime_genre (anime_key, genre_key) VALUES ('kill-la-kill', 'comedy'); SELECT genre.title FROM anime_genre LEFT JOIN genre USING (genre_key) WHERE anime_genre.anime_key = 'kill-la-kill';
anime_genre
is an intersection table.
Invalid data example:
INSERT INTO anime_genre (anime_key, genre_key) VALUES ('unknown-anime', 'comedy'); ERROR: insert or update on table "anime_genre" violates foreign key constraint "anime_genre_anime_key_fkey" DETAIL: Key (anime_key)=(unknown-anime) is not present in table "anime".
GROUP BY and aggregate
SELECT anime.title, COUNT(anime_genre.*) as genre_count FROM anime LEFT JOIN anime_genre USING (anime_key) GROUP BY anime.anime_key;
anime | genre_count Kill La Kill | 2
NULL and the third state
Using null is not the antipattern; the antipattern is using null like an ordinary value or using an ordinary value like null.
SQL Antipatterns: Avoiding the Pitfalls of Database Programming by Bill Karwin
Sometimes I hear that the third state is bad, only True and False must be allowed. I don't agree with it, an unknown is a natural third state. Sometimes we not sure about something, and there is no right answer except "I don't know".
NULL
in SQL is the best implementation of unknown I ever had seen:
DO language plpgsql $$ BEGIN RAISE NOTICE 'NULL OR TRUE = %', NULL OR TRUE; RAISE NOTICE 'NULL AND TRUE = %', NULL AND TRUE; RAISE NOTICE 'NULL OR NULL = %', NULL OR NULL; RAISE NOTICE 'NULL AND NULL = %', NULL AND NULL; RAISE NOTICE 'NULL + 1 = %', NULL + 1; RAISE NOTICE 'NULL || ''abc'' = %', NULL || 'abc'; RAISE NOTICE 'NOT NULL = %', NOT NULL; RAISE NOTICE 'NULL == NULL = %', NULL = NULL; RAISE NOTICE 'NULL != NULL = %', NULL != NULL; END $$;
NOTICE: NULL OR TRUE = t NOTICE: NULL AND TRUE = <NULL> NOTICE: NULL OR NULL = <NULL> NOTICE: NULL AND NULL = <NULL> NOTICE: NULL + 1 = <NULL> NOTICE: NULL || 'abc' = <NULL> NOTICE: NOT NULL = <NULL> NOTICE: NULL == NULL = <NULL> NOTICE: NULL != NULL = <NULL>
In SQL NONE
plays important role:
- Shows that there is no value was assigned (not available yet)
- No reference exists
- An outer join uses
NULL
values as placeholders for the columns of an unmatched table in an outer join
Searching for NULL
values:
SELECT title FROM artwork WHERE autor_id IS NULL; SELECT title FROM artwork WHERE autor_id IS NOT NULL;
Using IS DISTINCT FROM
:
SELECT title FROM artwork WHERE author_id IS DISTINCT FROM 123; /* is equal to */ SELECT title FROM artwork WHERE author_id IS NULL OR author_id != 123;
COALESCE
The function returns its first not-null argument.
ORDER BY RANDOM()
SELECT artwork_id FROM artwork ORDER BY RANDOM() LIMIT 1;
It is expensive and unreliable.
An alternative:
SELECT artwork_id FROM artwork LIMIT 1 OFFSET FLOOR(RANDOM() * (SELECT COUNT(*) FROM artwork));
PRIMARY KEY
PRIMARY KEY is a constraint.
PRIMARY KEY needs to:
- Prevent a table from containing duplicate rows (as it creates
UNIQUE INDEX
) - Support foreign key reference
How to declare:
CREATE TABLE artwork ( artwork_id INTEGER PRIMARY KEY ); /* or */ CREATE TABLE artwork ( artwork_id INTEGER, PRIMARY KEY (artwork_id) ); /* or */ CREATE TABLE artwork ( artwork_id INTEGER ); ALTER TABLE artwork ADD PRIMARY KEY (artwork_id);
Result is the same:
test=# \d+ artwork Table "public.artwork" Column | Type | Modifiers | Storage | Stats target | Description ------------+---------+-----------+---------+--------------+------------- artwork_id | integer | not null | plain | | Indexes: "artwork_pkey" PRIMARY KEY, btree (artwork_id)
PRIMARY KEY name
A good practice is to use next pattern:
{table name}_id
Fields references the table must have the same name, it allows to use JOIN USING
syntax:
SELECT artwork.title, author.name FROM artwork JOIN author USING (author_id); /* instead */ SELECT artwork.title, author.name FROM artwork JOIN author ON author.author_id = artwork.author_id;
artwork.artwork_id
vs artwork.id
: artwork_id
is more clear.
PRIMARY KEY value
There are primary key value types:
- autoincrement field
- natural id
- guid (Globally Unique IDentifier)
- compound (aka composite) primary key
Autoincrement and guid fields are known as pseudo keys or surrogate keys.
Most databases provide a mechanism to generate unique integer values serially, outside the scope of transaction isolation.
Autoincrement field example (PG):
CREATE TABLE items ( item_id SERIAL PRIMARY KEY, title VARCHAR(255) ); INSERT INTO items (title) VALUES ('item1'); INSERT INTO items (title) VALUES ('item2'); SELECT * FROM items;
item_id | title ---------+------- 1 | item1 2 | item2
Natural id, if it is short and unique, usually the best solution for a primary key. For example, phone numbers, usernames, etc.
GUID (uuid4 usually) is solution for distributed (horizontally) databases.
A compound key is a good solution for intersection table (there are a lot of cases where it may be used).
Trees (hierarchical data)
The most obvious solution is to use a parent_key field (adjacency list). The solution has its benefits:
- simple
- easy to modify (copy/move tree leaves)
Drawbacks:
- leaf remove is complex (needs to find and remove all leaves in the subtree,
ON DELETE CASCADE
solves the problem)
If You need to know how many children has the leave, how many descendants has the tree, etc., in most cases better just to use counters instead of query all the tree each time.
But the simple parent_key solution fails if you need to select all descendants (for instance: get full replies tree).
Exception is two level tree:
SELECT FROM nodes as n1 LEFT OUTER JOIN nodes as n2 ON n2.parent_id = n1.node_id;
There are alternative solutions:
PostgreSQL>=8.4
supports recursive queries- Path enumeration technique (drawbacks: no referral integrity)
- Nested sets (drawbacks: hard to insert/delete, complex, no referral integrity)
- Closure table (benefits: allow node to belong to multiple trees, drawback: requires additional table)
UNION
Query results can be combined with UNION
only if their columns are the same in number and type (provide NULL
placeholders for columns that are unique to each table).
Wildcard in SELECT
The *
symbol means every column (the list of columns is implicit).
There are few thoughts why better to specify columns explicitly:
- when we ask for only fields used in the index: request must be executed faster (no need to query the main table)
- blob and text fields stored separately from the main table (it stores only references to them), if don't ask for these fields, we decrease response time and network usage
Vocabulary
Antipattern
Antipattern is a technique that is intended to solve a problem but that often leads to other problems.
EAV design
Stands for Entity-Attribute-Value. Aka open schema, schemaless or name-value pairs.
Intersection table
Aka a join table, many-to-many table or mapping table.
Intersection table has foreign keys referencing two tables (implements many-to-many relationship).
Normalization
Objectives of normalization:
- To represent facts about the real world in a way that we can understand
- To reduce storing facts redundantly and to prevent anomalous or inconsistent data
- To support integrity constraints
One-way cryptographic hash function
The function transforms its input string into a new string, called the hash, that is unrecognizable.
Another characteristic of a hash is that it's not reversible. You can't recover the input string from its hash because the hashing algorithm is designed to "lose" some information about the input.
Example using bcrypt (adaptive hashing function):
import bcrypt hashed_password = bcrypt.hashpw(password.encode('utf-8'), bcrypt.gensalt(12))
Relational database
A relational database is a digital database whose organization is based on the relational model of data, as proposed by E.F. Codd in 1970.
Relational databases are specifically designed to manage relationships:
- each entity (row) can be uniquely identified by a primary key
- entities are normalized. The data for a unique entity is stored only once, and related entities store just its primary key. Changing the data of an entity has to happen in only one place
- entities can be joined at query time, allowing for cross-entity search
- changes to a single entity are atomic, consistent, isolated, and durable (ACID)
- most relational databases support ACID transactions across multiple entities
Links
SQL Antipatterns: Avoiding the Pitfalls of Database Programming by Bill Karwin