Databases

How to choose a DB

If short:

  1. Use MongoDB (joke)
  2. PostgreSQL
  3. It depends

See PyCon Ukraine 2016, lighting talks.

There is no quick and easy rule for determining which type of storage engine is better for the use case, it worth testing empirically.

As with everything else in architecture, the proper answer is "it depends".

DB types

By storage:

Relational model

Proposed by Edgar Codd in 1970. Active usage - since mid-1980s (SQL, RDBMSs). The dominance of relational databases has lasted around 25-30 years.

Data is organized into relations (called tables in SQL), where each relation is an unordered collection of tuples (rows in SQL).

Column oriented storage

Don't store all the values from one row together, but store all the values from each column together.

The column-oriented storage layour relies on each column file containing the rows in the same order. If you need to reassemble an entire row, you can take the N-th entry from each of the individual column files and put them together to form the N-th row of the table.

Network model and hierarchical model

1970s-1980s.

Object oriented model

Appeared in 1980s, 1990s.

Document oriented model

Back to hierarchical model: storing nested records (one-to-many relations).
Schema flexibility, better performance due to locality (data is not splitted across multiple tables).
For highly interconnected data, the document model is awkward.

Target use cases where data comes in self-contained documents and relationships between one document and another are rare.

XML model

2000s.

Relational model

Graph model

For highly interconnected data is the most natural model, target use cases where anything is potentially related to everything.

A graph consists of two kinds of objects: vertices (entities) and edges (relationships of arcs).

Languages: Cypher, SPARQL (a query language for triple-stores), Datalog, Gremlin.

Databases

Couchbase

Weak durability by writing to disk asynchronously.

Redis

Weak durability by writing to disk asynchronously.

See also Redis, usage examples.

AWS RedShift

Hosted version of ParAccel.

Miscelenous

Primary vs secondary indexes

Both are key: document/row pairs.
Primary index keys are unique, secondary index keys can be not unique.

Durable RAM

Battery powered RAM.

SQL-on-Hadoop projects

Apache Hive, Spark SQL, Cloudera Impala, Facebook Presto, Apache Tajo, Apache Drill.

Parquet

A columnar storage format that supports a document data model, based on Google's Dremel.

Isolation levels

# sqlalchemy
session_factory = sessionmaker(
    bind=create_engine(
        postgres_uri,
        isolation_level="REPEATABLE",
    )
)

PostgreSQL isolation levels.

Isolation Level Dirty Read Nonrepeatable Read Phantom Read Serialization Anomaly
Read uncommitted Allowed, but not in PG Possible Possible Possible
Read committed Not possible Possible Possible Possible
Repeatable read Not possible Not possible Allowed, but not in PG Possible
Serializable Not possible Not possible Not possible Not possible.

Dirty read: a transaction reads data written by a concurrent uncommitted transaction.

Nonrepeatable read: a transaction re-reads data it has previously read and finds that data has been modified by another transaction (that committed since the initial read).

Phantom read: a transaction re-executes a query returning a set of rows that satisfy a search condition and finds that the set of rows satisfying the condition has changed due to another recently-committed transaction.

Serialization anomaly: the result of successfully committing a group of transactions is inconsistent with all possible orderings of running those transactions one at a time.

Vocabulary

SQL injection

An SQL injection:

Columnal database

The Tables Have Turned.

Vertica slogan

Levenshtein automaton

pass

Data warehouse

A separate database for analytics purposes. A trend from late 1980s and early 1990s.
The process of getting data in the warehouse is known as Extract-Transform-Load (ETL).

Star schema

Aka dimensional modeling.
Star - event and connections to dimension tables like the rays of a star.
There is a fact table represents events and dimension tables represent who, when (allows to represent additional information, like publick holidays), where, what, how and why.

Replication and Partitioning

Replication - keeping a copy of the same data on several different nodes:

Variations:

Partitioning (aka sharding) - Splitting a big database into smaller subsets called partitions so that different partitions can be assigned to different nodes. Distribute large dataset across many disks, distribute query load across many processors.

Partition aka:

Keys distribution:

Service discovery in distributed databases

Knowledge about partitions may be stored in:

Many use ZooKeeper (LinkedIn's Expresso/Helix, HBase, SolrCloud, Kafka).

Kassandra and Reak use gossip protocol (there is no external coordination service).

NoSQL

A catchy Twitter tag, retroactively reinterpreted as Not Only SQL.

Query optimizer

In a relational database, the query optimizer automatically decides which parts of the query to execute in which order, and which indexes to use.

Schemaless

schema-on-read - the structure of the data is implicit, and only interpreted when the data is read.
schema-on-write - the schema is explicit and the database ensures all written data conforms to it.

Databases index

HBase (opensource BigTable)
Ketama

Designing Data-Intensive Applications by Martin Kleppmann

Licensed under CC BY-SA 3.0