Databases

How to choose a DB

If short:

Use MongoDB (joke)
PostgreSQL
It depends

See PyCon Ukraine 2016, lighting talks.

There is no quick and easy rule for determining which type of storage engine is better for the use case, it worth testing empirically.

As with everything else in architecture, the proper answer is "it depends".

DB types

relational databases
object-oriented databases
key/value stores
column-oriented databases
document-oriented databases
hierarchical databases
a tree, each node has one parent.
network databases (CODASYL model)
like hierarchical, but a record can have multiple perents.
The only way of accessing a record is from a root record along chains of links (access path).
map/reduce frameworks
semantic data stores
graph databases

By storage:

in memory
disk

Relational model

Proposed by Edgar Codd in 1970. Active usage - since mid-1980s (SQL, RDBMSs). The dominance of relational databases has lasted around 25-30 years.

Data is organized into relations (called tables in SQL), where each relation is an unordered collection of tuples (rows in SQL).

Column oriented storage

Don't store all the values from one row together, but store all the values from each column together.

The column-oriented storage layour relies on each column file containing the rows in the same order. If you need to reassemble an entire row, you can take the N-th entry from each of the individual column files and put them together to form the N-th row of the table.

Network model and hierarchical model

1970s-1980s.

Object oriented model

Appeared in 1980s, 1990s.

Document oriented model

Back to hierarchical model: storing nested records (one-to-many relations).
Schema flexibility, better performance due to locality (data is not splitted across multiple tables).
For highly interconnected data, the document model is awkward.

Target use cases where data comes in self-contained documents and relationships between one document and another are rare.

XML model

2000s.

Relational model

Graph model

For highly interconnected data is the most natural model, target use cases where anything is potentially related to everything.

A graph consists of two kinds of objects: vertices (entities) and edges (relationships of arcs).

Languages: Cypher, SPARQL (a query language for triple-stores), Datalog, Gremlin.

Databases

Couchbase

Weak durability by writing to disk asynchronously.

Redis

Weak durability by writing to disk asynchronously.

AWS RedShift

Hosted version of ParAccel.

Miscelenous

Primary vs secondary indexes

Both are key: document/row pairs.
Primary index keys are unique, secondary index keys can be not unique.

Durable RAM

Battery powered RAM.

SQL-on-Hadoop projects

Apache Hive, Spark SQL, Cloudera Impala, Facebook Presto, Apache Tajo, Apache Drill.

Parquet

A columnar storage format that supports a document data model, based on Google's Dremel.

Isolation levels

# sqlalchemy
session_factory = sessionmaker(
    bind=create_engine(
        postgres_uri,
        isolation_level="REPEATABLE",
    )
)

PostgreSQL isolation levels.

Isolation Level	Dirty Read	Nonrepeatable Read	Phantom Read	Serialization Anomaly
Read uncommitted	Allowed, but not in PG	Possible	Possible	Possible
Read committed	Not possible	Possible	Possible	Possible
Repeatable read	Not possible	Not possible	Allowed, but not in PG	Possible
Serializable	Not possible	Not possible	Not possible	Not possible.

Dirty read: a transaction reads data written by a concurrent uncommitted transaction.

Nonrepeatable read: a transaction re-reads data it has previously read and finds that data has been modified by another transaction (that committed since the initial read).

Phantom read: a transaction re-executes a query returning a set of rows that satisfy a search condition and finds that the set of rows satisfying the condition has changed due to another recently-committed transaction.

Serialization anomaly: the result of successfully committing a group of transactions is inconsistent with all possible orderings of running those transactions one at a time.

Vocabulary

SQL injection

An SQL injection:

Hi, this is you sons school, we're having some computer trouble.
Oh, dear - did he break something? In a way -)
Do you really name your son "Robert'); DROP TABLE Students;"?
Oh, yes. Little Bobby tables we call him.
Well, we've lost this year's student records. I hope yo're happy.
And I hope you've learned to sanitize your database inputs.

Columnal database

The Tables Have Turned.

Vertica slogan

Levenshtein automaton

pass

Data warehouse

A separate database for analytics purposes. A trend from late 1980s and early 1990s.
The process of getting data in the warehouse is known as Extract-Transform-Load (ETL).

Star schema

Aka dimensional modeling.
Star - event and connections to dimension tables like the rays of a star.
There is a fact table represents events and dimension tables represent who, when (allows to represent additional information, like publick holidays), where, what, how and why.

Replication and Partitioning

Replication - keeping a copy of the same data on several different nodes:

latency (keep data geographically close to users)
availability
scalability (read throughput)

Variations:

single-leader
multi-leader
leaderless

Partitioning (aka sharding) - Splitting a big database into smaller subsets called partitions so that different partitions can be assigned to different nodes. Distribute large dataset across many disks, distribute query load across many processors.

Partition aka:

Shard in: MongoDB, Elasticsearch, SolrCloud.
Region in: HBase.
Tablet in: Bigtable.
Vnode in: Cassandra and Riak.
vBucket in Couchbase.

Keys distribution:

by keys range
by key hash range (hash function makes data uniformly distributed, Cassandra and MongoDB use MD5)

Service discovery in distributed databases

Knowledge about partitions may be stored in:

each node
a routing tier
in a client

Many use ZooKeeper (LinkedIn's Expresso/Helix, HBase, SolrCloud, Kafka).

Kassandra and Reak use gossip protocol (there is no external coordination service).

NoSQL

A catchy Twitter tag, retroactively reinterpreted as Not Only SQL.

Query optimizer

In a relational database, the query optimizer automatically decides which parts of the query to execute in which order, and which indexes to use.

Schemaless

schema-on-read - the structure of the data is implicit, and only interpreted when the data is read.
schema-on-write - the schema is explicit and the database ensures all written data conforms to it.

Databases index

HBase (opensource BigTable)
Ketama

Links

Designing Data-Intensive Applications by Martin Kleppmann

Licensed under CC BY-SA 3.0