High scalability

High scalability

Loc 2695

Obsticles:

reads
writes
data size
data complexity
response time
access patterns

Instruments to solve:

NoSQL
message queues
caches
search indexes
batch and stream processing frameworks

Load

Checking what happens when increasing load:

increase load, keep the system resources unchanged, see how is the system performance affected
increase load, see how much resources you need to add to keep the system performance unchanged

Describing performance:

throughput (number of records/requests we process per some period of time)
response time (the time between request was sent and response was received)

CPU

Many applications today are data-intensive. Raw CPU power is rarely a limiting factor.

Queues

Help to handle spikes, scale horizintally and make system more reliable.

See Feeds.

Big data?

Big data - if amount of data or resources to process it is the current system limit.

Throughput:

low = <100/s
medium = <5000/s
high = >5000/s

Numbers:

Airbnb, 100k messages being sent on mobile per hour

High Scalability: Building bigger, faster, more reliable websites
Data Pipeline Architect - Resources to help you with data planning and plumbing
Why You Shouldn’t Build Your Own Data Pipeline
Spark talk on PyCon Ukraine 2017 by Taras Lehinevych

Vocabulary

Data-intensive applications

Limiting factors are the amount of data, the complexity of data, the speed at which it is changing.

Data warehouse

Late 1980s and early 1990s there was a trend to use a separate database for analytics.
Safe ti run queries those often harm performance of concurrently executing transactions in the main database if running there.

There is also Data Lake.

Compute-intensive application

Where CPU cycles are the bottleneck.

Stream processing

Send a message to another process, to be handled asynchronously.

Batch processing

Periodically crunch a large amount of accumulated data.

ETL

Extract-Transform-Load - a process of getting data into a data warehouse.

Reliability

The system should continue to work correctly even in the face of adversity (hardware or software faults).

Scalability

As the system grows, there should be reasonable ways of dealing with that growth.

Vertical scaling (scaling up) - moving to a more powerful machine.
Horizontal scaling (scaling out) - distributing the load across multiple machines.

Be pragmatic

Using several fairly powerful machines can still be simpler and cheaper than a large number of small virtual machines.

Maintainability

Over time, many different people will work on the system, and they should be able to work on it productively.

Latency vs response time

Response time - what a client sees (includes network and queuing delays).

Latency - is the time a request is waiting to be handled (during this period the request is latent).

MapReduce

MapReduce is a programming model for processing large amounts of data in bulk across many machines.

MapReduce is neither a declarative query language nor a fully imperative query API, but somewhere in between.

Links

High Scalability: Building bigger, faster, more reliable websites by Todd Hoff
Apache Kafka talk on Pycon Ukraine 2017 by Taras Voinarovskyy

Licensed under CC BY-SA 3.0