GBlog: new blogging engine, new blog

See http://nanvel.com
GitHub https://github.com/nanvel/gblog

For geeks like me.

Features:

Wrote in Python using tornado


All the main python code is about 200 lines.

Easy for local development


Just run python app.py and open localhost:5000 in browser to check how your new post looks.

Minimal requirements


tornado>=4.0
redis
arrow
docutils
pygments

Easy to deploy


I spent about 30 minutes to install and configure all the stuff on new vps: nginx, redis, python packages.

RST syntax


http://docutils.sourceforge.net/rst.html
If You familiar with Python, You probably have to be familiar with RST too.

Custom RST directives


Directives for videos, blockquotes and other. You may easily add directives You need, it's pretty easy.
See http://nanvel.com/#b=1416086820&l=1

Fast


It doesn't use any relational databases, all posts stores in files and redis, and it doesn't render pages on every requert. Redis zrevrangebyscore used for pagination.

No special editor required


Edit content in your favourite editor or directly on github.

Safe, don't worry to lose your notes


I have at least 3 copies of my posts: on my laptop, on vps and on GitHub.

Select my friends activities, nosql way

Related to: http://stackoverflow.com/questions/26820983/get-my-friends-activities-using-redis-redis-join-alternative

Goal: understand is it makes sense to use redis here or continue using postgres is a right choice.

The task

We have two tables:

Activities (user_id, activity_id, timestamp)
Friends (user_id, friend_id)

We need to get paginated list of friends activities for specified user_id. In SQL it looks like:
SELECT act.activity_id, act.timestamp from activities act
JOIN friends fr ON fr.friend_id=act.user_id AND fr.user_id='{user_id}'
WHERE act.timestamp < {last}
ORDER BY act.timestamp DESC
LIMIT {limit};

Let's try to use redis for this task. Simplified plan is next:
  1. we fill set of user friends in redis database ('friends:{user_id}')
  2. and zset of user_ids sorted by last activity ('activities')
  3. and user activities sorted by timestamp ('activities:{user_id}')
  4. interstore 'test:activities' and 'test:friends:{user_id}' -> friends
  5. 'tmp:{user_id}' = []
  6. for friend_id, timestamp in friends:
  7. zunionstore 'tmp:{user_id}' and 'activities:{friend_id}' -> 'tmp:{user_id}'
  8. if len(zrzngebyscore 'tmp:{user_id}' timestamp, last) >= limit: break
  9. endfor
  10. return zrange 'tmp:{user_id}', 0, limit
  11. del 'tmp:{user_id}'

To do all this things on the redis side I wrote lua script:
def search(self, user, last, limit):
    SCRIPT = """
    redis.call("ZINTERSTORE", "test:tmp:" .. ARGV[1], 2, "test:last_user_activity", "test:friends:" .. ARGV[1], "AGGREGATE", "MAX")
    local users = redis.call("ZREVRANGE", "test:tmp:" .. ARGV[1], 0, -1, "WITHSCORES")
    if users == nil then
        return {}
    end
    redis.call("DEL", "test:tmp:" .. ARGV[1])
    local counter = 0
    local lastval = users[1]
    for k, v in pairs(users) do
        if (counter % 2 == 0) then
            lastval = v
        else
            redis.call("ZUNIONSTORE", "test:tmp:" .. ARGV[1], 2, "test:tmp:" .. ARGV[1], "test:user_activities:" .. lastval, "AGGREGATE", "MAX")
            redis.call("ZREMRANGEBYSCORE", "test:tmp:" .. ARGV[1], ARGV[2], ARGV[3])
            if redis.call("ZCOUNT", "test:tmp:" .. ARGV[1], v, ARGV[2]) >= tonumber(ARGV[4]) then break end
        end
        counter = counter + 1
    end
    local users = redis.call("ZREVRANGE", "test:tmp:" .. ARGV[1], 0, ARGV[4] - 1)
    redis.call("DEL", "test:tmp:" .. ARGV[1])
    return users
    """
    return self.conn.eval(SCRIPT, 0, user, last, get_timestamp(), limit)

Full script and it's output see on gist: https://gist.github.com/nanvel/8725b9c71c0040b0472b

Briefly about results

Redis vs Postgresql, both were running on my laptop.

Postgres tables and indexes:
DROP TABLE IF EXISTS activities;
DROP TABLE IF EXISTS friends;
CREATE TABLE activities (
    id SERIAL,
    user_id VARCHAR(100),
    activity_id VARCHAR(100),
    timestamp BIGSERIAL
);
CREATE TABLE friends (
    id SERIAL,
    user_id VARCHAR(100),
    friend_id VARCHAR(100)
);
CREATE INDEX activities_user_id_index ON activities (user_id);
CREATE INDEX activities_timestamp_index ON activities (timestamp);
CREATE INDEX friends_user_id_index ON friends (user_id);
CREATE INDEX friends_friend_id_index ON friends (friend_id);

Activities count: 30000
Friends count: 25000
My friends count: 15000

Activities per page: 10
Page 1: 0.161883 s for postgres vs 0.025598 s for redis.
Page 2: 0.203902 s for postgres vs 0.026051 s for redis.
Page 10: 0.149319 s for postgres vs 0.048609 s for redis.

People, how You solve problems similar to described above? Is redis good for this task?
Any thoughts is graph database will solve the problem?

Amazon CloudSearch spike project

Gist: https://gist.github.com/nanvel/4f7696174ac3a9b3554c
"""
Search bebop series.
"""
import arrow
import json

from tornado import options
from tornado.httpclient import HTTPError, HTTPClient, HTTPRequest
from tornado_botocore import Botocore
from tvs import TVS


DOMAIN_NAME = 'test-bebop-domain'
API_VERSION = '2013-01-01'


if __name__ == '__main__':
    options.parse_command_line()
    # create domain
    cs_create_domain = Botocore(
        service='cloudsearch', operation='CreateDomain',
        region_name='us-west-2')
    session = cs_create_domain.session
    try:
        # create domain, domain will be reused if already exists
        print cs_create_domain.call(domain_name=DOMAIN_NAME)
        # { 
        #    "DomainStatus":{ 
        #       "DomainId":"240020657974/test-bebop-domain",
        #       "Created":true,
        #       "SearchService":{},
        #       "SearchInstanceCount":0,
        #       "DomainName":"test-bebop-domain",
        #       "DocService":{},
        #       "Deleted":false,
        #       "Processing":false,
        #       "RequiresIndexDocuments":false,
        #       "ARN":"arn:aws:cloudsearch:us-west-2:240020657974:domain/test-bebop-domain",
        #       "SearchPartitionCount":0
        #    },
        #    "ResponseMetadata":{ 
        #       "RequestId":"38b0cba7-60f2-11e4-980e-6d6976ea3108"
        #    }
        # }
    except HTTPError as e:
        print e.response.body
    # configure fields
    cs_define_index_field = Botocore(
        service='cloudsearch', operation='DefineIndexField',
        region_name='us-west-2', session=session)
    # Fields:
    # - title - text + show in result
    # - airdate - uint
    # - genre - literal + facet enabled (or literal-array?)
    # - content - text
    FIELDS = [{
        'DomainName': DOMAIN_NAME,
        'IndexField': {
            'IndexFieldName': 'title',
            'IndexFieldType': 'text',
            'TextOptions': {
                'HighlightEnabled': False,
                'DefaultValue': 'untitled',
                'ReturnEnabled': True,
            }
        }
    }, {
        'DomainName': DOMAIN_NAME,
        'IndexField': {
            'IndexFieldName': 'content',
            'IndexFieldType': 'text',
            'TextOptions': {
                'HighlightEnabled': False,
                'DefaultValue': '',
                'ReturnEnabled': False,
            }
        }
    }, {
        'DomainName': DOMAIN_NAME,
        'IndexField': {
            'IndexFieldName': 'airdate',
            'IndexFieldType': 'int',
            'IntOptions': {
                'DefaultValue': 946684800,
            }
        }
    }, {
        'DomainName': DOMAIN_NAME,
        'IndexField': {
            'IndexFieldName': 'genre',
            'IndexFieldType': 'literal-array',
            'LiteralArrayOptions': {
                'DefaultValue': '',
                'FacetEnabled': True,
                'ReturnEnabled': False,
                'SearchEnabled': True,
            }
        }
    }]
    try:
        for params in FIELDS:
            print cs_define_index_field.call(**params)
    except HTTPError as e:
        print e.response.body
    # add data
    batch = []
    for tv in TVS:
        batch.append({
            'type': 'add', 'id': tv['number'],
            'fields': {
                'title': tv['title'],
                'content': tv['content'],
                'airdate': arrow.get(tv['airdate'], ['YYYY-MM-DD', 'MMMM D, YYYY']).timestamp,
                'genre': tv['genre'],
            }
        })
    # get document and search endpoints
    cs_describe_domains = Botocore(
        service='cloudsearch', operation='DescribeDomains',
        region_name='us-west-2', session=session)
    response = cs_describe_domains.call(domain_names=[DOMAIN_NAME])
    # { 
    #    "DomainStatusList":[ 
    #       { 
    #          "DomainId":"240020657974/test-bebop-domain",
    #          "Created":true,
    #          "SearchService":{ 
    #             "Endpoint":"search-test-bebop-domain-kmvxd5zzot4opij6zvb6okvrma.us-west-2.cloudsearch.amazonaws.com"
    #          },
    #          "SearchInstanceCount":1,
    #          "DomainName":"test-bebop-domain",
    #          "DocService":{ 
    #             "Endpoint":"doc-test-bebop-domain-kmvxd5zzot4opij6zvb6okvrma.us-west-2.cloudsearch.amazonaws.com"
    #          },
    #          "SearchInstanceType":"search.m1.small",
    #          "Deleted":false,
    #          "Processing":false,
    #          "RequiresIndexDocuments":true,
    #          "ARN":"arn:aws:cloudsearch:us-west-2:240020657974:domain/test-bebop-domain",
    #          "SearchPartitionCount":1
    #       }
    #    ],
    #    "ResponseMetadata":{ 
    #       "RequestId":"7993ac9b-6101-11e4-8510-8ffcccb94f21"
    #    }
    # }
    search_endpoint = response['DomainStatusList'][0]['SearchService']['Endpoint']
    document_endpoint = response['DomainStatusList'][0]['DocService']['Endpoint']
    httpclient = HTTPClient()
    # reindex
    cs_index_documents = Botocore(
        service='cloudsearch', operation='IndexDocuments',
        region_name='us-west-2', session=session)
    print cs_index_documents.call(domain_name=DOMAIN_NAME)
    # wait unil reindex complete
    # add documents
    url = 'http://{document_endpoint}/{api_version}/documents/batch'.format(
        document_endpoint=document_endpoint,
        api_version=API_VERSION)
    try:
        request = HTTPRequest(
            url=url, body=json.dumps(batch),
            headers={'Content-Type': 'application/json'}, method='POST')
        request.params = None
        cs_describe_domains.endpoint.auth.add_auth(request=request)
        response = httpclient.fetch(request=request)
        print response.body
    except HTTPError as e:
        print e.response.body
    # search
    url = 'http://{search_endpoint}/{api_version}/search?q=bebop'.format(
        search_endpoint=search_endpoint, api_version=API_VERSION)
    request = HTTPRequest(
        url=url, headers={'Content-Type': 'application/json'},
        method='GET')
    request.params = None
    cs_describe_domains.endpoint.auth.add_auth(request=request)
    response = httpclient.fetch(request=request)
    print response.body
    # { 
    #    "status":{ 
    #       "rid":"st/UtJYpAAoghec=",
    #       "time-ms":82
    #    },
    #    "hits":{ 
    #       "found":12,
    #       "start":0,
    #       "hit":[ 
    #          { 
    #             "id":"3",
    #             "fields":{ 
    #                "airdate":"910396800",
    #                "title":"Honky Tonk Women"
    #             }
    #          },
    #          { 
    #             "id":"18",
    #             "fields":{ 
    #                "airdate":"920073600",
    #                "title":"Speak Like a Child"
    #             }
    #          },
    #          ...
    #       ]
    #    }
    # }

Tornado and asynchronous DynamoDB example project

GitHub:

https://github.com/nanvel/bebop

Used low level api to access DynamoDB API, for example:
class DDBEpisode(DDBBase):

    @gen.coroutine
    def create(self, **kwargs):
        ddb_put_item = self.dynamodb(operation='PutItem')
        res = yield gen.Task(ddb_put_item.call,
            table_name=self.TABLE_NAME,
            item=self.with_types(kwargs))
        raise gen.Return(res)

And it uses DyanamoDBLoacal, so You don't need to use real amazon infrastructure to do your experiments (I accidently lost $180 because of forgot to remove few dynamodb tables, be careful!).

See amazon documentation for full api description:

http://docs.aws.amazon.com/amazondynamodb/latest/APIReference/Welcome.html

You can start your experiments easily:
git clone https://github.com/nanvel/bebop
cd bebop/bin
# http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Tools.DynamoDBLocal.html
wget http://dynamodb-local.s3-website-us-west-2.amazonaws.com/dynamodb_local_latest
tar -xvf dynamodb_local_latest
rm dynamodb_local_latest
cd ..
make dynamo
python app.py
python example.py