response.urljoin - same urlparse.urljoin but with response.url as the first argument:

def parse(self.response):
    next_page = response.urljoin('/page/2/')
    yield scrapy.Request(next_page, callback=self.parse)

Form request

aka POST request.

def parse(self, response):
    return scrapy.FormRequest(
        formdata={'username': 'john', 'password': 'secret'},

Also see from_response method - returns a new FormRequest object with its form field values pre-populated with those found in the HTML form element contained in the given response.


Selectors are a higher level interface on top of lxml. It handles broken HTML and confusing encoding.

>>> response.css('title::text').re(r'Quotes.*')
['Quotes to Scrape']
>>> response.css('title::text').re(r'Q\w+')
>>> response.css('title::text').re(r'(\w+) to (\w+)')
['Quotes', 'Scrape']


Concise XPath.
XPath tutorial.
Scrapy best practices on The scrapinghub blog.

>>> from scrapy.selector import Selector
>>> from scrapy.http import HtmlResponse

>>> body = '<html><body><span>good</span></body></html>'
>>> Selector(text=body).xpath('//span/text()').extract()

>>> response = HtmlResponse(url='', body=body)
>>> response.selector.xpath('//span/text()').extract()

>>> response.selector.xpath('//span/text()').extract_first()

Conditions separated by / are known as steps.
Condition inside [] is known as predicate.
// allows to get all elements of a particular type, not only those belong to the current node.
@ allows to select attributes.

More examples:

.//text()  # extract all text
./table/tr[td]  # select only `tr`'s contain `td`
./li[a]/parent::ul'  # select `ul` that contains at least one `li` with `a` inside
./ul/li[@id="someid"]/following-sibling::li[1]  # following sibling
./ul/li[@id="someid"]/preceding-sibling::li[1]  # preceding siblings
./div[not(contains(@class,"somecls"))]  # not contains class
name(.)  # get current tag name
(./p | ./a)  # select `p` and `a` tags
./*[self::p or self::a]  # select `p` and `a` tags
./td/parent::tr/parent::table  # select parent element
./../../a  # a few levels upper (similar to `parent:*`)

XPath functions


Use .extract() or .extract_first().
Using re: .re('\d+ (.+)') or .re_first('\d+ (.+)')

Parsing, sanitizing, and more: w3lib - a Python library of web-related functions.




scrapy startproject myproject [project_dir]
scrapy genspider mydomain

Global commands:

Project commands:

Running a spider:
scrapy crawl <spidername> -s CLOSESPIDER_ITEMCOUNT=10

Using proxy:

export http_proxy=<ip/host>:<port>
scrapy crawl <spidername>


from scrapy.utils.response import open_in_browser

from import inspect_response
inspect_response(response, self)

It is possible to debug xpaths in Google Chrome browser console:


Scrapy shell

scrapy shell ''
2017-02-12 13:50:08 [scrapy] INFO: Spider opened
[s] Available Scrapy objects:
[s]   crawler    <scrapy.crawler.Crawler object at 0x1064a6ad0>
[s]   item       {}
[s]   request    <GET>
[s]   response   <302>
[s]   settings   <scrapy.settings.Settings object at 0x1064a6a50>
[s]   spider     <DefaultSpider 'default' at 0x1084c5490>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser

view and fetch functions are very useful.


import logging

import scrapy

class MySpider(scrapy.Spider):

    # ...

    def parse(self, response):'A response from %s just arrived!', response.url)
        # or
        self.log("Log something ...")
        # or
        self.log("Log something ...", level=logging.INFO)

Logging levels:

Use LOG_LEVEL setting to specify desired logging level.

Logs output tuning:

Logs management:

Memory usage

See Debugging memory leaks.



Command line:

scrapy shell -s SOME_SETTING=VALUE


Items provide the container of scraped data, while Item Loaders provide the mechanism for populating that container.

Items and ItemLoaders are sucks


  1. Fields don't have any validation
  2. Set/get values using getvalue: item['myvalue'] = 0
    item.myvalue = 0 is shorter. And I can't use my editor autocomplete (works with attributes).
  3. ItemLoader's in/out methods are duplicate of Field.input/output_processors, and one needs to keep them in sync with Item fields.
  4. I can pass a dict as a first argument to ItemLoader, and it will accept it same as an Item
  5. Passing an item from one parser to another through request.meta looks like:
item = loader.load_item()
yield Request(meta={'item': item})

item = response.meta['item']
loader = ItemLoader(item)
loader.add_value('myfield', 1)
yield loader.load_item()

load_item (and input/output processors) was called twice.
6. I can't copy response.xpath() or response.xpath().re_first() from scrapy console (where I do debug) 1:1 into my code (must rewrite into add_xpath(fieldname, xpath, re)). Copying xpath and re doesn't make me sure that it will work the same way as there are input/output processors.
7. add_value(None, {}) looks weird.

Solution: use builders instead. And schema validation.

class Builder(object):

    field1 = None
    _field2 = []

    def __init__(self, field1=None):
        self.field1 = field1
        # reset mutable attributes
        self._field2 = []

    def __setattr__(self, name, value):
        Raise an exception if a field name was mistyped.
        if not hasattr(self, name):
            raise AttributeError("{name} attribute does not exist.".format(name=name))
        super(OpenstatesBase, self).__setattr__(name, value)

    def add_field2(self, value):
        Any validation, formatting if required.

    def copy(self):
        Code to return the object copy. If you need it.

    def load(self):
        Can use ItemLoader or/and validation here.
        return {
            'field1': self.field1,
            'field2': self._field2

Item pipelines

Use if the problem is domain specific and the pipeline can be reused across projects.

Spider middlewares

Use if the problem is domain specific and the middleware can be reused across projects. Use to modify or drop items.

Downloader middlewares

Use for custom login or special cookies handling.



treq - an asynchronous equivalent for requests package.
Simpler than scrapy's Request/

Async DB clients


Using threads


Use locks: threading.RLock() (issues around global state).

Run executables with reactor.spawnProcess().


Plain classes that get loaded at crawl startup and can access settings, the crawler, register callbacks to signals, and define their own signals.

Close spider


Memory usage extension

Shuts down the spider when it exceeds a memory limit.


Use telnet console:

telnet localhost 6023
est()  # get execution engine status

See Learning Scrapy by Dimitrius Kouzis-Loukas, "Performance" chapter.

Pipeline: scheduller -> throttler -> downloader -> spider -> item pipelines.



Modify scrapy.cfg:

url = http://localhost:6800
project = myproject
pip install scrapyd-client
curl http://localhost:6800/schedule.json -d project=myproject -d spider=myspider

Multiple servers:

url = http://server1:6800
url = http://server2:6800
scrapyd-deploy server1


Default task priority is 0.
To set another priority use priority setting:

curl http://localhost:6800/schedule.json -d project=myproject -d spider=myspider -d priority=1

Best practices

Avoid denial-of-service attack

Use throttling, watch response time.


Look at the copyright notice of the site.


Use HttpProxyMiddleware (enabled by default) and http_proxy (https_proxy) environment variables.

Crawlera is a smart downloader designed specifically for web crawling and scraping. It allows you to crawl quickly and reliably, managing thousands of proxies internally, so you don’t have to.

Saving to a database

Batch insert usually is more efficient way.

Spider name

If the spider scrapes a single domain, a common practice is to name the spider after the domain, with or without the TLD. So, for example, a spider that crawls would often be called mywebsite.


Set User-Agent header to something that identifies you.



The main goal in scraping is to extract structured data from unstructured sources.



JSON Line format

.jl files have one JSON object per line, so they can be read more efficiently.

Learning Scrapy by Dimitrius Kouzis-Loukas

Licensed under CC BY-SA 3.0