AutoThrottle extension¶
This is an extension for automatically throttling crawling speed based on load of both the Scrapy server and the website you are crawling.
Design goals¶
be nicer to sites instead of using default download delay of zero
automatically adjust Scrapy to the optimum crawling speed, so the user doesn’t have to tune the download delays to find the optimum one. The user only needs to specify the maximum concurrent requests it allows, and the extension does the rest.
How it works¶
AutoThrottle extension adjusts download delays dynamically to make spider send
AUTOTHROTTLE_TARGET_CONCURRENCY
concurrent requests on average
to each remote website.
It uses download latency to compute the delays. The main idea is the
following: if a server needs latency
seconds to respond, a client
should send a request each latency/N
seconds to have N
requests
processed in parallel.
Instead of adjusting the delays one can just set a small fixed
download delay and impose hard limits on concurrency using
CONCURRENT_REQUESTS_PER_DOMAIN
or
CONCURRENT_REQUESTS_PER_IP
options. It will provide a similar
effect, but there are some important differences:
because the download delay is small there will be occasional bursts of requests;
often non-200 (error) responses can be returned faster than regular responses, so with a small download delay and a hard concurrency limit crawler will be sending requests to server faster when server starts to return errors. But this is an opposite of what crawler should do - in case of errors it makes more sense to slow down: these errors may be caused by the high request rate.
AutoThrottle doesn’t have these issues.
Throttling algorithm¶
AutoThrottle algorithm adjusts download delays based on the following rules:
spiders always start with a download delay of
AUTOTHROTTLE_START_DELAY
;when a response is received, the target download delay is calculated as
latency / N
wherelatency
is a latency of the response, andN
isAUTOTHROTTLE_TARGET_CONCURRENCY
.download delay for next requests is set to the average of previous download delay and the target download delay;
latencies of non-200 responses are not allowed to decrease the delay;
download delay can’t become less than
DOWNLOAD_DELAY
or greater thanAUTOTHROTTLE_MAX_DELAY
Note
The AutoThrottle extension honours the standard Scrapy settings for
concurrency and delay. This means that it will respect
CONCURRENT_REQUESTS_PER_DOMAIN
and
CONCURRENT_REQUESTS_PER_IP
options and
never set a download delay lower than DOWNLOAD_DELAY
.
In Scrapy, the download latency is measured as the time elapsed between establishing the TCP connection and receiving the HTTP headers.
Note that these latencies are very hard to measure accurately in a cooperative multitasking environment because Scrapy may be busy processing a spider callback, for example, and unable to attend downloads. However, these latencies should still give a reasonable estimate of how busy Scrapy (and ultimately, the server) is, and this extension builds on that premise.
Settings¶
The settings used to control the AutoThrottle extension are:
For more information see How it works.
AUTOTHROTTLE_ENABLED¶
Default: False
Enables the AutoThrottle extension.
AUTOTHROTTLE_START_DELAY¶
Default: 5.0
The initial download delay (in seconds).
AUTOTHROTTLE_MAX_DELAY¶
Default: 60.0
The maximum download delay (in seconds) to be set in case of high latencies.
AUTOTHROTTLE_TARGET_CONCURRENCY¶
Default: 1.0
Average number of requests Scrapy should be sending in parallel to remote websites.
By default, AutoThrottle adjusts the delay to send a single
concurrent request to each of the remote websites. Set this option to
a higher value (e.g. 2.0
) to increase the throughput and the load on remote
servers. A lower AUTOTHROTTLE_TARGET_CONCURRENCY
value
(e.g. 0.5
) makes the crawler more conservative and polite.
Note that CONCURRENT_REQUESTS_PER_DOMAIN
and CONCURRENT_REQUESTS_PER_IP
options are still respected
when AutoThrottle extension is enabled. This means that if
AUTOTHROTTLE_TARGET_CONCURRENCY
is set to a value higher than
CONCURRENT_REQUESTS_PER_DOMAIN
or
CONCURRENT_REQUESTS_PER_IP
, the crawler won’t reach this number
of concurrent requests.
At every given time point Scrapy can be sending more or less concurrent
requests than AUTOTHROTTLE_TARGET_CONCURRENCY
; it is a suggested
value the crawler tries to approach, not a hard limit.
AUTOTHROTTLE_DEBUG¶
Default: False
Enable AutoThrottle debug mode which will display stats on every response received, so you can see how the throttling parameters are being adjusted in real time.