Scheduler

The scheduler component receives requests from the engine and stores them into persistent and/or non-persistent data structures. It also gets those requests and feeds them back to the engine when it asks for a next request to be downloaded.

Overriding the default scheduler

You can use your own custom scheduler class by supplying its full Python path in the SCHEDULER setting.

Minimal scheduler interface

class scrapy.core.scheduler.BaseScheduler[source]

The scheduler component is responsible for storing requests received from the engine, and feeding them back upon request (also to the engine).

The original sources of said requests are:

Spider: start method, requests created for URLs in the start_urls attribute, request callbacks
Spider middleware: process_spider_output and process_spider_exception methods
Downloader middleware: process_request, process_response and process_exception methods

The order in which the scheduler returns its stored requests (via the next_request method) plays a great part in determining the order in which those requests are downloaded. See Request order.

The methods defined in this class constitute the minimal interface that the Scrapy engine will interact with.

close(reason: str) → Deferred[None] | None[source]

Called when the spider is closed by the engine. It receives the reason why the crawl finished as argument and it’s useful to execute cleaning code.

Parameters:: reason (str) – a string which describes the reason why the spider was closed

abstractmethod enqueue_request(request: Request) → bool[source]

Process a request received by the engine.

Return True if the request is stored correctly, False otherwise.

If False, the engine will fire a request_dropped signal, and will not make further attempts to schedule the request at a later time. For reference, the default Scrapy scheduler returns False when the request is rejected by the dupefilter.

classmethod from_crawler(crawler: Crawler) → Self[source]: Factory method which receives the current Crawler object as argument.

abstractmethod has_pending_requests() → bool[source]: True if the scheduler has enqueued requests, False otherwise

abstractmethod next_request() → Request | None[source]

Return the next Request to be processed, or None to indicate that there are no requests to be considered ready at the moment.

Returning None implies that no request from the scheduler will be sent to the downloader in the current reactor cycle. The engine will continue calling next_request until has_pending_requests is False.

open(spider: Spider) → Deferred[None] | None[source]

Called when the spider is opened by the engine. It receives the spider instance as argument and it’s useful to execute initialization code.

Parameters:: spider (Spider) – the spider object for the current crawl

Default scheduler

class scrapy.core.scheduler.Scheduler[source]

Default scheduler.

Requests are stored into priority queues (SCHEDULER_PRIORITY_QUEUE) that sort requests by priority.

By default, memory-based priority queues are used for all requests. When using JOBDIR, disk-based priority queues are also created, and only unserializable requests are stored in the memory-based priority queues. For a given priority value, requests in memory take precedence over requests in disk.

Each priority queue stores requests in separate internal queues, one per priority value. The memory priority queue uses SCHEDULER_MEMORY_QUEUE queues, while the disk priority queue uses SCHEDULER_DISK_QUEUE queues. The internal queues determine request order when requests have the same priority. Start requests are stored into separate internal queues by default, and ordered differently.

Duplicate requests are filtered out with an instance of DUPEFILTER_CLASS.

Request order

With default settings, pending requests are stored in a LIFO queue (except for start requests). As a result, crawling happens in DFO order, which is usually the most convenient crawl order. However, you can enforce BFO or a custom order (except for the first few requests).

Start request order

Start requests are sent in the order they are yielded from start(), and given the same priority, other requests take precedence over start requests.

You can set SCHEDULER_START_MEMORY_QUEUE and SCHEDULER_START_DISK_QUEUE to None to handle start requests the same as other requests when it comes to order and priority.

Crawling in BFO order

If you do want to crawl in BFO order, you can do it by setting the following settings:

DEPTH_PRIORITY = 1
SCHEDULER_DISK_QUEUE =
"scrapy.squeues.PickleFifoDiskQueue"
SCHEDULER_MEMORY_QUEUE = "scrapy.squeues.FifoMemoryQueue"

Crawling in a custom order

You can manually set priority on requests to force a specific request order.

Concurrency affects order

While pending requests are below the configured values of CONCURRENT_REQUESTS or CONCURRENT_REQUESTS_PER_DOMAIN, those requests are sent concurrently.

As a result, the first few requests of a crawl may not follow the desired order. Lowering those settings to 1 enforces the desired order except for the very first request, but it significantly slows down the crawl as a whole.

Job directory contents

Warning

The files that this class generates in the job directory are an implementation detail, and may change without a warning in a future version of Scrapy. Do not rely on the following information for anything other than debugging purposes.

When using JOBDIR, this scheduler class:

Creates a directory named requests.queue inside the job directory, meant to keep track of all requests stored in the scheduler (i.e. not downloaded yet).
Generates inside that directory an active.json file with a JSON representation of the state (startprios) of SCHEDULER_PRIORITY_QUEUE.

The file is generated whenever the job stops (cleanly) and is loaded when resuming the job.
Instantiates the configured SCHEDULER_PRIORITY_QUEUE with requests.queue/ as persistence directory (key) and SCHEDULER_DISK_QUEUE as downstream_queue_cls. The priority queue may create additional files and directories inside that directory, directly or though instances of SCHEDULER_DISK_QUEUE.

This scheduler class also uses the configured DUPEFILTER_CLASS, which may also write data inside the job directory.

__init__(dupefilter: BaseDupeFilter, jobdir: str | None = None, dqclass: type[BaseQueue] | None = None, mqclass: type[BaseQueue] | None = None, logunser: bool = False, stats: StatsCollector | None = None, pqclass: type[ScrapyPriorityQueue] | None = None, crawler: Crawler | None = None)[source]

Initialize the scheduler.

Parameters:

dupefilter (scrapy.dupefilters.BaseDupeFilter instance or similar: any class that implements the BaseDupeFilter interface) – An object responsible for checking and filtering duplicate requests. The value for the DUPEFILTER_CLASS setting is used by default.
jobdir (str or None) – The path of a directory to be used for persisting the crawl’s state. The value for the JOBDIR setting is used by default. See Jobs: pausing and resuming crawls.
dqclass (type) – A class to be used as persistent request queue. The value for the SCHEDULER_DISK_QUEUE setting is used by default.
mqclass (type) – A class to be used as non-persistent request queue. The value for the SCHEDULER_MEMORY_QUEUE setting is used by default.
logunser (bool) – A boolean that indicates whether or not unserializable requests should be logged. The value for the SCHEDULER_DEBUG setting is used by default.
stats (scrapy.statscollectors.StatsCollector instance or similar: any class that implements the StatsCollector interface) – A stats collector object to record stats about the request scheduling process. The value for the STATS_CLASS setting is used by default.
pqclass (type) – A class to be used as priority queue for requests. The value for the SCHEDULER_PRIORITY_QUEUE setting is used by default.
crawler (scrapy.crawler.Crawler) – The crawler object corresponding to the current crawl.

__len__() → int[source]: Return the total amount of enqueued requests

close(reason: str) → Deferred[None] | None[source]

dump pending requests to disk if there is a disk queue
return the result of the dupefilter’s close method

enqueue_request(request: Request) → bool[source]

Unless the received request is filtered out by the Dupefilter, attempt to push it into the disk queue, falling back to pushing it into the memory queue.

Increment the appropriate stats, such as: scheduler/enqueued, scheduler/enqueued/disk, scheduler/enqueued/memory.

Return True if the request was stored successfully, False otherwise.

classmethod from_crawler(crawler: Crawler) → Self[source]: Factory method which receives the current Crawler object as argument.

has_pending_requests() → bool[source]: True if the scheduler has enqueued requests, False otherwise

next_request() → Request | None[source]

Return a Request object from the memory queue, falling back to the disk queue if the memory queue is empty. Return None if there are no more enqueued requests.

Increment the appropriate stats, such as: scheduler/dequeued, scheduler/dequeued/disk, scheduler/dequeued/memory.

open(spider: Spider) → Deferred[None] | None[source]

initialize the memory queue
initialize the disk queue if the jobdir argument wasn’t empty
return the result of the dupefilter’s open method

Priority queues

class scrapy.pqueues.DownloaderAwarePriorityQueue(crawler: Crawler, downstream_queue_cls: type[QueueProtocol], key: str, slot_startprios: dict[str, Iterable[int]] | None = None, *, start_queue_cls: type[QueueProtocol] | None = None)[source]: PriorityQueue which takes Downloader activity into account: domains (slots) with the least amount of active downloads are dequeued first.

Disk persistence

Warning

The files that this class generates on disk are an implementation detail, and may change without a warning in a future version of Scrapy. Do not rely on the following information for anything other than debugging purposes.

When a component instantiates this class with a non-empty key argument, key is used as a persistence directory, and inside that directory this class creates a subdirectory per download slot (domain).

Those subdirectories are named after the corresponding download slot, with path-unsafe characters replaced by underscores and an MD5 hash suffix to avoid collisions.

For each download slot, this class creates an instance of ScrapyPriorityQueue with the download slot subdirectory as key and its own downstream_queue_cls.

class scrapy.pqueues.ScrapyPriorityQueue(crawler: Crawler, downstream_queue_cls: type[QueueProtocol], key: str, startprios: Iterable[int] = (), *, start_queue_cls: type[QueueProtocol] | None = None)[source]

A priority queue implemented using multiple internal queues (typically, FIFO queues). It uses one internal queue for each priority value. The internal queue must implement the following methods:

push(obj)

pop()

close()

__len__()

Optionally, the queue could provide a peek method, that should return the next object to be returned by pop, but without removing it from the queue.

__init__ method of ScrapyPriorityQueue receives a downstream_queue_cls argument, which is a class used to instantiate a new (internal) queue when a new priority is allocated.

Only integer priorities should be used. Lower numbers are higher priorities.

startprios is a sequence of priorities to start with. If the queue was previously closed leaving some priority buckets non-empty, those priorities should be passed in startprios.

Disk persistence

Warning

The files that this class generates on disk are an implementation detail, and may change without a warning in a future version of Scrapy. Do not rely on the following information for anything other than debugging purposes.

When a component instantiates this class with a non-empty key argument, key is used as a persistence directory.

For every request enqueued, this class checks:

Whether the request is a start request or not.
The priority of the request.

For each combination of the above seen, this class creates an instance of downstream_queue_cls (or start_queue_cls for start requests if it was passed) with key set to a subdirectory of the persistence directory, named as the negated request priority (e.g. -1), with an s suffix in case of a start request (e.g. -1s).