Core API

This section documents the Scrapy core API, and it’s intended for developers of extensions and middlewares.

Crawler API

The main entry point to the Scrapy API is the Crawler object, which components can get for initialization. It provides access to all Scrapy core components, and it is the only way for components to access them and hook their functionality into Scrapy.

The Extension Manager is responsible for loading and keeping track of installed extensions and it’s configured through the EXTENSIONS setting which contains a dictionary of all available extensions and their order similar to how you configure the downloader middlewares.

class scrapy.crawler.Crawler(spidercls: type[Spider], settings: dict[str, Any] | Settings | None = None, init_reactor: bool = False)[source]

The Crawler object must be instantiated with a scrapy.Spider subclass and a scrapy.settings.Settings object.

request_fingerprinter

The request fingerprint builder of this crawler.

This is used from extensions and middlewares to build short, unique identifiers for requests. See Request fingerprints.

settings

The settings manager of this crawler.

This is used by extensions & middlewares to access the Scrapy settings of this crawler.

For an introduction on Scrapy settings see Settings.

For the API see Settings class.

signals

The signals manager of this crawler.

This is used by extensions & middlewares to hook themselves into Scrapy functionality.

For an introduction on signals see Signals.

For the API see SignalManager class.

stats

The stats collector of this crawler.

This is used from extensions & middlewares to record stats of their behaviour, or access stats collected by other extensions.

For an introduction on stats collection see Stats Collection.

For the API see StatsCollector class.

extensions

The extension manager that keeps track of enabled extensions.

Most extensions won’t need to access this attribute.

For an introduction on extensions and a list of available extensions on Scrapy see Extensions.

engine

The execution engine, which coordinates the core crawling logic between the scheduler, downloader and spiders.

Some extension may want to access the Scrapy engine, to inspect or modify the downloader and scheduler behaviour, although this is an advanced use and this API is not yet stable.

spider: Spider currently being crawled. This is an instance of the spider class provided while constructing the crawler, and it is created after the arguments given in the crawl() method.

async crawl_async(*args: Any, **kwargs: Any) → None[source]

Start the crawler by instantiating its spider class with the given args and kwargs arguments, while setting the execution engine in motion. Should be called only once.

Added in version 2.14.

Complete when the crawl is finished.

crawl(*args: Any, **kwargs: Any) → Generator[Deferred[Any], Any, None][source]

Start the crawler by instantiating its spider class with the given args and kwargs arguments, while setting the execution engine in motion. Should be called only once.

Return a deferred that is fired when the crawl is finished.

async stop_async() → None[source]: Start a graceful stop of the crawler and complete when the crawler is stopped.

Added in version 2.14.

stop() → Deferred[None][source]: Start a graceful stop of the crawler and return a deferred that is fired when the crawler is stopped.

get_addon(cls: type[_T]) → _T | None[source]: Return the run-time instance of an add-on of the specified class or a subclass, or None if none is found.

Added in version 2.12.

get_downloader_middleware(cls: type[_T]) → _T | None[source]

Return the run-time instance of a downloader middleware of the specified class or a subclass, or None if none is found.

Added in version 2.12.

This method can only be called after the crawl engine has been created, e.g. at signals engine_started or spider_opened.

get_extension(cls: type[_T]) → _T | None[source]

Return the run-time instance of an extension of the specified class or a subclass, or None if none is found.

Added in version 2.12.

This method can only be called after the extension manager has been created, e.g. at signals engine_started or spider_opened.

get_item_pipeline(cls: type[_T]) → _T | None[source]

Return the run-time instance of a item pipeline of the specified class or a subclass, or None if none is found.

Added in version 2.12.

This method can only be called after the crawl engine has been created, e.g. at signals engine_started or spider_opened.

get_spider_middleware(cls: type[_T]) → _T | None[source]

Return the run-time instance of a spider middleware of the specified class or a subclass, or None if none is found.

Added in version 2.12.

This method can only be called after the crawl engine has been created, e.g. at signals engine_started or spider_opened.

class scrapy.crawler.AsyncCrawlerRunner(settings: dict[str, Any] | Settings | None = None)[source]

This is a convenient helper class that keeps track of, manages and runs crawlers inside an already setup reactor or asyncio event loop.

The AsyncCrawlerRunner object must be instantiated with a Settings object.

When the TWISTED_REACTOR_ENABLED setting is set to True, this class requires a reactor to be installed and uses it, otherwise it requires a reactor to not be installed but requires an asyncio event loop to be installed and uses it.

This class shouldn’t be needed (since Scrapy is responsible of using it accordingly) unless writing scripts that manually handle the crawling process. See Run Scrapy from a script for an example.

This class provides coroutine APIs. It requires AsyncioSelectorReactor when used with a reactor.

crawl(crawler_or_spidercls: type[Spider] | str | Crawler, *args: Any, **kwargs: Any) → Task[None][source]

Run a crawler with the provided arguments.

It will call the given Crawler’s crawl_async() method, while keeping track of it so it can be stopped later.

If crawler_or_spidercls isn’t a Crawler instance, this method will try to create one using this parameter as the spider class given to it.

Returns a Task object which completes when the crawling is finished.

Parameters:

crawler_or_spidercls (Crawler instance, Spider subclass or string) – already created crawler, or a spider class or spider’s name inside the project to create it
args – arguments to initialize the spider
kwargs – keyword arguments to initialize the spider

async join() → None[source]: Completes when all managed crawlers have completed their executions.

async stop() → None[source]

Stops simultaneously all the crawling jobs taking place.

Completes when they all have ended.

class scrapy.crawler.CrawlerRunner(settings: dict[str, Any] | Settings | None = None)[source]

This is a convenient helper class that keeps track of, manages and runs crawlers inside an already setup reactor.

The CrawlerRunner object must be instantiated with a Settings object.

This class shouldn’t be needed (since Scrapy is responsible of using it accordingly) unless writing scripts that manually handle the crawling process. See Run Scrapy from a script for an example.

This class provides Deferred-based APIs. Use AsyncCrawlerRunner for modern coroutine APIs.

crawl(crawler_or_spidercls: type[Spider] | str | Crawler, *args: Any, **kwargs: Any) → Deferred[None][source]

Run a crawler with the provided arguments.

It will call the given Crawler’s crawl() method, while keeping track of it so it can be stopped later.

If crawler_or_spidercls isn’t a Crawler instance, this method will try to create one using this parameter as the spider class given to it.

Returns a deferred that is fired when the crawling is finished.

Parameters:

crawler_or_spidercls (Crawler instance, Spider subclass or string) – already created crawler, or a spider class or spider’s name inside the project to create it
args – arguments to initialize the spider
kwargs – keyword arguments to initialize the spider

join()[source]: Returns a deferred that is fired when all managed crawlers have completed their executions.

stop() → Deferred[Any][source]

Stops simultaneously all the crawling jobs taking place.

Returns a deferred that is fired when they all have ended.

class scrapy.crawler.AsyncCrawlerProcess(settings: dict[str, Any] | Settings | None = None, install_root_handler: bool = True)[source]

Bases: CrawlerProcessBase, AsyncCrawlerRunner

A class to run multiple scrapy crawlers in a process simultaneously.

This class extends AsyncCrawlerRunner by adding support for starting a reactor and handling shutdown signals, like the keyboard interrupt command Ctrl-C. It also configures top-level logging.

This utility should be a better fit than AsyncCrawlerRunner if you aren’t running another reactor within your application.

The AsyncCrawlerProcess object must be instantiated with a Settings object.

When the TWISTED_REACTOR_ENABLED setting is set to True, this class installs a reactor and uses it, otherwise it requires a reactor to not be installed but installs an asyncio event loop and uses it.

Parameters:: install_root_handler – whether to install root logging handler (default: True)

This class shouldn’t be needed (since Scrapy is responsible of using it accordingly) unless writing scripts that manually handle the crawling process. See Run Scrapy from a script for an example.

This class provides coroutine APIs. It requires AsyncioSelectorReactor when used with a reactor.

crawl(crawler_or_spidercls: type[Spider] | str | Crawler, *args: Any, **kwargs: Any) → Task[None]

Run a crawler with the provided arguments.

It will call the given Crawler’s crawl_async() method, while keeping track of it so it can be stopped later.

If crawler_or_spidercls isn’t a Crawler instance, this method will try to create one using this parameter as the spider class given to it.

Returns a Task object which completes when the crawling is finished.

Parameters:

crawler_or_spidercls (Crawler instance, Spider subclass or string) – already created crawler, or a spider class or spider’s name inside the project to create it
args – arguments to initialize the spider
kwargs – keyword arguments to initialize the spider

property crawlers: set[Crawler]: Set of crawlers started by crawl() and managed by this class.

create_crawler(crawler_or_spidercls: type[Spider] | str | Crawler) → Crawler

Return a Crawler object.

If crawler_or_spidercls is a Crawler, the runner’s settings are merged into it as defaults: for each setting, the runner’s value is applied only if the Crawler does not already have that setting at an equal or higher priority. The Crawler is then returned.
If crawler_or_spidercls is a Spider subclass, a new Crawler is constructed for it using this runner’s settings.
If crawler_or_spidercls is a string, this function finds a spider with this name in a Scrapy project (using spider loader), then creates a Crawler instance for it.

async join() → None: Completes when all managed crawlers have completed their executions.

start(stop_after_crawl: bool = True, install_signal_handlers: bool = True) → None[source]

This method starts a reactor or an asyncio event loop, depending on the value of the TWISTED_REACTOR_ENABLED setting.

When using a reactor it adjusts its pool size to REACTOR_THREADPOOL_MAXSIZE and installs a DNS resolver based on TWISTED_DNS_RESOLVER.

If stop_after_crawl is True, the reactor/event loop will be stopped after all crawlers have finished, using join().

Parameters:

stop_after_crawl (bool) – stop or not the reactor when all crawlers have finished
install_signal_handlers (bool) – whether to install the OS signal handlers from Twisted and Scrapy (default: True)

async stop() → None

Stops simultaneously all the crawling jobs taking place.

Completes when they all have ended.

class scrapy.crawler.CrawlerProcess(settings: dict[str, Any] | Settings | None = None, install_root_handler: bool = True)[source]

Bases: CrawlerProcessBase, CrawlerRunner

A class to run multiple scrapy crawlers in a process simultaneously.

This class extends CrawlerRunner by adding support for starting a reactor and handling shutdown signals, like the keyboard interrupt command Ctrl-C. It also configures top-level logging.

This utility should be a better fit than CrawlerRunner if you aren’t running another reactor within your application.

The CrawlerProcess object must be instantiated with a Settings object.

Parameters:: install_root_handler – whether to install root logging handler (default: True)

This class shouldn’t be needed (since Scrapy is responsible of using it accordingly) unless writing scripts that manually handle the crawling process. See Run Scrapy from a script for an example.

This class provides Deferred-based APIs. Use AsyncCrawlerProcess for modern coroutine APIs.

crawl(crawler_or_spidercls: type[Spider] | str | Crawler, *args: Any, **kwargs: Any) → Deferred[None]

Run a crawler with the provided arguments.

It will call the given Crawler’s crawl() method, while keeping track of it so it can be stopped later.

If crawler_or_spidercls isn’t a Crawler instance, this method will try to create one using this parameter as the spider class given to it.

Returns a deferred that is fired when the crawling is finished.

Parameters:

crawler_or_spidercls (Crawler instance, Spider subclass or string) – already created crawler, or a spider class or spider’s name inside the project to create it
args – arguments to initialize the spider
kwargs – keyword arguments to initialize the spider

property crawlers: set[Crawler]: Set of crawlers started by crawl() and managed by this class.

create_crawler(crawler_or_spidercls: type[Spider] | str | Crawler) → Crawler

Return a Crawler object.

If crawler_or_spidercls is a Crawler, the runner’s settings are merged into it as defaults: for each setting, the runner’s value is applied only if the Crawler does not already have that setting at an equal or higher priority. The Crawler is then returned.
If crawler_or_spidercls is a Spider subclass, a new Crawler is constructed for it using this runner’s settings.
If crawler_or_spidercls is a string, this function finds a spider with this name in a Scrapy project (using spider loader), then creates a Crawler instance for it.

join(): Returns a deferred that is fired when all managed crawlers have completed their executions.

start(stop_after_crawl: bool = True, install_signal_handlers: bool = True) → None[source]

This method starts a reactor, adjusts its pool size to REACTOR_THREADPOOL_MAXSIZE, and installs a DNS resolver based on TWISTED_DNS_RESOLVER.

If stop_after_crawl is True, the reactor will be stopped after all crawlers have finished, using join().

Parameters:

stop_after_crawl (bool) – stop or not the reactor when all crawlers have finished
install_signal_handlers (bool) – whether to install the OS signal handlers from Twisted and Scrapy (default: True)

stop() → Deferred[Any]

Stops simultaneously all the crawling jobs taking place.

Returns a deferred that is fired when they all have ended.

Settings API

scrapy.settings.SETTINGS_PRIORITIES

Dictionary that sets the key name and priority level of the default settings priorities used in Scrapy.

Each item defines a settings entry point, giving it a code name for identification and an integer priority. Greater priorities take more precedence over lesser ones when setting and retrieving values in the Settings class.

SETTINGS_PRIORITIES = {
    "default": 0,
    "command": 10,
    "addon": 15,
    "project": 20,
    "spider": 30,
    "cmdline": 40,
}

For a detailed explanation on each settings sources, see: Settings.

scrapy.settings.get_settings_priority(priority: int | str) → int[source]: Small helper function that looks up a given string priority in the SETTINGS_PRIORITIES dictionary and returns its numerical value, or directly returns a given numerical priority.

class scrapy.settings.Settings(values: _SettingsInput = None, priority: int | str = 'project')[source]

Bases: BaseSettings

This object stores Scrapy settings for the configuration of internal components, and can be used for any further customization.

It is a direct subclass and supports all methods of BaseSettings. Additionally, after instantiation of this class, the new object will have the global default settings described on Built-in settings reference already populated.

class scrapy.settings.BaseSettings(values: _SettingsInput = None, priority: int | str = 'project')[source]

Instances of this class behave like dictionaries, but store priorities along with their (key, value) pairs, and can be frozen (i.e. marked immutable).

Key-value entries can be passed on initialization with the values argument, and they would take the priority level (unless values is already an instance of BaseSettings, in which case the existing priority levels will be kept). If the priority argument is a string, the priority name will be looked up in SETTINGS_PRIORITIES. Otherwise, a specific integer should be provided.

Once the object is created, new settings can be loaded or updated with the set() method, and can be accessed with the square bracket notation of dictionaries, or with the get() method of the instance and its value conversion variants. When requesting a stored key, the value with the highest priority will be retrieved.

add_to_list(name: str, item: Any) → None[source]

Append item to the list setting with the specified name if item is not already in that list.

This change is applied regardless of the priority of the name setting. The setting priority is not affected by this change either.

copy() → Self[source]

Make a deep copy of current settings.

This method returns a new instance of this class, populated with the same values and their priorities.

Modifications to the new object won’t be reflected on the original settings.

copy_to_dict() → dict[str, Any][source]

Make a copy of current settings and convert to a dict.

This method returns a new dict populated with the same values as the current settings.

Modifications to the returned dict won’t be reflected on the original settings.

This method can be useful for example for printing settings in Scrapy shell.

freeze() → None[source]

Disable further changes to the current settings.

After calling this method, the present state of the settings will become immutable. Trying to change values through the set() method and its variants won’t be possible and will be alerted.

frozencopy() → Self[source]

Return an immutable copy of the current settings.

Alias for a freeze() call in the object returned by copy().

get(name: str, default: Any = None) → Any[source]

Get a setting value without affecting its original type.

Parameters:

name (str) – the setting name
default (object) – the value to return if no setting is found

get_component_priority_dict_with_base(name: str) → BaseSettings[source]

Get a composition of a component priority dictionary setting and its _BASE counterpart.

Keys are resolved to their import path for deduplication and then restored to their latest input representation.

Parameters:: name (str) – name of the component priority dictionary setting

getbool(name: str, default: bool = False) → bool[source]

Get a setting value as a boolean.

1, '1', True` and 'True' return True, while 0, '0', False, 'False' and None return False.

For example, settings populated through environment variables set to '0' will return False when using this method.

Parameters:

name (str) – the setting name
default (object) – the value to return if no setting is found

getdict(name: str, default: dict[Any, Any] | None = None) → dict[Any, Any][source]

Get a setting value as a dictionary. If the setting original type is a dictionary, a copy of it will be returned. If it is a string it will be evaluated as a JSON dictionary. In the case that it is a BaseSettings instance itself, it will be converted to a dictionary, containing all its current settings values as they would be returned by get(), and losing all information about priority and mutability.

Parameters:

name (str) – the setting name
default (object) – the value to return if no setting is found

getdictorlist(name: str, default: dict[Any, Any] | list[Any] | tuple[Any] | None = None) → dict[Any, Any] | list[Any][source]

Get a setting value as either a dict or a list.

If the setting is already a dict or a list, a copy of it will be returned.

If it is a string it will be evaluated as JSON, or as a comma-separated list of strings as a fallback.

For example, settings populated from the command line will return:

{'key1': 'value1', 'key2': 'value2'} if set to '{"key1": "value1", "key2": "value2"}'
['one', 'two'] if set to '["one", "two"]' or 'one,two'

Parameters:

name (str) – the setting name
default (object) – the value to return if no setting is found

getfloat(name: str, default: float = 0.0) → float[source]

Get a setting value as a float.

Parameters:

name (str) – the setting name
default (object) – the value to return if no setting is found

getint(name: str, default: int = 0) → int[source]

Get a setting value as an int.

Parameters:

name (str) – the setting name
default (object) – the value to return if no setting is found

getlist(name: str, default: list[Any] | None = None) → list[Any][source]

Get a setting value as a list. If the setting original type is a list, a copy of it will be returned. If it’s a string it will be split by “,”. If it is an empty string, an empty list will be returned.

For example, settings populated through environment variables set to 'one,two' will return a list [‘one’, ‘two’] when using this method.

Parameters:

name (str) – the setting name
default (object) – the value to return if no setting is found

getpriority(name: str) → int | None[source]

Return the current numerical priority value of a setting, or None if the given name does not exist.

Parameters:: name (str) – the setting name

getwithbase(name: str) → BaseSettings[source]

Get a composition of a dictionary-like setting and its _BASE counterpart.

Use get_component_priority_dict_with_base() instead if the setting is a component priority dictionary.

Parameters:: name (str) – name of the dictionary-like setting

maxpriority() → int[source]: Return the numerical value of the highest priority present throughout all settings, or the numerical value for default from SETTINGS_PRIORITIES if there are no settings stored.

pop(k[, d]) → v, remove specified key and return the corresponding[source]: value. If key is not found, d is returned if given, otherwise KeyError is raised.

remove_from_list(name: str, item: Any) → None[source]

Remove item from the list setting with the specified name.

If item is missing, raise ValueError.

This change is applied regardless of the priority of the name setting. The setting priority is not affected by this change either.

replace_in_component_priority_dict(name: str, old_cls: type, new_cls: type, priority: int | None = None) → None[source]

Replace old_cls with new_cls in the name component priority dictionary.

If old_cls is missing, or has None as value, KeyError is raised.

If old_cls was present as an import string, even more than once, those keys are dropped and replaced by new_cls.

If priority is specified, that is the value assigned to new_cls in the component priority dictionary. Otherwise, the value of old_cls is used. If old_cls was present multiple times (possible with import strings) with different values, the value assigned to new_cls is one of them, with no guarantee about which one it is.

This change is applied regardless of the priority of the name setting. The setting priority is not affected by this change either.

set(name: str, value: Any, priority: int | str = 'project') → None[source]

Store a key/value attribute with a given priority.

Settings should be populated before the Crawler object applies them (in the crawl_async() or crawl() method), otherwise they won’t have any effect.

Parameters:

name (str) – the setting name
value (object) – the value to associate with the setting
priority (str or int) – the priority of the setting. Should be a key of SETTINGS_PRIORITIES or an integer

set_in_component_priority_dict(name: str, cls: type, priority: int | None) → None[source]

Set the cls component in the name component priority dictionary setting with priority.

If cls already exists, its value is updated.

If cls was present as an import string, even more than once, those keys are dropped and replaced by cls.

This change is applied regardless of the priority of the name setting. The setting priority is not affected by this change either.

setdefault(k[, d]) → D.get(k,d), also set D[k]=d if k not in D[source]

setdefault_in_component_priority_dict(name: str, cls: type, priority: int | None) → None[source]

Set the cls component in the name component priority dictionary setting with priority if not already defined (even as an import string).

If cls is not already defined, it is set regardless of the priority of the name setting. The setting priority is not affected by this change either.

setmodule(module: ModuleType | str, priority: int | str = 'project') → None[source]

Store settings from a module with a given priority.

This is a helper function that calls set() for every globally declared uppercase variable of module with the provided priority.

Parameters:

module (types.ModuleType or str) – the module or the path of the module
priority (str or int) – the priority of the settings. Should be a key of SETTINGS_PRIORITIES or an integer

update(values: _SettingsInput, priority: int | str = 'project') → None[source]

Store key/value pairs with a given priority.

This is a helper function that calls set() for every item of values with the provided priority.

If values is a string, it is assumed to be JSON-encoded and parsed into a dict with json.loads() first. If it is a BaseSettings instance, the per-key priorities will be used and the priority parameter ignored. This allows inserting/updating settings with different priorities with a single command.

Parameters:

values (dict, iterable, string or BaseSettings) – the settings names and values
priority (str or int) – the priority of the settings. Should be a key of SETTINGS_PRIORITIES or an integer

SpiderLoader API

Custom spider loaders can be employed by specifying their path in the SPIDER_LOADER_CLASS project setting. They must implement SpiderLoaderProtocol.

class scrapy.spiderloader.SpiderLoaderProtocol(*args, **kwargs)[source]

Protocol for spider loader implementations.

See SPIDER_LOADER_CLASS.

find_by_request(request: Request) → __builtins__.list[str][source]: Return the list of spiders names that can handle the given request.

classmethod from_settings(settings: BaseSettings) → Self[source]: Return an instance of the class for the given settings.

list() → list[str][source]: Return a list with the names of all spiders available in the project.

load(spider_name: str) → type[Spider][source]: Return the spider class for the given spider name. If the spider name is not found, it must raise a KeyError.

class scrapy.spiderloader.SpiderLoader(settings: BaseSettings)[source]

SpiderLoader is a class which locates and loads spiders in a Scrapy project.

find_by_request(request: Request) → list[str][source]

Return the list of spider names that can handle the given request.

It will try to match the request’s url against the domains of the spiders.

classmethod from_settings(settings: BaseSettings) → Self[source]

Create an instance of the class.

It’s called with the current project settings, and it loads the spiders found recursively in the modules of the SPIDER_MODULES setting.

list() → list[str][source]: Return a list with the names of all spiders available in the project.

load(spider_name: str) → type[Spider][source]

Return the spider class for the given spider name.

If the spider name is not found, raise a KeyError.

class scrapy.spiderloader.DummySpiderLoader[source]: A dummy spider loader that does not load any spiders.

Signals API

class scrapy.signalmanager.SignalManager(sender: Any = _Anonymous)[source]

connect(receiver: Any, signal: Any, **kwargs: Any) → None[source]

Connect a receiver function to a signal.

The signal can be any object, although Scrapy comes with some predefined signals that are documented in the Signals section.

Parameters:

receiver (collections.abc.Callable) – the function to be connected
signal (object) – the signal to connect to

disconnect(receiver: Any, signal: Any, **kwargs: Any) → None[source]: Disconnect a receiver function from a signal. This has the opposite effect of the connect() method, and the arguments are the same.

disconnect_all(signal: Any, **kwargs: Any) → None[source]

Disconnect all receivers from the given signal.

Parameters:: signal (object) – the signal to disconnect from

send_catch_log(signal: Any, **kwargs: Any) → list[tuple[Any, Any]][source]

Send a signal, catch exceptions and log them.

The keyword arguments are passed to the signal handlers (connected through the connect() method).

async send_catch_log_async(signal: Any, **kwargs: Any) → list[tuple[Any, Any]][source]

Like send_catch_log() but supports asynchronous signal handlers.

Returns a coroutine that completes once all signal handlers have finished. Send a signal, catch exceptions and log them.

The keyword arguments are passed to the signal handlers (connected through the connect() method).

Added in version 2.14.

send_catch_log_deferred(signal: Any, **kwargs: Any) → Deferred[list[tuple[Any, Any]]][source]

Like send_catch_log() but supports asynchronous signal handlers.

Returns a Deferred that gets fired once all signal handlers have finished. Send a signal, catch exceptions and log them.

The keyword arguments are passed to the signal handlers (connected through the connect() method).

async wait_for(signal: Any) → None[source]

Await the next signal.

See Delaying start request iteration for an example.

Stats Collector API

There are several Stats Collectors available under the scrapy.statscollectors module and they all implement the Stats Collector API defined by the StatsCollector class (which they all inherit from).

class scrapy.statscollectors.StatsCollector[source]

get_value(key, default=None)[source]: Return the value for the given stats key or default if it doesn’t exist.

get_stats()[source]: Get all stats from the currently running spider as a dict.

set_value(key, value)[source]: Set the given value for the given stats key.

set_stats(stats)[source]: Override the current stats with the dict passed in stats argument.

inc_value(key, count=1, start=0)[source]: Increment the value of the given stats key, by the given count, assuming the start value given (when it’s not set).

max_value(key, value)[source]: Set the given value for the given key only if current value for the same key is lower than value. If there is no current value for the given key, the value is always set.

min_value(key, value)[source]: Set the given value for the given key only if current value for the same key is greater than value. If there is no current value for the given key, the value is always set.

clear_stats()[source]: Clear all stats.

The following methods are not part of the stats collection api but instead used when implementing custom stats collectors:

open_spider()[source]: Open the spider for stats collection.

close_spider()[source]: Close the spider. After this is called, no more specific stats can be accessed or collected.

Engine API

class scrapy.core.engine.ExecutionEngine[source]

needs_backout() → bool[source]

Returns True if no more requests can be sent at the moment, or False otherwise.

See Delaying start request iteration for an example.