New in version 2.0.
The following callables may be defined as coroutines using
async def, and
hence use coroutine syntax (e.g.
The following are known caveats of the current implementation that we aim to address in future versions of Scrapy:
The callback output is not processed until the whole callback finishes.
As a side effect, if the callback raises an exception, none of its output is processed.
Because asynchronous generators were introduced in Python 3.6, you can only use
yieldif you are using Python 3.6 or later.
If you need to output multiple items or requests and you are using Python 3.5, return an iterable (e.g. a list) instead.
There are several use cases for coroutines in Scrapy. Code that would return Deferreds when written for previous Scrapy versions, such as downloader middlewares and signal handlers, can be rewritten to be shorter and cleaner:
from itemadapter import ItemAdapter class DbPipeline: def _update_item(self, data, item): adapter = ItemAdapter(item) adapter['field'] = data return item def process_item(self, item, spider): adapter = ItemAdapter(item) dfd = db.get_some_data(adapter['id']) dfd.addCallback(self._update_item, item) return dfd
from itemadapter import ItemAdapter class DbPipeline: async def process_item(self, item, spider): adapter = ItemAdapter(item) adapter['field'] = await db.get_some_data(adapter['id']) return item
Coroutines may be used to call asynchronous code. This includes other
coroutines, functions that return Deferreds and functions that return
awaitable objects such as
This means you can use many useful Python libraries providing such code:
class MySpider(Spider): # ... async def parse_with_deferred(self, response): additional_response = await treq.get('https://additional.url') additional_data = await treq.content(additional_response) # ... use response and additional_data to yield items and requests async def parse_with_asyncio(self, response): async with aiohttp.ClientSession() as session: async with session.get('https://additional.url') as additional_response: additional_data = await r.text() # ... use response and additional_data to yield items and requests
Common use cases for asynchronous code include:
requesting data from websites, databases and other services (in callbacks, pipelines and middlewares);
storing data in databases (in pipelines and middlewares);
delaying the spider initialization until some external event (in the
calling asynchronous Scrapy methods like
ExecutionEngine.download(see the screenshot pipeline example).