Scrapy 2.11 documentation¶
Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.
Having trouble? We’d like to help!
Try the FAQ – it’s got answers to some common questions.
Ask or search questions in StackOverflow using the scrapy tag.
Ask or search questions in the Scrapy subreddit.
Search for questions on the archives of the scrapy-users mailing list.
Ask a question in the #scrapy IRC channel,
Report bugs with Scrapy in our issue tracker.
Join the Discord community Scrapy Discord.
Command line tool
Learn about the command-line tool used to manage your Scrapy project.
Write the rules to crawl your websites.
Extract the data from web pages using XPath.
Test your extraction code in an interactive environment.
Define the data you want to scrape.
Populate your items with the extracted data.
Post-process and store your scraped data.
Output your scraped data using different formats and storages.
Requests and Responses
Understand the classes used to represent HTTP requests and responses.
Convenient classes to extract links to follow from pages.
Learn how to configure Scrapy and see all available settings.
See all available exceptions and their meaning.
Solving specific problems¶
Frequently Asked Questions
Get answers to most frequently asked questions.
Learn how to debug common problems of your Scrapy spider.
Learn how to use contracts for testing your spiders.
Get familiar with some Scrapy common practices.
Tune Scrapy for crawling a lot domains in parallel.
Using your browser’s Developer Tools for scraping
Learn how to scrape with your browser’s developer tools.
Selecting dynamically-loaded content
Read webpage data that is loaded dynamically.
Debugging memory leaks
Learn how to find and get rid of memory leaks in your crawler.
Downloading and processing files and images
Download files and/or images associated with your scraped items.
Deploying your Scrapy spiders and run them in a remote server.
Adjust crawl rate dynamically based on load.
Check how Scrapy performs on your hardware.
Jobs: pausing and resuming crawls
Learn how to pause and resume crawls for large spiders.
Use the coroutine syntax.
Understand the Scrapy architecture.
Enable and configure third-party extensions.
Customize how pages get requested and downloaded.
Customize the input and output of your spiders.
Extend Scrapy with your custom functionality
See all available signals and how to work with them.
Understand the scheduler component.
Quickly export your scraped items to a file (XML, CSV, etc).
Learn the common API and some good practices when building custom Scrapy components.
Use it on extensions and middlewares to extend Scrapy functionality.