Scrapy
  • Scrapy at a glance
    • Pick a website
    • Define the data you want to scrape
    • Write a Spider to extract the data
    • Run the spider to extract the data
    • Review scraped data
    • What else?
    • What’s next?
  • Installation guide
    • Pre-requisites
    • Installing Scrapy
    • Platform specific installation notes
  • Scrapy Tutorial
    • Creating a project
    • Defining our Item
    • Our first Spider
    • Storing the scraped data
    • Next steps
  • Examples
  • Command line tool
    • Default structure of Scrapy projects
    • Using the scrapy tool
    • Available tool commands
    • Custom project commands
  • Items
    • Declaring Items
    • Item Fields
    • Working with Items
    • Extending Items
    • Item objects
    • Field objects
  • Spiders
    • Spider arguments
    • Built-in spiders reference
  • Link Extractors
    • Built-in link extractors reference
  • Selectors
    • Using selectors
    • Built-in Selectors reference
  • Item Loaders
    • Using Item Loaders to populate items
    • Input and Output processors
    • Declaring Item Loaders
    • Declaring Input and Output Processors
    • Item Loader Context
    • ItemLoader objects
    • Reusing and extending Item Loaders
    • Available built-in processors
  • Scrapy shell
    • Launch the shell
    • Using the shell
    • Example of shell session
    • Invoking the shell from spiders to inspect responses
  • Item Pipeline
    • Writing your own item pipeline
    • Item pipeline example
    • Activating an Item Pipeline component
  • Feed exports
    • Serialization formats
    • Storages
    • Storage URI parameters
    • Storage backends
    • Settings
  • Link Extractors
    • Built-in link extractors reference
  • Logging
    • Log levels
    • How to set the log level
    • How to log messages
    • Logging from Spiders
    • scrapy.log module
    • Logging settings
  • Stats Collection
    • Common Stats Collector uses
    • Available Stats Collectors
  • Sending e-mail
    • Quick example
    • MailSender class reference
    • Mail settings
  • Telnet Console
    • How to access the telnet console
    • Available variables in the telnet console
    • Telnet console usage examples
    • Telnet Console signals
    • Telnet settings
  • Web Service
    • Web service resources
    • Web service settings
    • Writing a web service resource
    • Examples of web service resources
    • Example of web service client
  • Frequently Asked Questions
    • How does Scrapy compare to BeautifulSoup or lxml?
    • What Python versions does Scrapy support?
    • Does Scrapy work with Python 3?
    • Did Scrapy “steal” X from Django?
    • Does Scrapy work with HTTP proxies?
    • How can I scrape an item with attributes in different pages?
    • Scrapy crashes with: ImportError: No module named win32api
    • How can I simulate a user login in my spider?
    • Does Scrapy crawl in breadth-first or depth-first order?
    • My Scrapy crawler has memory leaks. What can I do?
    • How can I make Scrapy consume less memory?
    • Can I use Basic HTTP Authentication in my spiders?
    • Why does Scrapy download pages in English instead of my native language?
    • Where can I find some example Scrapy projects?
    • Can I run a spider without creating a project?
    • I get “Filtered offsite request” messages. How can I fix them?
    • What is the recommended way to deploy a Scrapy crawler in production?
    • Can I use JSON for large exports?
    • Can I return (Twisted) deferreds from signal handlers?
    • What does the response status code 999 means?
    • Can I call pdb.set_trace() from my spiders to debug them?
    • Simplest way to dump all my scraped items into a JSON/CSV/XML file?
    • What’s this huge cryptic __VIEWSTATE parameter used in some forms?
    • What’s the best way to parse big XML/CSV data feeds?
    • Does Scrapy manage cookies automatically?
    • How can I see the cookies being sent and received from Scrapy?
    • How can I instruct a spider to stop itself?
    • How can I prevent my Scrapy bot from getting banned?
    • Should I use spider arguments or settings to configure my spider?
    • I’m scraping a XML document and my XPath selector doesn’t return any items
    • I’m getting an error: “cannot import name crawler”
  • Debugging Spiders
    • Parse Command
    • Scrapy Shell
    • Open in browser
    • Logging
  • Spiders Contracts
    • Custom Contracts
  • Common Practices
    • Run Scrapy from a script
    • Running multiple spiders in the same process
    • Distributed crawls
    • Avoiding getting banned
    • Dynamic Creation of Item Classes
  • Broad Crawls
    • Increase concurrency
    • Reduce log level
    • Disable cookies
    • Disable retries
    • Reduce download timeout
    • Disable redirects
  • Using Firefox for scraping
    • Caveats with inspecting the live browser DOM
    • Useful Firefox add-ons for scraping
  • Using Firebug for scraping
    • Introduction
    • Getting links to follow
    • Extracting the data
  • Debugging memory leaks
    • Common causes of memory leaks
    • Debugging memory leaks with trackref
    • Debugging memory leaks with Guppy
    • Leaks without leaks
  • Downloading Item Images
    • Using the Images Pipeline
    • Usage example
    • Enabling your Images Pipeline
    • Images Storage
    • Additional features
    • Implementing your custom Images Pipeline
    • Custom Images pipeline example
  • Ubuntu packages
  • Scrapyd
  • AutoThrottle extension
    • Design goals
    • How it works
    • Throttling algorithm
    • Settings
  • Benchmarking
  • Jobs: pausing and resuming crawls
    • Job directory
    • How to use it
    • Keeping persistent state between batches
    • Persistence gotchas
  • DjangoItem
    • Using DjangoItem
    • DjangoItem caveats
    • Django settings set up
  • Architecture overview
    • Overview
    • Components
    • Data flow
    • Event-driven networking
  • Downloader Middleware
    • Activating a downloader middleware
    • Writing your own downloader middleware
    • Built-in downloader middleware reference
  • Spider Middleware
    • Activating a spider middleware
    • Writing your own spider middleware
    • Built-in spider middleware reference
  • Extensions
    • Extension settings
    • Loading & activating extensions
    • Available, enabled and disabled extensions
    • Disabling an extension
    • Writing your own extension
    • Built-in extensions reference
  • Core API
    • Crawler API
    • Settings API
    • Signals API
    • Stats Collector API
  • Requests and Responses
    • Request objects
    • Request.meta special keys
    • Request subclasses
    • Response objects
    • Response subclasses
  • Settings
    • Designating the settings
    • Populating the settings
    • How to access settings
    • Rationale for setting names
    • Built-in settings reference
  • Signals
    • Deferred signal handlers
    • Built-in signals reference
  • Exceptions
    • Built-in Exceptions reference
  • Item Exporters
    • Using Item Exporters
    • Serialization of item fields
    • Built-in Item Exporters reference
  • Release notes
    • 0.20.2 (released 2013-12-09)
    • 0.20.1 (released 2013-11-28)
    • 0.20.0 (released 2013-11-08)
    • 0.18.4 (released 2013-10-10)
    • 0.18.3 (released 2013-10-03)
    • 0.18.2 (released 2013-09-03)
    • 0.18.1 (released 2013-08-27)
    • 0.18.0 (released 2013-08-09)
    • 0.16.5 (released 2013-05-30)
    • 0.16.4 (released 2013-01-23)
    • 0.16.3 (released 2012-12-07)
    • 0.16.2 (released 2012-11-09)
    • 0.16.1 (released 2012-10-26)
    • 0.16.0 (released 2012-10-18)
    • 0.14.4
    • 0.14.3
    • 0.14.2
    • 0.14.1
    • 0.14
    • 0.12
    • 0.10
    • 0.9
    • 0.8
    • 0.7
  • Contributing to Scrapy
    • Reporting bugs
    • Writing patches
    • Submitting patches
    • Coding style
    • Scrapy Contrib
    • Documentation policies
    • Tests
  • Versioning and API Stability
    • Versioning
    • API Stability
  • Experimental features
    • Add commands using external libraries
 
Scrapy
  • Docs »
  • Link Extractors
  • Edit on GitHub

Link Extractors¶

LinkExtractors are objects whose only purpose is to extract links from web pages (scrapy.http.Response objects) which will be eventually followed.

There are two Link Extractors available in Scrapy by default, but you create your own custom Link Extractors to suit your needs by implementing a simple interface.

The only public method that every LinkExtractor has is extract_links, which receives a Response object and returns a list of links. Link Extractors are meant to be instantiated once and their extract_links method called several times with different responses, to extract links to follow.

Link extractors are used in the CrawlSpider class (available in Scrapy), through a set of rules, but you can also use it in your spiders, even if you don’t subclass from CrawlSpider, as its purpose is very simple: to extract links.

Built-in link extractors reference¶

All available link extractors classes bundled with Scrapy are provided in the scrapy.contrib.linkextractors module.

SgmlLinkExtractor¶

class scrapy.contrib.linkextractors.sgml.SgmlLinkExtractor(allow=(), deny=(), allow_domains=(), deny_domains=(), deny_extensions=None, restrict_xpaths=(), tags=('a', 'area'), attrs=('href'), canonicalize=True, unique=True, process_value=None)¶

The SgmlLinkExtractor extends the base BaseSgmlLinkExtractor by providing additional filters that you can specify to extract links, including regular expressions patterns that the links must match to be extracted. All those filters are configured through these constructor parameters:

Parameters:
  • allow (a regular expression (or list of)) – a single regular expression (or list of regular expressions) that the (absolute) urls must match in order to be extracted. If not given (or empty), it will match all links.
  • deny (a regular expression (or list of)) – a single regular expression (or list of regular expressions) that the (absolute) urls must match in order to be excluded (ie. not extracted). It has precedence over the allow parameter. If not given (or empty) it won’t exclude any links.
  • allow_domains (str or list) – a single value or a list of string containing domains which will be considered for extracting the links
  • deny_domains (str or list) – a single value or a list of strings containing domains which won’t be considered for extracting the links
  • deny_extensions (list) – a list of extensions that should be ignored when extracting links. If not given, it will default to the IGNORED_EXTENSIONS list defined in the scrapy.linkextractor module.
  • restrict_xpaths (str or list) – is a XPath (or list of XPath’s) which defines regions inside the response where links should be extracted from. If given, only the text selected by those XPath will be scanned for links. See examples below.
  • tags (str or list) – a tag or a list of tags to consider when extracting links. Defaults to ('a', 'area').
  • attrs (list) – list of attributes which should be considered when looking for links to extract (only for those tags specified in the tags parameter). Defaults to ('href',)
  • canonicalize (boolean) – canonicalize each extracted url (using scrapy.utils.url.canonicalize_url). Defaults to True.
  • unique (boolean) – whether duplicate filtering should be applied to extracted links.
  • process_value (callable) – see process_value argument of BaseSgmlLinkExtractor class constructor

BaseSgmlLinkExtractor¶

class scrapy.contrib.linkextractors.sgml.BaseSgmlLinkExtractor(tag="a", attr="href", unique=False, process_value=None)¶

The purpose of this Link Extractor is only to serve as a base class for the SgmlLinkExtractor. You should use that one instead.

The constructor arguments are:

Parameters:
  • tag (str or callable) – either a string (with the name of a tag) or a function that receives a tag name and returns True if links should be extracted from that tag, or False if they shouldn’t. Defaults to 'a'. request (once it’s downloaded) as its first parameter. For more information, see Passing additional data to callback functions.
  • attr (str or callable) – either string (with the name of a tag attribute), or a function that receives an attribute name and returns True if links should be extracted from it, or False if they shouldn’t. Defaults to href.
  • unique (boolean) – is a boolean that specifies if a duplicate filtering should be applied to links extracted.
  • process_value (callable) –

    a function which receives each value extracted from the tag and attributes scanned and can modify the value and return a new one, or return None to ignore the link altogether. If not given, process_value defaults to lambda x: x.

    For example, to extract links from this code:

    <a href="javascript:goToPage('../other/page.html'); return false">Link text</a>
    

    You can use the following function in process_value:

    def process_value(value):
        m = re.search("javascript:goToPage\('(.*?)'", value)
        if m:
            return m.group(1)
    
Next Previous

© Copyright 2008-2013, Scrapy developers. Last updated on Sep 18, 2014.

Sphinx theme provided by Read the Docs
Read the Docs v: 0.20
Versions
master
latest
0.22
0.20
0.18
0.16
0.14
0.12
0.10.3
0.9
0.8
0.7
Downloads
PDF
HTML
Epub
On Read the Docs
Project Home
Builds

Free document hosting provided by Read the Docs.