Using the Frontier with Scrapy¶

Using Frontera is quite easy, it includes a set of Scrapy middlewares and Scrapy scheduler that encapsulates Frontera usage and can be easily configured using Scrapy settings.

Activating the frontier¶

The Frontera uses 2 different middlewares: SchedulerSpiderMiddleware and SchedulerDownloaderMiddleware, and it’s own scheduler FronteraScheduler.

To activate the Frontera in your Scrapy project, just add them to the SPIDER_MIDDLEWARES, DOWNLOADER_MIDDLEWARES and SCHEDULER settings:

SPIDER_MIDDLEWARES.update({
    'frontera.contrib.scrapy.middlewares.schedulers.SchedulerSpiderMiddleware': 1000,
})

DOWNLOADER_MIDDLEWARES.update({
    'frontera.contrib.scrapy.middlewares.schedulers.SchedulerDownloaderMiddleware': 1000,
})

SCHEDULER = 'frontera.contrib.scrapy.schedulers.frontier.FronteraScheduler'

Create a Frontera settings.py file and add it to your Scrapy settings:

FRONTERA_SETTINGS = 'tutorial.frontera.settings'

Another option is to put these settings right into Scrapy settings module.

Organizing files¶

When using frontier with a Scrapy project, we propose the following directory structure:

my_scrapy_project/
    my_scrapy_project/
        frontera/
            __init__.py
            settings.py
            middlewares.py
            backends.py
        spiders/
            ...
        __init__.py
        settings.py
     scrapy.cfg

These are basically:

my_scrapy_project/frontera/settings.py: the Frontera settings file.
my_scrapy_project/frontera/middlewares.py: the middlewares used by the Frontera.
my_scrapy_project/frontera/backends.py: the backend(s) used by the Frontera.
my_scrapy_project/spiders: the Scrapy spiders folder
my_scrapy_project/settings.py: the Scrapy settings file
scrapy.cfg: the Scrapy config file

Running the сrawl¶

Just run your Scrapy spider as usual from the command line:

scrapy crawl myspider

Frontier Scrapy settings¶

You can configure your frontier two ways:

Using FRONTERA_SETTINGS parameter, which is a module path pointing to Frontera settings in Scrapy settings file. Defaults to None
Define frontier settings right into Scrapy settings file.

Defining frontier settings via Scrapy settings¶

Frontier settings can also be defined via Scrapy settings. In this case, the order of precedence will be the following:

Settings defined in the file pointed by FRONTERA_SETTINGS (higher precedence)
Settings defined in the Scrapy settings
Default frontier settings

Writing Scrapy spider¶

Spider logic¶

Creation of basic Scrapy spider is described at Quick start single process page.

It’s also a good practice to prevent spider from closing because of insufficiency of queued requests transport::

@classmethod
def from_crawler(cls, crawler, *args, **kwargs):
    spider = cls(*args, **kwargs)
    spider._set_crawler(crawler)
    spider.crawler.signals.connect(spider.spider_idle, signal=signals.spider_idle)
    return spider

def spider_idle(self):
    self.log("Spider idle signal caught.")
    raise DontCloseSpider

Configuration guidelines¶

There several tunings you can make for efficient broad crawling.

Adding one of seed loaders for bootstrapping of crawling process:

SPIDER_MIDDLEWARES.update({
    'frontera.contrib.scrapy.middlewares.seeds.file.FileSeedLoader': 1,
})

Various settings suitable for broad crawling:

HTTPCACHE_ENABLED = False   # Turns off disk cache, which has low hit ratio during broad crawls
REDIRECT_ENABLED = True
COOKIES_ENABLED = False
DOWNLOAD_TIMEOUT = 120
RETRY_ENABLED = False   # Retries can be handled by Frontera itself, depending on crawling strategy
DOWNLOAD_MAXSIZE = 10 * 1024 * 1024  # Maximum document size, causes OOM kills if not set
LOGSTATS_INTERVAL = 10  # Print stats every 10 secs to console

Auto throttling and concurrency settings for polite and responsible crawling::

# auto throttling
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_DEBUG = False
AUTOTHROTTLE_MAX_DELAY = 3.0
AUTOTHROTTLE_START_DELAY = 0.25     # Any small enough value, it will be adjusted during operation by averaging
                                    # with response latencies.
RANDOMIZE_DOWNLOAD_DELAY = False

# concurrency
CONCURRENT_REQUESTS = 256           # Depends on many factors, and should be determined experimentally
CONCURRENT_REQUESTS_PER_DOMAIN = 10
DOWNLOAD_DELAY = 0.0

Check also Scrapy broad crawling recommendations.

Scrapy Seed Loaders¶

Frontera has some built-in Scrapy middlewares for seed loading.

Seed loaders use the process_start_requests method to generate requests from a source that are added later to the FrontierManager.

Activating a Seed loader¶

Just add the Seed Loader middleware to the SPIDER_MIDDLEWARES scrapy settings:

SPIDER_MIDDLEWARES.update({
    'frontera.contrib.scrapy.middlewares.seeds.FileSeedLoader': 650
})

FileSeedLoader¶

Load seed URLs from a file. The file must be formatted contain one URL per line:

http://www.asite.com
http://www.anothersite.com
...

Yo can disable URLs using the # character:

...
#http://www.acommentedsite.com
...

Settings:

SEEDS_SOURCE: Path to the seeds file

S3SeedLoader¶

Load seeds from a file stored in an Amazon S3 bucket

File format should the same one used in FileSeedLoader.

Settings:

SEEDS_SOURCE: Path to S3 bucket file. eg: s3://some-project/seed-urls/
SEEDS_AWS_ACCESS_KEY: S3 credentials Access Key
SEEDS_AWS_SECRET_ACCESS_KEY: S3 credentials Secret Access Key