Settings¶
The Frontera settings allows you to customize the behaviour of all components, including the
FrontierManager
,
Middleware
and
Backend
themselves.
The infrastructure of the settings provides a global namespace of key-value mappings that can be used to pull configuration values from. The settings can be populated through different mechanisms, which are described below.
For a list of available built-in settings see: Built-in settings reference.
Designating the settings¶
When you use Frontera, you have to tell it which settings you’re using. As
FrontierManager
is the main entry point to Frontier usage,
you can do this by using the method described in the Loading from settings
section.
When using a string path pointing to a settings file for the frontier we propose the following directory structure:
my_project/
frontier/
__init__.py
settings.py
middlewares.py
backends.py
...
These are basically:
frontier/settings.py
: the frontier settings file.frontier/middlewares.py
: the middlewares used by the frontier.frontier/backends.py
: the backend(s) used by the frontier.
How to access settings¶
Settings
can be accessed through the
FrontierManager.settings
attribute, that is passed to
Middleware.from_manager
and
Backend.from_manager
class methods:
class MyMiddleware(Component):
@classmethod
def from_manager(cls, manager):
manager = crawler.settings
if settings.TEST_MODE:
print "test mode is enabled!"
In other words, settings can be accessed as attributes of the
Settings
object.
Built-in frontier settings¶
Here’s a list of all available Frontera settings, in alphabetical order, along with their default values and the scope where they apply.
AUTO_START¶
Default: True
Whether to enable frontier automatic start. See Starting/Stopping the frontier
BACKEND¶
Default: 'frontera.contrib.backends.memory.FIFO'
The Backend
to be used by the frontier. For more info see
Activating a backend.
BC_MIN_REQUESTS¶
Default: 64
Broad crawling queue get operation will keep retrying until specified number of requests is collected. Maximum number of retries is hard-coded to 3.
BC_MIN_HOSTS¶
Default: 24
Keep retyring when getting requests from queue, until there are requests for specified minimum number of hosts collected. Maximum number of retries is hard-coded and equals 3.
BC_MAX_REQUESTS_PER_HOST¶
Default:: 128
Don’t include (if possible) batches of requests containing requests for specific host if there are already more then specified count of maximum requests per host. This is a suggestion for broad crawling queue get algorithm.
CANONICAL_SOLVER¶
Default: frontera.contrib.canonicalsolvers.Basic
The CanonicalSolver
to be used by the frontier for resolving
canonical URLs. For more info see Canonical URL Solver.
DELAY_ON_EMPTY¶
Default: 5.0
Delay between calls to backend for new batches in Scrapy scheduler, when queue size is getting below
CONCURRENT_REQUESTS
. When backend has no requests to fetch, this delay helps to exhaust the rest of the buffer
without hitting backend on every request. Increase it if calls to your backend is taking too long, and decrease
if you need a fast spider bootstrap from seeds.
DISCOVERY_MAX_PAGES¶
Default: 100
The maximum number of pages to schedule by Discovery crawling strategy.
DOMAIN_STATS_LOG_INTERVAL¶
Default: 300
Time interval in seconds to rotate the domain statistics in db worker batch generator. Enabled only when logging set to DEBUG.
KAFKA_GET_TIMEOUT¶
Default: 5.0
Time process should block until requested amount of data will be received from message bus. This is a general message bus setting with obsolete Kafka-related name.
LOCAL_MODE¶
Default: True
Sets single process run mode. Crawling strategy together with backend are used from the same spider process.
LOGGING_CONFIG¶
Default: logging.conf
The path to a file with logging module configuration. See
https://docs.python.org/2/library/logging.config.html#logging-config-fileformat If file is absent, the logging system
will be initialized with logging.basicConfig()
and CONSOLE handler will be used. This option is used only in
db worker and strategy worker.
MAX_NEXT_REQUESTS¶
Default: 64
The maximum number of requests returned by
get_next_requests
API method. In distributed context
it could be amount of requests produced per spider by db worker or count of requests read from message bus per
attempt to fill the spider queue. In single process it’s the count of requests to get from backend per one call to
get_next_requests
method.
MAX_REQUESTS¶
Default: 0
Maximum number of returned requests after which Frontera is finished. If value is 0 (default), the frontier will continue indefinitely. See Finishing the frontier.
MESSAGE_BUS¶
Default: frontera.contrib.messagebus.zeromq.MessageBus
Points Frontera to message bus implementation. Defaults to ZeroMQ.
MESSAGE_BUS_CODEC¶
Default: frontera.contrib.backends.remote.codecs.msgpack
Points Frontera to message bus codec implementation. Here is the codec interface description. Defaults to MsgPack.
MIDDLEWARES¶
A list containing the middlewares enabled in the frontier. For more info see Activating a middleware.
Default:
[
'frontera.contrib.middlewares.fingerprint.UrlFingerprintMiddleware',
]
NEW_BATCH_DELAY¶
Default: 30.0
Used in DB worker, and it’s a time interval between production of new batches for all partitions. If partition is busy, it will be skipped.
OVERUSED_KEEP_PER_KEY¶
Default: 1000
After the purging this number of requests will be left in the queue.
OVERUSED_MAX_KEYS¶
Default: None
A threshold triggering the keys purging in OverusedBuffer. The purging will end up leaving OVERUSED_KEEP_KEYS.
None
disables purging.
OVERUSED_MAX_PER_KEY¶
Default: None
Purging will start when reaching this number of requests per key and leave OVERUSED_KEEP_PER_KEY requests.
None
disables purging.
OVERUSED_SLOT_FACTOR¶
Default: 5.0
(in progress + queued requests in that slot) / max allowed concurrent downloads per slot before slot is considered overused. This affects only Scrapy scheduler.”
REQUEST_MODEL¶
Default: 'frontera.core.models.Request'
The Request
model to be used by the frontier.
RESPONSE_MODEL¶
Default: 'frontera.core.models.Response'
The Response
model to be used by the frontier.
SPIDER_LOG_CONSUMER_BATCH_SIZE¶
Default: 512
This is a batch size used by strategy and db workers for consuming of spider log stream. Increasing it will cause worker to spend more time on every task, but processing more items per task, therefore leaving less time for other tasks during some fixed time interval. Reducing it will result to running several tasks within the same time interval, but with less overall efficiency. Use it when your consumers too slow, or too fast.
SCORING_LOG_CONSUMER_BATCH_SIZE¶
Default: 512
This is a batch size used by db worker for consuming of scoring log stream. Use it when you need to adjust scoring log consumption speed.
SCORING_PARTITION_ID¶
Default: 0
Used by strategy worker, and represents partition startegy worker assigned to.
SPIDER_LOG_PARTITIONS¶
Default: 1
Number of spider log stream partitions. This affects number of required strategy worker (s), each strategy worker assigned to it’s own partition.
SPIDER_FEED_PARTITIONS¶
Default: 1
Number of spider feed partitions. This directly affects number of spider processes running. Every spider is assigned to it’s own partition.
STORE_CONTENT¶
Default: False
Determines if content should be sent over the message bus and stored in the backend: a serious performance killer.
STRATEGY¶
Default: frontera.worker.strategies.basic.BasicCrawlingStrategy
The path to crawling strategy class.
STRATEGY_ARGS¶
Default: {}
Dict with default arguments for crawling strategy. Can be overridien with command line option in strategy worker.
SW_FLUSH_INTERVAL¶
Default: 300
Interval between flushing of states in strategy worker. Also used to set initial random delay to flush states
periodically, using formula RANDINT(SW_FLUSH_INTERVAL)
.
Built-in fingerprint middleware settings¶
Settings used by the UrlFingerprintMiddleware and DomainFingerprintMiddleware.
URL_FINGERPRINT_FUNCTION¶
Default: frontera.utils.fingerprint.sha1
The function used to calculate the url
fingerprint.
DOMAIN_FINGERPRINT_FUNCTION¶
Default: frontera.utils.fingerprint.sha1
The function used to calculate the domain
fingerprint.
TLDEXTRACT_DOMAIN_INFO¶
Default: False
If set to True
, will use tldextract to attach extra domain information
(second-level, top-level and subdomain) to meta field (see Adding additional data to objects).
Built-in backends settings¶
SQLAlchemy¶
SQLALCHEMYBACKEND_CACHE_SIZE¶
Default: 10000
SQLAlchemy Metadata LRU Cache size. It’s used for caching objects, which are requested from DB every time already known, documents are crawled. This is mainly saves DB throughput, increase it if you’re experiencing problems with too high volume of SELECT’s to Metadata table, or decrease if you need to save memory.
SQLALCHEMYBACKEND_CLEAR_CONTENT¶
Default: True
Set to False
if you need to disable table content clean up on backend instantiation (e.g. every Scrapy spider run).
SQLALCHEMYBACKEND_DROP_ALL_TABLES¶
Default: True
Set to False
if you need to disable dropping of DB tables on backend instantiation (e.g. every Scrapy spider run).
SQLALCHEMYBACKEND_ENGINE¶
Default:: sqlite:///:memory:
SQLAlchemy database URL. Default is set to memory.
SQLALCHEMYBACKEND_ENGINE_ECHO¶
Default: False
Turn on/off SQLAlchemy verbose output. Useful for debugging SQL queries.
SQLALCHEMYBACKEND_MODELS¶
Default:
{
'MetadataModel': 'frontera.contrib.backends.sqlalchemy.models.MetadataModel',
'StateModel': 'frontera.contrib.backends.sqlalchemy.models.StateModel',
'QueueModel': 'frontera.contrib.backends.sqlalchemy.models.QueueModel'
}
This is mapping with SQLAlchemy models used by backends. It is mainly used for customization. This setting uses a
dictionary where key
represents the name of the model to define and value
the model to use.
Revisiting backend¶
SQLALCHEMYBACKEND_REVISIT_INTERVAL¶
Default: timedelta(days=1)
Time between document visits, expressed in datetime.timedelta
objects. Changing of this setting will only affect
documents scheduled after the change. All previously queued documents will be crawled with old periodicity.
HBase backend¶
HBASE_DROP_ALL_TABLES¶
Default: False
Enables dropping and creation of new HBase tables on worker start.
HBASE_DOMAIN_METADATA_CACHE_SIZE¶
Default: 1000
The count of domain-value pairs cached in memory in strategy worker. Pairs are evicted from cache using LRU policy.
HBASE_DOMAIN_METADATA_BATCH_SIZE¶
Default: 100
Maximum count of domain-value pairs kept in write buffer before actual write happens.
HBASE_NAMESPACE¶
Default: crawler
Name of HBase namespace where all crawler related tables will reside.
HBASE_STATE_WRITE_LOG_SIZE¶
Default: 15000
Number of state changes in the state cache of strategy worker, before it get’s flushed to HBase and cleared.
HBASE_STATE_CACHE_SIZE_LIMIT¶
Default: 3000000
Number of cached state changes in the state cache of strategy worker. Internally there is cachetools.LRUCache
storing all the recent state changes, discarding least recently used when the cache gets over its capacity.
HBASE_USE_FRAMED_COMPACT¶
Default: False
Enabling this option dramatically reduces transmission overhead, but the server needs to be properly configured to use Thrifts framed transport and compact protocol.
HBASE_USE_SNAPPY¶
Default: False
Whatever to compress content and metadata in HBase using Snappy. Decreases amount of disk and network IO within HBase, lowering response times. HBase have to be properly configured to support Snappy compression.
ZeroMQ message bus settings¶
The message bus class is distributed_frontera.messagebus.zeromq.MessageBus
ZMQ_ADDRESS¶
Default: 127.0.0.1
Defines where the ZeroMQ socket should bind or connect. Can be a hostname or an IP address. Right now ZMQ has only been properly tested with IPv4. Proper IPv6 support will be added in the near future.
ZMQ_BASE_PORT¶
Default: 5550
The base port for all ZeroMQ sockets. It uses 6 sockets overall and port starting from base with step 1. Be sure that interval [base:base+5] is available.
Kafka message bus settings¶
The message bus class is frontera.contrib.messagebus.kafkabus.MessageBus
KAFKA_LOCATION¶
Hostname and port of kafka broker, separated with :. Can be a string with hostname:port pair separated with commas(,).
KAFKA_CODEC¶
Default: KAFKA_CODEC
Kafka-python 1.0.x version compression codec to use, is a string and could be one of none
, snappy
, gzip
or
lz4
.
KAFKA_CERT_PATH¶
OS path to the folder with three certificate files: ca-cert.pem, client-cert.pem, client-key.pem.
KAFKA_ENABLE_SSL¶
Boolean. Set to True to enable SSL connection in Kafka client.
SPIDER_LOG_DBW_GROUP¶
Default: dbw-spider-log
Kafka consumer group name, used for spider log by db worker s.
SPIDER_LOG_SW_GROUP¶
Default: sw-spider-log
Kafka consumer group name, used for spider log by strategy worker (s).
SCORING_LOG_DBW_GROUP¶
Default: dbw-scoring-log
Kafka consumer group name, used for scoring log by db worker (s).
SPIDER_FEED_GROUP¶
Default: fetchers-spider-feed
Kafka consumer group name, used for spider feed by spider (s).
SCORING_LOG_TOPIC¶
Kafka topic used for scoring log stream.
Default settings¶
If no settings are specified, frontier will use the built-in default ones. For a complete list of default values see: Built-in settings reference. All default settings can be overridden.