Message bus¶
Is the transport layer abstraction mechanism. It provides interface and several implementations. Only one message bus
can be used in crawler at the time, and it’s selected with MESSAGE_BUS
vim maksetting.
Spiders process can use
-
class
frontera.contrib.backends.remote.messagebus.
MessageBusBackend
(manager)¶
to communicate using message bus.
Built-in message bus reference¶
ZeroMQ¶
It’s the default option, implemented using lightweight ZeroMQ library in
-
class
frontera.contrib.messagebus.zeromq.
MessageBus
(settings)¶
and can be configured using ZeroMQ message bus settings.
ZeroMQ message bus requires installed ZeroMQ library and running broker process, see Start cluster.
WARNING! ZeroMQ message bus doesn’t support yet multiple SW and DB workers, only one instance of each worker type is allowed.
Kafka¶
Can be selected with
-
class
frontera.contrib.messagebus.kafkabus.
MessageBus
(settings)¶
and configured using Kafka message bus settings.
Requires running Kafka service and more suitable for large-scale web crawling.
Protocol¶
Depending on stream Frontera is using several message types to code it’s messages. Every message is a python native object serialized using msgpack (also JSON is available, but needs to be selected in code manually).
Here are the classes needed to subclass to implement own codec:
-
class
frontera.core.codec.
BaseEncoder
¶ -
encode_add_seeds
(seeds)¶ Encodes add_seeds message
Parameters: seeds (list) – A list of frontier Request objects Returns: bytes encoded message
-
encode_page_crawled
(response, links)¶ Encodes a page_crawled message
Parameters: - response (object) – A frontier Response object
- links (list) – A list of Request objects
Returns: bytes encoded message
-
encode_request_error
(request, error)¶ Encodes a request_error message
Parameters: - request (object) – A frontier Request object
- error (string) – Error description
Returns: bytes encoded message
-
encode_request
(request)¶ Encodes requests for spider feed stream.
Parameters: request (object) – Frontera Request object Returns: bytes encoded message
-
encode_update_score
(fingerprint, score, url, schedule)¶ Encodes update_score messages for scoring log stream.
Parameters: - fingerprint (str) – fingerprint in hex form
- score (float) – score
- url (str) – A document url
- schedule (bool) – True if document needs to be scheduled for download
Returns: bytes encoded message
-
encode_new_job_id
(job_id)¶ Encodes changing of job_id parameter.
Parameters: job_id (int) – Returns: bytes encoded message
-
encode_offset
(partition_id, offset)¶ Encodes current spider offset in spider feed.
Parameters: - partition_id (int) –
- offset (int) –
Returns: bytes encoded message
-