Crawling strategy¶
Use frontera.worker.strategies.bfs
module for reference. In general, you need to write a crawling strategy class
implementing the interface:
-
class
frontera.worker.strategies.
BaseCrawlingStrategy
(manager, mb_stream, states_context)¶ Interface definition for a crawling strategy.
Before calling these methods strategy worker is adding ‘state’ key to meta field in every
Request
with state of the URL. Pleases refer for the states to HBaseBackend implementation.After exiting from all of these methods states from meta field are passed back and stored in the backend.
Methods
-
classmethod
from_worker
(manager, mb_stream, states_context)¶ Called on instantiation in strategy worker.
Parameters: - manager –
class: Backend <frontera.core.manager.FrontierManager> instance - mb_stream –
class: UpdateScoreStream <frontera.worker.strategy.UpdateScoreStream> instance
Returns: new instance
- manager –
-
add_seeds
(seeds)¶ Called when add_seeds event is received from spider log.
Parameters: seeds (list) – A list of Request
objects.
-
page_crawled
(response, links)¶ Called every time document was successfully crawled, and receiving page_crawled event from spider log.
Parameters:
-
page_error
(request, error)¶ Called every time there was error during page downloading.
Parameters: - request (object) – The fetched with error
Request
object. - error (str) – A string identifier for the error.
- request (object) – The fetched with error
-
finished
()¶ Called by Strategy worker, after finishing processing each cycle of spider log. If this method returns true, then Strategy worker reports that crawling goal is achieved, stops and exits.
Returns: bool
-
close
()¶ Called when strategy worker is about to close crawling strategy.
-
classmethod
The class can be put in any module and passed to strategy worker using command line option or
CRAWLING_STRATEGY
setting on startup.
The strategy class instantiated in strategy worker, and can use it’s own storage or any other kind of resources. All
items from spider log will be passed through these methods. Scores returned doesn’t have to be the same as in
method arguments. Periodically finished()
method is called to check if crawling goal is achieved.