Middlewares¶
Frontier Middleware
sits between
FrontierManager
and
Backend
objects, using hooks for
Request
and Response
processing according to
frontier data flow.
It’s a light, low-level system for filtering and altering Frontier’s requests and responses.
Activating a middleware¶
To activate a Middleware
component, add it to the
MIDDLEWARES
setting, which is a list whose values can be class paths or instances of
Middleware
objects.
Here’s an example:
MIDDLEWARES = [
'frontera.contrib.middlewares.domain.DomainMiddleware',
]
Middlewares are called in the same order they’ve been defined in the list, to decide which order to assign to your middleware pick a value according to where you want to insert it. The order does matter because each middleware performs a different action and your middleware could depend on some previous (or subsequent) middleware being applied.
Finally, keep in mind that some middlewares may need to be enabled through a particular setting. See each middleware documentation for more info.
Writing your own middleware¶
Writing your own frontier middleware is easy. Each Middleware
component is a single Python class inherited from Component
.
FrontierManager
will communicate with all active middlewares
through the methods described below.
-
class
frontera.core.components.
Middleware
¶ Interface definition for a Frontier Middlewares
Methods
-
frontier_start
()¶ Called when the frontier starts, see starting/stopping the frontier.
-
frontier_stop
()¶ Called when the frontier stops, see starting/stopping the frontier.
-
page_crawled
(response)¶ This method is called every time a page has been crawled.
Parameters: response (object) – The Response
object for the crawled page.Returns: Response
orNone
Should either return
None
or aResponse
object.If it returns
None
,FrontierManager
won’t continue processing any other middleware andBackend
will never be notified.If it returns a
Response
object, this will be passed to next middleware. This process will repeat for all active middlewares until result is finally passed to theBackend
.If you want to filter a page, just return None.
-
request_error
(page, error)¶ This method is called each time an error occurs when crawling a page.
Parameters: - request (object) – The crawled with error
Request
object. - error (string) – A string identifier for the error.
Returns: Request
orNone
Should either return
None
or aRequest
object.If it returns
None
,FrontierManager
won’t continue processing any other middleware andBackend
will never be notified.If it returns a
Response
object, this will be passed to next middleware. This process will repeat for all active middlewares until result is finally passed to theBackend
.If you want to filter a page error, just return None.
- request (object) – The crawled with error
Class Methods
-
classmethod
from_manager
(manager)¶ Class method called from
FrontierManager
passing the manager itself.Example of usage:
def from_manager(cls, manager): return cls(settings=manager.settings)
-
Built-in middleware reference¶
This page describes all Middleware
components that come with Frontera.
For information on how to use them and how to write your own middleware, see the
middleware usage guide..
For a list of the components enabled by default (and their orders) see the MIDDLEWARES
setting.
DomainMiddleware¶
-
class
frontera.contrib.middlewares.domain.
DomainMiddleware
¶ This
Middleware
will add adomain
info field for everyRequest.meta
andResponse.meta
if is activated.domain
object will contain the following fields, with both keys and values as bytes:- netloc: URL netloc according to RFC 1808 syntax specifications
- name: Domain name
- scheme: URL scheme
- tld: Top level domain
- sld: Second level domain
- subdomain: URL subdomain(s)
An example for a
Request
object:>>> request.url 'http://www.scrapinghub.com:8080/this/is/an/url' >>> request.meta['domain'] { "name": "scrapinghub.com", "netloc": "www.scrapinghub.com", "scheme": "http", "sld": "scrapinghub", "subdomain": "www", "tld": "com" }
If
TEST_MODE
is active, It will accept testing URLs, parsing letter domains:>>> request.url 'A1' >>> request.meta['domain'] { "name": "A", "netloc": "A", "scheme": "-", "sld": "-", "subdomain": "-", "tld": "-" }
UrlFingerprintMiddleware¶
-
class
frontera.contrib.middlewares.fingerprint.
UrlFingerprintMiddleware
¶ This
Middleware
will add afingerprint
field for everyRequest.meta
andResponse.meta
if is activated.Fingerprint will be calculated from object
URL
, using the function defined inURL_FINGERPRINT_FUNCTION
setting. You can write your own fingerprint calculation function and use by changing this setting. The fingerprint must be bytes.An example for a
Request
object:>>> request.url 'http//www.scrapinghub.com:8080' >>> request.meta['fingerprint'] '60d846bc2969e9706829d5f1690f11dafb70ed18'
-
frontera.utils.fingerprint.
hostname_local_fingerprint
(key)¶ This function is used for URL fingerprinting, which serves to uniquely identify the document in storage.
hostname_local_fingerprint
is constructing fingerprint getting first 4 bytes as Crc32 from host, and rest is MD5 from rest of the URL. Default option is set to make use of HBase block cache. It is expected to fit all the documents of average website within one cache block, which can be efficiently read from disk once.Parameters: key – str URL Returns: str 20 bytes hex string
DomainFingerprintMiddleware¶
-
class
frontera.contrib.middlewares.fingerprint.
DomainFingerprintMiddleware
¶ This
Middleware
will add afingerprint
field for everyRequest.meta
andResponse.meta
domain
fields if is activated.Fingerprint will be calculated from object
URL
, using the function defined inDOMAIN_FINGERPRINT_FUNCTION
setting. You can write your own fingerprint calculation function and use by changing this setting. The fingerprint must be bytesAn example for a
Request
object:>>> request.url 'http//www.scrapinghub.com:8080' >>> request.meta['domain'] { "fingerprint": "5bab61eb53176449e25c2c82f172b82cb13ffb9d", "name": "scrapinghub.com", "netloc": "www.scrapinghub.com", "scheme": "http", "sld": "scrapinghub", "subdomain": "www", "tld": "com" }