gransk.core¶
gransk.core.abstract_subscriber¶
-
class
gransk.core.abstract_subscriber.
Subscriber
(pipeline)¶ Bases:
object
Abstract class subscribers inherits from.
Add subscriber to pipeline.
Parameters: pipeline ( gransk.core.pipeline.Pipeline
) – Pipeline managing subscribers and events.-
CONSUMES
= None¶ Subscribe to the following list of topics [
unicode
]
-
MAGIC
= None¶ Documents starting whith these bytes should be passed to this subscriber.
-
SERVICE_ID
= None¶ This subscriber may be fetched from the pipeline by this ID.
-
consume
(doc, payload)¶ Abstract method for receiving data.
Parameters: - doc (
gransk.core.document.Document
) – The document the event belongs to. - payload (
file
) – File pointer beloning to the document.
- doc (
-
produce
(topic, doc, payload)¶ Add a new event to the pipeline.
Parameters: - topic (
unicode
) – Topic to add event to. - doc (
gransk.core.document.Document
) – The document the event belongs to. - payload (
file
) – File pointer beloning to the document.
- topic (
-
setup
(config)¶ Placeholder for configuration of subscriber, before receiving data.
Parameters: config ( dict
) – Configuration object for processing.
-
stop
()¶ Stop subscriber after progressing is completed.
-
gransk.core.bootstrap¶
gransk.core.detect_type¶
-
class
gransk.core.detect_type.
Subscriber
(pipeline)¶ Bases:
gransk.core.abstract_subscriber.Subscriber
Class for determining document type.
Add subscriber to pipeline.
Parameters: pipeline ( gransk.core.pipeline.Pipeline
) – Pipeline managing subscribers and events.-
consume
(doc, payload)¶ Determine document type, either by extension or based on Tika mimetype. Produces an event based on the found type.
Parameters: - doc (
gransk.core.document.Document
) – Document to process. - payload – File pointer to the document.
- doc (
-
setup
(config)¶ Generate file extension-based type detection from the given configuration.
Parameters: config ( dict
) – Configuration object.
-
gransk.core.document¶
gransk.core.file_collector¶
gransk.core.helper¶
Module containing string variables used throughout the processing.
gransk.core.injector¶
gransk.core.magic¶
-
class
gransk.core.magic.
Subscriber
(pipeline)¶ Bases:
gransk.core.abstract_subscriber.Subscriber
Identify extractor subscribers based on file header.
Add subscriber to pipeline.
Parameters: pipeline ( gransk.core.pipeline.Pipeline
) – Pipeline managing subscribers and events.-
consume
(doc, payload)¶ Identify extractors and call their callback functions.
Parameters: - doc (
gransk.core.document.Document
) – The document object. - payload (
file
) – The document object.
- doc (
-
setup
(_)¶ Compile file headers for all magic extractors into a regex pattern.
-
gransk.core.pipeline¶
-
class
gransk.core.pipeline.
Pipeline
¶ Bases:
object
Class for instatiating and managing subscribers and events during processing. Subscribers are registered to topics (
str
). A subscriber may register to any number of topics, or a magic header. When a subscriber produces an event, the pipeline finds all subscribers for that event topic and calls these (theirconsume(doc, payload)
function) one by one. Seegransk.core.abstract_subscriber.Subscriber.CONSUME
.During text extraction, we may want to implement custom extractors. This is done by registering to a magic header, which means the first N bytes of the document. See
gransk.core.abstract_subscriber.Subscriber.MAGIC
.-
get_service
(service_id)¶ Get service by ID.
Parameters: service_id ( str
) – ID of service to fetch.Returns: object
service. None if no service is found.
-
produce
(topic, doc, payload)¶ Produce a new event.
Parameters: - topic (
str
) – The topic of the produced event. - doc (
gransk.core.Document
) – The document to which the event belongs. - payload (
file
) – The file pointer beloning to the document.
- topic (
-
register_listener
(topic, callback)¶ Register a subscriber callback to a topic.
Parameters: - topic (
str
) – The topic to subscribe to. - callback (
function
) – Function to call when an event with this topic is produced.
- topic (
-
register_magic
(magic, subscriber)¶ Register a subscriber to a magic header.
Parameters: - magic – The header of files to subscribe to.
- subscriber – The subscriber object.
-
register_service
(service_id, service)¶ Register a subscriber as a service that is fetchable by ID. There may only be a single service with a given ID.
Parameters: - service_id (
str
) – The ID of the service. - service (
object
) – The service object.
- service_id (
-
stop
()¶ Stop all subscribers.
-
-
gransk.core.pipeline.
build_pipeline
(config)¶ Build the pipeline based on the given configuration.
Parameters: config ( dict
) – The configuration object.Returns: Instantiated gransk.core.pipeline.Pipeline
-
gransk.core.pipeline.
init_subscriber
(config, subscriber_mod, pipeline)¶ Instatiate a Subscriber object and add it to the pipeline.
Parameters: - subscriber_mod (
str
) – Reference to the module containing the Subscriber. - pipline (
gransk.core.Pipeline
) – The pipeline object to add the subscriber to.
- subscriber_mod (
gransk.core.process¶
-
class
gransk.core.process.
Subscriber
(pipeline)¶ Bases:
gransk.core.abstract_subscriber.Subscriber
Module for producing common processing events on a document.
Add subscriber to pipeline.
Parameters: pipeline ( gransk.core.pipeline.Pipeline
) – Pipeline managing subscribers and events.-
consume
(doc, _)¶ Run a document through processing events.
Parameters: doc ( gransk.core.document.Document
) – Document to process.
-