gransk.core¶
gransk.core.abstract_subscriber¶
-
class
gransk.core.abstract_subscriber.Subscriber(pipeline)¶ Bases:
objectAbstract class subscribers inherits from.
Add subscriber to pipeline.
Parameters: pipeline ( gransk.core.pipeline.Pipeline) – Pipeline managing subscribers and events.-
CONSUMES= None¶ Subscribe to the following list of topics [
unicode]
-
MAGIC= None¶ Documents starting whith these bytes should be passed to this subscriber.
-
SERVICE_ID= None¶ This subscriber may be fetched from the pipeline by this ID.
-
consume(doc, payload)¶ Abstract method for receiving data.
Parameters: - doc (
gransk.core.document.Document) – The document the event belongs to. - payload (
file) – File pointer beloning to the document.
- doc (
-
produce(topic, doc, payload)¶ Add a new event to the pipeline.
Parameters: - topic (
unicode) – Topic to add event to. - doc (
gransk.core.document.Document) – The document the event belongs to. - payload (
file) – File pointer beloning to the document.
- topic (
-
setup(config)¶ Placeholder for configuration of subscriber, before receiving data.
Parameters: config ( dict) – Configuration object for processing.
-
stop()¶ Stop subscriber after progressing is completed.
-
gransk.core.bootstrap¶
gransk.core.detect_type¶
-
class
gransk.core.detect_type.Subscriber(pipeline)¶ Bases:
gransk.core.abstract_subscriber.SubscriberClass for determining document type.
Add subscriber to pipeline.
Parameters: pipeline ( gransk.core.pipeline.Pipeline) – Pipeline managing subscribers and events.-
consume(doc, payload)¶ Determine document type, either by extension or based on Tika mimetype. Produces an event based on the found type.
Parameters: - doc (
gransk.core.document.Document) – Document to process. - payload – File pointer to the document.
- doc (
-
setup(config)¶ Generate file extension-based type detection from the given configuration.
Parameters: config ( dict) – Configuration object.
-
gransk.core.document¶
gransk.core.file_collector¶
gransk.core.helper¶
Module containing string variables used throughout the processing.
gransk.core.injector¶
gransk.core.magic¶
-
class
gransk.core.magic.Subscriber(pipeline)¶ Bases:
gransk.core.abstract_subscriber.SubscriberIdentify extractor subscribers based on file header.
Add subscriber to pipeline.
Parameters: pipeline ( gransk.core.pipeline.Pipeline) – Pipeline managing subscribers and events.-
consume(doc, payload)¶ Identify extractors and call their callback functions.
Parameters: - doc (
gransk.core.document.Document) – The document object. - payload (
file) – The document object.
- doc (
-
setup(_)¶ Compile file headers for all magic extractors into a regex pattern.
-
gransk.core.pipeline¶
-
class
gransk.core.pipeline.Pipeline¶ Bases:
objectClass for instatiating and managing subscribers and events during processing. Subscribers are registered to topics (
str). A subscriber may register to any number of topics, or a magic header. When a subscriber produces an event, the pipeline finds all subscribers for that event topic and calls these (theirconsume(doc, payload)function) one by one. Seegransk.core.abstract_subscriber.Subscriber.CONSUME.During text extraction, we may want to implement custom extractors. This is done by registering to a magic header, which means the first N bytes of the document. See
gransk.core.abstract_subscriber.Subscriber.MAGIC.-
get_service(service_id)¶ Get service by ID.
Parameters: service_id ( str) – ID of service to fetch.Returns: objectservice. None if no service is found.
-
produce(topic, doc, payload)¶ Produce a new event.
Parameters: - topic (
str) – The topic of the produced event. - doc (
gransk.core.Document) – The document to which the event belongs. - payload (
file) – The file pointer beloning to the document.
- topic (
-
register_listener(topic, callback)¶ Register a subscriber callback to a topic.
Parameters: - topic (
str) – The topic to subscribe to. - callback (
function) – Function to call when an event with this topic is produced.
- topic (
-
register_magic(magic, subscriber)¶ Register a subscriber to a magic header.
Parameters: - magic – The header of files to subscribe to.
- subscriber – The subscriber object.
-
register_service(service_id, service)¶ Register a subscriber as a service that is fetchable by ID. There may only be a single service with a given ID.
Parameters: - service_id (
str) – The ID of the service. - service (
object) – The service object.
- service_id (
-
stop()¶ Stop all subscribers.
-
-
gransk.core.pipeline.build_pipeline(config)¶ Build the pipeline based on the given configuration.
Parameters: config ( dict) – The configuration object.Returns: Instantiated gransk.core.pipeline.Pipeline
-
gransk.core.pipeline.init_subscriber(config, subscriber_mod, pipeline)¶ Instatiate a Subscriber object and add it to the pipeline.
Parameters: - subscriber_mod (
str) – Reference to the module containing the Subscriber. - pipline (
gransk.core.Pipeline) – The pipeline object to add the subscriber to.
- subscriber_mod (
gransk.core.process¶
-
class
gransk.core.process.Subscriber(pipeline)¶ Bases:
gransk.core.abstract_subscriber.SubscriberModule for producing common processing events on a document.
Add subscriber to pipeline.
Parameters: pipeline ( gransk.core.pipeline.Pipeline) – Pipeline managing subscribers and events.-
consume(doc, _)¶ Run a document through processing events.
Parameters: doc ( gransk.core.document.Document) – Document to process.
-