gransk.core

gransk.core.abstract_subscriber

class gransk.core.abstract_subscriber.Subscriber(pipeline)

Bases: object

Abstract class subscribers inherits from.

Add subscriber to pipeline.

Parameters:pipeline (gransk.core.pipeline.Pipeline) – Pipeline managing subscribers and events.
CONSUMES = None

Subscribe to the following list of topics [unicode]

MAGIC = None

Documents starting whith these bytes should be passed to this subscriber.

SERVICE_ID = None

This subscriber may be fetched from the pipeline by this ID.

consume(doc, payload)

Abstract method for receiving data.

Parameters:
  • doc (gransk.core.document.Document) – The document the event belongs to.
  • payload (file) – File pointer beloning to the document.
produce(topic, doc, payload)

Add a new event to the pipeline.

Parameters:
  • topic (unicode) – Topic to add event to.
  • doc (gransk.core.document.Document) – The document the event belongs to.
  • payload (file) – File pointer beloning to the document.
setup(config)

Placeholder for configuration of subscriber, before receiving data.

Parameters:config (dict) – Configuration object for processing.
stop()

Stop subscriber after progressing is completed.

gransk.core.bootstrap

gransk.core.detect_type

class gransk.core.detect_type.Subscriber(pipeline)

Bases: gransk.core.abstract_subscriber.Subscriber

Class for determining document type.

Add subscriber to pipeline.

Parameters:pipeline (gransk.core.pipeline.Pipeline) – Pipeline managing subscribers and events.
consume(doc, payload)

Determine document type, either by extension or based on Tika mimetype. Produces an event based on the found type.

Parameters:
  • doc (gransk.core.document.Document) – Document to process.
  • payload – File pointer to the document.
setup(config)

Generate file extension-based type detection from the given configuration.

Parameters:config (dict) – Configuration object.

gransk.core.document

gransk.core.file_collector

class gransk.core.file_collector.Collector(config)

Bases: object

Class for collecting paths from filesystem.

collect(root_path)

Collect all files matching a path recursively.

Parameters:root_path – Input path. May point to a file or directory.
Returns:Iterator of found paths.

gransk.core.helper

Module containing string variables used throughout the processing.

gransk.core.injector

gransk.core.magic

class gransk.core.magic.Subscriber(pipeline)

Bases: gransk.core.abstract_subscriber.Subscriber

Identify extractor subscribers based on file header.

Add subscriber to pipeline.

Parameters:pipeline (gransk.core.pipeline.Pipeline) – Pipeline managing subscribers and events.
consume(doc, payload)

Identify extractors and call their callback functions.

Parameters:
  • doc (gransk.core.document.Document) – The document object.
  • payload (file) – The document object.
setup(_)

Compile file headers for all magic extractors into a regex pattern.

gransk.core.pipeline

class gransk.core.pipeline.Pipeline

Bases: object

Class for instatiating and managing subscribers and events during processing. Subscribers are registered to topics (str). A subscriber may register to any number of topics, or a magic header. When a subscriber produces an event, the pipeline finds all subscribers for that event topic and calls these (their consume(doc, payload) function) one by one. See gransk.core.abstract_subscriber.Subscriber.CONSUME.

During text extraction, we may want to implement custom extractors. This is done by registering to a magic header, which means the first N bytes of the document. See gransk.core.abstract_subscriber.Subscriber.MAGIC.

get_service(service_id)

Get service by ID.

Parameters:service_id (str) – ID of service to fetch.
Returns:object service. None if no service is found.
produce(topic, doc, payload)

Produce a new event.

Parameters:
  • topic (str) – The topic of the produced event.
  • doc (gransk.core.Document) – The document to which the event belongs.
  • payload (file) – The file pointer beloning to the document.
register_listener(topic, callback)

Register a subscriber callback to a topic.

Parameters:
  • topic (str) – The topic to subscribe to.
  • callback (function) – Function to call when an event with this topic is produced.
register_magic(magic, subscriber)

Register a subscriber to a magic header.

Parameters:
  • magic – The header of files to subscribe to.
  • subscriber – The subscriber object.
register_service(service_id, service)

Register a subscriber as a service that is fetchable by ID. There may only be a single service with a given ID.

Parameters:
  • service_id (str) – The ID of the service.
  • service (object) – The service object.
stop()

Stop all subscribers.

gransk.core.pipeline.build_pipeline(config)

Build the pipeline based on the given configuration.

Parameters:config (dict) – The configuration object.
Returns:Instantiated gransk.core.pipeline.Pipeline
gransk.core.pipeline.init_subscriber(config, subscriber_mod, pipeline)

Instatiate a Subscriber object and add it to the pipeline.

Parameters:
  • subscriber_mod (str) – Reference to the module containing the Subscriber.
  • pipline (gransk.core.Pipeline) – The pipeline object to add the subscriber to.

gransk.core.process

class gransk.core.process.Subscriber(pipeline)

Bases: gransk.core.abstract_subscriber.Subscriber

Module for producing common processing events on a document.

Add subscriber to pipeline.

Parameters:pipeline (gransk.core.pipeline.Pipeline) – Pipeline managing subscribers and events.
consume(doc, _)

Run a document through processing events.

Parameters:doc (gransk.core.document.Document) – Document to process.