gransk.plugins.extractors

Extract content from document.

gransk.plugins.extractors.ewf_strings

gransk.plugins.extractors.file_meta

class gransk.plugins.extractors.file_meta.Subscriber(pipeline)

Bases: gransk.core.abstract_subscriber.Subscriber

Class for extracting metadata from documents using Apache Tika.

Add subscriber to pipeline.

Parameters:pipeline (gransk.core.pipeline.Pipeline) – Pipeline managing subscribers and events.
consume(doc, payload)

Upload document to Apache Tika and parse results.

Parameters:
  • doc (gransk.core.document.Document) – Document object.
  • payload (file) – File pointer beloning to document.
setup(config)

Load mediatype mapping from file. This is used to determine document type.

Parameters:config (dict) – Configuration object.

gransk.plugins.extractors.picture_meta

class gransk.plugins.extractors.picture_meta.Subscriber(pipeline)

Bases: gransk.core.abstract_subscriber.Subscriber

Determine width and height. Called when the document is a picture.

Add subscriber to pipeline.

Parameters:pipeline (gransk.core.pipeline.Pipeline) – Pipeline managing subscribers and events.
consume(doc, payload)

Parse picture header and extract width/height information.

Parameters:
  • doc (gransk.core.document.Document) – Document object.
  • payload (file) – File pointer beloning to document.
setup(config)

Define picture magic headers and compute regex pattern to find correct parser later.

Parameters:config (dict) – Configuration object.

gransk.plugins.extractors.strings

gransk.plugins.extractors.tika_extractor

class gransk.plugins.extractors.tika_extractor.Subscriber(pipeline)

Bases: gransk.core.abstract_subscriber.Subscriber

Class for uploading documents to Apache Tika and reading text response. Tika is an open source tool that is capable of parsing a vast number (>200) of document formats.

Add subscriber to pipeline.

Parameters:pipeline (gransk.core.pipeline.Pipeline) – Pipeline managing subscribers and events.
consume(doc, payload)

Upload document to Apache Tika and add result to document as text.

Parameters:
  • doc (gransk.core.document.Document) – Document object.
  • payload (file) – File pointer beloning to document.
setup(config)

Define maximum size of document to upload.

Parameters:config (dict) – Configuration object.