gransk.plugins.extractors¶
Extract content from document.
gransk.plugins.extractors.ewf_strings¶
gransk.plugins.extractors.file_meta¶
-
class
gransk.plugins.extractors.file_meta.
Subscriber
(pipeline)¶ Bases:
gransk.core.abstract_subscriber.Subscriber
Class for extracting metadata from documents using Apache Tika.
Add subscriber to pipeline.
Parameters: pipeline ( gransk.core.pipeline.Pipeline
) – Pipeline managing subscribers and events.-
consume
(doc, payload)¶ Upload document to Apache Tika and parse results.
Parameters: - doc (
gransk.core.document.Document
) – Document object. - payload (
file
) – File pointer beloning to document.
- doc (
-
setup
(config)¶ Load mediatype mapping from file. This is used to determine document type.
Parameters: config ( dict
) – Configuration object.
-
gransk.plugins.extractors.picture_meta¶
-
class
gransk.plugins.extractors.picture_meta.
Subscriber
(pipeline)¶ Bases:
gransk.core.abstract_subscriber.Subscriber
Determine width and height. Called when the document is a picture.
Add subscriber to pipeline.
Parameters: pipeline ( gransk.core.pipeline.Pipeline
) – Pipeline managing subscribers and events.-
consume
(doc, payload)¶ Parse picture header and extract width/height information.
Parameters: - doc (
gransk.core.document.Document
) – Document object. - payload (
file
) – File pointer beloning to document.
- doc (
-
setup
(config)¶ Define picture magic headers and compute regex pattern to find correct parser later.
Parameters: config ( dict
) – Configuration object.
-
gransk.plugins.extractors.strings¶
gransk.plugins.extractors.tika_extractor¶
-
class
gransk.plugins.extractors.tika_extractor.
Subscriber
(pipeline)¶ Bases:
gransk.core.abstract_subscriber.Subscriber
Class for uploading documents to Apache Tika and reading text response. Tika is an open source tool that is capable of parsing a vast number (>200) of document formats.
Add subscriber to pipeline.
Parameters: pipeline ( gransk.core.pipeline.Pipeline
) – Pipeline managing subscribers and events.-
consume
(doc, payload)¶ Upload document to Apache Tika and add result to document as text.
Parameters: - doc (
gransk.core.document.Document
) – Document object. - payload (
file
) – File pointer beloning to document.
- doc (
-
setup
(config)¶ Define maximum size of document to upload.
Parameters: config ( dict
) – Configuration object.
-