Skip to content

pydoxtools.Document

Bases: Pipeline

Basic document pipeline class to analyze documents from all kinds of formats.

A list and documentation of all document analysis related functions can be found ->here<-.

The Document class is designed for information extraction from documents. It inherits from the pydoxtools.document_base.Pipeline class and uses a predefined extraction pipeline focused on document processing tasks. To load a document, create an instance of the Document class with a file path, a file object, a string, a URL or give it some data directly as a dict:

from pydoxtools import Document
doc = Document(fobj=Path('./data/demo.docx'))

Extracted data can be accessed by calling the x method with the specified output in the pipeline:

doc.x("addresses")
doc.x("entities")
doc.x("full_text")
# etc...

Most members can also be called as normal class attributes for easier readability:

doc.addresses

Additionally, it is possible to get the data directly in dict, yaml or json form:

doc.property_dict("addresses","filename","keywords")
doc.yaml("addresses","filename","keywords")
doc.json("addresses","filename","keywords")

To retrieve a list of all available extraction data methods, call the x_funcs() method:

doc.x_funcs()

Customizing the Document Pipeline:

The extraction pipeline can be partially overwritten or completely replaced to customize the document processing. To customize the pipeline, it's recommended to use the basic document pipeline defined in pydoxtools.Document as a starting point and only overwrite parts as needed.

Inherited classes can override any part of the graph. To exchange, override, extend or introduce extraction pipelines for specific file types (including the generic one: ""), such as .html, .pdf, .txt, etc., follow the example below.

Rules for customizing the extraction pipeline:

  • The pipeline is defined as a dictionary of several lists of [pydoxtools.operator_base.Operator][operator] nodes.
  • Each [pydoxtools.operator_base.Operator][] defines a set of output & input valus through the out and input methods. The input method takes a dictionary or list of input values and the out method takes a dictionary or list of output values.
  • Operator nodes are configured through method chaining.
  • Arguments can be overwritten by a new pipeline in inherited documents or document types higher up in the hierarchy. The argument precedence is as follows:

python-class-member < extractor-graph-function < configuration

  • the different lists in the dictionary represent a "hierarchy" of pipelines which can be combined in different ways by referencing each other. For example the "png" pipeline references the "image" pipeline which in turn references the "pdf" pipeline. All pipelines fall back to the "*" pipeline which is the most generic one.

The way this looks is like this:

_operators = {
        # .pdf-specific pipeline
        "application/pdf": [*PDFNodes],
        # image specific pipeline (does OCR)
        "image": [*OCRNodes],
        # .png-specific pipeline
        ".png": ["image", "application/pdf"],
        # base pipeline
        "*": [*BaseNodes],
        }

Here, the "image" pipeline overwrites the "application/pdf" pipeline for .png files. The "application/pdf"
pipeline on the other hand overwrites the "*" pipeline for .pdf files. The "*" pipeline doesn't not need
to be specified, as it will always be the fallback pipeline. This way it is possible
to dynamically adapt a pipeline to different types of input data. In the document pipeline
this is used to dynamically adapt the pipeline to different file types.

When customizing, it is possible to derive a new class from [pydoxtools.Document][] and partially
overwrite its hierarchy for your purposes. For example, if you want to add a new pipeline for Component
extraction one could do something like the following: This will add a few more nodes
to the generic pipeline in order to extract product information from documents. This would
now already work for all the document types defined in the base [pydoxtools.Document][] class!

```python
class DocumentX(pydoxtools.Document):
_operators = {
    "*": [
        FunctionOperator(get_products_from_pages).input("page_templates", pages="page_set")
        .out(product_information="product_information").cache(),
        FunctionOperator(lambda tables: [componardo.spec_utils.table2specs(t) for t in tables])
        .input("tables").out("raw_specs")
        .cache().t(list[componardo.spec_utils.Specification])
        .docs("transform tables into a list of specs"),
        FunctionOperator(lambda x: [ComponentExtractor([x]).component])
        .t(componardo.extract_product.ComponentX)
        .input(x="document_extract").out("products").cache()
        .docs("Extract products from Documents"),
    ]
}
```

When creating a new pipeline for documentation purposes, use a function or class for complex operations and include the documentation there. Lambda functions should not be used in this case.

document_type property

This has to be done in a member function and not in the pipeline, because the selection of the pipeline depends on this...

filename: str | None property

return filename or some other identifier of a file

__init__(fobj=None, source=None, meta=None, document_type='auto', page_numbers=None, max_pages=None, configuration=None, **kwargs)

Initialize a Document instance.

Either fobj or source are required. They can both be given. If either of them isn't specified the other one is inferred automatically.

document_type, page_number and max_pages are also not required, but can be used to override the default behaviour. specifically document_tgiype can be used manually specify the pipeline that should be used.

Parameters:

Name Type Description Default
fobj str | bytes | Path | IO | dict | list | set

The file object or data to load. Depending on the type of object passed: - If a string or bytes object: the object itself is the document. IN case of a bytes object, the source helps in determining the filetype through file endings. - If a string representing a URL: the document will be loaded from the URL. - If a pathlib.Path object: load the document from the path. - If a file object: load the document from the file object (e.g., bytestream). - If a python dict object: interprete a "dict" as a document - If a python list object: interprete a "list" as a document

None
source str | Path

The source of the extracted data (e.g., URL, 'pdfupload', parent-URL, or a path). source is given in addition to fobj it overrides the automatically inferred source. A special case applies if our document is a dataobject from a database. In that case the index key from the database should be used as source. This facilitates downstream tasks immensely where we have to refer back to where the data came from.

This also applies for "explode" operations on documents where the newly created documents will all try to trace their origin using the "source" attribute

None
document_type str

The document type to directly specify the pipeline to be used. If "auto" is given it will try to be inferred automatically. For example in some cases we would like to have a string given in fobj not to be loaded as a file but actually be used as raw "string" data. In this case we can explicitly specify document_type="string"

'auto'
meta dict[str, str]

Optionally set document metadata, which can be very useful for downstream tasks like building an index.

None
page_numbers list[int]

A list of specific pages to extract from the document (e.g., in a PDF).

None
max_pages int

The maximum number of pages to extract to protect resources.

None
configuration dict

configuration dictionary for the pipeline

None

__repr__()

Returns:

Name Type Description
str

A string representation of the instance.

document_type_detection() cached

This one here is actually important as it detects the type of data that we are going to use for out pipeline. That is also why this is implemented as a member function and can not be pushed in the pipeline itself, because in needs to be run in order to select which pipline we are going to use.

detect doc type based on various criteria TODO add a doc-type extractor using for example python-magic

Text extraction attributes and functions

The pydoxtools.Document is built on the pydoxtools.Pipeline class and most of the text extraction functionality makes extensive use of the pipeline features. All attributes and functions that are created by the pipeline are documented here.

Pipeline visualizations for the structure of the Document pipelines for different document types can be found here.

DG

Alias for:

* document_graph->DG (output)

name : <Document>.x('DG') or <Document>.DG

return type : <class 'networkx.classes.digraph.DiGraph'>

supports pipeline flows : *, <class 'dict'>, <class 'list'>, PIL.Image.Image, application/epub+zip, application/pdf, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/x-yaml, image, image/jpeg, image/png, image/tiff, mediawiki, pandoc, text/html, text/markdown, text/rtf

a_d_ratio

Letter/digit ratio of the text

name : <Document>.x('a_d_ratio') or <Document>.a_d_ratio

return type : <class 'float'>

supports pipeline flows : *, <class 'dict'>, <class 'list'>, PIL.Image.Image, application/epub+zip, application/pdf, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/x-yaml, image, image/jpeg, image/png, image/tiff, mediawiki, pandoc, text/html, text/markdown, text/rtf

addresses

get addresses from text

name : <Document>.x('addresses') or <Document>.addresses

return type : list[str]

supports pipeline flows : *, <class 'dict'>, <class 'list'>, PIL.Image.Image, application/epub+zip, application/pdf, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/x-yaml, image, image/jpeg, image/png, image/tiff, mediawiki, pandoc, text/html, text/markdown, text/rtf

answers

Extract answers from the text using the Huggingface question answering pipeline

name : <Document>.x('answers') or <Document>.answers

return type : typing.Any

supports pipeline flows : *, <class 'dict'>, <class 'list'>, PIL.Image.Image, application/epub+zip, application/pdf, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/x-yaml, image, image/jpeg, image/png, image/tiff, mediawiki, pandoc, text/html, text/markdown, text/rtf

chat_answers

Extract answers from the text using OpenAI Chat GPT and other models.

name : <Document>.x('chat_answers') or <Document>.chat_answers

return type : typing.Callable[[list[str], list[str] | str], list[str]]

supports pipeline flows : *, <class 'dict'>, <class 'list'>, PIL.Image.Image, application/epub+zip, application/pdf, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/x-yaml, image, image/jpeg, image/png, image/tiff, mediawiki, pandoc, text/html, text/markdown, text/rtf

clean_format

The format used to convert the document to a clean string for downstream processing tasks

name : <Document>.x('clean_format') or <Document>.clean_format

return type : typing.Any

supports pipeline flows : application/epub+zip, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, mediawiki, pandoc, text/markdown, text/rtf

clean_spacy_text

Generate text to be used for spacy. Depending on the 'use_clean_text_for_spacy' option it will use page templates and replace complicated text structures such as tables for better text understanding.

name : <Document>.x('clean_spacy_text') or <Document>.clean_spacy_text

return type : <class 'str'>

supports pipeline flows : *, <class 'dict'>, <class 'list'>, PIL.Image.Image, application/epub+zip, application/pdf, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/x-yaml, image, image/jpeg, image/png, image/tiff, mediawiki, pandoc, text/html, text/markdown, text/rtf

clean_text

pipe_type description
*, <class 'dict'>, <class 'list'>, PIL.Image.Image, application/pdf, application/x-yaml, image, image/jpeg, image/png, image/tiff, text/html Alias for:
* full_text->clean_text (output)
application/epub+zip, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, mediawiki, pandoc, text/markdown, text/rtf for some downstream tasks, it is better to have pure text, without any sructural elements in it

name : <Document>.x('clean_text') or <Document>.clean_text

return type : <class 'str'>

supports pipeline flows : *, <class 'dict'>, <class 'list'>, PIL.Image.Image, application/epub+zip, application/pdf, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/x-yaml, image, image/jpeg, image/png, image/tiff, mediawiki, pandoc, text/html, text/markdown, text/rtf

convert_to

Generic pandoc converter for other document formats. TODO: better docs

name : <Document>.x('convert_to') or <Document>.convert_to

return type : typing.Callable

supports pipeline flows : application/epub+zip, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, mediawiki, pandoc, text/markdown, text/rtf

coreferences

Resolve coreferences in the text

name : <Document>.x('coreferences') or <Document>.coreferences

return type : list[list[tuple[int, int]]]

supports pipeline flows : *, <class 'dict'>, <class 'list'>, PIL.Image.Image, application/epub+zip, application/pdf, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/x-yaml, image, image/jpeg, image/png, image/tiff, mediawiki, pandoc, text/html, text/markdown, text/rtf

data

pipe_type description
PIL.Image.Image Converts the image to a numpy array
image, image/jpeg, image/png, image/tiff Converts the image to a numpy array for downstream processing tasks
application/x-yaml Load yaml data from a string
*, <class 'dict'>, <class 'list'>, application/epub+zip, application/pdf, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, mediawiki, pandoc, text/html, text/markdown, text/rtf The unprocessed data.

name : <Document>.x('data') or <Document>.data

return type : <class 'numpy.ndarray'> | typing.Any

supports pipeline flows : *, <class 'dict'>, <class 'list'>, PIL.Image.Image, application/epub+zip, application/pdf, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/x-yaml, image, image/jpeg, image/png, image/tiff, mediawiki, pandoc, text/html, text/markdown, text/rtf

data_sel

select values by key from source data in Document

name : <Document>.x('data_sel') or <Document>.data_sel

return type : typing.Callable[..., dict]

supports pipeline flows : <class 'dict'>, application/x-yaml

do

Alias for:

* document_objects->do (output)

name : <Document>.x('do') or <Document>.do

return type : dict[int, pydoxtools.document_base.DocumentElement]

supports pipeline flows : *, <class 'dict'>, <class 'list'>, PIL.Image.Image, application/epub+zip, application/pdf, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/x-yaml, image, image/jpeg, image/png, image/tiff, mediawiki, pandoc, text/html, text/markdown, text/rtf

document_graph

Builds a networkx graph from the relations and coreferences

name : <Document>.x('document_graph') or <Document>.document_graph

return type : <class 'networkx.classes.digraph.DiGraph'>

supports pipeline flows : *, <class 'dict'>, <class 'list'>, PIL.Image.Image, application/epub+zip, application/pdf, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/x-yaml, image, image/jpeg, image/png, image/tiff, mediawiki, pandoc, text/html, text/markdown, text/rtf

document_objects

pipe_type description
PIL.Image.Image, application/pdf, image, image/jpeg, image/png, image/tiff extracts a list of document objects such as tables, text boxes, figures, etc.
*, <class 'dict'>, <class 'list'>, application/epub+zip, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/x-yaml, mediawiki, pandoc, text/html, text/markdown, text/rtf output a list of document elements which can be referenced by id

name : <Document>.x('document_objects') or <Document>.document_objects

return type : dict[int, pydoxtools.document_base.DocumentElement]

supports pipeline flows : *, <class 'dict'>, <class 'list'>, PIL.Image.Image, application/epub+zip, application/pdf, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/x-yaml, image, image/jpeg, image/png, image/tiff, mediawiki, pandoc, text/html, text/markdown, text/rtf

elements

pipe_type description
application/pdf Loads a pdf file and returns a list of basic document elements such as lines, figures, etc.
PIL.Image.Image, image, image/jpeg, image/png, image/tiff Loads the pdf file into a list of [][pydoxtools.document_base.DocumentElement]
*, <class 'dict'>, <class 'list'>, application/x-yaml, text/html extracts a list of document objects such as tables, text boxes, figures, etc.
application/epub+zip, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, mediawiki, pandoc, text/markdown, text/rtf split a pandoc document into text elements.

name : <Document>.x('elements') or <Document>.elements

return type : <class 'pandas.core.frame.DataFrame'> | list[pydoxtools.document_base.DocumentElement]

supports pipeline flows : *, <class 'dict'>, <class 'list'>, PIL.Image.Image, application/epub+zip, application/pdf, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/x-yaml, image, image/jpeg, image/png, image/tiff, mediawiki, pandoc, text/html, text/markdown, text/rtf

embedded_meta

pipe_type description
application/epub+zip, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, mediawiki, pandoc, text/markdown, text/rtf Alias for:
* meta_pandoc->embedded_meta (output)
PIL.Image.Image, application/pdf, image, image/jpeg, image/png, image/tiff Alias for:
* meta_pdf->embedded_meta (output)
*, <class 'dict'>, <class 'list'>, application/x-yaml, text/html represents the metadata embedded in the file

name : <Document>.x('embedded_meta') or <Document>.embedded_meta

return type : <class 'dict'> | typing.Any

supports pipeline flows : *, <class 'dict'>, <class 'list'>, PIL.Image.Image, application/epub+zip, application/pdf, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/x-yaml, image, image/jpeg, image/png, image/tiff, mediawiki, pandoc, text/html, text/markdown, text/rtf

embedding

Get a vector (embedding) for the entire text by taking the mean of the contextual embeddings of all tokens

name : <Document>.x('embedding') or <Document>.embedding

return type : list[float]

supports pipeline flows : *, <class 'dict'>, <class 'list'>, PIL.Image.Image, application/epub+zip, application/pdf, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/x-yaml, image, image/jpeg, image/png, image/tiff, mediawiki, pandoc, text/html, text/markdown, text/rtf

entities

Extract entities from text

name : <Document>.x('entities') or <Document>.entities

return type : list[str]

supports pipeline flows : *, <class 'dict'>, <class 'list'>, PIL.Image.Image, application/epub+zip, application/pdf, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/x-yaml, image, image/jpeg, image/png, image/tiff, mediawiki, pandoc, text/html, text/markdown, text/rtf

file_meta

Some fast-to-calculate metadata information about a document

name : <Document>.x('file_meta') or <Document>.file_meta

return type : dict[str, typing.Any]

supports pipeline flows : *, <class 'dict'>, <class 'list'>, PIL.Image.Image, application/epub+zip, application/pdf, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/x-yaml, image, image/jpeg, image/png, image/tiff, mediawiki, pandoc, text/html, text/markdown, text/rtf

final_urls

Extracts the main content from the html document, removing boilerplate and other noise

name : <Document>.x('final_urls') or <Document>.final_urls

return type : typing.Any

supports pipeline flows : text/html

full_text

pipe_type description
text/html Alias for:
* main_content->full_text (output)
application/x-yaml Alias for:
* raw_content->full_text (output)
application/epub+zip, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, mediawiki, pandoc, text/markdown, text/rtf Converts the document to a string using pandoc
<class 'dict'> Dump dict data to a yaml-like string
<class 'list'> Dump list data to a yaml-like string
PIL.Image.Image, application/pdf, image, image/jpeg, image/png, image/tiff Extracts the full text from the document by grouping text elements
* Full text as a string value

name : <Document>.x('full_text') or <Document>.full_text

return type : <class 'str'> | typing.Any

supports pipeline flows : *, <class 'dict'>, <class 'list'>, PIL.Image.Image, application/epub+zip, application/pdf, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/x-yaml, image, image/jpeg, image/png, image/tiff, mediawiki, pandoc, text/html, text/markdown, text/rtf

goose_article

Extracts the main content from the html document, removing boilerplate and other noise

name : <Document>.x('goose_article') or <Document>.goose_article

return type : typing.Any

supports pipeline flows : text/html

graphic_elements

Filters the document elements and only keeps the graphic elements

name : <Document>.x('graphic_elements') or <Document>.graphic_elements

return type : list[pydoxtools.document_base.DocumentElement]

supports pipeline flows : *, <class 'dict'>, <class 'list'>, PIL.Image.Image, application/epub+zip, application/pdf, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/x-yaml, image, image/jpeg, image/png, image/tiff, mediawiki, pandoc, text/html, text/markdown, text/rtf

headers

Extracts the headers from the document

name : <Document>.x('headers') or <Document>.headers

return type : list[pydoxtools.document_base.DocumentElement]

supports pipeline flows : *, <class 'dict'>, <class 'list'>, PIL.Image.Image, application/epub+zip, application/pdf, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/x-yaml, image, image/jpeg, image/png, image/tiff, mediawiki, pandoc, text/html, text/markdown, text/rtf

html_keywords

Extracts explicitly given keywords from the html document

name : <Document>.x('html_keywords') or <Document>.html_keywords

return type : set[str]

supports pipeline flows : text/html

html_keywords_str

Extracts the main content from the html document, removing boilerplate and other noise

name : <Document>.x('html_keywords_str') or <Document>.html_keywords_str

return type : typing.Any

supports pipeline flows : text/html

image_elements

Filters the document elements and only keeps the image elements

name : <Document>.x('image_elements') or <Document>.image_elements

return type : list[pydoxtools.document_base.DocumentElement]

supports pipeline flows : *, <class 'dict'>, <class 'list'>, PIL.Image.Image, application/epub+zip, application/pdf, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/x-yaml, image, image/jpeg, image/png, image/tiff, mediawiki, pandoc, text/html, text/markdown, text/rtf

images

pipe_type description
PIL.Image.Image, image, image/jpeg, image/png, image/tiff Access images as a dictionary with page numbers as keys for downstream processing tasks
application/pdf Render a pdf into images which can be used for further downstream processing

name : <Document>.x('images') or <Document>.images

return type : dict[int, PIL.Image.Image]

supports pipeline flows : PIL.Image.Image, application/pdf, image, image/jpeg, image/png, image/tiff

items

Get the items of the dictionary

name : <Document>.x('items') or <Document>.items

return type : typing.Any

supports pipeline flows : <class 'dict'>, application/x-yaml

keys

Get the keys of the dictionary

name : <Document>.x('keys') or <Document>.keys

return type : typing.Any

supports pipeline flows : <class 'dict'>, application/x-yaml

keywords

pipe_type description
text/html Aggregates the keywords from the html document and found by other algorithms
*, <class 'dict'>, <class 'list'>, PIL.Image.Image, application/epub+zip, application/pdf, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/x-yaml, image, image/jpeg, image/png, image/tiff, mediawiki, pandoc, text/markdown, text/rtf Alias for:
* textrank_keywords->keywords (output)

name : <Document>.x('keywords') or <Document>.keywords

return type : set[str]

supports pipeline flows : *, <class 'dict'>, <class 'list'>, PIL.Image.Image, application/epub+zip, application/pdf, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/x-yaml, image, image/jpeg, image/png, image/tiff, mediawiki, pandoc, text/html, text/markdown, text/rtf

labeled_text_boxes

Classifies the text elements into addresses, emails, phone numbers, etc. if possible.

name : <Document>.x('labeled_text_boxes') or <Document>.labeled_text_boxes

return type : list[pydoxtools.document_base.DocumentElement]

supports pipeline flows : *, <class 'dict'>, <class 'list'>, PIL.Image.Image, application/epub+zip, application/pdf, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/x-yaml, image, image/jpeg, image/png, image/tiff, mediawiki, pandoc, text/html, text/markdown, text/rtf

language

pipe_type description
*, <class 'dict'>, <class 'list'>, application/epub+zip, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/x-yaml, mediawiki, pandoc, text/markdown, text/rtf Detect language of a document, return 'unknown' in case of an error
PIL.Image.Image, application/pdf, image, image/jpeg, image/png, image/tiff Extracts the language of the document
text/html Extracts the main content from the html document, removing boilerplate and other noise

name : <Document>.x('language') or <Document>.language

return type : <class 'str'> | typing.Any

supports pipeline flows : *, <class 'dict'>, <class 'list'>, PIL.Image.Image, application/epub+zip, application/pdf, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/x-yaml, image, image/jpeg, image/png, image/tiff, mediawiki, pandoc, text/html, text/markdown, text/rtf

line_elements

Filters the document elements and only keeps the text elements

name : <Document>.x('line_elements') or <Document>.line_elements

return type : list[pydoxtools.document_base.DocumentElement]

supports pipeline flows : *, <class 'dict'>, <class 'list'>, PIL.Image.Image, application/epub+zip, application/pdf, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/x-yaml, image, image/jpeg, image/png, image/tiff, mediawiki, pandoc, text/html, text/markdown, text/rtf

lists

pipe_type description
PIL.Image.Image, application/pdf, image, image/jpeg, image/png, image/tiff Extracts lists from the document text elements
*, <class 'dict'>, <class 'list'>, application/epub+zip, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/x-yaml, mediawiki, pandoc, text/html, text/markdown, text/rtf Extracts the lists from the document

name : <Document>.x('lists') or <Document>.lists

return type : <class 'pandas.core.frame.DataFrame'> | list[pydoxtools.document_base.DocumentElement]

supports pipeline flows : *, <class 'dict'>, <class 'list'>, PIL.Image.Image, application/epub+zip, application/pdf, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/x-yaml, image, image/jpeg, image/png, image/tiff, mediawiki, pandoc, text/html, text/markdown, text/rtf

main_content

Extracts the main content from the html document, removing boilerplate and other noise

name : <Document>.x('main_content') or <Document>.main_content

return type : typing.Any

supports pipeline flows : text/html

main_content_clean_html

Extracts the main content from the html document, removing boilerplate and other noise

name : <Document>.x('main_content_clean_html') or <Document>.main_content_clean_html

return type : typing.Any

supports pipeline flows : text/html

main_image

Extracts the main image from the html document

name : <Document>.x('main_image') or <Document>.main_image

return type : typing.Any

supports pipeline flows : text/html

meta

Metadata of the document

name : <Document>.x('meta') or <Document>.meta

return type : dict[str, typing.Any]

supports pipeline flows : *, <class 'dict'>, <class 'list'>, PIL.Image.Image, application/epub+zip, application/pdf, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/x-yaml, image, image/jpeg, image/png, image/tiff, mediawiki, pandoc, text/html, text/markdown, text/rtf

meta_pandoc

meta information from pandoc document

name : <Document>.x('meta_pandoc') or <Document>.meta_pandoc

return type : typing.Any

supports pipeline flows : application/epub+zip, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, mediawiki, pandoc, text/markdown, text/rtf

meta_pdf

pipe_type description
application/pdf Loads a pdf file and returns a list of basic document elements such as lines, figures, etc.
PIL.Image.Image, image, image/jpeg, image/png, image/tiff Loads the pdf file into a list of [][pydoxtools.document_base.DocumentElement]

name : <Document>.x('meta_pdf') or <Document>.meta_pdf

return type : <class 'dict'>

supports pipeline flows : PIL.Image.Image, application/pdf, image, image/jpeg, image/png, image/tiff

noun_chunks

Alias for:

* spacy_noun_chunks->noun_chunks (output)

name : <Document>.x('noun_chunks') or <Document>.noun_chunks

return type : typing.List[pydoxtools.document_base.TokenCollection]

supports pipeline flows : *, <class 'dict'>, <class 'list'>, PIL.Image.Image, application/epub+zip, application/pdf, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/x-yaml, image, image/jpeg, image/png, image/tiff, mediawiki, pandoc, text/html, text/markdown, text/rtf

noun_graph

Create a graph of similar nouns

name : <Document>.x('noun_graph') or <Document>.noun_graph

return type : typing.Any

supports pipeline flows : *, <class 'dict'>, <class 'list'>, PIL.Image.Image, application/epub+zip, application/pdf, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/x-yaml, image, image/jpeg, image/png, image/tiff, mediawiki, pandoc, text/html, text/markdown, text/rtf

noun_ids

Vectors for nouns and corresponding noun ids in order to find them in the spacy document

name : <Document>.x('noun_ids') or <Document>.noun_ids

return type : typing.Any

supports pipeline flows : *, <class 'dict'>, <class 'list'>, PIL.Image.Image, application/epub+zip, application/pdf, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/x-yaml, image, image/jpeg, image/png, image/tiff, mediawiki, pandoc, text/html, text/markdown, text/rtf

noun_index

Create an index for the nouns

name : <Document>.x('noun_index') or <Document>.noun_index

return type : typing.Any

supports pipeline flows : *, <class 'dict'>, <class 'list'>, PIL.Image.Image, application/epub+zip, application/pdf, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/x-yaml, image, image/jpeg, image/png, image/tiff, mediawiki, pandoc, text/html, text/markdown, text/rtf

noun_query

Create a query function for the nouns which can be used to do nearest-neighbor queries

name : <Document>.x('noun_query') or <Document>.noun_query

return type : typing.Callable[..., list[tuple]]

supports pipeline flows : *, <class 'dict'>, <class 'list'>, PIL.Image.Image, application/epub+zip, application/pdf, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/x-yaml, image, image/jpeg, image/png, image/tiff, mediawiki, pandoc, text/html, text/markdown, text/rtf

noun_vecs

Vectors for nouns and corresponding noun ids in order to find them in the spacy document

name : <Document>.x('noun_vecs') or <Document>.noun_vecs

return type : typing.Any

supports pipeline flows : *, <class 'dict'>, <class 'list'>, PIL.Image.Image, application/epub+zip, application/pdf, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/x-yaml, image, image/jpeg, image/png, image/tiff, mediawiki, pandoc, text/html, text/markdown, text/rtf

num_pages

pipe_type description
*, <class 'dict'>, <class 'list'>, application/epub+zip, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/x-yaml, mediawiki, pandoc, text/html, text/markdown, text/rtf Number of pages in the document
PIL.Image.Image, application/pdf, image, image/jpeg, image/png, image/tiff Outputs the number of pages in the document

name : <Document>.x('num_pages') or <Document>.num_pages

return type : <class 'int'>

supports pipeline flows : *, <class 'dict'>, <class 'list'>, PIL.Image.Image, application/epub+zip, application/pdf, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/x-yaml, image, image/jpeg, image/png, image/tiff, mediawiki, pandoc, text/html, text/markdown, text/rtf

num_sents

number of sentences

name : <Document>.x('num_sents') or <Document>.num_sents

return type : <class 'int'>

supports pipeline flows : *, <class 'dict'>, <class 'list'>, PIL.Image.Image, application/epub+zip, application/pdf, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/x-yaml, image, image/jpeg, image/png, image/tiff, mediawiki, pandoc, text/html, text/markdown, text/rtf

num_words

Number of words in the document

name : <Document>.x('num_words') or <Document>.num_words

return type : <class 'int'>

supports pipeline flows : *, <class 'dict'>, <class 'list'>, PIL.Image.Image, application/epub+zip, application/pdf, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/x-yaml, image, image/jpeg, image/png, image/tiff, mediawiki, pandoc, text/html, text/markdown, text/rtf

ocr_pdf_file

Extracts the text from the document using OCR. It does this by creating a pdf which is important in order to keep the positional information of the text elements.

name : <Document>.x('ocr_pdf_file') or <Document>.ocr_pdf_file

return type : typing.Any

supports pipeline flows : PIL.Image.Image, image, image/jpeg, image/png, image/tiff

page_classifier

Classifies the pages into different types. This is useful for example for identifiying table of contents, certain chapters etc... . This works as a zero-shot classifier and the classes are not predefined. it can by called like this:

Document('somefile.pdf').page_classifier(candidate_labels=['table_of_contents', 'credits', 'license'])

name : <Document>.x('page_classifier') or <Document>.page_classifier

return type : typing.Callable[[list[str]], dict]

supports pipeline flows : *, <class 'dict'>, <class 'list'>, PIL.Image.Image, application/epub+zip, application/pdf, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/x-yaml, image, image/jpeg, image/png, image/tiff, mediawiki, pandoc, text/html, text/markdown, text/rtf

page_set

pipe_type description
*, <class 'dict'>, <class 'list'>, application/epub+zip, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/x-yaml, mediawiki, pandoc, text/html, text/markdown, text/rtf A constant value
application/pdf Loads a pdf file and returns a list of basic document elements such as lines, figures, etc.
PIL.Image.Image, image, image/jpeg, image/png, image/tiff Loads the pdf file into a list of [][pydoxtools.document_base.DocumentElement]

name : <Document>.x('page_set') or <Document>.page_set

return type : set[int] | typing.Any

supports pipeline flows : *, <class 'dict'>, <class 'list'>, PIL.Image.Image, application/epub+zip, application/pdf, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/x-yaml, image, image/jpeg, image/png, image/tiff, mediawiki, pandoc, text/html, text/markdown, text/rtf

page_templates

Generates a text page while replacing certain elements of the page which can be specified as a list of ElementTypes. It also automatically replaces elements which don't have a textual representation with an identifier. This is often the case with images & figures for example. The Id of the placeholder refers to the index of the DocumentObject. So for example, if we encounter and Identifier: {Table_22}, we would be able to find it using doc.document_objects[22] or doc.do[22].

name : <Document>.x('page_templates') or <Document>.page_templates

return type : typing.Callable[[list[str]], dict[int, str]]

supports pipeline flows : *, <class 'dict'>, <class 'list'>, PIL.Image.Image, application/epub+zip, application/pdf, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/x-yaml, image, image/jpeg, image/png, image/tiff, mediawiki, pandoc, text/html, text/markdown, text/rtf

page_templates_str

Outputs a nice text version of the documents with annotated document objects such as page numbers, tables, figures, etc.

name : <Document>.x('page_templates_str') or <Document>.page_templates_str

return type : <class 'str'>

supports pipeline flows : *, <class 'dict'>, <class 'list'>, PIL.Image.Image, application/epub+zip, application/pdf, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/x-yaml, image, image/jpeg, image/png, image/tiff, mediawiki, pandoc, text/html, text/markdown, text/rtf

page_templates_str_minimal

No documentation

name : <Document>.x('page_templates_str_minimal') or <Document>.page_templates_str_minimal

return type : <class 'str'>

supports pipeline flows : *, <class 'dict'>, <class 'list'>, PIL.Image.Image, application/epub+zip, application/pdf, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/x-yaml, image, image/jpeg, image/png, image/tiff, mediawiki, pandoc, text/html, text/markdown, text/rtf

pages_bbox

pipe_type description
application/pdf Loads a pdf file and returns a list of basic document elements such as lines, figures, etc.
PIL.Image.Image, image, image/jpeg, image/png, image/tiff Loads the pdf file into a list of [][pydoxtools.document_base.DocumentElement]

name : <Document>.x('pages_bbox') or <Document>.pages_bbox

return type : <class 'numpy.ndarray'>

supports pipeline flows : PIL.Image.Image, application/pdf, image, image/jpeg, image/png, image/tiff

pandoc_document

Loads the document using the pandoc project https://pandoc.org/ into a pydoxtools list of [][pydoxtools.document_base.DocumentElement]

name : <Document>.x('pandoc_document') or <Document>.pandoc_document

return type : Pandoc(Meta, [Block])

supports pipeline flows : application/epub+zip, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, mediawiki, pandoc, text/markdown, text/rtf

Extracts the main content from the html document, removing boilerplate and other noise

name : <Document>.x('pdf_links') or <Document>.pdf_links

return type : typing.Any

supports pipeline flows : text/html

pil_image

pipe_type description
PIL.Image.Image Alias for:
* _fobj->pil_image (output)
image, image/jpeg, image/png, image/tiff Converts the image to a PIL-style image for downstream processing tasks

name : <Document>.x('pil_image') or <Document>.pil_image

return type : <class 'PIL.Image.Image'> | typing.Any

supports pipeline flows : PIL.Image.Image, image, image/jpeg, image/png, image/tiff

schemadata

Extracts the main content from the html document, removing boilerplate and other noise

name : <Document>.x('schemadata') or <Document>.schemadata

return type : typing.Any

supports pipeline flows : text/html

sections

Extracts the sections from the document by grouping text elements

name : <Document>.x('sections') or <Document>.sections

return type : typing.Any

supports pipeline flows : application/epub+zip, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, mediawiki, pandoc, text/markdown, text/rtf

segment_query

Create a query function for the text segments which can be used to do nearest-neighbor queries

name : <Document>.x('segment_query') or <Document>.segment_query

return type : typing.Callable[..., list[tuple]]

supports pipeline flows : *, <class 'dict'>, <class 'list'>, PIL.Image.Image, application/epub+zip, application/pdf, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/x-yaml, image, image/jpeg, image/png, image/tiff, mediawiki, pandoc, text/html, text/markdown, text/rtf

semantic_relations

Extract relations from text for building a knowledge graph

name : <Document>.x('semantic_relations') or <Document>.semantic_relations

return type : <class 'pandas.core.frame.DataFrame'>

supports pipeline flows : *, <class 'dict'>, <class 'list'>, PIL.Image.Image, application/epub+zip, application/pdf, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/x-yaml, image, image/jpeg, image/png, image/tiff, mediawiki, pandoc, text/html, text/markdown, text/rtf

sent_graph

Create a graph of similar sentences

name : <Document>.x('sent_graph') or <Document>.sent_graph

return type : typing.Any

supports pipeline flows : *, <class 'dict'>, <class 'list'>, PIL.Image.Image, application/epub+zip, application/pdf, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/x-yaml, image, image/jpeg, image/png, image/tiff, mediawiki, pandoc, text/html, text/markdown, text/rtf

sent_ids

Vectors for sentences & sentence_ids

name : <Document>.x('sent_ids') or <Document>.sent_ids

return type : typing.Any

supports pipeline flows : *, <class 'dict'>, <class 'list'>, PIL.Image.Image, application/epub+zip, application/pdf, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/x-yaml, image, image/jpeg, image/png, image/tiff, mediawiki, pandoc, text/html, text/markdown, text/rtf

sent_index

Create an index for the sentences

name : <Document>.x('sent_index') or <Document>.sent_index

return type : typing.Any

supports pipeline flows : *, <class 'dict'>, <class 'list'>, PIL.Image.Image, application/epub+zip, application/pdf, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/x-yaml, image, image/jpeg, image/png, image/tiff, mediawiki, pandoc, text/html, text/markdown, text/rtf

sent_query

Create a query function for the sentences which can be used to do nearest-neighbor queries

name : <Document>.x('sent_query') or <Document>.sent_query

return type : typing.Callable[..., list[tuple]]

supports pipeline flows : *, <class 'dict'>, <class 'list'>, PIL.Image.Image, application/epub+zip, application/pdf, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/x-yaml, image, image/jpeg, image/png, image/tiff, mediawiki, pandoc, text/html, text/markdown, text/rtf

sent_vecs

Vectors for sentences & sentence_ids

name : <Document>.x('sent_vecs') or <Document>.sent_vecs

return type : typing.Any

supports pipeline flows : *, <class 'dict'>, <class 'list'>, PIL.Image.Image, application/epub+zip, application/pdf, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/x-yaml, image, image/jpeg, image/png, image/tiff, mediawiki, pandoc, text/html, text/markdown, text/rtf

sents

Alias for:

* spacy_sents->sents (output)

name : <Document>.x('sents') or <Document>.sents

return type : list[str]

supports pipeline flows : *, <class 'dict'>, <class 'list'>, PIL.Image.Image, application/epub+zip, application/pdf, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/x-yaml, image, image/jpeg, image/png, image/tiff, mediawiki, pandoc, text/html, text/markdown, text/rtf

short_title

Extracts the main content from the html document, removing boilerplate and other noise

name : <Document>.x('short_title') or <Document>.short_title

return type : typing.Any

supports pipeline flows : text/html

side_titles

Extracts the titles from the document by detecting unusual font styles

name : <Document>.x('side_titles') or <Document>.side_titles

return type : typing.Any

supports pipeline flows : PIL.Image.Image, application/pdf, image, image/jpeg, image/png, image/tiff

slow_summary

Summarize the text using the Huggingface summarization pipeline

name : <Document>.x('slow_summary') or <Document>.slow_summary

return type : typing.Any

supports pipeline flows : *, <class 'dict'>, <class 'list'>, PIL.Image.Image, application/epub+zip, application/pdf, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/x-yaml, image, image/jpeg, image/png, image/tiff, mediawiki, pandoc, text/html, text/markdown, text/rtf

spacy_doc

Spacy Document and Language Model for this document

name : <Document>.x('spacy_doc') or <Document>.spacy_doc

return type : <class 'spacy.tokens.doc.Doc'>

supports pipeline flows : *, <class 'dict'>, <class 'list'>, PIL.Image.Image, application/epub+zip, application/pdf, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/x-yaml, image, image/jpeg, image/png, image/tiff, mediawiki, pandoc, text/html, text/markdown, text/rtf

spacy_embeddings

Embeddings calculated by a spacy transformer

name : <Document>.x('spacy_embeddings') or <Document>.spacy_embeddings

return type : typing.Any

supports pipeline flows : *, <class 'dict'>, <class 'list'>, PIL.Image.Image, application/epub+zip, application/pdf, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/x-yaml, image, image/jpeg, image/png, image/tiff, mediawiki, pandoc, text/html, text/markdown, text/rtf

spacy_nlp

Spacy Document and Language Model for this document

name : <Document>.x('spacy_nlp') or <Document>.spacy_nlp

return type : <class 'spacy.language.Language'>

supports pipeline flows : *, <class 'dict'>, <class 'list'>, PIL.Image.Image, application/epub+zip, application/pdf, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/x-yaml, image, image/jpeg, image/png, image/tiff, mediawiki, pandoc, text/html, text/markdown, text/rtf

spacy_noun_chunks

exracts nounchunks from spacy. Will not be cached because it is allin the spacy doc already

name : <Document>.x('spacy_noun_chunks') or <Document>.spacy_noun_chunks

return type : typing.List[pydoxtools.document_base.TokenCollection]

supports pipeline flows : *, <class 'dict'>, <class 'list'>, PIL.Image.Image, application/epub+zip, application/pdf, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/x-yaml, image, image/jpeg, image/png, image/tiff, mediawiki, pandoc, text/html, text/markdown, text/rtf

spacy_sents

List of sentences by spacy nlp framework

name : <Document>.x('spacy_sents') or <Document>.spacy_sents

return type : list[str]

supports pipeline flows : *, <class 'dict'>, <class 'list'>, PIL.Image.Image, application/epub+zip, application/pdf, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/x-yaml, image, image/jpeg, image/png, image/tiff, mediawiki, pandoc, text/html, text/markdown, text/rtf

spacy_vectorizer

Create a vectorizer function from spacy library.

name : <Document>.x('spacy_vectorizer') or <Document>.spacy_vectorizer

return type : typing.Any

supports pipeline flows : *, <class 'dict'>, <class 'list'>, PIL.Image.Image, application/epub+zip, application/pdf, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/x-yaml, image, image/jpeg, image/png, image/tiff, mediawiki, pandoc, text/html, text/markdown, text/rtf

spacy_vectors

Vectors for all tokens calculated by spacy

name : <Document>.x('spacy_vectors') or <Document>.spacy_vectors

return type : torch.Tensor | typing.Any

supports pipeline flows : *, <class 'dict'>, <class 'list'>, PIL.Image.Image, application/epub+zip, application/pdf, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/x-yaml, image, image/jpeg, image/png, image/tiff, mediawiki, pandoc, text/html, text/markdown, text/rtf

summary

Extracts the main content from the html document, removing boilerplate and other noise

name : <Document>.x('summary') or <Document>.summary

return type : typing.Any

supports pipeline flows : text/html

table_areas

Areas of all detected tables

name : <Document>.x('table_areas') or <Document>.table_areas

return type : list[numpy.ndarray]

supports pipeline flows : PIL.Image.Image, application/pdf, image, image/jpeg, image/png, image/tiff

table_box_levels

pipe_type description
application/pdf Detects table candidates from the document elements
PIL.Image.Image, image, image/jpeg, image/png, image/tiff Extracts the table candidates from the document. As this is an image, we need to use a different method than for pdfs. Right now this relies on neural networks. TODO: add adtitional pure text-based method.

name : <Document>.x('table_box_levels') or <Document>.table_box_levels

return type : typing.Any

supports pipeline flows : PIL.Image.Image, application/pdf, image, image/jpeg, image/png, image/tiff

table_candidates

pipe_type description
application/pdf Detects table candidates from the document elements
PIL.Image.Image, image, image/jpeg, image/png, image/tiff Extracts the table candidates from the document. As this is an image, we need to use a different method than for pdfs. Right now this relies on neural networks. TODO: add adtitional pure text-based method.

name : <Document>.x('table_candidates') or <Document>.table_candidates

return type : typing.Any

supports pipeline flows : PIL.Image.Image, application/pdf, image, image/jpeg, image/png, image/tiff

table_context

Outputs a dictionary with the context of each table in the document

name : <Document>.x('table_context') or <Document>.table_context

return type : dict[int, str]

supports pipeline flows : PIL.Image.Image, application/pdf, image, image/jpeg, image/png, image/tiff

table_df0

Filter valid tables from table candidates by looking if meaningful values can be extracted

name : <Document>.x('table_df0') or <Document>.table_df0

return type : list[pandas.core.frame.DataFrame]

supports pipeline flows : PIL.Image.Image, application/pdf, image, image/jpeg, image/png, image/tiff

tables

Extracts the tables from the document as a document element

name : <Document>.x('tables') or <Document>.tables

return type : dict[int, pydoxtools.document_base.DocumentElement]

supports pipeline flows : *, <class 'dict'>, <class 'list'>, PIL.Image.Image, application/epub+zip, application/pdf, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/x-yaml, image, image/jpeg, image/png, image/tiff, mediawiki, pandoc, text/html, text/markdown, text/rtf

tables_df

pipe_type description
PIL.Image.Image, application/pdf, image, image/jpeg, image/png, image/tiff Dataframes of all tables
text/html Extracts the main content from the html document, removing boilerplate and other noise
*, <class 'dict'>, <class 'list'>, application/epub+zip, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/x-yaml, mediawiki, pandoc, text/markdown, text/rtf No documentation

name : <Document>.x('tables_df') or <Document>.tables_df

return type : list[pandas.core.frame.DataFrame] | typing.Any

supports pipeline flows : *, <class 'dict'>, <class 'list'>, PIL.Image.Image, application/epub+zip, application/pdf, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/x-yaml, image, image/jpeg, image/png, image/tiff, mediawiki, pandoc, text/html, text/markdown, text/rtf

tables_dict

List of Table

name : <Document>.x('tables_dict') or <Document>.tables_dict

return type : list[dict[int, dict[int, str]]]

supports pipeline flows : *, <class 'dict'>, <class 'list'>, PIL.Image.Image, application/epub+zip, application/pdf, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/x-yaml, image, image/jpeg, image/png, image/tiff, mediawiki, pandoc, text/html, text/markdown, text/rtf

text_box_elements

pipe_type description
<class 'dict'>, application/x-yaml Create a dataframe from a dictionary. TODO: this is not working correctly, it should create a list of [][pydoxtools.document_base.DocumentELements]
PIL.Image.Image, application/pdf, image, image/jpeg, image/png, image/tiff Extracts a dataframe of text boxes from the document by grouping text elements
text/html Extracts the text boxes from the html document
<class 'list'> No documentation
*, application/epub+zip, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, mediawiki, pandoc, text/markdown, text/rtf Text boxes extracted as a pandas Dataframe with some additional metadata

name : <Document>.x('text_box_elements') or <Document>.text_box_elements

return type : list[pydoxtools.document_base.DocumentElement] | list[str]

supports pipeline flows : *, <class 'dict'>, <class 'list'>, PIL.Image.Image, application/epub+zip, application/pdf, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/x-yaml, image, image/jpeg, image/png, image/tiff, mediawiki, pandoc, text/html, text/markdown, text/rtf

text_segment_ids

Get the a list of ids for individual text segments

name : <Document>.x('text_segment_ids') or <Document>.text_segment_ids

return type : typing.Any

supports pipeline flows : *, <class 'dict'>, <class 'list'>, PIL.Image.Image, application/epub+zip, application/pdf, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/x-yaml, image, image/jpeg, image/png, image/tiff, mediawiki, pandoc, text/html, text/markdown, text/rtf

text_segment_index

Create an index for the text segments

name : <Document>.x('text_segment_index') or <Document>.text_segment_index

return type : typing.Any

supports pipeline flows : *, <class 'dict'>, <class 'list'>, PIL.Image.Image, application/epub+zip, application/pdf, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/x-yaml, image, image/jpeg, image/png, image/tiff, mediawiki, pandoc, text/html, text/markdown, text/rtf

text_segment_vec_res

Calculate the embeddings for each text segment

name : <Document>.x('text_segment_vec_res') or <Document>.text_segment_vec_res

return type : typing.Any

supports pipeline flows : *, <class 'dict'>, <class 'list'>, PIL.Image.Image, application/epub+zip, application/pdf, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/x-yaml, image, image/jpeg, image/png, image/tiff, mediawiki, pandoc, text/html, text/markdown, text/rtf

text_segment_vecs

Get the embeddings for individual text segments

name : <Document>.x('text_segment_vecs') or <Document>.text_segment_vecs

return type : typing.Any

supports pipeline flows : *, <class 'dict'>, <class 'list'>, PIL.Image.Image, application/epub+zip, application/pdf, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/x-yaml, image, image/jpeg, image/png, image/tiff, mediawiki, pandoc, text/html, text/markdown, text/rtf

text_segments

Split the text into segments

name : <Document>.x('text_segments') or <Document>.text_segments

return type : list[str]

supports pipeline flows : *, <class 'dict'>, <class 'list'>, PIL.Image.Image, application/epub+zip, application/pdf, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/x-yaml, image, image/jpeg, image/png, image/tiff, mediawiki, pandoc, text/html, text/markdown, text/rtf

textrank_keywords

Extract keywords from the graph of similar nouns

name : <Document>.x('textrank_keywords') or <Document>.textrank_keywords

return type : set[str]

supports pipeline flows : *, <class 'dict'>, <class 'list'>, PIL.Image.Image, application/epub+zip, application/pdf, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/x-yaml, image, image/jpeg, image/png, image/tiff, mediawiki, pandoc, text/html, text/markdown, text/rtf

textrank_sents

Extract the most important sentences from the graph of similar sentences

name : <Document>.x('textrank_sents') or <Document>.textrank_sents

return type : set[str]

supports pipeline flows : *, <class 'dict'>, <class 'list'>, PIL.Image.Image, application/epub+zip, application/pdf, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/x-yaml, image, image/jpeg, image/png, image/tiff, mediawiki, pandoc, text/html, text/markdown, text/rtf

title

Extracts the main content from the html document, removing boilerplate and other noise

name : <Document>.x('title') or <Document>.title

return type : typing.Any

supports pipeline flows : text/html

titles

pipe_type description
PIL.Image.Image, application/pdf, image, image/jpeg, image/png, image/tiff Extracts the titles from the document by detecting unusual font styles
text/html Extracts the titles from the html document

name : <Document>.x('titles') or <Document>.titles

return type : tuple[str, str] | typing.Any

supports pipeline flows : PIL.Image.Image, application/pdf, image, image/jpeg, image/png, image/tiff, text/html

tok_embeddings

Get the tokenized text

name : <Document>.x('tok_embeddings') or <Document>.tok_embeddings

return type : typing.Any

supports pipeline flows : *, <class 'dict'>, <class 'list'>, PIL.Image.Image, application/epub+zip, application/pdf, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/x-yaml, image, image/jpeg, image/png, image/tiff, mediawiki, pandoc, text/html, text/markdown, text/rtf

tokens

Get the tokenized text

name : <Document>.x('tokens') or <Document>.tokens

return type : typing.Any

supports pipeline flows : *, <class 'dict'>, <class 'list'>, PIL.Image.Image, application/epub+zip, application/pdf, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/x-yaml, image, image/jpeg, image/png, image/tiff, mediawiki, pandoc, text/html, text/markdown, text/rtf

url

pipe_type description
text/html Extracts the main content from the html document, removing boilerplate and other noise
*, <class 'dict'>, <class 'list'>, PIL.Image.Image, application/epub+zip, application/pdf, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/x-yaml, image, image/jpeg, image/png, image/tiff, mediawiki, pandoc, text/markdown, text/rtf Url of this document

name : <Document>.x('url') or <Document>.url

return type : typing.Any

supports pipeline flows : *, <class 'dict'>, <class 'list'>, PIL.Image.Image, application/epub+zip, application/pdf, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/x-yaml, image, image/jpeg, image/png, image/tiff, mediawiki, pandoc, text/html, text/markdown, text/rtf

urls

Extracts the urls from the html document

name : <Document>.x('urls') or <Document>.urls

return type : typing.Any

supports pipeline flows : text/html

valid_tables

Filter valid tables from table candidates by looking if meaningful values can be extracted

name : <Document>.x('valid_tables') or <Document>.valid_tables

return type : typing.Any

supports pipeline flows : PIL.Image.Image, application/pdf, image, image/jpeg, image/png, image/tiff

values

Get the values of the dictionary

name : <Document>.x('values') or <Document>.values

return type : typing.Any

supports pipeline flows : <class 'dict'>, application/x-yaml

vec_res

Calculate context-based vectors (embeddings) for the entire text

name : <Document>.x('vec_res') or <Document>.vec_res

return type : typing.Any

supports pipeline flows : *, <class 'dict'>, <class 'list'>, PIL.Image.Image, application/epub+zip, application/pdf, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/x-yaml, image, image/jpeg, image/png, image/tiff, mediawiki, pandoc, text/html, text/markdown, text/rtf

vector

Embeddings from spacy

name : <Document>.x('vector') or <Document>.vector

return type : typing.Any

supports pipeline flows : *, <class 'dict'>, <class 'list'>, PIL.Image.Image, application/epub+zip, application/pdf, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/x-yaml, image, image/jpeg, image/png, image/tiff, mediawiki, pandoc, text/html, text/markdown, text/rtf

vectorizer

Get the vectorizer function used for this document for an arbitrary text

name : <Document>.x('vectorizer') or <Document>.vectorizer

return type : typing.Any

supports pipeline flows : *, <class 'dict'>, <class 'list'>, PIL.Image.Image, application/epub+zip, application/pdf, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/x-yaml, image, image/jpeg, image/png, image/tiff, mediawiki, pandoc, text/html, text/markdown, text/rtf

Configuration parameters

name description default_values
chat_model_id In order to use openai-chatgpt, you can use 'gpt-3.5-turbo' or 'gpt-4'.Additionally, we support models used by gpt4all library whichcan be run locally and most are available for commercial purposes. Currently available models are: ['wizardlm-13b-v1.1-superhot-8k.ggmlv3.q4_0', 'ggml-model-gpt4all-falcon-q4_0', 'ous-hermes-13b.ggmlv3.q4_0', 'GPT4All-13B-snoozy.ggmlv3.q4_0', 'orca-mini-7b.ggmlv3.q4_0', 'orca-mini-3b.ggmlv3.q4_0', 'orca-mini-13b.ggmlv3.q4_0', 'wizardLM-13B-Uncensored.ggmlv3.q4_0', 'ggml-replit-code-v1-3', 'ggml-all-MiniLM-L6-v2-f16', 'starcoderbase-3b-ggml', 'starcoderbase-7b-ggml', 'llama-2-7b-chat.ggmlv3.q4_0'] gpt-3.5-turbo
coreference_method can be 'fast' or 'accurate' fast
full_text_format The format used to convert the document to a string markdown
graph_debug_context_size can be 'fast' or 'accurate' 0
image_dpi The dpi when rendering the document. The standard image generation resolution is set to 216 dpi for pdfs as we want to have sufficient DPI for downstram OCR tasks (e.g. table extraction) 216
max_size_text_segment controls the text segmentation for knowledge basesoverlap is only relevant for large text segmenets that need tobe split up into smaller pieces. 512
max_text_segment_num controls the text segmentation for knowledge basesoverlap is only relevant for large text segmenets that need tobe split up into smaller pieces. 100
min_size_text_segment controls the text segmentation for knowledge basesoverlap is only relevant for large text segmenets that need tobe split up into smaller pieces. 256
ocr_lang Configuration for the ocr extractor. We can turn it on/off and specify the language used for OCR. auto
ocr_on Configuration for the ocr extractor. We can turn it on/off and specify the language used for OCR. True
qam_model_id Configuration for values: deepset/minilm-uncased-squad2
* qam_model_id = deepset/minilm-uncased-squad2 (default)
spacy_model we can also explicitly specify the spacy model we want to use. auto
spacy_model_size the model size which is used for spacy text analysis. Can be: sm,md,lg,trf. md
summarizer_max_text_len Configuration for values: 200
* summarizer_model = sshleifer/distilbart-cnn-12-6 (default)
* summarizer_token_overlap = 50 (default)
* summarizer_max_text_len = 200 (default)
summarizer_model Configuration for values: sshleifer/distilbart-cnn-12-6
* summarizer_model = sshleifer/distilbart-cnn-12-6 (default)
* summarizer_token_overlap = 50 (default)
* summarizer_max_text_len = 200 (default)
summarizer_token_overlap Configuration for values: 50
* summarizer_model = sshleifer/distilbart-cnn-12-6 (default)
* summarizer_token_overlap = 50 (default)
* summarizer_max_text_len = 200 (default)
text_segment_overlap controls the text segmentation for knowledge basesoverlap is only relevant for large text segmenets that need tobe split up into smaller pieces. 0.3
top_k_text_rank_keywords Configuration for values: 5
* top_k_text_rank_keywords = 5 (default)
top_k_text_rank_sentences controls the number of most important sentences that are extracted from the text. 5
use_clean_text_for_spacy Whether pydoxtools cleans up the text before using spacy on it. True
vectorizer_model Choose the embeddings model (huggingface-style) and if we wantto do the vectorization using only the tokenizer. Using only thetokenizer is MUCH faster and uses lower CPU than creating actualcontextual embeddings using the model. BUt is also lower qualitybecause it lacks the context. sentence-transformers/all-MiniLM-L6-v2
vectorizer_only_tokenizer Choose the embeddings model (huggingface-style) and if we wantto do the vectorization using only the tokenizer. Using only thetokenizer is MUCH faster and uses lower CPU than creating actualcontextual embeddings using the model. BUt is also lower qualitybecause it lacks the context. False
vectorizer_overlap_ratio Choose the embeddings model (huggingface-style) and if we wantto do the vectorization using only the tokenizer. Using only thetokenizer is MUCH faster and uses lower CPU than creating actualcontextual embeddings using the model. BUt is also lower qualitybecause it lacks the context. 0.1