Skip to content

Reference

Pipeline

Base class for all document classes in pydoxtools, defining a common pipeline interface and establishing a basic pipeline schema that derived classes can override.

The MetaPipelineClassConfiguration acts as a compiler to resolve the pipeline hierarchy, allowing pipelines to inherit, mix, extend, or partially overwrite each other. Each key in the _pipelines dictionary represents a different pipeline version.

The pydoxtools.Document class leverages this functionality to build separate pipelines for different file types, as the information processing requirements differ significantly between file types.

Attributes:

Name Type Description
_operators dict[str, list[Operator]]

Stores the definition of the pipeline graph, a collection of connected operators/functions that process data from a document.

_pipelines dict[str, dict[str, Operator]]

Provides access to all operator functions by their "out-key" which was defined in _operators.

Todo
  • Use pandera (https://github.com/unionai-oss/pandera) to validate dataframes exchanged between operators & loaders (https://pandera.readthedocs.io/en/stable/pydantic_integration.html)

configuration property

Returns a dictionary of all configuration objects for the current pipeline.

Returns:

Name Type Description
dict

A dictionary containing the names and values of all configuration objects for the current pipeline.

pipeline_chooser: str property

Must be implemented by derived classes to decide which pipeline they should use.

uuid cached property

Retrieves a universally unique identifier (UUID) for the instance.

This method generates a new UUID for the instance using Python's uuid.uuid4() function. The UUID is then cached as a property, ensuring that the same UUID is returned for subsequent accesses.

Returns:

Type Description

uuid.UUID: A unique identifier for the instance.

x_funcs: dict[str, Operator] cached property

get all operators/pipeline nodes and their property names for this specific file type/pipeline

__getattr__(extract_name)

Retrieves an extractor result by directly accessing it as an attribute.

This method is automatically called for attribute names that aren't defined on class level, allowing for a convenient way to access pipeline operator outputs without needing to call the 'x' method.

Example

document.addresses instead of document.x('addresses')

Parameters:

Name Type Description Default
extract_name str

The name of the extractor result to be accessed.

required

Returns:

Name Type Description
Any Any

The result of the extractor after processing the document.

__getitem__(extract_name)

Retrieves an extractor result by directly accessing it as an attribute.

This method is automatically called for attribute names that aren't defined on class level, allowing for a convenient way to access pipeline operator outputs without needing to call the 'x' method.

Example

document["addresses"] instead of document.x('addresses')

Parameters:

Name Type Description Default
extract_name str

The name of the extractor result to be accessed.

required

Returns:

Name Type Description
Any Any

The result of the extractor after processing the document.

__getstate__()

return necessary variables for pickling, ensuring that we leave out everything that can potentially have a lambda function in it...

__init__(**configuration)

Initializes the Pipeline instance with cache-related attributes.

**configuration: A dictionary of key-value pairs representing the configuration settings for the pipeline. Each key is a string representing the name of the configuration setting, and the value is the corresponding value to be set.

__repr__()

Returns:

Name Type Description
str

A string representation of the instance.

__setstate__(state)

we need to restore _x_func_cache for pickling to work...

gather_inputs(mapped_args, traceable)

Gathers arguments from the pipeline and class, and maps them to the provided keys of kwargs.

This method retrieves all required input parameters from _in_mapping, which was declared with "pipe". It first checks if the parameter is available as an extractor. If so, it calls the function to get the value. Otherwise, it gets the member-variables or functions of the derived pipeline class if an extractor with that name cannot be found.

Parameters:

Name Type Description Default
**kwargs dict

A dictionary containing the keys to be mapped to the corresponding values.

required

Returns:

Name Type Description
dict

A dictionary containing the mapped keys and their corresponding values.

get_configuration_names(pipeline) cached classmethod

Returns a list of names of all configuration objects for a given pipeline.

This is a cached function which is important,

Parameters:

Name Type Description Default
pipeline str

The name of the pipeline to retrieve configuration objects from.

required

Returns:

Name Type Description
list list[str]

A list of strings containing the names of all configuration objects for the given pipeline.

node_infos(pipeline_type=None) classmethod

Aggregates the pipeline operations and their corresponding types and metadata.

This method iterates through all the pipelines registered in the class, and gathers information about each node/operation, such as the pipeline types it appears in, the return type of the operation, and the operation's docstring.

Returns:

Type Description
list[dict]

TODO...

non_interactive_pipeline()

return all non-interactive operators/pipeline nodes

operator_types() classmethod

This function returns a dictionary of operators with their types which is suitable for declaring a pydantic model.

if this is set to True, we make sure that only valid json

schema types are included in the model. The typical use case is to expose the pipeline via this model to an http API e.g. through fastapi. In this case we should only allow types that are valid json schema. Therefore, this is set to "False" by default.

pipeline_graph(image_path=None, document_logic_id='*') classmethod

Generates a visualization of the defined pipelines and optionally saves it as an image.

Parameters:

Name Type Description Default
image_path str | Path

File path for the generated image. If provided, the generated graph will be saved as an image.

None
document_logic_id str

The document logic ID for which the pipeline graph should be generated. Defaults to "current".

'*'

Returns:

Name Type Description
AGraph

A PyGraphviz AGraph object representing the pipeline graph. This object can be visualized or manipulated using PyGraphviz functions.

Notes

This method requires the NetworkX and PyGraphviz libraries to be installed.

pre_cache()

Pre-caches the results of all operators that have caching enabled.

This method iterates through the defined operators and calls each one with caching enabled, storing the results for faster access in future calls.

Returns:

Name Type Description
self

The instance of the class, allowing for method chaining.

run_pipeline(exclude=None)

Runs all operators defined in the pipeline for testing or pre-caching purposes.

!!IMPORTANT!!! This function should normally not be used as the pipeline is lazily executed anyway.

This method iterates through the defined operators and calls each one, ensuring that the extractor logic is functioning correctly and caching the results if required.

run_pipeline_fast()

run pipeline, but exclude long-running calculations

set_disk_cache_settings(enable, ttl=3600 * 24 * 7)

Sets disk cache settings

to_dict(*args, **kwargs)

Returns a dictionary that accumulates the properties given in args or with a mapping in *kwargs.

Parameters:

Name Type Description Default
*args str

A variable number of strings, each representing a property name.

()
**kwargs dict

A dictionary mapping property names (values) to custom keys (keys) for the returned dictionary.

{}
Note

This function currently only supports properties that do not require any arguments, such as "full_text". Properties like "answers" that return a function requiring arguments cannot be used with this function.

Returns:

Name Type Description
dict

A dictionary with the accumulated properties and their values, using either the property names or custom keys as specified in the input arguments.

to_json(*args, **kwargs)

Returns a dictionary that accumulates the properties given in args or with a mapping in *kwargs, and dumps the output as JSON.

Parameters:

Name Type Description Default
*args str

A variable number of strings, each representing a property name.

()
**kwargs dict

A dictionary mapping property names (values) to custom keys (keys) for the returned dictionary.

{}
Note

This function currently only supports properties that do not require any arguments, such as "full_text". Properties like "answers" that return a function requiring arguments cannot be used with this function.

Returns:

Name Type Description
str

A JSON-formatted string representing the accumulated properties and their values, using either the property names or custom keys as specified in the input arguments.

to_yaml(*args, **kwargs)

Returns a dictionary that accumulates the properties given in args or with a mapping in *kwargs, and dumps the output as YAML.

Parameters:

Name Type Description Default
*args str

A variable number of strings, each representing a property name.

()
**kwargs dict

A dictionary mapping property names (values) to custom keys (keys) for the returned dictionary.

{}
Note

This function currently only supports properties that do not require any arguments, such as "full_text". Properties like "answers" that return a function requiring arguments cannot be used with this function.

Returns:

Name Type Description
str

A YAML-formatted string representing the accumulated properties and their values, using either the property names or custom keys as specified in the input arguments.

x(operator_name, disk_cache=False, traceable=False)

Calls an extractor from the defined pipeline and returns the result.

Parameters:

Name Type Description Default
operator_name str

The name of the extractor to be called.

required
cache

if we want to cache the call. We can explicitly tell the pipeline to cache a call. to make caching more efficient by only caching the calls we want.

required
traceable bool

Some operators will propagate the source of their information through the pipeline. This adds traceability. By setting this to traceable=True we can turn this feature on.

False

Returns:

Name Type Description
Any Any

The result of the extractor after processing the document.

Raises:

Type Description
OperatorException

If an error occurs while executing the extractor.

Notes

The extractor's parameters can be overridden using args and *kwargs.

x_all()

Retrieves the results of all operators defined in the pipeline.

Returns:

Name Type Description
dict

A dictionary containing the results of all operators, with keys as the extractor names and values as the corresponding results.