Reference

`Pipeline`

Base class for all document classes in pydoxtools, defining a common pipeline interface and establishing a basic pipeline schema that derived classes can override.

The MetaPipelineClassConfiguration acts as a compiler to resolve the pipeline hierarchy, allowing pipelines to inherit, mix, extend, or partially overwrite each other. Each key in the _pipelines dictionary represents a different pipeline version.

The pydoxtools.Document class leverages this functionality to build separate pipelines for different file types, as the information processing requirements differ significantly between file types.

Attributes:

Name	Type	Description
`_operators`	`dict[str, list[Operator]]`	Stores the definition of the pipeline graph, a collection of connected operators/functions that process data from a document.
`_pipelines`	`dict[str, dict[str, Operator]]`	Provides access to all operator functions by their "out-key" which was defined in _operators.

Todo

Use pandera (https://github.com/unionai-oss/pandera) to validate dataframes exchanged between operators & loaders (https://pandera.readthedocs.io/en/stable/pydantic_integration.html)

`configuration` `property`

Returns a dictionary of all configuration objects for the current pipeline.

Returns:

Name	Type	Description
`dict`		A dictionary containing the names and values of all configuration objects for the current pipeline.

`pipeline_chooser: str` `property`

Must be implemented by derived classes to decide which pipeline they should use.

`uuid` `cached` `property`

Retrieves a universally unique identifier (UUID) for the instance.

This method generates a new UUID for the instance using Python's uuid.uuid4() function. The UUID is then cached as a property, ensuring that the same UUID is returned for subsequent accesses.

Returns:

Type	Description
	uuid.UUID: A unique identifier for the instance.

`x_funcs: dict[str, Operator]` `cached` `property`

get all operators/pipeline nodes and their property names for this specific file type/pipeline

`getattr(extract_name)`

Retrieves an extractor result by directly accessing it as an attribute.

This method is automatically called for attribute names that aren't defined on class level, allowing for a convenient way to access pipeline operator outputs without needing to call the 'x' method.

Example

document.addresses instead of document.x('addresses')

Parameters:

Name	Type	Description	Default
`extract_name`	`str`	The name of the extractor result to be accessed.	required

Returns:

Name	Type	Description
`Any`	`Any`	The result of the extractor after processing the document.

`getitem(extract_name)`

Retrieves an extractor result by directly accessing it as an attribute.

This method is automatically called for attribute names that aren't defined on class level, allowing for a convenient way to access pipeline operator outputs without needing to call the 'x' method.

Example

document["addresses"] instead of document.x('addresses')

Parameters:

Name	Type	Description	Default
`extract_name`	`str`	The name of the extractor result to be accessed.	required

Returns:

Name	Type	Description
`Any`	`Any`	The result of the extractor after processing the document.

`getstate()`

return necessary variables for pickling, ensuring that we leave out everything that can potentially have a lambda function in it...

`init(**configuration)`

Initializes the Pipeline instance with cache-related attributes.

**configuration: A dictionary of key-value pairs representing the configuration settings for the pipeline. Each key is a string representing the name of the configuration setting, and the value is the corresponding value to be set.

`repr()`

Returns:

Name	Type	Description
`str`		A string representation of the instance.

`setstate(state)`

we need to restore _x_func_cache for pickling to work...

`gather_inputs(mapped_args, traceable)`

Gathers arguments from the pipeline and class, and maps them to the provided keys of kwargs.

This method retrieves all required input parameters from _in_mapping, which was declared with "pipe". It first checks if the parameter is available as an extractor. If so, it calls the function to get the value. Otherwise, it gets the member-variables or functions of the derived pipeline class if an extractor with that name cannot be found.

Parameters:

Name	Type	Description	Default
`**kwargs`	`dict`	A dictionary containing the keys to be mapped to the corresponding values.	required

Returns:

Name	Type	Description
`dict`		A dictionary containing the mapped keys and their corresponding values.

`get_configuration_names(pipeline)` `cached` `classmethod`

Returns a list of names of all configuration objects for a given pipeline.

This is a cached function which is important,

Parameters:

Name	Type	Description	Default
`pipeline`	`str`	The name of the pipeline to retrieve configuration objects from.	required

Returns:

Name	Type	Description
`list`	`list[str]`	A list of strings containing the names of all configuration objects for the given pipeline.

`node_infos(pipeline_type=None)` `classmethod`

Aggregates the pipeline operations and their corresponding types and metadata.

This method iterates through all the pipelines registered in the class, and gathers information about each node/operation, such as the pipeline types it appears in, the return type of the operation, and the operation's docstring.

Returns:

Type	Description
`list[dict]`	TODO...

`non_interactive_pipeline()`

return all non-interactive operators/pipeline nodes

`operator_types()` `classmethod`

This function returns a dictionary of operators with their types which is suitable for declaring a pydantic model.

if this is set to True, we make sure that only valid json

schema types are included in the model. The typical use case is to expose the pipeline via this model to an http API e.g. through fastapi. In this case we should only allow types that are valid json schema. Therefore, this is set to "False" by default.

`pipeline_graph(image_path=None, document_logic_id='*')` `classmethod`

Generates a visualization of the defined pipelines and optionally saves it as an image.

Parameters:

Name	Type	Description	Default
`image_path`	`str \| Path`	File path for the generated image. If provided, the generated graph will be saved as an image.	`None`
`document_logic_id`	`str`	The document logic ID for which the pipeline graph should be generated. Defaults to "current".	`'*'`

Returns:

Name	Type	Description
`AGraph`		A PyGraphviz AGraph object representing the pipeline graph. This object can be visualized or manipulated using PyGraphviz functions.

Notes

This method requires the NetworkX and PyGraphviz libraries to be installed.

`pre_cache()`

Pre-caches the results of all operators that have caching enabled.

This method iterates through the defined operators and calls each one with caching enabled, storing the results for faster access in future calls.

Returns:

Name	Type	Description
`self`		The instance of the class, allowing for method chaining.

`run_pipeline(exclude=None)`

Runs all operators defined in the pipeline for testing or pre-caching purposes.

!!IMPORTANT!!! This function should normally not be used as the pipeline is lazily executed anyway.

This method iterates through the defined operators and calls each one, ensuring that the extractor logic is functioning correctly and caching the results if required.

`run_pipeline_fast()`

run pipeline, but exclude long-running calculations

`set_disk_cache_settings(enable, ttl=3600 * 24 * 7)`

Sets disk cache settings

`to_dict(*args, **kwargs)`

Returns a dictionary that accumulates the properties given in args or with a mapping in *kwargs.

Parameters:

Name	Type	Description	Default
`*args`	`str`	A variable number of strings, each representing a property name.	`()`
`**kwargs`	`dict`	A dictionary mapping property names (values) to custom keys (keys) for the returned dictionary.	`{}`

Note

This function currently only supports properties that do not require any arguments, such as "full_text". Properties like "answers" that return a function requiring arguments cannot be used with this function.

Returns:

Name	Type	Description
`dict`		A dictionary with the accumulated properties and their values, using either the property names or custom keys as specified in the input arguments.

`to_json(*args, **kwargs)`

Returns a dictionary that accumulates the properties given in args or with a mapping in *kwargs, and dumps the output as JSON.

Parameters:

Name	Type	Description	Default
`*args`	`str`	A variable number of strings, each representing a property name.	`()`
`**kwargs`	`dict`	A dictionary mapping property names (values) to custom keys (keys) for the returned dictionary.	`{}`

Note

This function currently only supports properties that do not require any arguments, such as "full_text". Properties like "answers" that return a function requiring arguments cannot be used with this function.

Returns:

Name	Type	Description
`str`		A JSON-formatted string representing the accumulated properties and their values, using either the property names or custom keys as specified in the input arguments.

`to_yaml(*args, **kwargs)`

Returns a dictionary that accumulates the properties given in args or with a mapping in *kwargs, and dumps the output as YAML.

Parameters:

Name	Type	Description	Default
`*args`	`str`	A variable number of strings, each representing a property name.	`()`
`**kwargs`	`dict`	A dictionary mapping property names (values) to custom keys (keys) for the returned dictionary.	`{}`

Note

This function currently only supports properties that do not require any arguments, such as "full_text". Properties like "answers" that return a function requiring arguments cannot be used with this function.

Returns:

Name	Type	Description
`str`		A YAML-formatted string representing the accumulated properties and their values, using either the property names or custom keys as specified in the input arguments.

`x(operator_name, disk_cache=False, traceable=False)`

Calls an extractor from the defined pipeline and returns the result.

Parameters:

Name	Type	Description	Default
`operator_name`	`str`	The name of the extractor to be called.	required
`cache`		if we want to cache the call. We can explicitly tell the pipeline to cache a call. to make caching more efficient by only caching the calls we want.	required
`traceable`	`bool`	Some operators will propagate the source of their information through the pipeline. This adds traceability. By setting this to traceable=True we can turn this feature on.	`False`

Returns:

Name	Type	Description
`Any`	`Any`	The result of the extractor after processing the document.

Raises:

Type	Description
`OperatorException`	If an error occurs while executing the extractor.

Notes

The extractor's parameters can be overridden using args and *kwargs.

`x_all()`

Retrieves the results of all operators defined in the pipeline.

Returns:

Name	Type	Description
`dict`		A dictionary containing the results of all operators, with keys as the extractor names and values as the corresponding results.

Reference

Pipeline

configuration property

pipeline_chooser: str property

uuid cached property

x_funcs: dict[str, Operator] cached property

__getattr__(extract_name)

__getitem__(extract_name)

__getstate__()

__init__(**configuration)

__repr__()

__setstate__(state)

gather_inputs(mapped_args, traceable)

get_configuration_names(pipeline) cached classmethod

node_infos(pipeline_type=None) classmethod

non_interactive_pipeline()

operator_types() classmethod

pipeline_graph(image_path=None, document_logic_id='*') classmethod

pre_cache()

run_pipeline(exclude=None)

run_pipeline_fast()

set_disk_cache_settings(enable, ttl=3600 * 24 * 7)

to_dict(*args, **kwargs)

to_json(*args, **kwargs)

to_yaml(*args, **kwargs)

x(operator_name, disk_cache=False, traceable=False)

x_all()

`Pipeline`

`configuration` `property`

`pipeline_chooser: str` `property`

`uuid` `cached` `property`

`x_funcs: dict[str, Operator]` `cached` `property`

`getattr(extract_name)`

`getitem(extract_name)`

`getstate()`

`init(**configuration)`

`repr()`

`setstate(state)`

`gather_inputs(mapped_args, traceable)`

`get_configuration_names(pipeline)` `cached` `classmethod`

`node_infos(pipeline_type=None)` `classmethod`

`non_interactive_pipeline()`

`operator_types()` `classmethod`

`pipeline_graph(image_path=None, document_logic_id='*')` `classmethod`

`pre_cache()`

`run_pipeline(exclude=None)`

`run_pipeline_fast()`

`set_disk_cache_settings(enable, ttl=3600 * 24 * 7)`

`to_dict(*args, **kwargs)`

`to_json(*args, **kwargs)`

`to_yaml(*args, **kwargs)`

`x(operator_name, disk_cache=False, traceable=False)`

`x_all()`