Reference
Pipeline
Base class for all document classes in pydoxtools, defining a common pipeline interface and establishing a basic pipeline schema that derived classes can override.
The MetaPipelineClassConfiguration acts as a compiler to resolve the pipeline hierarchy, allowing pipelines to inherit, mix, extend, or partially overwrite each other. Each key in the _pipelines dictionary represents a different pipeline version.
The pydoxtools.Document class leverages this functionality to build separate pipelines for different file types, as the information processing requirements differ significantly between file types.
Attributes:
Name | Type | Description |
---|---|---|
_operators |
dict[str, list[Operator]]
|
Stores the definition of the pipeline graph, a collection of connected operators/functions that process data from a document. |
_pipelines |
dict[str, dict[str, Operator]]
|
Provides access to all operator functions by their "out-key" which was defined in _operators. |
Todo
- Use pandera (https://github.com/unionai-oss/pandera) to validate dataframes exchanged between operators & loaders (https://pandera.readthedocs.io/en/stable/pydantic_integration.html)
configuration
property
Returns a dictionary of all configuration objects for the current pipeline.
Returns:
Name | Type | Description |
---|---|---|
dict |
A dictionary containing the names and values of all configuration objects for the current pipeline. |
pipeline_chooser: str
property
Must be implemented by derived classes to decide which pipeline they should use.
uuid
cached
property
Retrieves a universally unique identifier (UUID) for the instance.
This method generates a new UUID for the instance using Python's uuid.uuid4()
function. The
UUID is then cached as a property, ensuring that the same UUID is returned for subsequent
accesses.
Returns:
Type | Description |
---|---|
uuid.UUID: A unique identifier for the instance. |
x_funcs: dict[str, Operator]
cached
property
get all operators/pipeline nodes and their property names for this specific file type/pipeline
__getattr__(extract_name)
Retrieves an extractor result by directly accessing it as an attribute.
This method is automatically called for attribute names that aren't defined on class level, allowing for a convenient way to access pipeline operator outputs without needing to call the 'x' method.
Example
document.addresses instead of document.x('addresses')
Parameters:
Name | Type | Description | Default |
---|---|---|---|
extract_name |
str
|
The name of the extractor result to be accessed. |
required |
Returns:
Name | Type | Description |
---|---|---|
Any |
Any
|
The result of the extractor after processing the document. |
__getitem__(extract_name)
Retrieves an extractor result by directly accessing it as an attribute.
This method is automatically called for attribute names that aren't defined on class level, allowing for a convenient way to access pipeline operator outputs without needing to call the 'x' method.
Example
document["addresses"] instead of document.x('addresses')
Parameters:
Name | Type | Description | Default |
---|---|---|---|
extract_name |
str
|
The name of the extractor result to be accessed. |
required |
Returns:
Name | Type | Description |
---|---|---|
Any |
Any
|
The result of the extractor after processing the document. |
__getstate__()
return necessary variables for pickling, ensuring that we leave out everything that can potentially have a lambda function in it...
__init__(**configuration)
Initializes the Pipeline instance with cache-related attributes.
**configuration: A dictionary of key-value pairs representing the configuration settings for the pipeline. Each key is a string representing the name of the configuration setting, and the value is the corresponding value to be set.
__repr__()
Returns:
Name | Type | Description |
---|---|---|
str |
A string representation of the instance. |
__setstate__(state)
we need to restore _x_func_cache for pickling to work...
gather_inputs(mapped_args, traceable)
Gathers arguments from the pipeline and class, and maps them to the provided keys of kwargs.
This method retrieves all required input parameters from _in_mapping, which was declared with "pipe". It first checks if the parameter is available as an extractor. If so, it calls the function to get the value. Otherwise, it gets the member-variables or functions of the derived pipeline class if an extractor with that name cannot be found.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
**kwargs |
dict
|
A dictionary containing the keys to be mapped to the corresponding values. |
required |
Returns:
Name | Type | Description |
---|---|---|
dict |
A dictionary containing the mapped keys and their corresponding values. |
get_configuration_names(pipeline)
cached
classmethod
Returns a list of names of all configuration objects for a given pipeline.
This is a cached function which is important,
Parameters:
Name | Type | Description | Default |
---|---|---|---|
pipeline |
str
|
The name of the pipeline to retrieve configuration objects from. |
required |
Returns:
Name | Type | Description |
---|---|---|
list |
list[str]
|
A list of strings containing the names of all configuration objects for the given pipeline. |
node_infos(pipeline_type=None)
classmethod
Aggregates the pipeline operations and their corresponding types and metadata.
This method iterates through all the pipelines registered in the class, and gathers information about each node/operation, such as the pipeline types it appears in, the return type of the operation, and the operation's docstring.
Returns:
Type | Description |
---|---|
list[dict]
|
TODO... |
non_interactive_pipeline()
return all non-interactive operators/pipeline nodes
operator_types()
classmethod
This function returns a dictionary of operators with their types which is suitable for declaring a pydantic model.
if this is set to True, we make sure that only valid json
schema types are included in the model. The typical use case is to expose the pipeline via this model to an http API e.g. through fastapi. In this case we should only allow types that are valid json schema. Therefore, this is set to "False" by default.
pipeline_graph(image_path=None, document_logic_id='*')
classmethod
Generates a visualization of the defined pipelines and optionally saves it as an image.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
image_path |
str | Path
|
File path for the generated image. If provided, the generated graph will be saved as an image. |
None
|
document_logic_id |
str
|
The document logic ID for which the pipeline graph should be generated. Defaults to "current". |
'*'
|
Returns:
Name | Type | Description |
---|---|---|
AGraph |
A PyGraphviz AGraph object representing the pipeline graph. This object can be visualized or manipulated using PyGraphviz functions. |
Notes
This method requires the NetworkX and PyGraphviz libraries to be installed.
pre_cache()
Pre-caches the results of all operators that have caching enabled.
This method iterates through the defined operators and calls each one with caching enabled, storing the results for faster access in future calls.
Returns:
Name | Type | Description |
---|---|---|
self |
The instance of the class, allowing for method chaining. |
run_pipeline(exclude=None)
Runs all operators defined in the pipeline for testing or pre-caching purposes.
!!IMPORTANT!!! This function should normally not be used as the pipeline is lazily executed anyway.
This method iterates through the defined operators and calls each one, ensuring that the extractor logic is functioning correctly and caching the results if required.
run_pipeline_fast()
run pipeline, but exclude long-running calculations
set_disk_cache_settings(enable, ttl=3600 * 24 * 7)
Sets disk cache settings
to_dict(*args, **kwargs)
Returns a dictionary that accumulates the properties given in args or with a mapping in *kwargs.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
*args |
str
|
A variable number of strings, each representing a property name. |
()
|
**kwargs |
dict
|
A dictionary mapping property names (values) to custom keys (keys) for the returned dictionary. |
{}
|
Note
This function currently only supports properties that do not require any arguments, such as "full_text". Properties like "answers" that return a function requiring arguments cannot be used with this function.
Returns:
Name | Type | Description |
---|---|---|
dict |
A dictionary with the accumulated properties and their values, using either the property names or custom keys as specified in the input arguments. |
to_json(*args, **kwargs)
Returns a dictionary that accumulates the properties given in args or with a mapping in *kwargs, and dumps the output as JSON.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
*args |
str
|
A variable number of strings, each representing a property name. |
()
|
**kwargs |
dict
|
A dictionary mapping property names (values) to custom keys (keys) for the returned dictionary. |
{}
|
Note
This function currently only supports properties that do not require any arguments, such as "full_text". Properties like "answers" that return a function requiring arguments cannot be used with this function.
Returns:
Name | Type | Description |
---|---|---|
str |
A JSON-formatted string representing the accumulated properties and their values, using either the property names or custom keys as specified in the input arguments. |
to_yaml(*args, **kwargs)
Returns a dictionary that accumulates the properties given in args or with a mapping in *kwargs, and dumps the output as YAML.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
*args |
str
|
A variable number of strings, each representing a property name. |
()
|
**kwargs |
dict
|
A dictionary mapping property names (values) to custom keys (keys) for the returned dictionary. |
{}
|
Note
This function currently only supports properties that do not require any arguments, such as "full_text". Properties like "answers" that return a function requiring arguments cannot be used with this function.
Returns:
Name | Type | Description |
---|---|---|
str |
A YAML-formatted string representing the accumulated properties and their values, using either the property names or custom keys as specified in the input arguments. |
x(operator_name, disk_cache=False, traceable=False)
Calls an extractor from the defined pipeline and returns the result.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
operator_name |
str
|
The name of the extractor to be called. |
required |
cache |
if we want to cache the call. We can explicitly tell the pipeline to cache a call. to make caching more efficient by only caching the calls we want. |
required | |
traceable |
bool
|
Some operators will propagate the source of their information through the pipeline. This adds traceability. By setting this to traceable=True we can turn this feature on. |
False
|
Returns:
Name | Type | Description |
---|---|---|
Any |
Any
|
The result of the extractor after processing the document. |
Raises:
Type | Description |
---|---|
OperatorException
|
If an error occurs while executing the extractor. |
Notes
The extractor's parameters can be overridden using args and *kwargs.
x_all()
Retrieves the results of all operators defined in the pipeline.
Returns:
Name | Type | Description |
---|---|---|
dict |
A dictionary containing the results of all operators, with keys as the extractor names and values as the corresponding results. |