Skip to content

pydoxtools.DocumentBag

Bases: Pipeline

This class is a work-in-progress (WIP), use with caution.

The DocumentBag class loads and processes a set of documents using a pipeline. It leverages Dask bags for efficient memory usage and large-scale computations on documents.

Notes
  • Dask bags documentation can be found here.
  • Dask dataframes can be used for downstream calculations.
  • This class helps scale LLM & AI inference to larger workloads.
  • It uses iterative Dask bags & dataframes to avoid out-of-memory issues.
Rationale

This function is needed to create and process new document bags, instead of using Dask bags directly with arbitrary data. It reduces boilerplate code for creating new documents and traceable datasources.

__repr__()

Returns:

Name Type Description
str

A string representation of the instance.

Text extraction attributes and functions

The pydoxtools.DocumentBag is built on the pydoxtools.Pipeline class and most of the text extraction functionality makes extensive use of the pipeline features. All attributes and functions that are created by the pipeline are documented here.

Pipeline visualizations for the structure of the Document pipelines for different document types can be found here.

Document

Get a factory for pre-configured documents. Can be called just like pydoxtools.Document class, but automatically gets assigned the same configuration as all Documents in this bag

name : <DocumentBag>.x('Document') or <DocumentBag>.Document

return type : typing.Any

supports pipeline flows : *, <class 'dask.bag.core.Bag'>, <class 'list'>, <class 'pathlib.Path'>, <class 'pydoxtools.document.DatabaseSource'>

_stats

A constant value

name : <DocumentBag>.x('_stats') or <DocumentBag>._stats

return type : typing.Any

supports pipeline flows : *, <class 'dask.bag.core.Bag'>, <class 'list'>, <class 'pathlib.Path'>, <class 'pydoxtools.document.DatabaseSource'>

apply

Basically it creates a Documentbag from two sets of on documents in a dask bag and then creates a new DocumentBag from that. This works similar to pandas dataframes and series. But with documents as a basic datatype. And apply functions are also required to produce data which can be used as a document again (which is a lot).

name : <DocumentBag>.x('apply') or <DocumentBag>.apply

return type : typing.Callable[..., pydoxtools.document.DocumentBag]

supports pipeline flows : *, <class 'dask.bag.core.Bag'>, <class 'list'>, <class 'pathlib.Path'>, <class 'pydoxtools.document.DatabaseSource'>

bag

pipe_type description
<class 'list'>, <class 'pydoxtools.document.DatabaseSource'> No documentation
<class 'pathlib.Path'> create a dask bag with all the filepaths in it

name : <DocumentBag>.x('bag') or <DocumentBag>.bag

return type : typing.Any

supports pipeline flows : <class 'list'>, <class 'pathlib.Path'>, <class 'pydoxtools.document.DatabaseSource'>

bag_apply

Basically it applies a function element-wise on documents in a dask bag and then creates a new DocumentBag from that. This works similar to pandas dataframes and series. But with documents as a basic datatype. And apply functions are also required to produce data which can be used as a document again (which is a lot).

name : <DocumentBag>.x('bag_apply') or <DocumentBag>.bag_apply

return type : typing.Callable[..., dask.bag.core.Bag]

supports pipeline flows : *, <class 'dask.bag.core.Bag'>, <class 'list'>, <class 'pathlib.Path'>, <class 'pydoxtools.document.DatabaseSource'>

compute

No documentation

name : <DocumentBag>.x('compute') or <DocumentBag>.compute

return type : typing.Any

supports pipeline flows : <class 'dask.bag.core.Bag'>, <class 'list'>, <class 'pathlib.Path'>, <class 'pydoxtools.document.DatabaseSource'>

connection_string

No documentation

name : <DocumentBag>.x('connection_string') or <DocumentBag>.connection_string

return type : typing.Any

supports pipeline flows : <class 'pydoxtools.document.DatabaseSource'>

d

Alias for:

* get_dicts->d (output)

name : <DocumentBag>.x('d') or <DocumentBag>.d

return type : typing.Callable[[typing.Any], dask.bag.core.Bag]

supports pipeline flows : *, <class 'dask.bag.core.Bag'>, <class 'list'>, <class 'pathlib.Path'>, <class 'pydoxtools.document.DatabaseSource'>

dataframe

Load a table using dask/pandas read_sql

sql: can either be the entire table or an SQL expression

name : <DocumentBag>.x('dataframe') or <DocumentBag>.dataframe

return type : <class 'dask.dataframe.core.DataFrame'>

supports pipeline flows : <class 'pydoxtools.document.DatabaseSource'>

dir_list

pipe_type description
<class 'list'> Applies any function on items in a dask bag and filters them based on the result.
if func returns False, the element will be dropped from the bag.
<class 'pathlib.Path'> No documentation

name : <DocumentBag>.x('dir_list') or <DocumentBag>.dir_list

return type : <class 'dask.bag.core.Bag'> | typing.Any

supports pipeline flows : <class 'list'>, <class 'pathlib.Path'>

docs

pipe_type description
<class 'dask.bag.core.Bag'> Alias for:
* source->docs (output)
<class 'pydoxtools.document.DatabaseSource'> Create a dask bag of one data document for each row of the source table
<class 'list'>, <class 'pathlib.Path'> create a bag with one document for each file that was foundFrom this point we can hand off the logic to str(Bag) pipeline.

name : <DocumentBag>.x('docs') or <DocumentBag>.docs

return type : <class 'dask.bag.core.Bag'> | typing.Any

supports pipeline flows : <class 'dask.bag.core.Bag'>, <class 'list'>, <class 'pathlib.Path'>, <class 'pydoxtools.document.DatabaseSource'>

e

Alias for:

* exploded->e (output)

name : <DocumentBag>.x('e') or <DocumentBag>.e

return type : typing.Callable[..., ForwardRef('DocumentBag')]

supports pipeline flows : *, <class 'dask.bag.core.Bag'>, <class 'list'>, <class 'pathlib.Path'>, <class 'pydoxtools.document.DatabaseSource'>

exploded

name : <DocumentBag>.x('exploded') or <DocumentBag>.exploded

return type : typing.Callable[..., ForwardRef('DocumentBag')]

supports pipeline flows : *, <class 'dask.bag.core.Bag'>, <class 'list'>, <class 'pathlib.Path'>, <class 'pydoxtools.document.DatabaseSource'>

file_path_list

pipe_type description
<class 'list'> Applies any function on items in a dask bag and filters them based on the result.
if func returns False, the element will be dropped from the bag.
<class 'pathlib.Path'> No documentation

name : <DocumentBag>.x('file_path_list') or <DocumentBag>.file_path_list

return type : <class 'dask.bag.core.Bag'> | typing.Any

supports pipeline flows : <class 'list'>, <class 'pathlib.Path'>

get_dicts

Returns a function closure which returns a bag of the specified property of the enclosed documents.

name : <DocumentBag>.x('get_dicts') or <DocumentBag>.get_dicts

return type : typing.Callable[[typing.Any], dask.bag.core.Bag]

supports pipeline flows : *, <class 'dask.bag.core.Bag'>, <class 'list'>, <class 'pathlib.Path'>, <class 'pydoxtools.document.DatabaseSource'>

index_column

No documentation

name : <DocumentBag>.x('index_column') or <DocumentBag>.index_column

return type : typing.Any

supports pipeline flows : <class 'pydoxtools.document.DatabaseSource'>

paths

name : <DocumentBag>.x('paths') or <DocumentBag>.paths

return type : typing.Callable

supports pipeline flows : <class 'list'>, <class 'pathlib.Path'>

root_path

Alias for:

* source->root_path (output)

name : <DocumentBag>.x('root_path') or <DocumentBag>.root_path

return type : typing.Any

supports pipeline flows : <class 'list'>, <class 'pathlib.Path'>

sql

No documentation

name : <DocumentBag>.x('sql') or <DocumentBag>.sql

return type : typing.Any

supports pipeline flows : <class 'pydoxtools.document.DatabaseSource'>

stats

gather a number of statistics from documents as a pandas dataframe

name : <DocumentBag>.x('stats') or <DocumentBag>.stats

return type : typing.Any

supports pipeline flows : *, <class 'dask.bag.core.Bag'>, <class 'list'>, <class 'pathlib.Path'>, <class 'pydoxtools.document.DatabaseSource'>

take

No documentation

name : <DocumentBag>.x('take') or <DocumentBag>.take

return type : typing.Any

supports pipeline flows : <class 'dask.bag.core.Bag'>, <class 'list'>, <class 'pathlib.Path'>, <class 'pydoxtools.document.DatabaseSource'>

vectorizer

vectorizes a query, using the document configuration of the Documentbag to determine which model to use.

name : <DocumentBag>.x('vectorizer') or <DocumentBag>.vectorizer

return type : typing.Any

supports pipeline flows : *, <class 'dask.bag.core.Bag'>, <class 'list'>, <class 'pathlib.Path'>, <class 'pydoxtools.document.DatabaseSource'>

Configuration parameters

name description default_values
bytes_per_chunk Configuration for values: 256 MiB
* bytes_per_chunk = 256 MiB (default)
doc_configuration We can pass through a configuration object to Documents that are created in our document bag. Any setting that is supported by Document can be specified here. {}
forgiving_extracts When enabled, if we execute certain batch operations on our document bag, this will not stop the extraction, but rather put an error message in the document. False
verbosity Configuration for values: None
* verbosity = None (default)