Document

A Document represents a document, capable of storing raw text and/or a structured collection of paragraphs, which can in turn contain sentences.

Example Usage:

Let’s illustrate how to create a Document and use its methods.

from llm_etl_pipeline.extraction.public.documents import Document

# Create a simple document with raw text
doc_raw_text = Document(
    raw_text="This is the first sentence. This is the second sentence.\n\n"\
             "This is a new paragraph. Another sentence in the same paragraph."
)

# Example using get_paras_or_sents_raw_text
# Get all sentences
all_sentences = doc_raw_text.get_paras_or_sents_raw_text(reference_depth='sentences')

# Get paragraphs containing "new paragraph"
filtered_paragraphs = doc_raw_text.get_paras_or_sents_raw_text(
    regex_pattern="new paragraph",
    reference_depth='paragraphs'
)

# Get sentences containing "second"
filtered_sentences = doc_raw_text.get_paras_or_sents_raw_text(
    regex_pattern="second",
    reference_depth='sentences'
)

API Reference

class llm_etl_pipeline.Document(**data)[source]

Bases: BaseModel

Represents a document, capable of storing raw text and/or a structured collection of paragraphs, which can in turn contain sentences.

This class provides functionalities for segmenting text into paragraphs and sentences, and for retrieving filtered content based on regular expressions and reference depth.

Parameters:
  • raw_text (Annotated[str, Strict(strict=True), StringConstraints(strip_whitespace=True, to_upper=None, to_lower=None, strict=None, min_length=1, max_length=None, pattern=None)] | None)

  • paragraphs (list[Paragraph])

  • paragraph_segmentation_mode (Literal['newlines', 'empty_line', 'sat'])

  • sat_model_id (Literal['sat-1l', 'sat-1l-sm', 'sat-3l', 'sat-3l-sm', 'sat-6l', 'sat-6l-sm', 'sat-9l', 'sat-12l', 'sat-12l-sm'] | str | ~pathlib._local.Path)

raw_text

The raw, unsegmented text of the document. Cannot be reassigned once set to a truthy value.

Type:

Optional[NonEmptyStr]

paragraphs

A list of Paragraph objects representing the segmented content of the document. Cannot be reassigned once populated.

Type:

list[Paragraph]

paragraph_segmentation_mode

The method used for segmenting raw_text into paragraphs. - “newlines”: Segments by newline characters. - “empty_line”: Segments by empty lines. - “sat”: Uses a SaT (Semantic Augmentation Tool) model for segmentation. Defaults to “empty_line”.

Type:

Literal[“newlines”, ‘empty_line’, “sat”]

sat_model_id

The identifier for the SaT model to be used for segmentation (both paragraphs and sentences, if “sat” mode is used). Defaults to “sat-3l-sm”.

Type:

SaTModelId

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

raw_text: typing.Optional[typing.Annotated[str]]
paragraphs: list[llm_etl_pipeline.extraction.public.paragraphs.Paragraph]
paragraph_segmentation_mode: typing.Literal['newlines', 'empty_line', 'sat']
sat_model_id: typing.Union[typing.Literal['sat-1l', 'sat-1l-sm', 'sat-3l', 'sat-3l-sm', 'sat-6l', 'sat-6l-sm', 'sat-9l', 'sat-12l', 'sat-12l-sm'], str, pathlib._local.Path]
property sentences: list[Sentence]

Provides access to all sentences within the paragraphs of the document by flattening and combining sentences from each paragraph into a single list.

This property iterates through all Paragraph objects in the paragraphs list and chains their respective sentences lists together.

Returns:

A list of all Sentence objects contained within all paragraphs.

Return type:

list[Sentence]

get_paras_or_sents_raw_text(regex_pattern=None, reference_depth='sentences')[source]

Retrieves raw text content from either sentences or paragraphs, optionally filtered by a regex pattern.

The method compiles the provided regex (or uses a wildcard regex if none is provided) and then filters the raw text of sentences or paragraphs based on whether they match the pattern.

Parameters:
  • regex_pattern (Optional[RegexPattern]) – An optional regular expression pattern to filter the text items. If None, all items’ raw text will be returned.

  • reference_depth (ReferenceDepth) – Specifies whether to retrieve sentences or paragraphs. Must be “sentences” or “paragraphs”. Defaults to “sentences”.

Returns:

A list of raw text strings from the filtered sentences or paragraphs.

Return type:

list[str]