PdfConverter

A specialized class for the conversion of PDF documents, leveraging Pydantic for configuration options management and internally encapsulating a DocumentConverter instance. This class provides a streamlined interface for converting PDF documents into text.

Example Usage:

Let’s demonstrate how to convert a PDF document to plain text using PdfConverter.

from llm_etl_pipeline import PdfConverter
from pathlib import Path
import os

dummy_pdf_path = Path("example.pdf")
converter = PdfConverter()
extracted_text = converter.convert_to_text(dummy_pdf_path)
print(extracted_text)

API Reference

class llm_etl_pipeline.PdfConverter(**data)[source]

Bases: BaseModel

A specialized class for the conversion of PDF documents, leveraging Pydantic for configuration options management and internally encapsulating a DocumentConverter instance.

This class provides a streamlined interface for converting PDF documents into text with configurable table structure detection, and cell matching during the conversion process.

For the moment, the attributes are frozen.

Parameters:
  • do_ocr (bool)

  • do_table_structure (bool)

  • do_cell_matching (bool)

do_ocr

Indicates whether Optical Character Recognition (OCR) should be performed on the PDF document. Defaults to False. This field is frozen=True.

Type:

bool

do_table_structure

Indicates whether to detect table structures within the PDF. Defaults to True. This field is frozen=True.

Type:

bool

do_cell_matching

Indicates whether to perform cell matching for detected tables. Defaults to False. This field is frozen=True.

Type:

bool

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

do_ocr: bool
do_table_structure: bool
do_cell_matching: bool
convert_to_text(input_pdf_path)[source]

Converts a PDF document to plain text using the internal DocumentConverter instance.

This method takes various forms of PDF input (file path, string path, or document stream) and utilizes the pre-configured DocumentConverter to perform the conversion.

Parameters:

input_pdf_path (Union[Path, str, DocumentStream]) – The path to the input PDF document (as a pathlib.Path object or string), or a DocumentStream object representing the PDF content.

Returns:

The extracted plain text content from the converted PDF document.

Return type:

str

Raises:

Exception – Re-raises any exception encountered during the conversion process by the underlying DocumentConverter. An error message is also logged.