PdfConverter
A specialized class for the conversion of PDF documents, leveraging Pydantic
for configuration options management and internally encapsulating a DocumentConverter
instance.
This class provides a streamlined interface for converting PDF documents
into text.
Example Usage:
Let’s demonstrate how to convert a PDF document to plain text using PdfConverter
.
from llm_etl_pipeline import PdfConverter
from pathlib import Path
import os
dummy_pdf_path = Path("example.pdf")
converter = PdfConverter()
extracted_text = converter.convert_to_text(dummy_pdf_path)
print(extracted_text)
API Reference
- class llm_etl_pipeline.PdfConverter(**data)[source]
Bases:
BaseModel
A specialized class for the conversion of PDF documents, leveraging Pydantic for configuration options management and internally encapsulating a DocumentConverter instance.
This class provides a streamlined interface for converting PDF documents into text with configurable table structure detection, and cell matching during the conversion process.
For the moment, the attributes are frozen.
- Parameters:
do_ocr (bool)
do_table_structure (bool)
do_cell_matching (bool)
- do_ocr
Indicates whether Optical Character Recognition (OCR) should be performed on the PDF document. Defaults to False. This field is frozen=True.
- Type:
bool
- do_table_structure
Indicates whether to detect table structures within the PDF. Defaults to True. This field is frozen=True.
- Type:
bool
- do_cell_matching
Indicates whether to perform cell matching for detected tables. Defaults to False. This field is frozen=True.
- Type:
bool
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
-
do_ocr:
bool
-
do_table_structure:
bool
-
do_cell_matching:
bool
- convert_to_text(input_pdf_path)[source]
Converts a PDF document to plain text using the internal DocumentConverter instance.
This method takes various forms of PDF input (file path, string path, or document stream) and utilizes the pre-configured DocumentConverter to perform the conversion.
- Parameters:
input_pdf_path (Union[Path, str, DocumentStream]) – The path to the input PDF document (as a pathlib.Path object or string), or a DocumentStream object representing the PDF content.
- Returns:
The extracted plain text content from the converted PDF document.
- Return type:
str
- Raises:
Exception – Re-raises any exception encountered during the conversion process by the underlying DocumentConverter. An error message is also logged.