Project Structure and Overview
This project is organized into several key directories and files to facilitate its ETL (Extract, Transform, Load) operations, documentation, and development practices.
The top-level structure includes:
`.github/workflows/`: Contains the Continuous Integration (CI) workflow definitions.
`docs/`: Houses the necessary code and configurations for creating and developing project documentation using Sphinx.
`logs/`: This directory serves as the storage location for log files generated by the custom loguru logger.
`llm_etl_pipeline/`: The core directory containing the main ETL application logic.
`.pre-commit-config.yaml`: Configures pre-commit hooks that automate steps like code linting and import sorting, enforcing code style and quality before commits.
The `__main__.py` file at the root of the project orchestrates the entire ETL process. Users can execute this file by providing the input directory path via the command line. This file, while currently serving as the main execution point, can be considered an early version of an orchestration layer. Due to time constraints, a more sophisticated orchestration layer was not implemented, but the modular design of the llm_etl_pipeline directory aims to provide independent, high-level abstract building blocks for developing custom ETL pipelines.
The llm_etl_pipeline directory itself is organized into the following key sub-folders, each developed to be independent and serve a distinct purpose:
`customized_loggers/`: This Python package is responsible for setting up and managing a customized loguru logger. It outputs logs to the console and simultaneously stores them in the /logs directory at the project’s root.
`extraction/`: Contains the core methods and classes dedicated to the extraction of raw data from the PDF documents, including PDF conversion and text segmentation (Extraction).
`transformation/`: This directory contains the classes and functions responsible for transforming and validating the extracted data. This includes data cleaning and deduplication logic (Transformation).
`typings/`: This folder is dedicated to custom Python type hints, which facilitate dynamic analysis checks across the project, particularly within the extraction and transformation packages.
Within the extraction, transformation, and typings sub-folders, you will find additional sub-folders named public and internal. These indicate whether the methods and classes within are designed for public consumption or internal project use, respectively.
Notably, a dedicated `load` subfolder has not yet been developed due to time constraints. For now, the processed Pandas DataFrames are simply stored using their to_csv method, simplifying the initial data persistence approach.
Data Validation with Pydantic
Data validation is a cornerstone of this project’s reliability. To enforce strict data integrity, we’ve heavily leveraged the Pydantic package.
Every public method in the project is adorned with a call_validate decorator, ensuring that data conforms to predefined schemas at the point of invocation. We’ve applied this rigorous validation and immutability to public interfaces, while private methods do not have these same constraints.
Furthermore, most of the project’s classes, such as Document and MonetaryInformation, were designed with immutability in mind. This means that once an object of these classes is initialized, its field values cannot be altered, preventing unintended data corruption.
Testing Strategy
For quality assurance, a suite of unit tests has been developed to validate individual components and functions of the codebase. While these tests provide foundational coverage, the current test coverage stands at approximately 66%, indicating areas for future expansion.