Parser — Construction Cost RAG Pipeline

January 2026

Engineered a multi-stage document parsing pipeline in Python using Docling OCR and regex-based heuristics to extract, classify, and normalize construction cost data from books, reducing manual data correction effort by ~95%.

Implemented an LLM-powered post-processing layer using OpenAI's GPT-4.1-mini API with batch processing and retry logic to normalize raw extracted items into a canonical schema, then generated natural-language chunks indexed in ChromaDB with HuggingFace BGE embeddings and FlashRank reranking, enabling RAG-based semantic search over the entire cost database.

Tech Stack

PythonDoclingEasyOCRPandasLlamaIndexChromaDBHuggingFaceFlashRank