This paper aims to develop an innovative framework to enhance extracting knowledge encapsulated in scanned archived documents, the search and retrieval functionalities of Knowledge Management Systems (KMS) through exploiting artificial intelligence (AI) mechanisms. The framework leverages state-of-the-art techniques in natural language processing (NLP), and deep learning (DL) to handle the challenges of heterogeneous and unstructured data sources. The framework is a multi-staged approach. For data preparation, it employs heuristic and rule-based techniques for extracting data within scanned archive documents. After that, utilize the indexing approach to organize the extracted data. Furthermore, harness the power of the Large Language Model (LLM) to find the similarity between the user query and documents for information retrieval functionality. The proposed framework is evaluated in comparison with traditional approaches of data extraction, search, and information retrieval. This study shows that employing rule heuristics accelerates extraction time by targeting specific document parts. Additionally, our experiments demonstrate superior search speed with the IVF indexing method, and highlight the effectiveness of our innovative parallelism approach in optimizing query processing. Furthermore, consistent performance across different indexing methods on the BeIR dataset was consistent, except for a noticeable drop in accuracy for PQ index.

You can access this article if you purchase or spend a download.