Abstract
Oil and gas industry is a data-driven industry as it depends massively on information technology. According to internal statistics, the amount of data coming from upstream alone is doubling every two years. This data arrives through a wide variety of vendors’ sources and is handled by various applications’ repositories. Well data, in specific, is a key asset for such industry throughout the process lifetime from early exploration to production. In practice, companies often tend to create their own well data master repositories that are poorly synchronized between each other and with other databases. This results into well data residing in silos databases with no commonly defined standard. Often, there is little mechanism to cross-validate well data quality across various sources. Thus, maintaining high quality level of definitive versions of well data is a critical activity to any firm's data management strategy.
Recently, Big Data technologies have evolved to quickly fetch and analyze large volumes of data that can substantially lead to an improved data quality at reasonable time. In this paper, a novel system is presented to preserve high level of well data quality in a heterogeneous environment. This system utilizes Apache Spark as a main framework for distributed processing and a mid-tier software as a data integration layer. Through a set of defined mapping rules, the system will compare data from multiple databases against the database that hosts the organizational verified data. It is typical for oil and gas companies to dedicate one master database containing the corporate standard well data. So, this database will be used as a source for comparison against well data residing in project repositories. Moreover, the system extends its functionality to cover well sub data types such as headers, check shots, deviation surveys, and picks. The final output is a data quality report that helps in making strategic decisions.