Abstract

The New Mexico produced water chemistry database (PWCD) was constructed to analyze and alleviate a number of produced-water-related issues in New Mexico. Data in this database have serious data quality problems caused by the data acquisition methods. This paper first classifies problems in the database into following types:

  1. token-level data problems, such as misspelling and typographical errors,

  2. field-level data problems, like field definition conflicts from different data sources as "first name last name" and "last name first name" in name field,

  3. entity-level data problems, like the same entity with different representations in "San Juan 30–6 234" and "SJ 30 6 unit 234",

  4. record level data problems, like duplicate records, contradictory records and incomplete data.

Next, a systematic method is presented to identify and solve these problems in character data. In the first step, a context-based token analysis method was developed to identify and clean up token-level data problems. Then an extended field matching algorithm was developed to identify equivalent string values with different formats and solve the field-level, entity-level, and record level problems. The solution to entity identity problems was applied in data linking process, which can link relevant records in different databases and share data in them. The same field-matching algorithm was extended to detect duplicate records in the source database to eliminate record-level problems. The processed database provides more accurate, consistent, complete, and reliable data for further analysis. Methods used in this paper can be used in other character data quality control problems.

Introduction

The general purpose of data quality control is to transform the source data into well defined, consistent forms in the information encoding and representation for further analysis or applications. There are different names for the same purpose in different communities. Statisticians and epidemiologists use record or data linkage, computer scientists speak of data scrubbing, pre-processing, or data cleaning (1)(2)(3), while the same processes are called merge/purge processing, data integration, or ETL in commercial processing of customer databases or business mailing lists (4).

Data quality problems can be classified into two categories according to the number of data sources: single-source problems and multi-source problems. Single-source problems are mainly caused by careless design of schema and integrity constraints. Schema and constraints for a data source guide which data can be entered or stored and which data cannot. Careless design of schema and constraints give rise to a high probability of errors and inconsistencies. For example, a constraint of pH values of produced water samples between 0 and 14 can arise an alarm when values occur outside of this range. Some of these problems can be solved by the integrity check. Multi-source problems are caused by the integration of multiple (mostly independently developed) data sources, in which each source may contain dirty data and in which the data may be represented in different ways and may contain contradictions. Multi-source problems are more complicate than single-source problems because all problems from the singlesource case can occur with different representations in different sources.

This content is only available via PDF.
You can access this article if you purchase or spend a download.