Abstract

Spatially abnormal data, also called spatial outliers, refer to those data whose non-spatial property values are significantly different from those of other spatial data in their spatial neighborhood, not unlike a brand-new fashionable house surrounded by old and weathered ones. Spatial outlier detection aids in understanding geographical data that are widely used in the petroleum industry and thus is of interest to petroleum scientists.

This paper presents a method to identify spatial outliers based on spatial statistical theory and geostatistical local estimation concept. K nearest neighborhood search strategy is used to search fixed number of k neighbors for specific spatial points in spatial data sets. Inverse distance weighting local estimation method is used to calculate the expected values of spatial points using values of their neighbors. Z-score testing process is used to evaluate the difference of expected values and measured values for spatial samples. The lager the z-score, the larger the difference between expected value and measured value is. Corresponding spatial objects are considered outliers if their z-scores meet the outlier criteria in certain confidence level. Outlier detection results from different z-score based outlier detection algorithms are compared and analyzed. Proposed outlier detection algorithm is applied in New Mexico Produced Water Chemistry Database (PWCD). Results show that outlier detection can aid in bad data checking and in the analysis of produced water related problems.

Introduction

With the increasing amounts of data being collected and stored in databases, it is necessary to find efficient and effective analysis methods to make use of the information contained implicitly in the data (1). This is called knowledge discovery in database, or KDD, or data mining. Most current KDD research aims in identifying patterns that apply to the majority of objects in a data set, such as data clustering, data classification, or data generalization. Outlier detection is another important KDD task to pursue.

Outliers are defined as data points that are far outside the norm for a variable or population (2), or an observation that deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism (3). Outliers are also defined as values that are dubious in the eyes of the researcher, and contaminants. Outliers can arise from several different mechanisms or causes. Some outliers are aroused by errors in data. Some occur from the inherent variability of the data. Outliers may represent nuisance, errors, or legitimate data, but identification of outliers can also lead to the discovery of both useful knowledge and serious problems. Some practical applications are credit card fraud detection, oil field production performance analysis, and abnormal event diagnosis. There are many outlier detection methods discussed in the statistical literature. In statistics, outlier detection methods can be classified into two categories. The first category is distribution-based outlier detection. In this category, a standard distribution (e.g. normal distribution, Poisson distribution, etc) is used to fit the data. Outliers are defined based on the distribution. Outlier detection methods in this category are also called discordancy tests (4)

This content is only available via PDF.
You can access this article if you purchase or spend a download.