An important challenge for asset management is to analyze large amounts of data in a short period of time to provide insightful information for decision making in a timely fashion. Analyzing all available data manually is impractical and inefficient. It is advantageous to develop pattern recognition algorithms to recognize events-of-interest to achieve effective asset management.
Conventional pattern recognition algorithms usually require a fairly large training set in which data points are carefully prepared and "labeled". Examples include designating an equipment’s status as healthy or faulty by subject-matter experts. This can make the process time consuming and error-prone when the training set is large. For most applications, a small amount of data points are already labeled by the experts through their routine activities. While these data points are usually not enough to form a training set for conventional pattern recognition methods, some of the newer methods can take advantage of them along with the hidden manifold structures manifested by the unlabeled data. Moreover, subject matter experts may be willing to provide more input if a clear indication of the value of information and a manageable subset of data is pre-selected for their inspection. In fact, many other industries are facing the same challenge where the cost of acquiring labels is too expensive to be practical and large amounts of unlabeled data and limited expert time are available. A suite of advanced machine learning algorithms (e.g., semi-supervised learning, active learning) have been developed to tackle this challenge, and many of them have been successfully used for various applications in the past few years. In this paper, we will review the concepts and report our observations about the effectiveness of these methods in a real-world asset management scenario. We consider well test validation in an asset with a large number of tests as an example of a label-rich data set that can serve as the basis for our numerical review of existing methods. In this example we will specifically look at the task of building a statistical model to recognize the validity of rate measurement tests in a test separator. In this case, through their daily activities, the operators have labeled most of these tests as valid or invalid. The extensive amount of well test validation data provides sufficient information to assess the newer approaches under review. The plan then is to apply a similar approach to tasks such as equipment health monitoring to identify pump failures with limited expert input.
Exxon Mobil Corporation has numerous subsidiaries, many with names that include ExxonMobil, Exxon, Esso and Mobil. For convenience and simplicity in this paper, the parent company and its subsidiaries may be referenced separately or collectively as "ExxonMobil." Abbreviated references describing global or regional operational organizations and global or regional business lines are also sometimes used for convenience and simplicity. Nothing in this paper is intended to override the corporate separateness of these separate legal entities. Working relationships discussed in this paper do not necessarily represent a reporting connection, but may reflect a functional guidance, stewardship, or service relationship.
Conceptually, reduction in labeled input can be achieved by combining the information from the labels and the statistical distribution of the data (e.g., clusters). As an extreme example, consider that the pump measurement data may show two distinct clusters and the operators have labeled a few data points in one cluster as pump failures when reports had to be made due to wells being shut in. This information is sufficient to label one of the clusters as healthy and the other one as faulty. For a new measurement, a prediction may be made by first determining the cluster to which the measurement belongs and then assigning it the corresponding label. While most real world problems are much more challenging than this example due to the number of data points, dimensionality of the data, lack of clear cluster structure and potential ambiguity of data structures, similar ideas can be used to develop highly accurate statistical models with a limited number of labels.