Recent advances in search, machine learning, and natural language processing have made it possible to extract structured information from free text, providing a new and largely untapped source of insights for well and reservoir planning. However, there are major challenges involved in applying these techniques to data that is messy and/or lacking a labeled training set; we cover some of the methods in which these problems can be overcome. We present a method to compare the distribution of hypothesized and realized risks to oil wells described in two datasets that contain free-text descriptions of risks. We treat one dataset as a training set for a logistic regression classifier, and then use this classifier to label in the events in the other, out-of-domain dataset. To adjust for differences between the datasets, we rebalance the training set and supplement it with labeled instances automatically extracted from the test set. These simple domain adaptation techniques allow us to achieve an average F1 score of 0.84 on the out-of-domain test set.

You can access this article if you purchase or spend a download.