Lithology types identification is one of the processes geoscientists rely on to understand the subsurface formations and better evaluate the quality of reservoirs and aquifers. However, direct lithological identification processes usually require more effort and time. Therefore, researchers developed several machine learning models based on well-logging data to avoid challenges associated with direct lithological identification and increase identification accuracy. Nevertheless, high uncertainty and low accuracy are commonly encountered issues due to the heterogeneous nature of lithology types. This work aims to employ decision tree ensemble techniques to predict the lithologies more accurately in time saving and cost-efficient manner, accounting for the uncertainty.
This study investigated the real-world well logs dataset from the public Athabasca Oil Sands Database to identify and extract the relevant features. Then, we conducted a thorough training using grid search to optimize the hyperparameters of the ensemble decision tree models. This paper evaluated two ensemble techniques: random forest (RF) and extreme gradient boosting (XGB). We picked metrics such as accuracy, precision, and recall to assess the developed models' performance using 5-fold cross-validation. Finally, we performed a chi-squared test to test our hypothesis of the identical performance of the developed models.
The XGB and RF models have 94% and 93% accuracy, respectively. Also, the extreme gradient boost model's weighted average recall and precision of 93% and 93% are only 5% and 4% higher than the RF model. In addition, the chi-squared test resulted in a p-value as low as 0.013, suggesting a low probability of difference in both models' performance. Classification of sand and coal formations is more straightforward than sandy shale and cemented sand. The dataset's low representation of sandy shale and cemented sand can be the reason behind their prediction errors. The developed models can classify the studied field lithologies with an overall accuracy of 94%. In addition, there is no statistically significant evidence of a difference in prediction performance between extreme gradient boost and random forest.