Chemostratigraphy has become a powerful correlation tool for identifying strata where other traditional stratigraphic methods fail. This is largely due to its ability to capture subtle variations in rather homogenously looking sedimentary rock formations or in rocks lacking biostratigraphic markers. In this study, we compared two multivariate statistical analysis by extract high-resolution stratigraphic information from a major elemental dataset acquired using MicroXRF for three Paleozoic formations in Saudi Arabia. Our main goal was to demonstrate the efficiency and reproducibility of the statistical protocol for the identification of correlatable chemozones within highly homogenous formations.
Dimensionality reduction is widely used in machine learning and big data analytics, especially in analyzing and visualizing large, high-dimensional datasets. Chemostratigraphy can benefit immensely from some of the cutting-edge data transformation techniques. As a result of increasingly sophisticated analytical techniques and equipment developed in the past 20 years, it now possible to detect smaller and smaller concentrations of the elements of interest which results in millions of data points for geochemical analysis. One of the well-established dimensionality reduction techniques used in geochemistry for defining chemozones is principal component analysis (PCA), coupled with pairwise correlation and hierarchical clustering of principal components (HCPC) (Williams et al., 2019; Hussain et al., 2020). Although PCA methods have been shown to explain high percentages in variabilities when robust elemental data are available, it mainly depends on the assumption of linear relationships.
We introduce a new algorithm that outperforms our earlier results on the same dataset. Uniform Manifold Approximation and Projection (UMAP; (McInnes et al., 2018; Allaoui et al., 2020)) is a non-linear algorithm for dimensional reduction. UMAP is based on Riemannian geometry and algebraic topology. It uses exponential probability distribution in high dimensions and is not restricted to the use of necessarily Euclidean distances like PCA, but rather any distance can be used. In addition, the probabilities are not normalized. It considerably outperforms PCA in data clustering and classification tasks, while also improving accuracy.