{ "cells": [ { "cell_type": "markdown", "id": "4564329f-01ac-4df2-8503-f0149be314eb", "metadata": {}, "source": [ "# Multivariate curve resolution (MCR)\n", "\n", "Next to its preprocessing, plotting and supervised calibration functionalities, chemometrics is also useful for the unsupervised factorization of data. In this example, multivariate curve resolution (MCR) is used to decompose Raman spectra of carbohydrate mixtures into single component concentrations. MCR factorizes a spectral data set $D$ into concentration $\\hat C$ and spectra $\\hat S$ estimates such that constraints both on $\\hat C$ and $\\hat S$ are fulfilled. Typical constraints are non-negativity, magnitude of the estimated vectors and unimodality, among others. Mathematically speaking:\n", "\n", "$$\\min||D - \\hat C\\hat S^T ||$$\n", "such that e.g. $$\\hat C, \\hat S > 0$$, \n", "\n", "\n", "MCR is especially useful, if we have additional information on how the concentration profiles or the spectra should look like.\n", "\n", "For this example, we first need to load the required packages. numpy is used for matrix handling, matplotlib for plotting and pandas for importing the data. From chemometrics, we import the main package `chemometrics`, the MCR transformer class (`chemometrics.mcr.McrAR`) and MCR constraints (`chemometrics.mcr.constraint`). " ] }, { "cell_type": "code", "execution_count": null, "id": "42853e18-042c-4d2d-a210-f564b8109d12", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import pandas as pd\n", "# chemometrics\n", "import chemometrics as cm\n", "from chemometrics.mcr import McrAR\n", "import chemometrics.mcr.constraint as constraint" ] }, { "cell_type": "markdown", "id": "a2734b42-2013-4a38-b225-0d775bf4413a", "metadata": {}, "source": [ "The data is provided in three csv-files. 'carbs_D.csv' contains the spectra of mixtures of three carbohydrates, 'carbs_C.csv' contains the volume fractions of the three carbohydrates and 'carbs_S.csv' contains the pure component spectra. The three carbohydrates are 1) fructose, 2) lactose, 3) ribose. The original carbohydrate spectra were published as [SPECARB](http://www.models.life.ku.dk/~specarb/specarb.html) and subsequently mixed *in silico*, combined with Gaussian noise and published as part of the R library [mdatools](https://github.com/svkucheryavski/mdatools).\n", "\n", "We are relying on the `read_csv` functionality of pandas to read the data. The data is then converted to pure numpy arrays such that they can be processed with chemometrics and scikit-learn." ] }, { "cell_type": "code", "execution_count": null, "id": "bc8b8640-0731-487e-85dd-6fcd09fdc290", "metadata": {}, "outputs": [], "source": [ "# prepare file names for loading\n", "files = ['carbs_D.csv', 'carbs_C.csv', 'carbs_S.csv']\n", "\n", "# import data into dict\n", "data = {}\n", "for file in files:\n", " data[file] = pd.read_csv(file, index_col=0)\n", "\n", "# convert to numpy array for easy handling with chemometrics\n", "D, C, S = [data[key].values for key in data]\n", "\n", "# store wavenumbers and names of used carbohydrates\n", "wavenumbers = pd.to_numeric(data[files[0]].columns.values)\n", "substance_names = data[files[2]].columns" ] }, { "cell_type": "markdown", "id": "6fbe9710-ca3c-46b9-8f2e-b85ff6c7c82e", "metadata": {}, "source": [ "Typically, in case of MCR, we would not have the actual component concentrations available. However, since we are working with artificial data and to get a feeling on the underlying differences in the spectra, the spectra colored by the reference concentration of lactose are plotted below by a call to `chemometrics.plot_colored_series`.\n", "\n", "Based on the plotted spectra, we see that there are a few peaks associated with lactose. However, there is a significant overlap between lactose and the other two sugars. This is a situation where MCR is useful. MCR factorizes the data while taking into account certain constraints. After factorization, we obtain the estimates of the pure spectra `S` and get an estimate of the concentrations `C` of the components used in the system." ] }, { "cell_type": "code", "execution_count": null, "id": "333b4b3e-e5a1-45cb-933a-962e3b698a7f", "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize=[12, 5])\n", "\n", "lines = cm.plot_colored_series(D.T, x=wavenumbers, reference=C[:, 1])\n", "\n", "plt.xlabel(r'Wavenumber / $\\mathrm{cm^{-1}}$')\n", "plt.ylabel('Intensity / AU')" ] }, { "cell_type": "markdown", "id": "64bd7c06-911b-4a2c-827e-6c8bf8cd5d7f", "metadata": {}, "source": [ "## Selecting the number of components\n", "\n", "For real-world data, we often don't know how many components contribute to the spectra. The MCR implementation of chemometrics requires us to provide the number of components when factorizing data. The most straight-forward approach to estimate the number of components for new data is to use a PCA or an SVD and investigate, how much variance is explained by how many components. In chemometrics, we can use the `cm.plot_svd` function to get a quick overview of the singular vectors and the eigenvalues, former corresponding to the composition and spectra, latter corresponding to the explained variance.\n", "\n", "After executing the function with our data, we get three plots sumarizing the first three eigenvectors (left and right plot) as well as the eigenvalues (center plot). If we look at the eigenvalues, we can clearly see that three components are necessary to describe the spectra adequatly. After the first three components, there is a steep drop in the magnitude of eigenvalues with a subsequent plateau (L-shaped curve). The horizontal part of the L is typically considered to be due to measurement noise. We can thus proceed towards fitting an MCR model with three independent components." ] }, { "cell_type": "code", "execution_count": null, "id": "65be4142-d0d7-44c7-ae82-d30b7afcfa7c", "metadata": {}, "outputs": [], "source": [ "cm.plot_svd(D,n_comp=3)" ] }, { "cell_type": "markdown", "id": "a683fc99-03d8-4228-9651-6c4f027ed18e", "metadata": {}, "source": [ "## Fitting an MCR model\n", "\n", "For fitting the MCR model, we need to generate an initial concentration estimate (technically, we could also use a spectral estimate - the procedure would be the same). For a reliable convergence of the algorithm as implemented in chemometrics, it is important to select a reasonable starting estimate of `C`. As observed above, we will need three components to describe the variance in the data. We will use the output of a truncated SVD. However, since the left singular matrix contains negative entries, we are taking the absolut values of the vector and scaling it, such that the left singular matrix largest entry is 1." ] }, { "cell_type": "code", "execution_count": null, "id": "b254e737-95d0-4892-b5f2-e8237ea42057", "metadata": {}, "outputs": [], "source": [ "# construct initial concentration estimate\n", "from sklearn.decomposition import TruncatedSVD\n", "c_init = np.abs(TruncatedSVD(n_components=3).fit_transform(D))\n", "c_init /= np.max(c_init)" ] }, { "cell_type": "markdown", "id": "78276f17-aed3-4b09-842a-fa7e49fb141d", "metadata": {}, "source": [ "With our initial estimate for `C`, we can now initialize `McrAR`. As constraints, we use non-negativity `constraint.Nonneg` for the concentration estimate as well as the spectral estimate. Additionally, to fix the scaling, we set a normalization constraint for the concentration vector `constraint.Normalizer`. After initialization, the MCR transformer is fitted to the data set `D`. To inspect the result, the fitted spectra (red) are plotted below with the pure spectra (black and grey). The results show a very good overlap of the estimated spectra and the pure component spectra." ] }, { "cell_type": "code", "execution_count": null, "id": "51f49573-f890-4490-9cc0-d6f80bb6755f", "metadata": {}, "outputs": [], "source": [ "# generate and fit MCR model\n", "mcr = McrAR(c_constraints=[constraint.Nonneg(), constraint.Normalizer()], st_constraints=[constraint.Nonneg()])\n", "mcr.fit(D, C=c_init)\n", "\n", "for i in range(3):\n", " plt.figure(figsize=[12, 5])\n", " plt.plot(wavenumbers, S[:, i], 'k')\n", " plt.plot(wavenumbers, S, 'k', alpha=0.5)\n", " plt.plot(wavenumbers, mcr.ST_.T[:, i], 'r', alpha=0.8, linewidth=2)\n", " plt.ylabel('Intensity / AU')\n", " plt.xlabel('Wavenumber / cm-1')\n", " plt.title(substance_names[i])" ] }, { "cell_type": "markdown", "id": "f1752878-a7e4-4499-8224-f04e6f29beaa", "metadata": {}, "source": [ "We may also use the `McrAR` transformer for transforming data from the high dimensional spectral space into the concentration space. This is done with the `transform` method. Calls to transform use the estimated spectra to project the data into the concentration space by a least square fit. The estimation of the spectra is not updated.\n", "\n", "Below, the training data was transformed and compared to the true concentrations in an observed vs predicted plot. A systematic bias is visible for the predictions especially of lactose and ribose. However, for an unsupervised prediction, the result is convincing. It is also worth noting that the transform method is very fast since it does not rely on an iterative approach but predicts the concentrations directly with a least square fit and the application of the constraints." ] }, { "cell_type": "code", "execution_count": null, "id": "d5310902-2c26-46ac-a813-67d6023bf806", "metadata": {}, "outputs": [], "source": [ "c_estimate = mcr.transform(D)\n", "plt.figure(figsize=[5,5])\n", "\n", "for i in range(3):\n", " plt.scatter(C[:, i],c_estimate[:, i], alpha=0.5)\n", "\n", "plt.plot([0,1],[0,1], 'k')\n", "plt.legend(substance_names)\n", "plt.xlabel('Observed')\n", "plt.ylabel('Predicted')" ] }, { "cell_type": "markdown", "id": "14c9afc7-a16e-46fd-8b53-2a9b09651a55", "metadata": {}, "source": [ "## Conclusion\n", "\n", "In summary, this example demonstrates, how MCR may be used for the unsupervised factorization of spectroscopic data sets. It showes, how the number of components is chosen, how the `McrAR` fit is performed, the results of the fit and the transformation of a toy dataset.\n", "\n", "`McrAR` is highly extensible and may be completed with other constraints which are not available in chemometrics." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" } }, "nbformat": 4, "nbformat_minor": 5 }