chemometrics.PLSRegression¶
- class chemometrics.PLSRegression(n_components=2, *, max_iter=500, tol=1e-06, copy=True)¶
Bases:
PLSRegression
,LVmixin
PLS regression with added chemometric functionality
References
Calculations according to
- Eriksson(1,2,3)
L. Eriksson, E. Johansson, N. Kettaneh-Wold, J. Trygg, C. Wikström, and S. Wold. Multi- and Megavariate Data Analysis, Part I Basic Principles and Applications. Second Edition.
- __init__(n_components=2, *, max_iter=500, tol=1e-06, copy=True)¶
Methods
__init__
([n_components, max_iter, tol, copy])cooks_distance
(X, Y)Calculate Cook's distance from the calibration data
crit_dhypx
([confidence])Calculate critical dhypx according to Hotelling's T2
crit_dmodx
([confidence])Critical distance to hyperplane based on an F2 test
dhypx
(X)Normalized distance on hyperplane
distance_plot
(X[, sample_id, confidence])Plot distances colinear and orthogonal to model predictor hyperplane
dmodx
(X[, normalize, absolute])Calculate distance to model hyperplane in X (DModX)
fit
(X, Y)Fit model to data.
fit_transform
(X[, y])Learn and apply the dimension reduction on the train data.
get_feature_names_out
([input_features])Get output feature names for transformation.
get_params
([deep])Get parameters for this estimator.
hat
(X)Calculate the hat (projection) matrix
inverse_transform
(X[, Y])Transform data back to its original space.
leverage
(X)Calculate the statistical leverage
plot
(X, Y)Displays a figure with 4 common analytical plots for PLS models
predict
(X[, copy])Predict targets of given samples.
residuals
(X, Y[, scaling])Calculate (normalized) residuals
score
(X, y[, sample_weight])Return the coefficient of determination of the prediction.
set_params
(**params)Set the parameters of this estimator.
transform
(X[, Y, copy])Apply the dimension reduction.
Attributes
The coefficients of the linear model.
- property coef_¶
The coefficients of the linear model.
- cooks_distance(X, Y)¶
Calculate Cook’s distance from the calibration data
- Parameters
X ((n, m) ndarray) – Matrix of predictors. n samples x m predictors
Y ((n, o) ndarray) – Matrix of responses. n samples x o responses
- Returns
distances – List of axis for subplots
- Return type
(n, o) ndarray
Notes
Cooks distance is calculated according to
\[D_i = \frac{r_i^2}{p\hat\sigma} \frac{h_{ii}}{(1-h_{ii})^2}\]
- crit_dhypx(confidence=0.95)¶
Calculate critical dhypx according to Hotelling’s T2
- crit_dmodx(confidence=0.95)¶
Critical distance to hyperplane based on an F2 test
The critical distance to the model hyperplane is estimated based on an F2 distribution. Values above crit_dmodx may be considered outliers. dmodx is only approximately F2 distributed [Eriksson]. It is thus worthnoting that the estimated critcal distance is biased. It however gives a reasonable indication of points worth investigating.
- dhypx(X)¶
Normalized distance on hyperplane
Provides a distance on the hyperplane, normalized by the distance observed during calibration. It can be a useful measure to see whether new data is comparable to the calibration data. The normalized dhypx is slightly biased towards larger values since the estimated x_residual_std_ is slightly underestimated during model calibration [Eriksson].
- distance_plot(X, sample_id=None, confidence=0.95)¶
Plot distances colinear and orthogonal to model predictor hyperplane
Generates a figure with two subplots. The subplots provide information on how X behaves compared to the calibration data. Subplots: 1) Distance in model hyperplane of predictors. Provides insight into the magnitude of variation within the hyperplane compared to the calibration data. Large values indicate samples which are outside of the calibration space but may be described by linearly scaled latent variables. 2) Distance orthogonal to model hyperplane. Provides insight into the magnitude of variation orthogonal to the model hyperplane compared to the calibration data. Large values indicate samples which show a significant trend not observed in the calibration data.
- dmodx(X, normalize=True, absolute=False)¶
Calculate distance to model hyperplane in X (DModX)
DModX provides the distance to the model hyperplane spanned by the loading vectors. Any information in the predictors that is not captured by the PLS model contributes to DModX. If the DModX is normalized, DModX is devided by the mean residual variance of X observed during model calibration.
- Parameters
X ((n, m) ndarray) – matrix of predictors. n samples x m predictors
normalize ({True (default); False}) – normalization of DModX by error in X during calibration
absolute ({True; False (default)}) – return the absolute distance to the model plane (not normalized by degrees of freedom)
- Returns
dmodx – distance of n samples to model hyperplane
- Return type
(n, ) ndarray
- fit(X, Y)¶
Fit model to data.
- Parameters
X (array-like of shape (n_samples, n_features)) – Training vectors, where n_samples is the number of samples and n_features is the number of predictors.
Y (array-like of shape (n_samples,) or (n_samples, n_targets)) – Target vectors, where n_samples is the number of samples and n_targets is the number of response variables.
- Returns
self – Fitted model.
- Return type
object
- fit_transform(X, y=None)¶
Learn and apply the dimension reduction on the train data.
- Parameters
X (array-like of shape (n_samples, n_features)) – Training vectors, where n_samples is the number of samples and n_features is the number of predictors.
y (array-like of shape (n_samples, n_targets), default=None) – Target vectors, where n_samples is the number of samples and n_targets is the number of response variables.
- Returns
self – Return x_scores if Y is not given, (x_scores, y_scores) otherwise.
- Return type
ndarray of shape (n_samples, n_components)
- get_feature_names_out(input_features=None)¶
Get output feature names for transformation.
- Parameters
input_features (array-like of str or None, default=None) – Only used to validate feature names with the names seen in
fit()
.- Returns
feature_names_out – Transformed feature names.
- Return type
ndarray of str objects
- get_params(deep=True)¶
Get parameters for this estimator.
- Parameters
deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
params – Parameter names mapped to their values.
- Return type
dict
- hat(X)¶
Calculate the hat (projection) matrix
Calculate the hat matrix in the X/Y score space. The hat matrix \(H\) projects the observed \(Y\) onto the predicted \(\hat Y\). For obtaining the standard hat matrix, the provided X matrix should correspond to the matrix used during the calibration (call to fit) [Eriksson].
- Parameters
X ((n, m) ndarray) – Matrix of predictors. n samples x m predictors
- Returns
hat – Hat matrix, symmetric matrix, n x n samples
- Return type
(n, n) ndarray
- inverse_transform(X, Y=None)¶
Transform data back to its original space.
- Parameters
X (array-like of shape (n_samples, n_components)) – New data, where n_samples is the number of samples and n_components is the number of pls components.
Y (array-like of shape (n_samples, n_components)) – New target, where n_samples is the number of samples and n_components is the number of pls components.
- Returns
X_reconstructed (ndarray of shape (n_samples, n_features)) – Return the reconstructed X data.
Y_reconstructed (ndarray of shape (n_samples, n_targets)) – Return the reconstructed X target. Only returned when Y is given.
Notes
This transformation will only be exact if n_components=n_features.
- leverage(X)¶
Calculate the statistical leverage
Calculate the leverage (self-influence of Y) in the X/Y score space. For obtaining the standard leverage, the provided X matrix should correspond to the matrix used during calibration (call to fit).
- Parameters
X ((n, m) ndarray) – Matrix of predictors. n samples x m predictors
- Returns
leverage – leverage for n samples
- Return type
(n, ) ndarray
- plot(X, Y)¶
Displays a figure with 4 common analytical plots for PLS models
Generates a figure with four subplots providing analytical insights into the PLS model. Typically, the calibration data is used for the method call. following four subplots are generated: 1) observed -> predicted. Provides insights into the linearity of the data and shows how well the model performes over the model range. 2) predicted -> studentized residuals. Similar to 1). Useful for evaluating the error structure (e.g. homoscedasticity) and detecting outliers (studentized residuals > 3) 3) leverage -> studentized residuals. Provides insights into any data points/outliers which strongly affect the model. Optimally, the points should be scattered in the center left. The plot includes a limit on the Cook’s distance of 0.5 and 1 as dashed and solid bordeaux lines, respectively. 4) predictors -> VIP. Provides insights into the predictor importance for the model.
- Parameters
X ((n, m) ndarray) – Matrix of predictors. n samples x m predictors
Y ((n, o) ndarray) – Matrix of responses. n samples x o responses
- Returns
axes – List of axis for subplots
- Return type
list(axis, …)
Notes
The residuals are studentized according to
\[\hat{r}_i = \frac{r_i}{\sqrt{MSE (1-h_{ii)}}}\]The Cook’s distance limit is calculated according to
\[\hat{r}_i = \pm \sqrt{D_{crit} p \frac{(1-h_{ii})}{h_{ii}}}\]with \(\hat{r}_i\) being the studentized residuals, \(r_i\) the original residuals, MSE the mean squared error, \(h_{ii}\) the leverage, \(D_{crit}\) the critical distance, \(p\) the number of latent variables.
- predict(X, copy=True)¶
Predict targets of given samples.
- Parameters
X (array-like of shape (n_samples, n_features)) – Samples.
copy (bool, default=True) – Whether to copy X and Y, or perform in-place normalization.
- Returns
y_pred – Returns predicted values.
- Return type
ndarray of shape (n_samples,) or (n_samples, n_targets)
Notes
This call requires the estimation of a matrix of shape (n_features, n_targets), which may be an issue in high dimensional space.
- residuals(X, Y, scaling='studentize')¶
Calculate (normalized) residuals
Calculate the (normalized) residuals. The scaling scheme may be defined between ‘none’, ‘standardize’ and ‘studentize’. The normalized residuals should only be calculated with the current training set.
- Parameters
X ((n, m) ndarray) – Matrix of predictors. n samples x m predictors
Y ((n, o) ndarray) – Matrix of responses. n samples x o responses
scaling ({'none', 'standardize', 'studentize' (default)}) – Define scaling of returned residuals
- Returns
residuals – Matrix of unscaled, standardized or studentized residuals
- Return type
(n, o)
Notes
The response-wise standard deviation \(\sigma_j\) is calculated according to
\[\sigma_j = \sqrt{\frac{\sum_i=1^n r_{i,j}^2}{n - p}}.\]Residuals are studentized according to
\[\hat{r}_i = \frac{r_i}{\sigma\sqrt{(1-h_{ii})}},\]with \(\hat{r}_i\) being the studentized residuals, \(r_i\) the original residuals and \(h_{ii}\) the leverage.
- score(X, y, sample_weight=None)¶
Return the coefficient of determination of the prediction.
The coefficient of determination \(R^2\) is defined as \((1 - \frac{u}{v})\), where \(u\) is the residual sum of squares
((y_true - y_pred)** 2).sum()
and \(v\) is the total sum of squares((y_true - y_true.mean()) ** 2).sum()
. The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a \(R^2\) score of 0.0.- Parameters
X (array-like of shape (n_samples, n_features)) – Test samples. For some estimators this may be a precomputed kernel matrix or a list of generic objects instead with shape
(n_samples, n_samples_fitted)
, wheren_samples_fitted
is the number of samples used in the fitting for the estimator.y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – True values for X.
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.
- Returns
score – \(R^2\) of
self.predict(X)
wrt. y.- Return type
float
Notes
The \(R^2\) score used when calling
score
on a regressor usesmultioutput='uniform_average'
from version 0.23 to keep consistent with default value ofr2_score()
. This influences thescore
method of all the multioutput regressors (except forMultiOutputRegressor
).
- set_params(**params)¶
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
**params (dict) – Estimator parameters.
- Returns
self – Estimator instance.
- Return type
estimator instance
- transform(X, Y=None, copy=True)¶
Apply the dimension reduction.
- Parameters
X (array-like of shape (n_samples, n_features)) – Samples to transform.
Y (array-like of shape (n_samples, n_targets), default=None) – Target vectors.
copy (bool, default=True) – Whether to copy X and Y, or perform in-place normalization.
- Returns
x_scores, y_scores – Return x_scores if Y is not given, (x_scores, y_scores) otherwise.
- Return type
array-like or tuple of array-like