chemometrics.PLSRegression

class chemometrics.PLSRegression(n_components=2, *, max_iter=500, tol=1e-06, copy=True)

Bases: PLSRegression, LVmixin

PLS regression with added chemometric functionality

References

Calculations according to

Eriksson(1,2,3)

L. Eriksson, E. Johansson, N. Kettaneh-Wold, J. Trygg, C. Wikström, and S. Wold. Multi- and Megavariate Data Analysis, Part I Basic Principles and Applications. Second Edition.

__init__(n_components=2, *, max_iter=500, tol=1e-06, copy=True)

Methods

__init__([n_components, max_iter, tol, copy])

cooks_distance(X, Y)

Calculate Cook's distance from the calibration data

crit_dhypx([confidence])

Calculate critical dhypx according to Hotelling's T2

crit_dmodx([confidence])

Critical distance to hyperplane based on an F2 test

dhypx(X)

Normalized distance on hyperplane

distance_plot(X[, sample_id, confidence])

Plot distances colinear and orthogonal to model predictor hyperplane

dmodx(X[, normalize, absolute])

Calculate distance to model hyperplane in X (DModX)

fit(X, Y)

Fit model to data.

fit_transform(X[, y])

Learn and apply the dimension reduction on the train data.

get_feature_names_out([input_features])

Get output feature names for transformation.

get_params([deep])

Get parameters for this estimator.

hat(X)

Calculate the hat (projection) matrix

inverse_transform(X[, Y])

Transform data back to its original space.

leverage(X)

Calculate the statistical leverage

plot(X, Y)

Displays a figure with 4 common analytical plots for PLS models

predict(X[, copy])

Predict targets of given samples.

residuals(X, Y[, scaling])

Calculate (normalized) residuals

score(X, y[, sample_weight])

Return the coefficient of determination of the prediction.

set_params(**params)

Set the parameters of this estimator.

transform(X[, Y, copy])

Apply the dimension reduction.

Attributes

coef_

The coefficients of the linear model.

property coef_

The coefficients of the linear model.

cooks_distance(X, Y)

Calculate Cook’s distance from the calibration data

Parameters
  • X ((n, m) ndarray) – Matrix of predictors. n samples x m predictors

  • Y ((n, o) ndarray) – Matrix of responses. n samples x o responses

Returns

distances – List of axis for subplots

Return type

(n, o) ndarray

Notes

Cooks distance is calculated according to

\[D_i = \frac{r_i^2}{p\hat\sigma} \frac{h_{ii}}{(1-h_{ii})^2}\]
crit_dhypx(confidence=0.95)

Calculate critical dhypx according to Hotelling’s T2

crit_dmodx(confidence=0.95)

Critical distance to hyperplane based on an F2 test

The critical distance to the model hyperplane is estimated based on an F2 distribution. Values above crit_dmodx may be considered outliers. dmodx is only approximately F2 distributed [Eriksson]. It is thus worthnoting that the estimated critcal distance is biased. It however gives a reasonable indication of points worth investigating.

dhypx(X)

Normalized distance on hyperplane

Provides a distance on the hyperplane, normalized by the distance observed during calibration. It can be a useful measure to see whether new data is comparable to the calibration data. The normalized dhypx is slightly biased towards larger values since the estimated x_residual_std_ is slightly underestimated during model calibration [Eriksson].

distance_plot(X, sample_id=None, confidence=0.95)

Plot distances colinear and orthogonal to model predictor hyperplane

Generates a figure with two subplots. The subplots provide information on how X behaves compared to the calibration data. Subplots: 1) Distance in model hyperplane of predictors. Provides insight into the magnitude of variation within the hyperplane compared to the calibration data. Large values indicate samples which are outside of the calibration space but may be described by linearly scaled latent variables. 2) Distance orthogonal to model hyperplane. Provides insight into the magnitude of variation orthogonal to the model hyperplane compared to the calibration data. Large values indicate samples which show a significant trend not observed in the calibration data.

dmodx(X, normalize=True, absolute=False)

Calculate distance to model hyperplane in X (DModX)

DModX provides the distance to the model hyperplane spanned by the loading vectors. Any information in the predictors that is not captured by the PLS model contributes to DModX. If the DModX is normalized, DModX is devided by the mean residual variance of X observed during model calibration.

Parameters
  • X ((n, m) ndarray) – matrix of predictors. n samples x m predictors

  • normalize ({True (default); False}) – normalization of DModX by error in X during calibration

  • absolute ({True; False (default)}) – return the absolute distance to the model plane (not normalized by degrees of freedom)

Returns

dmodx – distance of n samples to model hyperplane

Return type

(n, ) ndarray

fit(X, Y)

Fit model to data.

Parameters
  • X (array-like of shape (n_samples, n_features)) – Training vectors, where n_samples is the number of samples and n_features is the number of predictors.

  • Y (array-like of shape (n_samples,) or (n_samples, n_targets)) – Target vectors, where n_samples is the number of samples and n_targets is the number of response variables.

Returns

self – Fitted model.

Return type

object

fit_transform(X, y=None)

Learn and apply the dimension reduction on the train data.

Parameters
  • X (array-like of shape (n_samples, n_features)) – Training vectors, where n_samples is the number of samples and n_features is the number of predictors.

  • y (array-like of shape (n_samples, n_targets), default=None) – Target vectors, where n_samples is the number of samples and n_targets is the number of response variables.

Returns

self – Return x_scores if Y is not given, (x_scores, y_scores) otherwise.

Return type

ndarray of shape (n_samples, n_components)

get_feature_names_out(input_features=None)

Get output feature names for transformation.

Parameters

input_features (array-like of str or None, default=None) – Only used to validate feature names with the names seen in fit().

Returns

feature_names_out – Transformed feature names.

Return type

ndarray of str objects

get_params(deep=True)

Get parameters for this estimator.

Parameters

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

params – Parameter names mapped to their values.

Return type

dict

hat(X)

Calculate the hat (projection) matrix

Calculate the hat matrix in the X/Y score space. The hat matrix \(H\) projects the observed \(Y\) onto the predicted \(\hat Y\). For obtaining the standard hat matrix, the provided X matrix should correspond to the matrix used during the calibration (call to fit) [Eriksson].

Parameters

X ((n, m) ndarray) – Matrix of predictors. n samples x m predictors

Returns

hat – Hat matrix, symmetric matrix, n x n samples

Return type

(n, n) ndarray

inverse_transform(X, Y=None)

Transform data back to its original space.

Parameters
  • X (array-like of shape (n_samples, n_components)) – New data, where n_samples is the number of samples and n_components is the number of pls components.

  • Y (array-like of shape (n_samples, n_components)) – New target, where n_samples is the number of samples and n_components is the number of pls components.

Returns

  • X_reconstructed (ndarray of shape (n_samples, n_features)) – Return the reconstructed X data.

  • Y_reconstructed (ndarray of shape (n_samples, n_targets)) – Return the reconstructed X target. Only returned when Y is given.

Notes

This transformation will only be exact if n_components=n_features.

leverage(X)

Calculate the statistical leverage

Calculate the leverage (self-influence of Y) in the X/Y score space. For obtaining the standard leverage, the provided X matrix should correspond to the matrix used during calibration (call to fit).

Parameters

X ((n, m) ndarray) – Matrix of predictors. n samples x m predictors

Returns

leverage – leverage for n samples

Return type

(n, ) ndarray

plot(X, Y)

Displays a figure with 4 common analytical plots for PLS models

Generates a figure with four subplots providing analytical insights into the PLS model. Typically, the calibration data is used for the method call. following four subplots are generated: 1) observed -> predicted. Provides insights into the linearity of the data and shows how well the model performes over the model range. 2) predicted -> studentized residuals. Similar to 1). Useful for evaluating the error structure (e.g. homoscedasticity) and detecting outliers (studentized residuals > 3) 3) leverage -> studentized residuals. Provides insights into any data points/outliers which strongly affect the model. Optimally, the points should be scattered in the center left. The plot includes a limit on the Cook’s distance of 0.5 and 1 as dashed and solid bordeaux lines, respectively. 4) predictors -> VIP. Provides insights into the predictor importance for the model.

Parameters
  • X ((n, m) ndarray) – Matrix of predictors. n samples x m predictors

  • Y ((n, o) ndarray) – Matrix of responses. n samples x o responses

Returns

axes – List of axis for subplots

Return type

list(axis, …)

Notes

The residuals are studentized according to

\[\hat{r}_i = \frac{r_i}{\sqrt{MSE (1-h_{ii)}}}\]

The Cook’s distance limit is calculated according to

\[\hat{r}_i = \pm \sqrt{D_{crit} p \frac{(1-h_{ii})}{h_{ii}}}\]

with \(\hat{r}_i\) being the studentized residuals, \(r_i\) the original residuals, MSE the mean squared error, \(h_{ii}\) the leverage, \(D_{crit}\) the critical distance, \(p\) the number of latent variables.

predict(X, copy=True)

Predict targets of given samples.

Parameters
  • X (array-like of shape (n_samples, n_features)) – Samples.

  • copy (bool, default=True) – Whether to copy X and Y, or perform in-place normalization.

Returns

y_pred – Returns predicted values.

Return type

ndarray of shape (n_samples,) or (n_samples, n_targets)

Notes

This call requires the estimation of a matrix of shape (n_features, n_targets), which may be an issue in high dimensional space.

residuals(X, Y, scaling='studentize')

Calculate (normalized) residuals

Calculate the (normalized) residuals. The scaling scheme may be defined between ‘none’, ‘standardize’ and ‘studentize’. The normalized residuals should only be calculated with the current training set.

Parameters
  • X ((n, m) ndarray) – Matrix of predictors. n samples x m predictors

  • Y ((n, o) ndarray) – Matrix of responses. n samples x o responses

  • scaling ({'none', 'standardize', 'studentize' (default)}) – Define scaling of returned residuals

Returns

residuals – Matrix of unscaled, standardized or studentized residuals

Return type

(n, o)

Notes

The response-wise standard deviation \(\sigma_j\) is calculated according to

\[\sigma_j = \sqrt{\frac{\sum_i=1^n r_{i,j}^2}{n - p}}.\]

Residuals are studentized according to

\[\hat{r}_i = \frac{r_i}{\sigma\sqrt{(1-h_{ii})}},\]

with \(\hat{r}_i\) being the studentized residuals, \(r_i\) the original residuals and \(h_{ii}\) the leverage.

score(X, y, sample_weight=None)

Return the coefficient of determination of the prediction.

The coefficient of determination \(R^2\) is defined as \((1 - \frac{u}{v})\), where \(u\) is the residual sum of squares ((y_true - y_pred)** 2).sum() and \(v\) is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a \(R^2\) score of 0.0.

Parameters
  • X (array-like of shape (n_samples, n_features)) – Test samples. For some estimators this may be a precomputed kernel matrix or a list of generic objects instead with shape (n_samples, n_samples_fitted), where n_samples_fitted is the number of samples used in the fitting for the estimator.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – True values for X.

  • sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.

Returns

score\(R^2\) of self.predict(X) wrt. y.

Return type

float

Notes

The \(R^2\) score used when calling score on a regressor uses multioutput='uniform_average' from version 0.23 to keep consistent with default value of r2_score(). This influences the score method of all the multioutput regressors (except for MultiOutputRegressor).

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters

**params (dict) – Estimator parameters.

Returns

self – Estimator instance.

Return type

estimator instance

transform(X, Y=None, copy=True)

Apply the dimension reduction.

Parameters
  • X (array-like of shape (n_samples, n_features)) – Samples to transform.

  • Y (array-like of shape (n_samples, n_targets), default=None) – Target vectors.

  • copy (bool, default=True) – Whether to copy X and Y, or perform in-place normalization.

Returns

x_scores, y_scores – Return x_scores if Y is not given, (x_scores, y_scores) otherwise.

Return type

array-like or tuple of array-like