Multiple Regression in R using LARS - Winona



1 - Multiple Regression in R using OLS1.1 – “Review” of OLS Load the comma-delimited file bodyfat.csv into R> Bodyfat = read.table(file.choose(),header=T,sep=",")Read 3528 items > Bodyfat = Bodyfat[,-1] first column density is redundantResponse is in column 1, the candidate predictors are in columns 2 – 14.> X <- Bodyfat[,2:14]> y <- Bodyfat[,1]> dim(X)[1] 252 13> dim(y)[1] 252 1> pairs.plus(Bodyfat) Examine a scatterplot matrix with the “bells and whistles”…> bodyfat.ols = lm(bodyfat~.,data=Bodyfat)> summary(bodyfat.ols)Call:lm(formula = bodyfat ~ ., data = Bodyfat)Residuals: Min 1Q Median 3Q Max -11.1966 -2.8824 -0.1111 3.1901 9.9979 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -21.35323 22.18616 -0.962 0.33680 age 0.06457 0.03219 2.006 0.04601 * weight -0.09638 0.06185 -1.558 0.12047 height -0.04394 0.17870 -0.246 0.80599 neck -0.47547 0.23557 -2.018 0.04467 * chest -0.01718 0.10322 -0.166 0.86792 abdomen 0.95500 0.09016 10.592 < 2e-16 ***hip -0.18859 0.14479 -1.302 0.19401 thigh 0.24835 0.14617 1.699 0.09061 . knee 0.01395 0.24775 0.056 0.95516 ankle 0.17788 0.22262 0.799 0.42505 biceps 0.18230 0.17250 1.057 0.29166 forearm 0.45574 0.19930 2.287 0.02309 * wrist -1.65450 0.53316 -3.103 0.00215 ** ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 4.309 on 238 degrees of freedomMultiple R-squared: 0.7486, Adjusted R-squared: 0.7348 F-statistic: 54.5 on 13 and 238 DF, p-value: < 2.2e-16Regression diagnostics using a variety of functions written by Chris Malone for his senior capstone project while an undergraduate at WSU.> Diagplot1(bodyfat.ols) Look at Cook’s Distances & Leverages> Diagplot2(bodyfat.ols) DFBETA’s primarily> Diagplot3(bodyfat.ols,dfbet=T) AVP’s, DFBETAS, and VIFS> Resplot Various diagnostic plots examining the residuals2924175216344500> MLRdiag(bodyfat.ols) Inverse response plots with case diagnostics added> VIF(bodyfat.ols) returns table of VIF’s for each predictor. This table is returned by Diagplot3 as wellVariance Inflation Factor Table Variable VIF Rsquaredage age 2.224469 0.5504545weight weight 44.652515 0.9776048height height 2.939110 0.6597610neck neck 4.431923 0.7743643chest chest 10.234694 0.9022931abdomen abdomen 12.775528 0.9217253hip hip 14.541932 0.9312333thigh thigh 7.958662 0.8743507knee knee 4.825304 0.7927592ankle ankle 1.924098 0.4802760biceps biceps 3.670907 0.7275877forearm forearm 2.191933 0.5437817wrist wrist 3.348404 0.7013503There is clearly evidence of collinearity suggesting that a reduced model should be considered. Model “selection” is the focus of this handout. We will first consider using standard stepwise selection methods – forward, backward, mixed, or potentially all possible subsets.1.2 – C + R Plots and CERES Plots in R These plots are used to visualize the functional form for a predictor in a OLS multiple regression setting. We can formulate an OLS regression with a response Y and potential predictors X1,X2,…,Xp as follows:Y= ηo+η1τ1X1+…+ηpτpXp+εwhere the τi(Xi)'s represent the functional form of the ith predictor in the model. For example τiXi=lnXi or τiXi=polynomial of degree 2 in Xi (i.e. add Xi and Xi2) terms to the model. The model above is an example of what we call an additive model. Later in the course we will look at the other methods for developing flexible additive models in a regression setting.The package car contains functions for regression that are similar to those available in Arc which the software developed to accompany Applied Regression: Including Computing and Graphics by Cook & Weisberg (text from STAT 360). Although not as interactive as Arc, the crPlots() & ceresPlots() functions in the car library will construct C+R and CERES plots respectively for each term in a regression model. As stated earlier, both C+R plots and CERES Plots are used to visualize the predictors that might benefit from the creation of nonlinear terms based on the predictor. CERES plots are better when there are nonlinear relationships amongst the predictors themselves. The nonlinear relationships between the predictors can “bleed” into the C+R Plots, resulting in an inaccurate representation of the potential ponent + Residual Plots (C+R Plots)> crPlots(bodyfat.ols)CERES Plots (Conditional Expectation RESidual plots)> ceresPlots(bodyfat.ols)1.3 - Standard Stepwise Selection Methods for OLS RegressionThese methods seek to minimize a penalized version of the RSS = residual sum of squares of the regression model. These statistics are Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC), adjusted R-square (adj-R2), and Mallow’s Ck and presented below:AIC=nlogRSSkn+2k=1nσ2RSS+2kσ2Ck=RSSkσ2+2k-n=1nRSS+2kσ2BIC= 1nRSS+lognkσ2 Adjusted R2=1- RSS/(n-k-1)SSTot/(n-1)where k = the number of parameters in the candidate model and σ2= estimated residual variance from the “full” model. Minimizing AIC, BIC, or Ck in the case of OLS yields the “best” model according to that criterion. In contrast, the adjusted R2 is maximized to find the “best” model.Backward Elimination> bodyfat.back = step(bodyfat.ols,direction="backward")Backward elimination results are displayed (not shown)> anova(bodyfat.back)Analysis of Variance TableResponse: bodyfat Df Sum Sq Mean Sq F value Pr(>F) age 1 1493.3 1493.3 81.4468 < 2.2e-16 ***weight 1 6674.3 6674.3 364.0279 < 2.2e-16 ***neck 1 182.5 182.5 9.9533 0.001808 ** abdomen 1 4373.0 4373.0 238.5125 < 2.2e-16 ***hip 1 6.9 6.9 0.3747 0.541022 thigh 1 136.6 136.6 7.4523 0.006799 ** forearm 1 90.1 90.1 4.9164 0.027528 * wrist 1 166.8 166.8 9.1002 0.002827 ** Residuals 243 4455.3 18.3 ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > bodyfat.back$anova Step Df Deviance Resid. Df Resid. Dev AIC1 NA NA 238 4420.064 749.84912 - knee 1 0.05885058 239 4420.123 747.85243 - chest 1 0.52286065 240 4420.646 745.88224 - height 1 0.68462867 241 4421.330 743.92125 - ankle 1 13.28231735 242 4434.613 742.67726 - biceps 1 20.71159705 243 4455.324 741.8514Forward Selection (painful due to the fact candidate predictors need to be listed explicitly)> bodyfat.base = lm(bodyfat~1,data=Bodyfat) Model with intercept only> bodyfat.forward step(bodyfat.base,~.+age+weight+height+neck+chest+abdomen+hip+thigh+knee+ankle+biceps+forearm+wrist,direction="forward")Start: AIC=1071.75bodyfat ~ 1 Df Sum of Sq RSS AIC+ abdomen 1 11631.5 5947.5 800.65+ chest 1 8678.3 8900.7 902.24+ hip 1 6871.2 10707.8 948.82+ weight 1 6593.0 10986.0 955.29+ thigh 1 5505.0 12073.9 979.08+ knee 1 4548.4 13030.6 998.30+ biceps 1 4277.3 13301.7 1003.49+ neck 1 4230.9 13348.1 1004.36+ forearm 1 2295.8 15283.2 1038.48+ wrist 1 2111.5 15467.5 1041.50+ age 1 1493.3 16085.7 1051.38+ ankle 1 1243.5 16335.5 1055.26<none> 17579.0 1071.75+ height 1 11.2 17567.7 1073.59Etc…Both or Mixed Selection> bodyfat.mixed = step(bodyfat.ols) default=”both”, feeds in full model> anova(bodyfat.mixed)Analysis of Variance TableResponse: bodyfat Df Sum Sq Mean Sq F value Pr(>F) age 1 1493.3 1493.3 81.4468 < 2.2e-16 ***weight 1 6674.3 6674.3 364.0279 < 2.2e-16 ***neck 1 182.5 182.5 9.9533 0.001808 ** abdomen 1 4373.0 4373.0 238.5125 < 2.2e-16 ***hip 1 6.9 6.9 0.3747 0.541022 thigh 1 136.6 136.6 7.4523 0.006799 ** forearm 1 90.1 90.1 4.9164 0.027528 * wrist 1 166.8 166.8 9.1002 0.002827 ** Residuals 243 4455.3 18.3 ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > bodyfat.mixed$anova Step Df Deviance Resid. Df Resid. Dev AIC1 NA NA 238 4420.064 749.84912 - knee 1 0.05885058 239 4420.123 747.85243 - chest 1 0.52286065 240 4420.646 745.88224 - height 1 0.68462867 241 4421.330 743.92125 - ankle 1 13.28231735 242 4434.613 742.67726 - biceps 1 20.71159705 243 4455.324 741.8514Stepwise Methods Using the leaps package in RThe package leaps available through CRAN will perform forward, backward, and mixed approaches as well, but offer some improvements over the default step function in base R.> library(leaps)> bodyfat.full = regsubsets(bodyfat~.,data=Bodyfat,nvmax=13)> summary(bodyfat.full)Subset selection objectCall: regsubsets.formula(bodyfat ~ ., data = Bodyfat, nvmax = 13)13 Variables (and intercept) Forced in Forced outage FALSE FALSEweight FALSE FALSEheight FALSE FALSEneck FALSE FALSEchest FALSE FALSEabdomen FALSE FALSEhip FALSE FALSEthigh FALSE FALSEknee FALSE FALSEankle FALSE FALSEbiceps FALSE FALSEforearm FALSE FALSEwrist FALSE FALSE1 subsets of each size up to 13Selection Algorithm: exhaustive age weight height neck chest abdomen hip thigh knee ankle biceps forearm wrist1 ( 1 ) " " " " " " " " " " "*" " " " " " " " " " " " " " " 2 ( 1 ) " " "*" " " " " " " "*" " " " " " " " " " " " " " " 3 ( 1 ) " " "*" " " " " " " "*" " " " " " " " " " " " " "*" 4 ( 1 ) " " "*" " " " " " " "*" " " " " " " " " " " "*" "*" 5 ( 1 ) " " "*" " " "*" " " "*" " " " " " " " " " " "*" "*" 6 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " " " "*" "*" 7 ( 1 ) "*" "*" " " "*" " " "*" " " "*" " " " " " " "*" "*" 8 ( 1 ) "*" "*" " " "*" " " "*" "*" "*" " " " " " " "*" "*" 9 ( 1 ) "*" "*" " " "*" " " "*" "*" "*" " " " " "*" "*" "*" 10 ( 1 ) "*" "*" " " "*" " " "*" "*" "*" " " "*" "*" "*" "*" 11 ( 1 ) "*" "*" "*" "*" " " "*" "*" "*" " " "*" "*" "*" "*" 12 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" " " "*" "*" "*" "*" 13 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" > reg.summary = summary(bodyfat.full)> names(reg.summary)[1] "which" "rsq" "rss" "adjr2" "cp" "bic" "outmat" "obj" > par(mfrow=c(2,2)) set up a 2 X 2 grid of plots> plot(reg.summary$rss,xlab="Number of Variables",ylab="RSS",type="b")> plot(reg.summary$adjr2,xlab="Number of Variables",ylab="Adjusted R-square",type="b")> plot(reg.summary$cp,xlab="Number of Variables",ylab="Mallow's Cp",type="b")> plot(reg.summary$bic,xlab="Number of Variables",ylab="Bayesian Information Criterion (BIC)",type="b")> par(mfrow=c(1,1)) restore to 1 plot per pageFind “optimal” model size using adjusted-R2, Mallow’s Ck, and BIC> which.max(reg.summary$adjr2)[1] 9> which.min(reg.summary$cp)[1] 7> which.min(reg.summary$bic)[1] 4The regsubsets() function has a built-in plot command which can display the selected variables for the “best” model with a given model selection statistic. The top row of each plot contains a black square for each variable selected according to the optimal model associated with that statistic. Examples using the R2 (unadjusted), adjusted R2, Mallow’s Ck, and the BIC are shown on the following page.> plot(bodyfat.full,scale="r2") > plot(bodyfat.full,scale="adjr2") > plot(bodyfat.full,scale="Cp") > plot(bodyfat.full,scale="bic") Automatic Selection via All Possible SubsetsThe package bestglm uses will return the “best” model using user-specified model selection criterion such as AIC (basically Mallow’s Ck for OLS), BIC, and cross-validation schemes. The PDF documentation for this package is excellent with several complete examples and details on how to use the various options. The output below shows the use of the bestglm function to find the “best” OLS model using the AIC/Ck criterion.> library(bestglm)> Xy = cbind(x,y)> bodyfat.best = bestglm(Xy,IC="AIC")> attributes(bodyfat.best)$names[1] "BestModel" "BestModels" "Bestq" "qTable" "Subsets" "Title" "ModelReport"$class[1] "bestglm"> bodyfat.best$Subsets> bodyfat.best$BestModelCall:lm(formula = y ~ ., data = data.frame(Xy[, c(bestset[-1], FALSE), drop = FALSE], y = y))Coefficients:(Intercept) age weight neck abdomen hip thigh forearm wrist -22.65637 0.06578 -0.08985 -0.46656 0.94482 -0.19543 0.30239 0.51572 -1.53665 > summary(bodyfat.best$BestModel)Residuals: Min 1Q Median 3Q Max -10.9757 -2.9937 -0.1644 2.9766 10.2244 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -22.65637 11.71385 -1.934 0.05426 . age 0.06578 0.03078 2.137 0.03356 * weight -0.08985 0.03991 -2.252 0.02524 * neck -0.46656 0.22462 -2.077 0.03884 * abdomen 0.94482 0.07193 13.134 < 2e-16 ***hip -0.19543 0.13847 -1.411 0.15940 thigh 0.30239 0.12904 2.343 0.01992 * forearm 0.51572 0.18631 2.768 0.00607 ** wrist -1.53665 0.50939 -3.017 0.00283 ** ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 4.282 on 243 degrees of freedomMultiple R-squared: 0.7466, Adjusted R-squared: 0.7382 F-statistic: 89.47 on 8 and 243 DF, p-value: < 2.2e-16Save the “best” OLS model to an object named appropriately. We can then examine various regression diagnostics for this model as considered above.> bodyfat.bestols = lm(formula(bodyfat.best$BestModel))> MLRdiag(bodyfat.bestols) etc…> VIF(bodyfat.bestols) Variance Inflation Factor Table Variable VIF Rsquaredage age 2.059194 0.5143731weight weight 18.829990 0.9468932neck neck 4.081562 0.7549958abdomen abdomen 8.236808 0.8785937hip hip 13.471431 0.9257688thigh thigh 6.283117 0.8408433forearm forearm 1.940309 0.4846180wrist wrist 3.096051 0.6770079The presence of collinearity issues even after reducing the model size suggests some problems with the “in/out” selection strategies regardless of the criterion used to select them. 1.4 - Cross-Validation Functions for OLSIn this section I will show some sample R code that can be used (and altered) to perform cross-validation to estimate the “average” PSE (prediction squared error), MPSE (mean prediction squared error), or the MSEP (mean squared error for prediction). Note these are all the same thing! We could also take the square root of any of these to obtain root squared error of prediction (RMSEP). I won’t even consider listing other associated acronyms. Suppose we have m observations to predict the value of response (y) for. These m observations must NOT have been used to develop/train the model!MSE for prediction = 1mi=1myi-ypredi2As discussed in class there are several schemes that can be used to estimate the predictive abilities of a regression model. The list of methods we discussed is:Test/Train Samples or Split Sample approachK-fold Cross-Validation (k-fold CV)Leave-out-one Cross-validation (LOOCV)Monte Carlo Cross-validation (MCCV).632 BootstrapIn this handout I will demonstrate these different methods of cross-validation using the Bodyfat example. In the textbook (section 6.5.3 pg. 248-251), the authors demonstrate how to use k-fold cross-validation to determine the optimal number of predictors in the OLS model using the Hitters data found in the ISLR package. The approach I will take with the body fat data is a little different. I will assume that we have a model chosen and we wish to estimate the predictive performance of this model using the MSEP estimated via cross-validation. Test/Train or Split Sample Cross-Validation for an OLS Regression ModelThe code below will construct a split sample cross-validation for an OLS model. We first choose a fraction of the available data p to form our training data (e.g. p = .67). The remaining data is then used to form our test cases. You fit the model to training data and then predict the response value for the test cases. > dim(Bodyfat)[1] 252 14> n = nrow(Bodyfat)> n[1] 252> p = .67> m = floor(n*(1-p))> m[1] 83> sam = sample(1:n,m,replace=F)> Bodyfat.test = Bodyfat[sam,]> Bodyfat.train = Bodyfat[-sam,]> Bodyfat.ols = lm(bodyfat~.,data=Bodyfat.train)> bodyfat.pred = predict(Bodyfat.ols,newdata=Bodyfat.test)> pred.err = Bodyfat.test$bodyfat – bodyfat.pred> pred.err> pred.err 109 72 78 84 194 223 199 129 247 6.16872468 -3.89057224 3.02314968 5.55960734 -2.26666371 -5.76919180 1.29122863 2.29403342 0.84380506 211 71 143 212 116 55 142 3 183 -5.80966602 4.82964948 4.87661670 2.78002236 0.16611536 -4.26941892 -3.01050196 7.11824662 -4.41990034 197 206 1 91 83 134 222 110 58 5.61137864 2.53708347 -3.66664960 -1.72264289 -4.54710906 5.50948186 -3.43478744 0.16872091 1.35967343 102 97 15 45 240 51 151 173 207 0.04633038 -7.65387694 -1.67430723 -3.82375057 4.43252304 -5.70578266 1.00073463 3.66326700 10.55889964 47 27 204 53 192 170 119 196 50 2.68099394 -1.29145551 -8.82900506 -6.68075831 8.27135710 -3.41991050 7.41856162 2.75480323 -2.12010334 101 252 88 107 123 163 153 67 232 2.71919034 5.19305658 2.17247349 -7.31113211 2.05349309 -2.67587344 5.74096473 5.40488472 -5.18731001 234 118 5 177 2 184 133 103 157 2.20765427 -0.20192157 1.13773495 -2.32761863 -3.24592972 -4.80316777 -1.70083102 2.46434656 2.33830994 10 241 185 17 61 166 235 56 100 2.33085273 2.17346001 -0.47892211 5.91822645 1.14124567 1.53410161 4.48022867 -0.81379728 3.74242033 42 41 148 40 251 210 233 169 43 3.56701914 -2.47471287 7.27411561 0.85255665 1.77435585 -4.14641758 -0.90216282 -3.40791216 -2.22336196 49 195 -4.26635609 5.97086308 > mean(pred.err^2)[1] 17.8044If did this yourself you would most likely obtain a different PSE, because your random sample of the indices would produce different test and training samples. We can guarantee our results will match by using the command set.seed()to obtain the same random samples.> set.seed(1) if we all used this value before any command that utilizes randomization we get the same results. > set.seed(1)> sam = sample(1:n,m,replace=F)> Bodyfat.test = Bodyfat[sam,]> Bodyfat.train = Bodyfat[-sam,]> Bodyfat.ols = lm(bodyfat~.,data=Bodyfat.train)> bodyfat.pred = predict(Bodyfat.ols,newdata=Bodyfat.test)> pred.err = Bodyfat.test$bodyfat - bodyfat.pred> mean(pred.err^2) [1] 14.20143> set.seed(1000)> sam = sample(1:n,m,replace=F)> Bodyfat.test = Bodyfat[sam,]> Bodyfat.train = Bodyfat[-sam,]> Bodyfat.ols = lm(bodyfat~.,data=Bodyfat.train)> bodyfat.pred = predict(Bodyfat.ols,newdata=Bodyfat.test)> pred.err = Bodyfat.test$bodyfat - bodyfat.pred> mean(pred.err^2)[1] 20.6782> set.seed(1111)> sam = sample(1:n,m,replace=F)> Bodyfat.test = Bodyfat[sam,]> Bodyfat.train = Bodyfat[-sam,]> Bodyfat.ols = lm(bodyfat~.,data=Bodyfat.train)> bodyfat.pred = predict(Bodyfat.ols,newdata=Bodyfat.test)> pred.err = Bodyfat.test$bodyfat - bodyfat.pred> mean(pred.err^2)[1] 22.39968 Notice the variation in the MPSE estimates!Here is a slight variation on the code that will produce the results. > set.seed(1)> test = sample(n,m)> Bodyfat.ols = lm(bodyfat~.,data=Bodyfat,subset=-test)> mean((bodyfat-predict(Bodyfat.ols,Bodyfat))[test]^2)[1] 14.20143> set.seed(1111)> test = sample(n,m)> Bodyfat.ols = lm(bodyfat~.,data=Bodyfat,subset=-test)> mean((bodyfat-predict(Bodyfat.ols,Bodyfat))[test]^2)[1] 22.39968> set.seed(1000)> test = sample(n,m)> Bodyfat.ols = lm(bodyfat~.,data=Bodyfat,subset=-test)> mean((bodyfat-predict(Bodyfat.ols,Bodyfat))[test]^2)[1] 20.6782k-Fold Cross-ValidationTo perform a k-fold cross-validation we first need to divide our available data into k roughly equal sample size sets of observations. We then our model using (k – 1) sets of the observations and predict the set of observations not used. This is done k times with each set being left out in turn. Typical values used in practice are k = 5 and k = 10.The function below will take an OLS model to be cross-validated using k-fold cross-validation and return the MSEP.> kfold.cv = function(fit,k=10) { sum.sqerr <- rep(0,k) y = fit$model[,1] x = fit$model[,-1] data = fit$model n = nrow(data) folds = sample(1:k,nrow(data),replace=T) for (i in 1:k) { fit2 <- lm(formula(fit),data=data[folds!=i,]) ypred <- predict(fit2,newdata=data[folds==i,]) sum.sqerr[i] <- sum((y[folds==i]-ypred)^2) } cv = sum(sum.sqerr)/n cv}> kfold.cv(Bodyfat.ols,k=10)[1] 21.02072Leave-Out-One Cross-Validation (LOOCV)Using the fact the predicted value for yi when the ith case is deleted from the model is equal to yi- yi=ei1-hi=ei=(yi-ypredi)This is also called the ith jackknife residual and the sum of these squared residuals is called the PRESS statistic, one of the first measures of prediction error. In R you can obtain the prediction errors as follows:> pred.err = resid(fit)/(1-lm.influence(fit)$hat) where fit is the OLS model we want to estimate the prediction error for.> pred.err = resid(Bodyfat.ols)/(1-lm.influence(Bodyfat.ols)$hat)> mean(pred.err^2)[1] 20.29476Monte Carlo Cross-validation (MCCV) for an OLS Regression ModelThis function performs Monte Carlo Cross-validation for an arbitrary OLS model. Main argument is the fitted model from the lm()function. Optional arguments are the fraction of observations to use in the training set (default is p = .667 or approximately two-thirds of the original data) and the number of replications (default is B = 100, which is rather small actually).> ols.mccv = function(fit,p=.667,B=100) { cv <- rep(0,B) y = fit$model[,1] x = fit$model[,-1] data = fit$model n = nrow(data) for (i in 1:B) { ss <- floor(n*p) sam <- sample(1:n,ss,replace=F) fit2 <- lm(formula(fit),data=data[sam,]) ypred <- predict(fit2,newdata=x[-sam,]) cv[i] <- mean((y[-sam]-ypred)^2) } cv}Here is a different version using a cleaner approach for dealing with the train/test data.> ols.mccv2 = function(fit,p=.667,B=100) { cv <- rep(0,B) y = fit$model[,1] x = fit$model[,-1] data = fit$model n = nrow(data) for (i in 1:B) { ss <- floor(n*p) sam <- sample(n,ss,replace=F) fit2 <- lm(formula(fit),subset=sam) ypred <- predict(fit2,data) cv[i] <- mean((y - ypred)[-sam]^2) } cv}MCCV Example: Bodyfat OLS – using dataframe with standardized X’s> Bodyfat.x = scale(Bodyfat[,-1])> Bodyfat.scale = data.frame(bodyfat=Bodyfat$bodyfat,Bodyfat.x)> names(Bodyfat.scale)[1] "bodyfat" "age" "weight" "height" "neck" "chest" "abdomen" "hip" "thigh" "knee" "ankle" [12] "biceps" "forearm" "wrist" > bodyfat.ols = lm(bodyfat~.,data=Bodyfat.scale) note this is the full model> summary(bodyfat.ols)Call:lm(formula = bodyfat ~ ., data = Bodyfat.scale)Residuals: Min 1Q Median 3Q Max -11.1966 -2.8824 -0.1111 3.1901 9.9979 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 19.15079 0.27147 70.544 < 2e-16 ***age 0.81376 0.40570 2.006 0.04601 * weight -2.83261 1.81766 -1.558 0.12047 height -0.11466 0.46633 -0.246 0.80599 neck -1.15582 0.57264 -2.018 0.04467 * chest -0.14488 0.87021 -0.166 0.86792 abdomen 10.29781 0.97225 10.592 < 2e-16 ***hip -1.35104 1.03729 -1.302 0.19401 thigh 1.30382 0.76738 1.699 0.09061 . knee 0.03364 0.59752 0.056 0.95516 ankle 0.30150 0.37731 0.799 0.42505 biceps 0.55078 0.52117 1.057 0.29166 forearm 0.92091 0.40272 2.287 0.02309 * wrist -1.54462 0.49775 -3.103 0.00215 ** ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 4.309 on 238 degrees of freedomMultiple R-squared: 0.7486, Adjusted R-squared: 0.7348 F-statistic: 54.5 on 13 and 238 DF, p-value: < 2.2e-16 > results = ols.mccv(bodyfat.ols)> mean(results) Avg. PSE or MPSE[1] 21.12341> results [1] 23.05984 18.67317 16.19726 20.80143 19.03076 23.04956 19.51525 25.96077 23.53890 23.21706 25.40393 20.72045 [13] 20.90526 25.05236 21.92504 19.57326 22.51138 18.41306 19.44615 20.58773 19.99501 24.28484 21.15600 26.98161 [25] 17.92298 22.44382 19.99677 21.53336 19.05944 23.18343 23.86428 14.57523 22.69167 18.88682 19.61813 21.38962 [37] 23.44665 23.61154 23.99787 19.85491 18.90459 23.63645 20.71451 20.67572 17.69072 19.93048 21.76474 17.93562 [49] 24.07194 17.43179 27.06074 20.57398 21.50898 17.12468 22.13966 22.57043 19.62440 20.08109 23.17060 17.34390 [61] 19.71538 18.75761 24.37032 17.28482 21.68807 20.82153 22.52111 24.41738 20.89708 25.00203 15.67878 22.44576 [73] 20.98620 20.75209 20.18027 24.02314 14.90204 23.22266 23.30630 21.74893 20.02030 19.84644 22.31727 19.65249 [85] 22.67341 17.84598 19.57925 21.03370 22.04792 21.77679 22.90814 23.08865 22.26177 19.02380 21.20719 18.50420 [97] 20.77255 20.66017 19.95717 24.41063> sum(resid(bodyfat.ols)^2)/252 RSS/n < MPSE as it should be![1] 17.53994> results = ols.mccv(bodyfat.ols,B=1000)> mean(results)[1] 20.95795> bodyfat.step = step(bodyfat.ols) find the “best” OLS model using mixed selection.> results = ols.mccv(bodyfat.step,B=500)> mean(results)[1] 19.4979 Q: The MPSE is smaller for the simpler model, but is this the best we can do?Bootstrap Estimate of the Mean Squared Error for PredictionThe bootstrap in statistics is a method for approximating the sampling distribution of a statistic by resampling from our observed random sample. To put it simply, a bootstrap sample is a sample of size n drawn with replacement from our original sample. A bootstrap sample for regression (or classification) problems is illustrated below.Data: x1,y1, x2, y2,…, (xn, yn) here the xi's are the p-dimensional predictor vectors.Bootstrap Sample:x1*,y1*,x2*,y2*,…,(xn*,yn*) where (xi*,yi*) is a random selected observation from the original data drawn with replacement. We can use the bootstrap sample to calculate any statistic of interest. This process is then repeated a large number of times (B = 500, 1000, 5000, etc.).For estimating prediction error we fit a model to our bootstrap sample and use it to predict the observations not selected in our bootstrap sample. One can show that about 63.2% of the original observations will represented in the bootstrap sample and about 36.8% of the original observations will not be selected. Thus we will almost certainly have some observations that are not represented in our bootstrap sample to serve as a “test” set, with the selected observations in our bootstrap sample serving as our “training” set. For each bootstrap sample we can predict the response for the cases Estimating the prediction error via the .632 BootstrapAgain our goal is to estimate the mean prediction squared error (MPSE or PSE for short) or mean squared error for prediction (MSEP).Another alternative to those presented above is to use the .632 bootstrap for estimating the PSE. The algorithm is given below:First calculate the average squared residual (ASR) from your model ASR = RSS/n. Take B bootstrap samples drawn with replacement, i.e. we draw a sample with replacement from the numbers 1 to n and use those observations as our “new data”.Fit the model to each of the B bootstrap samples, computing the ASR(j) for predicting the observations not represented in the bootstrap sample.ASRj = average squared residual for prediction in the jth bootstrap sample, j = 1,…,pute ASR0 = the average of the bootstrap ASR valuesCompute the optimism (OP) = .632*(ASR0 – ASR)The .632 bootstrap estimate of mean PSE = ASR + OP.The bootstrap approach has been shown to be better than K-fold cross-validation in many cases.Here is an example/function of the .632 bootstrap estimate of the mean PSE again using the body fat dataset (Bodyfat).> bootols.cv = function(fit,B=100) { ASR = mean(fit$residuals^2) boot.err <- rep(0,B) y = fit$model[,1] x = fit$model[,-1] data = fit$model n = nrow(data) for (i in 1:B) { sam = sample(1:n,n,replace=T) samind = sort(unique(sam)) temp = lm(formula(fit),data=data[sam,]) ypred = predict(temp,newdata=data[-samind,]) boot.err[i] = mean((y[-samind]-ypred)^2) } ASR0 = mean(boot.err) OP = .632*(ASR0 – ASR) PSE = ASR + OP PSE}Again we perform cross-validation on the full OLS model for the body fat data.> Bodyfat.ols = lm(bodyfat~.,data=Bodyfat)> set.seed(1111)> bootols.cv(Bodyfat.ols,B=100)[1] 20.16974> bootols.cv(Bodyfat.ols,B=100)[1] 19.87913> bootols.cv(Bodyfat.ols,B=100)[1] 19.80591> bootols.cv(Bodyfat.ols,B=1000) increasing the number of bootstrap samples (B = 1000)[1] 19.89335More on Prediction Error and the Variance-Bias TradeoffFor any regression problem we assume that the response has the following model:Y=fx+ε where x=collection of p predictors = (x1,x2,…,xp) and Varε=σε2.Our goal in modeling is to approximate or estimate f(x) using a random sample of size n: x1,y1, x2, y2,…, (xn, yn) here the xi's are the p-dimensional predictor vectors.PSE(Y) = EY-f(x)2 = E(fx)-f(x)2+E(fx-Efx2 +σε2 = Bias2+Varfx+Irreducible ErrorThe cross-validation methods discussed above are all acceptable ways to estimate PSE(Y), but some are certainly better than others. This is still an active area of research and there is no definitive best method for every situation. Some methods are better at estimate the variance component of the PSE while others are better at estimating the bias. Ideally we would like to use a method of cross-validation that does a reasonable job of estimating each component. In the sections to follow we will be introducing alternatives to OLS or variations of OLS for developing models for f(x). Some of these modeling strategies have the potential to be very flexible (i.e. have small Bias) but at the expense of being highly variable, i.e. have large variation, Var(fx). Balancing these two components of prediction error is critical and cross-validation is one of the main tools we will use to create this balance in our model development.2 - Shrinkage Methods (“Automatic” Variable Selection Methods)In our review OLS we review classic stepwise model selection methods: forward, backward, and mixed. All three of these methods will either include or exclude terms starting from an appropriate base model. Other model selection methods have been developed that are viable alternatives to these in/out strategies. These include ridge regression (old one but has new found life), LASSO (newer one), LARS (newest one), PCR, and PLS. We will discuss the idea behind each of these modeling methods in sections below. Aside from the model selection these methods have also been used extensively in high dimensional regression problems. A high dimensional problem is one in which n < p or n << p. The text authors present two examples where this might be the case, but there are certainly many others.Predicting blood pressure – rather than use standard predictors such as age, gender, and BMI, one might also collect measurements for half a million single nucleotide polymorphisms (SNP’s) for inclusion in the model. Thus we might have n ≈300 and p ≈500,000!Predicting purchasing behavior of online shoppers - using a table of 50,000 key words (coded 0/1) potential customers might use in the process of searching for products (i.e. ) we might try to predict their purchasing behavior. We might gather information from 5,000 randomly selected visitors to the website, in which case n≈5,000 and p≈50,000!Ridge and Lasso regression models will allow us to fit models to these situations where n?p, where OLS mathematically cannot!2.1 - Ridge Regression or Regularized RegressionRidge regression chooses parameter estimates, βridge, to minimize the residual sum of squares subject to a penalty on the size of the coefficients. After standardizing all potential terms in the model the ridge coefficients minimizeβridge=minβi=1n(yi-βo-j=1kuijβj)2+λj=1kβj2Here ? > 0 is a complexity parameter that controls the amount of shrinkage, the larger ? the greater the amount of shrinkage. The intercept is not included in the shrinkage and will be estimated as the mean of the response. An equivalent way to write the ridge regression criterion is βridge=minβi=1n(yi-βo-j=1kuijβj)2 subject to j=1kβj2≤sThis clearly shows how of the size of the parameter estimates are constrained. Also this formulation of the problem also leads to a nice geometric interpretation of how the penalized least squares estimation works (see figure next page).Important Question: Why is it essential to standardize the terms in our model?Visualization of Ridge Regression2200275539115002486025142494000158115128155900114300023679151162051212979003352799-3810Usual OLS Estimate = (β1,β2)Contours of the OLS criterionRidge regression estimate = β1ridge,β2ridgej=12βj2=β12+β22≤s00Usual OLS Estimate = (β1,β2)Contours of the OLS criterionRidge regression estimate = β1ridge,β2ridgej=12βj2=β12+β22≤sIn matrix notation the ridge regression criterion is given byRSSλ=y-UβTy-Uβ+ λβTβwith the resulting parameter estimates being very similar to those for OLSβridge=UTU+λI-1UTy I is the k x k identity matrix.There are several packages in R that contain functions that perform ridge regression. One we will use is lm.ridge in the package MASS. The MASS package actually contains a variety of very useful functions. MASS stands for Modern Applied Statistics in S-Plus (expensive R) by Venables & Ripley, this is an excellent reference if you are so inclined. The function call using lm.ridge is very similar to the lm() function. The other function we will use is the function ridge in the genridge package. The genridge package contains a number of plotting functions to help visualize the coefficient shrinkage that takes place by using ridge regression. Using the bodyfat dataset we will conduct a ridge regression analysis In order to fairly compare the parameter estimates obtained via ridge regression to those from OLS we will first run the OLS regression using the standardized predictors.> bodyfat.scaled = lm(bodyfat~.,data=Bodyfat.scale)> summary(bodyfat.scaled)Call:lm(formula = bodyfat ~ ., data = Bodyfat.scale)Residuals: Min 1Q Median 3Q Max -11.1966 -2.8824 -0.1111 3.1901 9.9979 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 19.15079 0.27147 70.544 < 2e-16 ***age 0.81376 0.40570 2.006 0.04601 * weight -2.83261 1.81766 -1.558 0.12047 height -0.11466 0.46633 -0.246 0.80599 neck -1.15582 0.57264 -2.018 0.04467 * chest -0.14488 0.87021 -0.166 0.86792 abdomen 10.29781 0.97225 10.592 < 2e-16 ***hip -1.35104 1.03729 -1.302 0.19401 thigh 1.30382 0.76738 1.699 0.09061 . knee 0.03364 0.59752 0.056 0.95516 ankle 0.30150 0.37731 0.799 0.42505 biceps 0.55078 0.52117 1.057 0.29166 forearm 0.92091 0.40272 2.287 0.02309 * wrist -1.54462 0.49775 -3.103 0.00215 ** ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 4.309 on 238 degrees of freedomMultiple R-squared: 0.7486, Adjusted R-squared: 0.7348 F-statistic: 54.5 on 13 and 238 DF, p-value: < 2.2e-16 > mean(bodyfat)[1] 19.15079To run ridge regression we first need to choose an optimal value for the penalty parameter ?. The size of reasonable ? values varies vastly from one ridge model to the next so using some form of automated selection method like cross-validation to help find one is a good idea. Another approach is use the effective degrees of freedom of the model which is given by the trace (sum of the diagonal elements) of the matrixdfλ=tr[UUTU+λI-1UT]which as we can see is a function of ?. Note that when ? = 0, i.e. OLS, this matrix is the hat matrix whose trace is always k. To fit ridge models and choose an appropriate ? we will use the function lm.ridge from the MASS package.> args(lm.ridge)function (formula, data, subset, na.action, lambda = 0, model = FALSE, x = FALSE, y = FALSE, contrasts = NULL, ...)The args command is an easy way to see what arguments a function takes to run. Of course some functions are quite complex so using the command >?lm.ridge will bring up the help file with additional details on the arguments and generally simple examples of the functions use. We first we will use a wide range of ? values and let the built-in optimal selection algorithms choose good candidates.> bodyfat.ridge = lm.ridge(bodyfat~.,data=Bodyfat.scale,lambda=seq(0,1000,.1))> select(bodyfat.ridge)modified HKB estimator is 1.664046 modified L-W estimator is 3.91223 smallest value of GCV at 1.1 cross-validation choice for ?Using the ridge function from the genridge package along with some different plotting features we can see the shrinkage in the parameter estimates.> bodyfat.ridge2 = ridge(bodyfat,bodyfat.Xs, lambda=seq(0,1000,.1))> traceplot(bodyfat.ridge2)> traceplot(bodyfat.ridge2,X=”df”)We can narrow the range on the ? choices to take a closer look at the optimal shrinkage parameter values.> bodyfat.ridge3 = ridge(bodyfat,bodyfat.Xs, lambda=seq(0,4,.001))> traceplot(bodyfat.ridge3)> traceplot(bodyfat.ridge3,X=”df”)> bodyfat.xs = Bodyfat.scale[,-1]> bodyfat.y = Bodyfat.scale[,1]> bodyfat.ridge = ridge(bodyfat.y,bodyfat.xs,lambda=seq(0,10,2))> pairs(bodyfat.ridge)This plot shows the shrinkage in the estimated coefficients occurring as lambda increases from 0 to 10 by increments of 2. Most of the shrinkage occurs in the first 3 terms: age, weight, and height. > plot3d(bodyfat.ridge,variables=1:3)A 3-D look at the shrinkage of the coefficients of age, weight, and height.Fit ridge regression model using the HKB optimal value for ????????> bodyfat.ridge4 = lm.ridge(bodyfat~.,data=Bodyfat.scale,lambda=1.66)> attributes(bodyfat.ridge4)$names[1] "coef" "scales" "Inter" "lambda" "ym" "xm" "GCV" [8] "kHKB" "kLW" $class[1] "ridgelm"Compare the OLS coefficients to the ridge coefficients side-by-side.> cbind(coef(bodyfat.scaled),coef(bodyfat.ridge4)) [,1] [,2](Intercept) 19.15079365 19.150793651age 0.81375776 0.941990017weight -2.83261161 -1.944588412height -0.11466232 -0.313666216neck -1.15582043 -1.182543415chest -0.14487500 -0.009673795abdomen 10.29780784 9.416114940hip -1.35104126 -1.197685531thigh 1.30382219 1.227323244knee 0.03363573 0.027303926ankle 0.30149592 0.235719800biceps 0.55078084 0.461889816forearm 0.92090523 0.891302127wrist -1.54461619 -1.592169696The decreases in the parameter estimates for most notably abdomen and weight allow for nominal increases in some of the parameter estimates for the other predictors. Unfortunately the ridge regression routines in these packages do not allow for easy extraction of the fitted values and residuals from the fit. It is not hard to write to a simple function that will return the fitted values from a lm.ridge fit.ridgefitted = function(fit,xmat) {p = length(coef(fit))fitted = coef(fit)[1] + xmat%*%coef(fit)[2:p]fitted}> ridge4fit = ridgefitted(bodyfat.ridge4,bodyfat.Xs)> plot(bodyfat,ridge4fit,xlab="Bodyfat",ylab="Fitted Values from Ridge Regression") > ridge4resid = bodyfat - ridge4fit> plot(ridge4fit,ridge4resid,xlab="Fitted Values",ylab="Ridge Residuals") Ridge Regression using glmnet() (Friedman, Hastie, Tibshirani 2013)The glmnet package contains the function glmnet()which can be used to fit both the ridge regression and the Lasso model discussed in the next section. This function has a natural predict() function so obtaining fitted values and making predictions is easier than in the functions used above.We again return to the body fat example. The author’s also present another example of ridge regression in Lab 2 of Chapter 6 beginning on pg. 251 using data on baseball hitters and their salaries. The function glmnet()does not use standard formula conventions for developing models. Instead we form the a model matrix (X) that contains the predictors/terms as columns and the response vector y, and use them as arguments to the function. The columns of X must be numeric, so any categorical variables will need to converted to dummy variables 1st. This is easily achieved by using the model.matrix()function.For this example we will use a driver seat position data set found in the faraway package from CRAN. Response is a numeric measurement of their hip position when sitting in the driver seat.> library(faraway) you need to install it first!> names(seatpos)[1] "Age" "Weight" "HtShoes" "Ht" "Seated" "Arm" "Thigh" "Leg" [9] "hipcenter"> summary(seatpos) Age Weight HtShoes Ht Seated Arm Min. :19.00 Min. :100.0 Min. :152.8 Min. :150.2 Min. : 79.40 Min. :26.00 1st Qu.:22.25 1st Qu.:131.8 1st Qu.:165.7 1st Qu.:163.6 1st Qu.: 85.20 1st Qu.:29.50 Median :30.00 Median :153.5 Median :171.9 Median :169.5 Median : 89.40 Median :32.00 Mean :35.26 Mean :155.6 Mean :171.4 Mean :169.1 Mean : 88.95 Mean :32.22 3rd Qu.:46.75 3rd Qu.:174.0 3rd Qu.:177.6 3rd Qu.:175.7 3rd Qu.: 91.62 3rd Qu.:34.48 Max. :72.00 Max. :293.0 Max. :201.2 Max. :198.4 Max. :101.60 Max. :39.60 Thigh Leg hipcenter Min. :31.00 Min. :30.20 Min. :-279.15 1st Qu.:35.73 1st Qu.:33.80 1st Qu.:-203.09 Median :38.55 Median :36.30 Median :-174.84 Mean :38.66 Mean :36.26 Mean :-164.88 3rd Qu.:41.30 3rd Qu.:38.33 3rd Qu.:-119.92 Max. :45.50 Max. :43.10 Max. : -30.95 > pairs.plus(seatpos)> hip.ols = lm(hipcenter~.,data=seatpos)> summary(hip.ols)Call:lm(formula = hipcenter ~ ., data = seatpos)Residuals: Min 1Q Median 3Q Max -73.827 -22.833 -3.678 25.017 62.337 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 436.43213 166.57162 2.620 0.0138 *Age 0.77572 0.57033 1.360 0.1843 Weight 0.02631 0.33097 0.080 0.9372 HtShoes -2.69241 9.75304 -0.276 0.7845 Ht 0.60134 10.12987 0.059 0.9531 Seated 0.53375 3.76189 0.142 0.8882 Arm -1.32807 3.90020 -0.341 0.7359 Thigh -1.14312 2.66002 -0.430 0.6706 Leg -6.43905 4.71386 -1.366 0.1824 ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 37.72 on 29 degrees of freedomMultiple R-squared: 0.6866,Adjusted R-squared: 0.6001 F-statistic: 7.94 on 8 and 29 DF, p-value: 1.306e-05 > attach(seatpos)> VIF(hip.ols) Variance Inflation Factor Table Variable VIF RsquaredAge Age 1.997931 0.4994823Weight Weight 3.647030 0.7258043HtShoes HtShoes 307.429378 0.9967472Ht Ht 333.137832 0.9969982Seated Seated 8.951054 0.8882813Arm Arm 4.496368 0.7775983Thigh Thigh 2.762886 0.6380596Leg Leg 6.694291 0.8506190As stated above it is imperative when performing ridge regression (or any other regularized regression method) that we scale the terms to have mean 0 and variance 1. We will form a new data frame in R containing the seat position data with the predictors scaled.> X = model.matrix(hipcenter~.,data=seatpos)[,-1]> X = scale(X)> summary(X) Age Weight HtShoes Ht Seated Min. :-1.0582 Min. :-1.55477 Min. :-1.66748 Min. :-1.69012 Min. :-1.93695 1st Qu.:-0.8467 1st Qu.:-0.66743 1st Qu.:-0.50810 1st Qu.:-0.49307 1st Qu.:-0.76091 Median :-0.3425 Median :-0.05957 Median : 0.05028 Median : 0.03721 Median : 0.09071 Mean : 0.0000 Mean : 0.00000 Mean : 0.00000 Mean : 0.00000 Mean : 0.00000 3rd Qu.: 0.7474 3rd Qu.: 0.51335 3rd Qu.: 0.55484 3rd Qu.: 0.59434 3rd Qu.: 0.54187 Max. : 2.3904 Max. : 3.83913 Max. : 2.67401 Max. : 2.62373 Max. : 2.56446 Arm Thigh Leg Min. :-1.8436 Min. :-1.97556 Min. :-1.78135 1st Qu.:-0.8055 1st Qu.:-0.75620 1st Qu.:-0.72367 Median :-0.0640 Median :-0.02716 Median : 0.01082 Mean : 0.0000 Mean : 0.00000 Mean : 0.00000 3rd Qu.: 0.6701 3rd Qu.: 0.68252 3rd Qu.: 0.60577 Max. : 2.1902 Max. : 1.76639 Max. : 2.00866 > var(X) Age Weight HtShoes Ht Seated Arm Thigh LegAge 1.00000000 0.08068523 -0.07929694 -0.09012812 -0.1702040 0.3595111 0.09128584 -0.04233121Weight 0.08068523 1.00000000 0.82817733 0.82852568 0.7756271 0.6975524 0.57261442 0.78425706HtShoes -0.07929694 0.82817733 1.00000000 0.99814750 0.9296751 0.7519530 0.72486225 0.90843341Ht -0.09012812 0.82852568 0.99814750 1.00000000 0.9282281 0.7521416 0.73496041 0.90975238Seated -0.17020403 0.77562705 0.92967507 0.92822805 1.0000000 0.6251964 0.60709067 0.81191429Arm 0.35951115 0.69755240 0.75195305 0.75214156 0.6251964 1.0000000 0.67109849 0.75381405Thigh 0.09128584 0.57261442 0.72486225 0.73496041 0.6070907 0.6710985 1.00000000 0.64954120Leg -0.04233121 0.78425706 0.90843341 0.90975238 0.8119143 0.7538140 0.64954120 1.00000000> seatpos.scale = data.frame(hip=seatpos$hipcenter,X)> names(seatpos.scale)[1] "hip" "Age" "Weight" "HtShoes" "Ht" "Seated" "Arm" "Thigh" "Leg" > hip.ols = lm(hip~.,data=seatpos.scale)> summary(hip.ols)Call:lm(formula = hip ~ ., data = seatpos.scale)Residuals: Min 1Q Median 3Q Max -73.827 -22.833 -3.678 25.017 62.337 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -164.8849 6.1190 -26.946 <2e-16 ***Age 11.9218 8.7653 1.360 0.184 Weight 0.9415 11.8425 0.080 0.937 HtShoes -30.0157 108.7294 -0.276 0.784 Ht 6.7190 113.1843 0.059 0.953 Seated 2.6324 18.5529 0.142 0.888 Arm -4.4775 13.1494 -0.341 0.736 Thigh -4.4296 10.3076 -0.430 0.671 Leg -21.9165 16.0445 -1.366 0.182 ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 37.72 on 29 degrees of freedomMultiple R-squared: 0.6866,Adjusted R-squared: 0.6001 F-statistic: 7.94 on 8 and 29 DF, p-value: 1.306e-05 > attach(seatpos.scale)> VIF(hip.ols) Variance Inflation Factor Table Variable VIF RsquaredAge Age 1.997931 0.4994823Weight Weight 3.647030 0.7258043HtShoes HtShoes 307.429378 0.9967472Ht Ht 333.137832 0.9969982Seated Seated 8.951054 0.8882813Arm Arm 4.496368 0.7775983Thigh Thigh 2.762886 0.6380596Leg Leg 6.694291 0.8506190> detach(seatpos.scale)Rescaling the X’s does not change the model performance in any way. The p-values, R2, RSS, VIF’s, etc. are all the same. The only changes are the estimated regression coefficients. We now consider fitting a ridge regression model to these data.> X = model.matrix(hip~.,data=seatpos.scale)[,-1]> y = seatpos.scale$hip> library(glmnet)> grid = 10^seq(10,-2,length=100) <- set up a wide range of ? values> grid [1] 1.000000e+10 7.564633e+09 5.722368e+09 4.328761e+09 3.274549e+09 2.477076e+09 1.873817e+09 [8] 1.417474e+09 1.072267e+09 8.111308e+08 6.135907e+08 4.641589e+08 3.511192e+08 2.656088e+08 [15] 2.009233e+08 1.519911e+08 1.149757e+08 8.697490e+07 6.579332e+07 4.977024e+07 3.764936e+07 [22] 2.848036e+07 2.154435e+07 1.629751e+07 1.232847e+07 9.326033e+06 7.054802e+06 5.336699e+06 [29] 4.037017e+06 3.053856e+06 2.310130e+06 1.747528e+06 1.321941e+06 1.000000e+06 7.564633e+05 [36] 5.722368e+05 4.328761e+05 3.274549e+05 2.477076e+05 1.873817e+05 1.417474e+05 1.072267e+05 [43] 8.111308e+04 6.135907e+04 4.641589e+04 3.511192e+04 2.656088e+04 2.009233e+04 1.519911e+04 [50] 1.149757e+04 8.697490e+03 6.579332e+03 4.977024e+03 3.764936e+03 2.848036e+03 2.154435e+03 [57] 1.629751e+03 1.232847e+03 9.326033e+02 7.054802e+02 5.336699e+02 4.037017e+02 3.053856e+02 [64] 2.310130e+02 1.747528e+02 1.321941e+02 1.000000e+02 7.564633e+01 5.722368e+01 4.328761e+01 [71] 3.274549e+01 2.477076e+01 1.873817e+01 1.417474e+01 1.072267e+01 8.111308e+00 6.135907e+00 [78] 4.641589e+00 3.511192e+00 2.656088e+00 2.009233e+00 1.519911e+00 1.149757e+00 8.697490e-01 [85] 6.579332e-01 4.977024e-01 3.764936e-01 2.848036e-01 2.154435e-01 1.629751e-01 1.232847e-01 [92] 9.326033e-02 7.054802e-02 5.336699e-02 4.037017e-02 3.053856e-02 2.310130e-02 1.747528e-02> ridge.mod = glmnet(X,y,alpha=0,lambda=grid) alpha = 0 for ridge alpha = 1 for Lasso > dim(coef(ridge.mod)) has 100 columns of parameter estimates, one for [1] 9 100 each lambda in our sequence.> coef(ridge.mod)[,1] ???????? (Intercept) Age Weight HtShoes Ht Seated Arm -1.648849e+02 7.202949e-08 -2.248007e-07 -2.796599e-07 -2.804783e-07 -2.567201e-07 -2.054084e-07 Thigh Leg -2.075522e-07 -2.763501e-07 When lambda is very large we see that the parameter estimates are near 0 and the intercept estimate is approximately equal to the mean of the response (y).> mean(y)[1] -164.8849> coef(ridge.mod)[,100] ?????????? (Intercept) Age Weight HtShoes Ht Seated Arm -164.8848684 11.7788069 0.9953047 -23.0463404 -0.2996308 2.4814234 -4.4305306 Thigh Leg -4.2792243 -21.9175012 When lambda is near 0, we see that the coefficients do not differ much from the OLS regression parameter estimates which are shown below.> coef(hip.ols) (Intercept) Age Weight HtShoes Ht Seated Arm -164.8848684 11.9218052 0.9415132 -30.0156578 6.7190129 2.6323517 -4.4775359 Thigh Leg -4.4295690 -21.9165049 We can see this shrinkage of the coefficients graphically by plotting the results.> plot(ridge.mod,xvar="lambda")What value of ? should we use to obtain the “best” ridge regression model? In the code below we form a train data set consisting of 75% of the original data set and use the remaining cases as test cases. We then look at the mean PSE for various choices of ? by setting the parameter s = ? in the glmnet()function call.> train = sample(n,floor(n*p))> train [1] 15 23 4 28 20 2 9 35 21 25 22 31 34 18 32 7 16 27 26 36 29 5 8 19 12 13 17 11> test = (-train)> ridge.mod = glmnet(X[train,],y[train],alpha=0,lambda=grid)> ridge.pred = predict(ridge.mod,s=1000,newx=X[test,]) ????????> PSE = mean((ridge.pred-y[test])^2)> PSE[1] 3226.455> ridge.pred = predict(ridge.mod,s=100,newx=X[test,])???????> PSE = mean((ridge.pred-y[test])^2)> PSE[1] 1556.286> ridge.pred = predict(ridge.mod,s=10,newx=X[test,]) ??????> PSE = mean((ridge.pred-y[test])^2)> PSE[1] 1334.643> ridge.pred = predict(ridge.mod,s=5,newx=X[test,]) ?????> PSE = mean((ridge.pred-y[test])^2)> PSE[1] 1336.778> ridge.pred = predict(ridge.mod,s=1,newx=X[test,])> PSE = mean((ridge.pred-y[test])^2)> PSE[1] 1353.926It appears a λ value between 5 and 10 appears optimal for this particular train/test set combination. What if we use different train/test sets?> set.seed(1)> train = sample(n,floor(n*p))> ridge.mod glmnet(X[train,],y[train],lambda=grid,alpha=0)> ridge.pred = predict(ridge.mod,s=1000,newx=X[test,])> PSE = mean((ridge.pred - y[test])^2)> PSE[1] 2983.198> ridge.pred = predict(ridge.mod,s=100,newx=X[test,])> PSE = mean((ridge.pred - y[test])^2)> PSE[1] 1321.951> ridge.pred = predict(ridge.mod,s=50,newx=X[test,])> PSE = mean((ridge.pred - y[test])^2)> PSE[1] 1304.09> ridge.pred = predict(ridge.mod,s=25,newx=X[test,])> PSE = mean((ridge.pred - y[test])^2)> PSE[1] 1375.791> ridge.pred = predict(ridge.mod,s=10,newx=X[test,])> PSE = mean((ridge.pred - y[test])^2)> PSE[1] 1527.045Now it appears that the “optimal” ? is somewhere between 25 and 50?We can use cross-validation to choose an “optimal” ? for prediction purposes. The function cv.glmnet()uses 10-fold cross-validation to find an optimal value for ?. > cv.out = cv.glmnet(X[train,],y[train],alpha=0)Warning message:Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per fold This dataset is too small to use 10-fold cross-validation on as the sample size n = 38!> plot(cv.out)> cv.out$lambda.min[1] 36.58695> bestlam = cv.out$lambda.min> ridge.best = glmnet(X[train,],y[train],alpha=0,lambda=bestlam)> ridge.pred = predict(ridge.best,newx=X[test,])> PSE = mean((ridge.pred-y[test])^2)> PSE[1] 1328.86> coef(ridge.best)9 x 1 sparse Matrix of class "dgCMatrix" s0(Intercept) -165.669959Age 9.954727Weight -1.937062HtShoes -10.000879Ht -10.249189Seated -4.228238Arm -4.404729Thigh -4.954140Leg -10.749589> coef(hip.ols) (Intercept) Age Weight HtShoes Ht Seated Arm -164.8848684 11.9218052 0.9415132 -30.0156578 6.7190129 2.6323517 -4.4775359 Thigh Leg -4.4295690 -21.9165049 2.2 - The LassoThe lasso is another shrinkage method like ridge, but uses an L1-norm based penalty.The parameter estimates are chosen according to the following βlasso=minβi=1n(yi-βo-j=1kuijβj)2 subject to j=1kβj≤tHere t > 0 is the complexity parameter that controls the amount of shrinkage, the smaller t the greater the amount of shrinkage. As with ridge regression, the intercept is not included in the shrinkage and will be estimated as the mean of the response. If t is chosen larger than to=j=1kβjls then there will be no shrinkage and the lasso estimates will be the same as the OLS estimates. If t=to/2 then the OLS estimates will be shrunk by about 50%, however this is not to say that βjlasso=βjls/2 . The shrinkage can result in some parameters being zeroed, essentially dropping the associated predictor from the model as the figure below shows. Here the lasso estimate for β1lasso=0.1254760244348001125591238315524022051701800021520151115060013759133306397003392170581061Usual OLS Estimate = (β1,β2)Contours of the OLS criterionLasso regression estimate = β1lasso,β2lassoj=12βj=β1+β2≤s00Usual OLS Estimate = (β1,β2)Contours of the OLS criterionLasso regression estimate = β1lasso,β2lassoj=12βj=β1+β2≤s We return again to the body fat example and look at the use of the Lasso to build a model for the body fat.> X = model.matrix(bodyfat~.,data=Bodyfat)[,-1]> y = Bodyfat$bodyfat> n = nrow(X)> p = .667> set.seed(1)> train = sample(n,floor(n*p))> test = (-train)> grid = 10^seq(10,-2,length=100)> lasso.mod = glmnet(X[train,],y[train],alpha=1,lambda=grid)> plot(lasso.mod)> plot(lasso.mod,xvar=”lambda”)> set.seed(1)> cv.out = cv.glmnet(X[train,],y[train],alpha=1)> plot(cv.out)> bestlam.lasso = cv.out$lambda.min> bestlam.lasso[1] 0.1533247Use test set to obtain and estimate of the PSE for the Lasso===========================================================================================> lasso.mod = glmnet(X[train,],y[train],alpha=1,lambda=bestlam.lasso)> lasso.pred = predict(lasso.mod,newx=X[test,])> PSE = mean((lasso.pred-y[test])^2)> PSE[1] 23.72193Use the same 10-fold cross-validation to estimate optimal ? for ridge regression. Then estimate the PSE using the same test data as for the Lasso. Compare the mean PSE values.============================================================================================> set.seed(1)> cv.out = cv.glmnet(X[train,],y[train],alpha=0)> bestlam.ridge = cv.out$lambda.min> bestlam.ridge[1] 0.6335397> ridge.mod = glmnet(X[train,],y[train],alpha=0,lambda=bestlam.ridge)> ridge.pred = predict(ridge.mod,newx=X[test,])> PSE = mean((ridge.pred - y[test])^2)> PSE[1] 26.11665Comparing the coefficient estimates from Lasso, ridge regression, and OLS. Also compare PSE for test data.===============================================================================================> coef(lasso.mod) s0(Intercept) 1.27888249age 0.08947089weight . height -0.28803077neck -0.39922361chest . abdomen 0.67740803hip . thigh . knee . ankle . biceps . forearm 0.34448133wrist -1.27946216> coef(ridge.mod) s0(Intercept) -4.956145707age 0.130174182weight -0.005247158height -0.310172813neck -0.452885891chest 0.159678718abdomen 0.467277929hip 0.003963329thigh 0.189565205knee 0.057918646ankle 0.043846187biceps 0.022254664forearm 0.348491036wrist -1.470360498> temp = data.frame(bodyfat = y[train],X[train,])> head(temp) bodyfat age weight height neck chest abdomen hip thigh knee ankle biceps forearm wrist67 21.5 54 151.50 70.75 35.6 90.0 83.9 93.9 55.0 36.1 21.7 29.6 27.4 17.494 24.9 46 192.50 71.75 38.0 106.6 97.5 100.6 58.9 40.5 24.5 33.3 29.6 19.1144 9.4 23 159.75 72.25 35.5 92.1 77.1 93.9 56.1 36.1 22.7 30.5 27.2 18.2227 14.8 55 169.50 68.25 37.2 101.7 91.1 97.1 56.6 38.5 22.6 33.4 29.3 18.851 10.2 47 158.25 72.25 34.9 90.2 86.7 98.3 52.6 37.2 22.4 26.0 25.8 17.3222 26.0 54 230.00 72.25 42.5 119.9 110.4 105.5 64.2 42.7 27.0 38.4 32.0 19.6> ols.mod = lm(bodyfat~.,data=temp)> coef(ols.mod) (Intercept) age weight height neck chest abdomen hip thigh -40.65178764 0.09175572 -0.15569221 0.06515098 -0.41393595 0.10173785 0.92607342 -0.18562568 0.37387418 knee ankle biceps forearm wrist -0.03849608 0.36585984 0.11606918 0.44247339 -1.54993981 > ols.step = step(ols.mod)> coef(ols.step) (Intercept) age weight neck abdomen hip thigh forearm wrist -24.91558645 0.09187304 -0.10466396 -0.46132959 0.93852739 -0.24119508 0.38608812 0.51997961 -1.33498663 > ols.pred = predict(ols.mod,newdata=Bodyfat[test,])> PSE = mean((ols.pred-y[test])^2) [1] 23.39602> ols.pred2 = predict(ols.step,newdata=Bodyfat[test,])> PSE = mean((ols.pred2-y[test])^2)[1] 22.8308For these data we see the three approaches differ in their results. Lasso is zeroes out some coefficients, thus does completely eliminate some terms from the model. Ridge will shrink coefficients down to very near zero, effectively eliminating them, but technically will zero none of them. Stepwise selection in OLS is either in or out, so some get zeroed some don’t, however there is no shrinkage of the estimated coefficients. A good question to ask would be “how do these methods cross-validate for making future predictions?” We can use cross-validation methods to compare these competing models via estimates of the PSE.Monte Carlo Cross-Validation of OLS Regression Models> ols.mccv = function(fit,p=.667,B=100) { cv <- rep(0,B) y = fit$model[,1] x = fit$model[,-1] data = fit$model n = nrow(data) for (i in 1:B) { ss <- floor(n*p) sam <- sample(1:n,ss,replace=F) fit2 <- lm(formula(fit),data=data[sam,]) ypred <- predict(fit2,newdata=x[-sam,]) cv[i] <- mean((y[-sam]-ypred)^2) } cv}Monte Carlo Cross-Validation of Ridge and Lasso Regression> glmnet.mccv = function(X,y,alpha=0,lambda=1,p=.667,B=100) { cv <- rep(0,B) n = nrow(X) for (i in 1:B) { ss <- floor(n*p) sam <- sample(n,ss,replace=F) fit <- glmnet(X[sam,],y[sam],lambda=lambda) ypred <- predict(fit,newx=X[-sam,]) cv[i] <- mean((y[-sam]-ypred)^2) } cv}> set.seed(1)> rr.cv = glmnet.mccv(X,y,alpha=0,lambda=.634)> Statplot(rr.cv)> mean(rr.cv) > sd(rr.cv)[1] 21.65482[1] 2.847533> set.seed(1)> lass.cv = glmnet.mccv(X,y,alpha=1,lambda=.153)> mean(lass.cv)> sd(lass.cv)[1] 20.30297[1] 2.601356> ols.scale = lm(bodyfat~.,data=Bodyfat.scale)> ols.results = ols.mccv(ols.scale)> mean(ols.results)[1] 20.68592> sd(ols.results)[1] 2.737272> Statplot(ols.results)> ols.scalestep = step(ols.scale)> ols.results = ols.mccv(ols.scalestep)> mean(ols.results)[1] 19.72026> sd(ols.results)[1] 2.185153> Statplot(ols.results)1.3 - Least Angle Regression (LAR) – FYI only!The lars function in the library of the same name will perform least angle regression which is another shrinkage method for fitting regression models. lars(x, y, type = c("lasso", "lar", "forward.stagewise", "stepwise"))lar = Least Angle Regression (LAR) – see algorithm and diagram next pageforward.stagewise = Forward Stagewise selectionstepwise = forward stepwise selection (classic method)For Lasso regression use the glmnet function versus the lars implementation. LAR Algorithm As we can see the LAR and Forward Stagewise selection methods produce very similar models to the lasso for these data as seen below. Good advice would be to try them all, plot the results, and examine them for any large differences. The usability of the results from lars is an issue. Extracting fitted values, residuals and making predictions using lars is very cumbersome, but definitely doable.> X = model.matrix(bodyfat~.,data=Bodyfat)[,-1]> y = Bodyfat$bodyfat> bodyfat.lars = lars(X,y,type="lar")> plot(bodyfat.lars)> summary(bodyfat.lars)LARS/LARCall: lars(x = X, y = y, type = "lar") Df Rss Cp0 1 17579.0 696.5471 2 6348.3 93.8242 3 5999.8 77.0623 4 5645.1 59.9634 5 5037.4 29.2415 6 4998.8 29.1646 7 4684.9 14.2627 8 4678.3 15.9058 9 4658.4 16.8319 10 4644.8 18.09910 11 4516.3 13.18311 12 4427.8 10.41612 13 4421.5 12.07913 14 4420.1 14.000> fit = predict.lars(bodyfat.lars,X,s=11)> fit = predict.lars(bodyfat.lars,X,s=6)1.5 - Principal Component Regression (PCR) & Partial Least Squares (PLS)Multivariate regression methods like principal component regression (PCR) and partial least squares regression (PLSR) enjoy large popularity in a wide range of fields, including the natural sciences. The main reason is that they have been designed to confront the situation where there are many, generally correlated, predictor variables, and relatively few samples – a situation that is common, especially in chemistry where developments in spectroscopy allow for obtaining hundreds of spectra readings on single sample. In these situations n << p, thus some form of dimension reduction in the predictor space is necessary. Principal components analysis is a dimension reduction technique where p independent/orthogonal linear combinations of the input numeric variables X1,X2,…,Xp are formed so that the first linear combination accounts as much of the total variation in the original data as possible. The 2nd linear combination accounts for as much of the remaining variation in the data as possible subject to the constraint that it is orthogonal to the first linear combination, etc.. Generally the variables are all scaled to have mean 0 and variance 1 (denoted Xj*) thus the total variation in the scaled data is given byj=1pVXj*=p=j=1pV(Zj)where,Z1=a11X1*+a12X2*+…+a1pXp*Z2=a21X1*+a22X2*+…+a2pXp* …Zp=ap1X1*+ap2X2*+…+appXp*and CovZi,Zj=Corr(Zi,Zj)=0 for i≠jThe linear combinations are determined by the spectral analysis (i.e. finding eigenvalues and eigenvectors) of the sample correlation matrix (R) and the variance of the jth principal component Yj isVZj=λj the jthlargest eigenvalue of R and the coefficients of the linear combination aj1, aj2, …, ajp=eigenvector corresponding to λjIdeally the first k principal components will account for a sizeable percentage of the total variation in these data. We can then use these k principal components, Z1,Z2,…, Zk as predictors in a multiple regression model below:EYX1,X2,…,Xp=β0+j=1kβjZj and VYX1,…, Xp=σ2Yarn DataThese data were obtained from a calibration study of polyethylene terephthalate (PET) yarns which are used for textile and industrial purposes. PET yarns are produced by a process of melt-spinning, whose settings largely determine the final semi-crystalline structure of the yarn, which, in turn, determines the physical structure of PET yarns are important quality parameters for the end use of the yarn. Raman near-infrared (NIR) spectroscopy has recently become an important tool in the pharmaceutical and semiconductor industries for investigating structural information on polymers; in particular, it is used to reveal information about the chemical nature, conformational order, state of the order, and orientation of polymers. Thus, Raman spectra are used to predict the physical characteristics of polymers.In this example, we study the relationship between the overall density of a PET yarn to its NIR spectrum. The data consist of a sample of n = 21 PET yarns having known mechanical and structural properties. For each PET yarn, the Y-variable is the density (kg/m3) of the yarn, and the p = 268 X-variables (measured at 268 frequencies in the range 598 – 1900 cm-1 ) are selected from the NIR spectrum of that yarn. Thus n << p!!Many of the X-variables are highly correlated as the scatterplot matrices on the following pages clearly show. Variables X1,…, X10 Variables X31,…, X40Obviously all 268 variables contain similar information and therefore we should be able to effectively use principal components to reduce the dimensionality of the spectra data.Form principal components for the PET yarn dataFirst load package pls which contains the yarn data and routines for both principal component regression (PCR) and partial least squares (PLS) regression.> Yarn = yarn[1:21,]> R = cor(Yarn$NIR)> eigenR = eigen(R)> attributes(eigenR)$names[1] "values" "vectors”Only the first four PC’s have variance larger than a single scaled variable, i.e. there are four eigenvalues greater 1.0. We will now form Z1,Z2,Z3, Z4 using the corresponding eigenvectors.> z1 = scale(Yarn$NIR)%*%eigenR$vectors[,1]> z2 = scale(Yarn$NIR)%*%eigenR$vectors[,2]> z3 = scale(Yarn$NIR)%*%eigenR$vectors[,3]> z4 = scale(Yarn$NIR)%*%eigenR$vectors[,4]> YarnPC = data.frame(density=Yarn$density,z1,z2,z3,z4)> yarn.pcr = lm(density~.,data=YarnPC)> summary(yarn.pcr)Call:lm(formula = density ~ ., data = YarnPC)Residuals: Min 1Q Median 3Q Max -2.106 -0.522 0.246 0.632 1.219 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 33.6200 0.2249 149.5 < 2e-16 ***z1 -2.3062 0.0194 -118.7 < 2e-16 ***z2 0.6602 0.0261 25.3 2.5e-14 ***z3 1.8044 0.0341 52.8 < 2e-16 ***z4 0.9759 0.1434 6.8 4.2e-06 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.03 on 16 degrees of freedomMultiple R-squared: 0.999, Adjusted R-squared: 0.999 F-statistic: 4.39e+03 on 4 and 16 DF, p-value: <2e-16 > pairs.plus(YarnPC)119062515049500The marginal response plots look rather interesting, see rectangle above.22860001855470Residuals look surprising good considering the nonlinear relationships displayed in the marginal response plots.00Residuals look surprising good considering the nonlinear relationships displayed in the marginal response plots.Using the pcr command from the pls package.> yarn.pcr2 = pcr(density~scale(NIR),data=yarn[1:21,],ncomp=6,validation="CV")> summary(yarn.pcr2)Data: X dimension: 21 268 Y dimension: 21 1Fit method: svdpcNumber of components considered: 6VALIDATION: RMSEPCross-validated using 10 random segments. (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 compsCV 31.31 15.38 14.29 2.392 1.312 1.167 0.9461adjCV 31.31 15.39 14.46 2.358 1.288 1.145 0.9336TRAINING: % variance explained 1 comps 2 comps 3 comps 4 comps 5 comps 6 compsX 52.49 81.59 98.57 99.54 99.75 99.86density 80.13 83.77 99.65 99.91 99.93 99.95> loadingplot(yarn.pcr2,comps=1:4,legendpos=”topright”)The plot above shows the weight assigned to each variable on the first four PC’s. The solid line shows the weights assigned to the variables on the first principal component. Identifying important individual spectra will be very difficult but you can ranges that appear important for each component.Extract fitted values from 4-component fit> fit = fitted(yarn.pcr2)[,,4]> plot(Yarn$density,fit)> predplot(yarn.pcr2,ncomp=1:6)> corrplot(yarn.pcr2,comps=1:4)This plot displays the correlation between each of the 268 variables with each of the first four principal components.> YarnTest = yarn[22:28,]> predict(yarn.pcr2,ncomp=4,newdata=YarnTest), , 4 comps density110 50.9522 50.9731 31.9241 34.7751 30.7261 19.9371 19.37> YarnTest$density[1] 51.04 50.32 32.14 34.69 30.30 20.45 20.06 Partial Least Squares (PLS) AlgorithmWhile PCR focuses on the covariance structure of the X’s independent of the response Y partial least squares (PLS) looks at the covariance structure of the X’s and the response Y jointly. The algorithm for PLS is shown below.To better understand the PLS algorithm consider the simple example below.27130088652800Generate some datay = rnorm(100)y = y -mean(y)x1 = rnorm(100)x1 = (x1 - mean(x1))/sd(x1)x2 = y+x1+rnorm(100)x2 = (x2 - mean(x2))/sd(x2)phi1 = sum(y*x1)phi2 = sum(y*x2)z1 = phi1*x1 + phi2*x2z1 = (z1 - mean(z1))/sd(z1)th1 = lsfit(z1,y,int=F)$coefy1 = y + th1*z1pairs(cbind(y,x1,x2,z1,y1))Now we do the second iterationx11 = x1 - sum(x1*z1)*z1/sum(z1*z1)x21 = x2 - sum(x2*z1)*z1/sum(z1*z1)phi1 = sum(y1*x11)phi2 = sum(y1*x21)z2 = phi1*x11 + phi2*x21z2 = (z2 - mean(z2))/sd(z2)2635370-24154000th2 = lsfit(z2,y1,int=F)$coefy2 = y1 + th2*z2pairs(cbind(y,z2,y2,y1))Ultimately the final fitted values are a linear combination of the z-components and thus they can be expressed Y=y+j=1kθjzj. Interpretation of the results is done in a similar fashion to PCR by examining plots of the cross-validation results, variable loadings, and correlations with the original predictors.We now examine the results from PLS regression for the yarn data.> yarn.pls = plsr(density~NIR,ncomp=10,data=Yarn,validation="CV")> summary(yarn.pls)Data: X dimension: 21 268 Y dimension: 21 1Fit method: kernelplsNumber of components considered: 10VALIDATION: RMSEPCross-validated using 10 random segments. (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps 9 comps 10 compsCV 31.31 6.473 4.912 2.164 0.8847 0.6041 0.6550 0.3983 0.3244 0.2890 0.2822adjCV 31.31 5.840 4.862 2.150 0.8552 0.5908 0.6367 0.3816 0.3115 0.2759 0.2687TRAINING: % variance explained 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps 9 comps 10 compsX 47.07 98.58 99.50 99.72 99.87 99.98 99.98 99.99 99.99 99.99density 98.19 98.29 99.71 99.97 99.99 99.99 100.00 100.00 100.00 100.00> plot(RMSEP(yarn.pls),legendpos="topright")2903220400050The 4 component model is suggested by cross-validation.4000020000The 4 component model is suggested by cross-validation.> predplot(yarn.pls,ncomp=4,line=T)> plot(yarn.pls,plottype=”scores”,comps=1:4)> plot(yarn.pls,"loadings",comps=1:4,legendpos="topright")> abline(h=0)> predict(yarn.pls,ncomp=4,newdata=YarnTest), , 4 comps density110 51.0522 50.7231 32.0141 34.2951 30.3661 20.5871 19.08> YarnTest$density[1] 51.04 50.32 32.14 34.69 30.30 20.45 20.06> sum((predict t(yarn.pcr2,ncomp=4,newdata=YarnTest)-YarnTest$density)^2)[1] 1.407> sum((predict(yarn.pls,ncomp=4,newdata=YarnTest)-YarnTest$density)^2)[1] 1.320The 4-component PLS model does a slightly better job of predicting the test yarn densities. ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download