Title: | Regularization Paths for SCAD and MCP Penalized Regression Models |
---|---|
Description: | Fits regularization paths for linear regression, GLM, and Cox regression models using lasso or nonconvex penalties, in particular the minimax concave penalty (MCP) and smoothly clipped absolute deviation (SCAD) penalty, with options for additional L2 penalties (the "elastic net" idea). Utilities for carrying out cross-validation as well as post-fitting visualization, summarization, inference, and prediction are also provided. For more information, see Breheny and Huang (2011) <doi:10.1214/10-AOAS388> or visit the ncvreg homepage <https://pbreheny.github.io/ncvreg/>. |
Authors: | Patrick Breheny [aut, cre] |
Maintainer: | Patrick Breheny <[email protected]> |
License: | GPL-3 |
Version: | 3.14.3.1 |
Built: | 2024-10-25 05:47:43 UTC |
Source: | https://github.com/pbreheny/ncvreg |
Calculates the cross-validated AUC (concordance) from a cv.ncvsurv
object.
## S3 method for class 'cv.ncvsurv' AUC(obj, ...)
## S3 method for class 'cv.ncvsurv' AUC(obj, ...)
obj |
A |
... |
For S3 method compatibility; not used |
The area under the curve (AUC), or equivalently, the concordance statistic
(C), is calculated according to the procedure described in van Houwelingen
and Putter (2011). The function calls survival::concordancefit()
, except
cross-validated linear predictors are used to guard against overfitting.
Thus, the values returned by AUC.cv.ncvsurv()
will be lower than those you
would obtain with concordancefit()
if you fit the full (unpenalized) model.
Patrick Breheny, Brandon Butcher, and Lawrence Hunsicker
van Houwelingen H, Putter H (2011). Dynamic Prediction in Clinical Survival Analysis. CRC Press.
cv.ncvsurv()
, survival::concordancefit()
data(Lung) X <- Lung$X y <- Lung$y cvfit <- cv.ncvsurv(X, y, returnY=TRUE) head(AUC(cvfit)) lam <- cvfit$lambda plot(lam, AUC(cvfit), xlim=rev(range(lam)), lwd=3, type='l', las=1, xlab=expression(lambda), ylab='AUC')
data(Lung) X <- Lung$X y <- Lung$y cvfit <- cv.ncvsurv(X, y, returnY=TRUE) head(AUC(cvfit)) lam <- cvfit$lambda plot(lam, AUC(cvfit), xlim=rev(range(lam)), lwd=3, type='l', las=1, xlab=expression(lambda), ylab='AUC')
Performs k-fold cross validation for MCP- or SCAD-penalized regression models over a grid of values for the regularization parameter lambda.
cv.ncvreg( X, y, ..., cluster, nfolds = 10, seed, fold, returnY = FALSE, trace = FALSE ) cv.ncvsurv( X, y, ..., cluster, nfolds = 10, seed, fold, se = c("quick", "bootstrap"), returnY = FALSE, trace = FALSE )
cv.ncvreg( X, y, ..., cluster, nfolds = 10, seed, fold, returnY = FALSE, trace = FALSE ) cv.ncvsurv( X, y, ..., cluster, nfolds = 10, seed, fold, se = c("quick", "bootstrap"), returnY = FALSE, trace = FALSE )
X |
The design matrix, without an intercept, as in |
y |
|
... |
|
cluster |
|
nfolds |
The number of cross-validation folds. Default is 10. |
seed |
You may set the seed of the random number generator in order to obtain reproducible results. |
fold |
Which fold each observation belongs to. By default the observations are randomly assigned. |
returnY |
Should |
trace |
If set to |
se |
For |
The function calls ncvreg
/ncvsurv
nfolds
times, each
time leaving out 1/nfolds
of the data. The cross-validation error is
based on the deviance; see here for more details.
For family="binomial"
models, the cross-validation fold assignments are
balanced across the 0/1 outcomes, so that each fold has the same proportion
of 0/1 outcomes (or as close to the same proportion as it is possible to
achieve if cases do not divide evenly).
For Cox models, cv.ncvsurv()
uses the approach of calculating the full
Cox partial likelihood using the cross-validated set of linear predictors.
Other approaches to cross-validation for the Cox regression model have been
proposed in the literature; the strengths and weaknesses of the various
methods for penalized regression in the Cox model are the subject of current
research. A simple approximation to the standard error is provided,
although an option to bootstrap the standard error (se='bootstrap'
) is also
available.
An object with S3 class cv.ncvreg
or cv.ncvsurv
containing:
The error for each value of lambda
, averaged across the cross-
validation folds.
The estimated standard error associated with each value of for cve
.
The fold assignments for cross-validation for each observation;
note that for cv.ncvsurv()
, these are in terms of the ordered observations,
not the original observations.
The sequence of regularization parameter values along which the cross-validation error was calculated.
The index of lambda
corresponding to lambda.min
.
The value of lambda
with the minimum cross-validation error.
The deviance for the intercept-only model. If you have supplied
your own lambda
sequence, this quantity may not be meaningful.
The estimated bias of the minimum cross-validation error, as in Tibshirani and Tibshirani (2009) doi:10.1214/08-AOAS224
If family="binomial"
, the cross-validation prediction error for
each value of lambda
.
If returnY=TRUE
, the matrix of cross-validated fitted values (see above).
Patrick Breheny; Grant Brown helped with the parallelization support
Breheny P and Huang J. (2011) Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. Annals of Applied Statistics, 5: 232-253. doi:10.1214/10-AOAS388
ncvreg()
, plot.cv.ncvreg()
, summary.cv.ncvreg()
data(Prostate) cvfit <- cv.ncvreg(Prostate$X, Prostate$y) plot(cvfit) summary(cvfit) fit <- cvfit$fit plot(fit) beta <- fit$beta[,cvfit$min] ## requires loading the parallel package ## Not run: library(parallel) X <- Prostate$X y <- Prostate$y cl <- makeCluster(4) cvfit <- cv.ncvreg(X, y, cluster=cl, nfolds=length(y)) ## End(Not run) # Survival data(Lung) X <- Lung$X y <- Lung$y cvfit <- cv.ncvsurv(X, y) summary(cvfit) plot(cvfit) plot(cvfit, type="rsq")
data(Prostate) cvfit <- cv.ncvreg(Prostate$X, Prostate$y) plot(cvfit) summary(cvfit) fit <- cvfit$fit plot(fit) beta <- fit$beta[,cvfit$min] ## requires loading the parallel package ## Not run: library(parallel) X <- Prostate$X y <- Prostate$y cl <- makeCluster(4) cvfit <- cv.ncvreg(X, y, cluster=cl, nfolds=length(y)) ## End(Not run) # Survival data(Lung) X <- Lung$X y <- Lung$y cvfit <- cv.ncvsurv(X, y) summary(cvfit) plot(cvfit) plot(cvfit, type="rsq")
Data from a subset of the Coronary Risk-Factor Study baseline survey, carried out in rural South Africa.
Heart
Heart
A list of two objects: y
and X
Coronary heart disease at baseline; 1=Yes 0=No
A matrix with 462 observations (rows) and 9 predictor variables
(columns). The remainder of this list describes the columns of X
Systolic blood pressure
Cumulative tobacco consumption, in kg
Low-density lipoprotein cholesterol
Adipose tissue concentration
Family history of heart disease (1=Present, 0=Absent)
Score on test designed to measure type-A behavior
Obesity
Current consumption of alcohol
Age of subject
https://web.stanford.edu/~hastie/ElemStatLearn/
Hastie T, Tibshirani R, and Friedman J. (2001). The Elements of Statistical Learning. Springer.
Rousseauw J, et al. (1983). Coronary risk factor screening in three rural communities. South African Medical Journal, 64: 430-436.
local_mfdr()
is called by summary.ncvreg()
, which typically offers a more convenient interface to users.
If, however, you are working with local mfdrs programmatically rather than interactively, you probably want to
use local_mfdr()
, which skips the sorting, filtering, and print formatting of summary.ncvreg()
.
local_mfdr( fit, lambda, X = NULL, y = NULL, method = c("ashr", "kernel"), sigma, ... )
local_mfdr( fit, lambda, X = NULL, y = NULL, method = c("ashr", "kernel"), sigma, ... )
fit |
A fitted |
lambda |
The value of lambda at which inference should be carried out. |
X , y
|
The design matrix and response used to fit the model; in most cases, it is not necessary to provide
|
method |
What method should be used to calculate the local fdr? Options are |
sigma |
For linear regression models, users can supply an estimate of the residual standard deviation. The default is to use RSS / DF, where degrees of freedom are approximated using the number of nonzero coefficients. |
... |
Additional arguments to |
If all features are penalized, then the object returns a data frame with one row per feature and four columns:
Estimate
: The coefficient estimate from the penalized regression fit
z
: A test statistic that approximately follows a standard normal distribution under the null hypothesis that the
feature is marginally independent of the outcome
mfdr
: The estimated marginal local false discovery rate
Selected
: Features with nonzero coefficient estimates are given an asterisk
If some features are penalized and others are not, then a list is returned with two elements: pen.vars
, which consists
of the data frame described above, and unpen.vars
, a data frame with four columns: Estimate
, SE
, Statistic
, and
p.value
. The standard errors and p-values are based on a classical lm
/glm
/coxph
model using the effect of the
penalized features as an offset.
# Linear regression data(Prostate) fit <- ncvreg(Prostate$X, Prostate$y) local_mfdr(fit, 0.1) fit <- ncvreg(Prostate$X, Prostate$y, penalty.factor=rep(0:1, each=4)) local_mfdr(fit, 0.1) # Logistic regression data(Heart) X <- Heart$X y <- Heart$y fit <- ncvreg(X, y, family='binomial') local_mfdr(fit, 0.1) # Cox regression data(Lung) X <- Lung$X y <- Lung$y fit <- ncvsurv(X, y) local_mfdr(fit, 0.1)
# Linear regression data(Prostate) fit <- ncvreg(Prostate$X, Prostate$y) local_mfdr(fit, 0.1) fit <- ncvreg(Prostate$X, Prostate$y, penalty.factor=rep(0:1, each=4)) local_mfdr(fit, 0.1) # Logistic regression data(Heart) X <- Heart$X y <- Heart$y fit <- ncvreg(X, y, family='binomial') local_mfdr(fit, 0.1) # Cox regression data(Lung) X <- Lung$X y <- Lung$y fit <- ncvsurv(X, y) local_mfdr(fit, 0.1)
Extract the log-likelihood of an ncvreg
or ncvsurv
object.
## S3 method for class 'ncvreg' logLik(object, REML = FALSE, ...) ## S3 method for class 'ncvsurv' logLik(object, ...)
## S3 method for class 'ncvreg' logLik(object, REML = FALSE, ...) ## S3 method for class 'ncvsurv' logLik(object, ...)
object |
An |
REML |
As in |
... |
For S3 compatibility |
Data from a randomised trial of two treatment regimens for lung cancer. This is a standard survival analysis data set from the classic textbook by Kalbfleisch and Prentice.
Lung
Lung
A list of two objects: y
and X
A two column matrix (Surv
object) containing the follow-up
time (in days) and an indicator variable for whether the patient died
while on the study or not.
A matrix with 137 observations (rows) and 9 predictor variables
(columns). The remainder of this list describes the columns of X
Treatment indicator (1=control group, 2=treatment group)
Karnofsky performance score (0=bad, 100=good)
Time from diagnosis to randomization (months)
Age (years, at baseline)
Prior therapy (0=no, 1=yes)
Indicator for whether the cancer type is squamous cell carcinoma (0=no, 1=yes)
Indicator for whether the cancer type is small cell lung cancer (0=no, 1=yes)
Indicator for whether the cancer type is adenocarcinoma (0=no, 1=yes)
Indicator for whether the cancer type is large cell carcinoma (0=no, 1=yes)
https://cran.r-project.org/package=survival
Kalbfleisch D and Prentice RL (1980), The Statistical Analysis of Failure Time Data. Wiley, New York.
Estimates the marginal false discovery rate (mFDR) of a penalized regression model.
mfdr(fit, X)
mfdr(fit, X)
fit |
An |
X |
The model matrix corresponding to |
The function estimates the marginal false discovery rate (mFDR) for a
penalized regression model. The estimate tends to be accurate in most
settings, but will be slightly conservative if predictors are highly
correlated. For an alternative way of estimating the mFDR, typically more
accurate in highly correlated cases, see perm.ncvreg()
.
An object with S3 class mfdr
inheriting from data.frame
, containing:
The number of variables selected at each value of lambda
, averaged over the permutation fits.
The actual number of selected variables for the non-permuted data.
The estimated marginal false discovery rate (EF/S
).
Patrick Breheny and Ryan Miller
ncvreg()
, ncvsurv()
, plot.mfdr()
, perm.ncvreg()
# Linear regression -------------------------------- data(Prostate) fit <- ncvreg(Prostate$X, Prostate$y) obj <- mfdr(fit) obj[1:10,] # Comparison with perm.ncvreg op <- par(mfrow=c(2,2)) plot(obj) plot(obj, type="EF") pmfit <- perm.ncvreg(Prostate$X, Prostate$y) plot(pmfit) plot(pmfit, type="EF") par(op) # Logistic regression ------------------------------ data(Heart) fit <- ncvreg(Heart$X, Heart$y, family="binomial") obj <- mfdr(fit) head(obj) op <- par(mfrow=c(1,2)) plot(obj) plot(obj, type="EF") par(op) # Cox regression ----------------------------------- data(Lung) fit <- ncvsurv(Lung$X, Lung$y) obj <- mfdr(fit) head(obj) op <- par(mfrow=c(1,2)) plot(obj) plot(obj, type="EF") par(op)
# Linear regression -------------------------------- data(Prostate) fit <- ncvreg(Prostate$X, Prostate$y) obj <- mfdr(fit) obj[1:10,] # Comparison with perm.ncvreg op <- par(mfrow=c(2,2)) plot(obj) plot(obj, type="EF") pmfit <- perm.ncvreg(Prostate$X, Prostate$y) plot(pmfit) plot(pmfit, type="EF") par(op) # Logistic regression ------------------------------ data(Heart) fit <- ncvreg(Heart$X, Heart$y, family="binomial") obj <- mfdr(fit) head(obj) op <- par(mfrow=c(1,2)) plot(obj) plot(obj, type="EF") par(op) # Cox regression ----------------------------------- data(Lung) fit <- ncvsurv(Lung$X, Lung$y) obj <- mfdr(fit) head(obj) op <- par(mfrow=c(1,2)) plot(obj) plot(obj, type="EF") par(op)
This function is intended for users who know exactly what they're doing and want complete control over the fitting process: no standardization is applied, no intercept is included, no path is fit. All of these things are best practices for data analysis, so if you are choosing not to do them, you are on your own – there is no guarantee that your results will be meaningful. Some things in particular that you should pay attention to:
If your model has an intercept, it is up to you to (un)penalize it properly, typically
by settings its corresponding element of penalty.factor
to zero.
You should provide initial values for the coefficients; in nonconvex optimization, initial values are very important in determining which local solution an algorithm converges to.
ncvfit( X, y, init = rep(0, ncol(X)), r, xtx, penalty = c("MCP", "SCAD", "lasso"), gamma = switch(penalty, SCAD = 3.7, 3), alpha = 1, lambda, eps = 1e-05, max.iter = 1000, penalty.factor = rep(1, ncol(X)), warn = TRUE )
ncvfit( X, y, init = rep(0, ncol(X)), r, xtx, penalty = c("MCP", "SCAD", "lasso"), gamma = switch(penalty, SCAD = 3.7, 3), alpha = 1, lambda, eps = 1e-05, max.iter = 1000, penalty.factor = rep(1, ncol(X)), warn = TRUE )
X |
Design matrix; no intercept will be added, no standardization will occur (n x p matrix) |
y |
Response vector (length n vector) |
init |
Initial values for beta. Default: zero (length p vector) |
r |
Residuals corresponding to |
xtx |
X scales: the jth element should equal |
penalty |
Penalty function to be applied, either "MCP" (default), "SCAD", or "lasso") |
gamma |
Tuning parameter of the MCP/SCAD penalty, as in |
alpha |
Tuning paramter controlling the ridge component of penalty, as in
|
lambda |
Regularization parameter value at which to estimate beta; must be
scalar – for pathwise optimization, see |
eps |
Convergence threshhold. The algorithm iterates until the RMSD for the change in linear predictors for each coefficient is less than eps. Default is 1e-4. |
max.iter |
Maximum number of allowed iterations; if this number is reached, algorithm will terminate prior to convergence. Default: 1000. |
penalty.factor |
Multiplicative factor for the penalty applied to each coefficient,
as in |
warn |
Return warning messages for failures to converge and model saturation? Default is TRUE. |
At the moment, this function only works for least-squares loss functions. Additional functionality for other loss functions (logistic, Cox) is in development.
A list containing:
beta
: The estimated regression coefficients
iter
: The number of iterations required to solve for 'beta
loss
: The loss (residual sum of squares) at convergence
resid
: The residuals at convergence
lambda
: See above
penalty
: See above
gamma
: See above
alpha
: See above
penalty.factor
: See above
n
: Sample size
data(Prostate) X <- cbind(1, Prostate$X) y <- Prostate$y fit <- ncvfit(X, y, lambda=0.1, penalty.factor=c(0, rep(1, ncol(X)-1))) fit$beta # Compare with: coef(ncvreg(X, y), 0.1) # The unstandardized version makes little sense here, as it fails to account # for differences in the scales of the predictors.
data(Prostate) X <- cbind(1, Prostate$X) y <- Prostate$y fit <- ncvfit(X, y, lambda=0.1, penalty.factor=c(0, rep(1, ncol(X)-1))) fit$beta # Compare with: coef(ncvreg(X, y), 0.1) # The unstandardized version makes little sense here, as it fails to account # for differences in the scales of the predictors.
Fit coefficients paths for MCP- or SCAD-penalized regression models over a grid of values for the regularization parameter lambda. Fits linear and logistic regression models, with option for an additional L2 penalty.
ncvreg( X, y, family = c("gaussian", "binomial", "poisson"), penalty = c("MCP", "SCAD", "lasso"), gamma = switch(penalty, SCAD = 3.7, 3), alpha = 1, lambda.min = ifelse(n > p, 0.001, 0.05), nlambda = 100, lambda, eps = 1e-04, max.iter = 10000, convex = TRUE, dfmax = p + 1, penalty.factor = rep(1, ncol(X)), warn = TRUE, returnX, ... )
ncvreg( X, y, family = c("gaussian", "binomial", "poisson"), penalty = c("MCP", "SCAD", "lasso"), gamma = switch(penalty, SCAD = 3.7, 3), alpha = 1, lambda.min = ifelse(n > p, 0.001, 0.05), nlambda = 100, lambda, eps = 1e-04, max.iter = 10000, convex = TRUE, dfmax = p + 1, penalty.factor = rep(1, ncol(X)), warn = TRUE, returnX, ... )
X |
The design matrix, without an intercept. |
y |
The response vector. |
family |
Either "gaussian", "binomial", or "poisson", depending on the response. |
penalty |
The penalty to be applied to the model. Either "MCP" (the default), "SCAD", or "lasso". |
gamma |
The tuning parameter of the MCP/SCAD penalty (see details). Default is 3 for MCP and 3.7 for SCAD. |
alpha |
Tuning parameter for the Mnet estimator which controls the
relative contributions from the MCP/SCAD penalty and the
ridge, or L2 penalty. |
lambda.min |
The smallest value for lambda, as a fraction of lambda.max. Default is 0.001 if the number of observations is larger than the number of covariates and .05 otherwise. |
nlambda |
The number of lambda values. Default is 100. |
lambda |
A user-specified sequence of lambda values. By default, a
sequence of values of length |
eps |
Convergence threshhold. The algorithm iterates until the
RMSD for the change in linear predictors for each
coefficient is less than |
max.iter |
Maximum number of iterations (total across entire path). Default is 10000. |
convex |
Calculate index for which objective function ceases to be locally convex? Default is TRUE. |
dfmax |
Upper bound for the number of nonzero coefficients. Default is no upper bound. However, for large data sets, computational burden may be heavy for models with a large number of nonzero coefficients. |
penalty.factor |
A multiplicative factor for the penalty applied to
each coefficient. If supplied, |
warn |
Return warning messages for failures to converge and model saturation? Default is TRUE. |
returnX |
Return the standardized design matrix along with the fit? By
default, this option is turned on if X is under 100 MB, but
turned off for larger matrices to preserve memory. Note that
certain methods, such as |
... |
Not used. |
The sequence of models indexed by the regularization parameter lambda
is
fit using a coordinate descent algorithm. For logistic regression models,
some care is taken to avoid model saturation; the algorithm may exit early in
this setting. The objective function is defined to be
where the loss function L is the deviance (-2 times the log likelihood) for the specified outcome distribution (gaussian/binomial/poisson). See here for more details.
This algorithm is stable, very efficient, and generally converges quite rapidly to the solution. For GLMs, adaptive rescaling is used.
An object with S3 class "ncvreg"
containing:
The fitted matrix of coefficients. The number of rows is equal to the number of coefficients, and the number of columns is equal to nlambda
.
A vector of length nlambda
containing the number of iterations until convergence at each value of lambda
.
The sequence of regularization parameter values in the path.
Same as above.
The last index for which the objective function is locally convex. The smallest value of lambda for which the objective function is convex is therefore lambda[convex.min]
, with corresponding coefficients beta[,convex.min]
.
A vector containing the deviance (i.e., the loss) at each value of lambda
. Note that for gaussian
models, the loss is simply the residual sum of squares.
Sample size.
Additionally, if returnX=TRUE
, the object will also contain
The standardized design matrix.
The response, centered if family='gaussian'
.
Breheny P and Huang J. (2011) Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. Annals of Applied Statistics, 5: 232-253. doi:10.1214/10-AOAS388
# Linear regression -------------------------------------------------- data(Prostate) X <- Prostate$X y <- Prostate$y op <- par(mfrow=c(2,2)) fit <- ncvreg(X, y) plot(fit, main=expression(paste(gamma,"=",3))) fit <- ncvreg(X, y, gamma=10) plot(fit, main=expression(paste(gamma,"=",10))) fit <- ncvreg(X, y, gamma=1.5) plot(fit, main=expression(paste(gamma,"=",1.5))) fit <- ncvreg(X, y, penalty="SCAD") plot(fit, main=expression(paste("SCAD, ",gamma,"=",3))) par(op) op <- par(mfrow=c(2,2)) fit <- ncvreg(X, y) plot(fit, main=expression(paste(alpha,"=",1))) fit <- ncvreg(X, y, alpha=0.9) plot(fit, main=expression(paste(alpha,"=",0.9))) fit <- ncvreg(X, y, alpha=0.5) plot(fit, main=expression(paste(alpha,"=",0.5))) fit <- ncvreg(X, y, alpha=0.1) plot(fit, main=expression(paste(alpha,"=",0.1))) par(op) op <- par(mfrow=c(2,2)) fit <- ncvreg(X, y) plot(mfdr(fit)) # Independence approximation plot(mfdr(fit), type="EF") # Independence approximation perm.fit <- perm.ncvreg(X, y) plot(perm.fit) plot(perm.fit, type="EF") par(op) # Logistic regression ------------------------------------------------ data(Heart) X <- Heart$X y <- Heart$y op <- par(mfrow=c(2,2)) fit <- ncvreg(X, y, family="binomial") plot(fit, main=expression(paste(gamma,"=",3))) fit <- ncvreg(X, y, family="binomial", gamma=10) plot(fit, main=expression(paste(gamma,"=",10))) fit <- ncvreg(X, y, family="binomial", gamma=1.5) plot(fit, main=expression(paste(gamma,"=",1.5))) fit <- ncvreg(X, y, family="binomial", penalty="SCAD") plot(fit, main=expression(paste("SCAD, ",gamma,"=",3))) par(op) op <- par(mfrow=c(2,2)) fit <- ncvreg(X, y, family="binomial") plot(fit, main=expression(paste(alpha,"=",1))) fit <- ncvreg(X, y, family="binomial", alpha=0.9) plot(fit, main=expression(paste(alpha,"=",0.9))) fit <- ncvreg(X, y, family="binomial", alpha=0.5) plot(fit, main=expression(paste(alpha,"=",0.5))) fit <- ncvreg(X, y, family="binomial", alpha=0.1) plot(fit, main=expression(paste(alpha,"=",0.1))) par(op)
# Linear regression -------------------------------------------------- data(Prostate) X <- Prostate$X y <- Prostate$y op <- par(mfrow=c(2,2)) fit <- ncvreg(X, y) plot(fit, main=expression(paste(gamma,"=",3))) fit <- ncvreg(X, y, gamma=10) plot(fit, main=expression(paste(gamma,"=",10))) fit <- ncvreg(X, y, gamma=1.5) plot(fit, main=expression(paste(gamma,"=",1.5))) fit <- ncvreg(X, y, penalty="SCAD") plot(fit, main=expression(paste("SCAD, ",gamma,"=",3))) par(op) op <- par(mfrow=c(2,2)) fit <- ncvreg(X, y) plot(fit, main=expression(paste(alpha,"=",1))) fit <- ncvreg(X, y, alpha=0.9) plot(fit, main=expression(paste(alpha,"=",0.9))) fit <- ncvreg(X, y, alpha=0.5) plot(fit, main=expression(paste(alpha,"=",0.5))) fit <- ncvreg(X, y, alpha=0.1) plot(fit, main=expression(paste(alpha,"=",0.1))) par(op) op <- par(mfrow=c(2,2)) fit <- ncvreg(X, y) plot(mfdr(fit)) # Independence approximation plot(mfdr(fit), type="EF") # Independence approximation perm.fit <- perm.ncvreg(X, y) plot(perm.fit) plot(perm.fit, type="EF") par(op) # Logistic regression ------------------------------------------------ data(Heart) X <- Heart$X y <- Heart$y op <- par(mfrow=c(2,2)) fit <- ncvreg(X, y, family="binomial") plot(fit, main=expression(paste(gamma,"=",3))) fit <- ncvreg(X, y, family="binomial", gamma=10) plot(fit, main=expression(paste(gamma,"=",10))) fit <- ncvreg(X, y, family="binomial", gamma=1.5) plot(fit, main=expression(paste(gamma,"=",1.5))) fit <- ncvreg(X, y, family="binomial", penalty="SCAD") plot(fit, main=expression(paste("SCAD, ",gamma,"=",3))) par(op) op <- par(mfrow=c(2,2)) fit <- ncvreg(X, y, family="binomial") plot(fit, main=expression(paste(alpha,"=",1))) fit <- ncvreg(X, y, family="binomial", alpha=0.9) plot(fit, main=expression(paste(alpha,"=",0.9))) fit <- ncvreg(X, y, family="binomial", alpha=0.5) plot(fit, main=expression(paste(alpha,"=",0.5))) fit <- ncvreg(X, y, family="binomial", alpha=0.1) plot(fit, main=expression(paste(alpha,"=",0.1))) par(op)
Fit coefficients paths for MCP- or SCAD-penalized Cox regression models over a grid of values for the regularization parameter lambda, with option for an additional L2 penalty.
ncvsurv( X, y, penalty = c("MCP", "SCAD", "lasso"), gamma = switch(penalty, SCAD = 3.7, 3), alpha = 1, lambda.min = ifelse(n > p, 0.001, 0.05), nlambda = 100, lambda, eps = 1e-04, max.iter = 10000, convex = TRUE, dfmax = p, penalty.factor = rep(1, ncol(X)), warn = TRUE, returnX, ... )
ncvsurv( X, y, penalty = c("MCP", "SCAD", "lasso"), gamma = switch(penalty, SCAD = 3.7, 3), alpha = 1, lambda.min = ifelse(n > p, 0.001, 0.05), nlambda = 100, lambda, eps = 1e-04, max.iter = 10000, convex = TRUE, dfmax = p, penalty.factor = rep(1, ncol(X)), warn = TRUE, returnX, ... )
X |
The design matrix of predictor values. |
y |
The time-to-event outcome, as a two-column matrix or
|
penalty |
The penalty to be applied to the model. Either "MCP" (the default), "SCAD", or "lasso". |
gamma |
The tuning parameter of the MCP/SCAD penalty (see details). Default is 3 for MCP and 3.7 for SCAD. |
alpha |
Tuning parameter for the Mnet estimator which controls the
relative contributions from the MCP/SCAD penalty and the ridge, or L2
penalty. |
lambda.min |
The smallest value for lambda, as a fraction of lambda.max. Default is .001 if the number of observations is larger than the number of covariates and .05 otherwise. |
nlambda |
The number of lambda values. Default is 100. |
lambda |
A user-specified sequence of lambda values. By default, a
sequence of values of length |
eps |
Convergence threshhold. The algorithm iterates until the RMSD for
the change in linear predictors for any coefficient is less than
|
max.iter |
Maximum number of iterations (total across entire path). Default is 1000. |
convex |
Calculate index for which objective function ceases to be locally convex? Default is TRUE. |
dfmax |
Upper bound for the number of nonzero coefficients. Default is no upper bound. However, for large data sets, computational burden may be heavy for models with a large number of nonzero coefficients. |
penalty.factor |
A multiplicative factor for the penalty applied to each
coefficient. If supplied, |
warn |
Return warning messages for failures to converge and model saturation? Default is TRUE. |
returnX |
Return the standardized design matrix along with the fit? By
default, this option is turned on if X is under 100 MB, but turned off for
larger matrices to preserve memory. Note that certain methods, such as
|
... |
Not used. |
The sequence of models indexed by the regularization parameter lambda
is fit using a coordinate descent algorithm. In order to accomplish this,
the second derivative (Hessian) of the Cox partial log-likelihood is
diagonalized (see references for details). The objective function is
defined to be
where the loss function L is the deviance (-2 times the partial log-likelihood) from the Cox regression mode. See here for more details.
Presently, ties are not handled by ncvsurv
in a particularly
sophisticated manner. This will be improved upon in a future release of
ncvreg.
An object with S3 class ncvsurv
containing:
The fitted matrix of coefficients. The number of rows is equal to the number of coefficients, and the number of columns is equal to nlambda
.
A vector of length nlambda
containing the number of iterations until convergence at each value of lambda
.
The sequence of regularization parameter values in the path.
Same as above.
The last index for which the objective function is locally convex. The smallest value of lambda for which the objective function is convex is therefore lambda[convex.min]
, with corresponding coefficients beta[,convex.min]
.
The deviance of the fitted model at each value of lambda
.
The number of instances.
For Cox models, the following objects are also returned (and are necessary to
estimate baseline survival conditonal on the estimated regression
coefficients), all of which are ordered by time on study. I.e., the ith row
of W
does not correspond to the ith row of X
):
Matrix of exp(beta)
values for each subject over all lambda
values.
Times on study.
Failure event indicator.
Additionally, if returnX=TRUE
, the object will also contain
The standardized design matrix.
Breheny P and Huang J. (2011) Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. Annals of Applied Statistics, 5: 232-253. doi:10.1214/10-AOAS388
Simon N, Friedman JH, Hastie T, and Tibshirani R. (2011) Regularization Paths for Cox's Proportional Hazards Model via Coordinate Descent. Journal of Statistical Software, 39: 1-13. doi:10.18637/jss.v039.i05
data(Lung) X <- Lung$X y <- Lung$y op <- par(mfrow=c(2,2)) fit <- ncvsurv(X, y) plot(fit, main=expression(paste(gamma,"=",3))) fit <- ncvsurv(X, y, gamma=10) plot(fit, main=expression(paste(gamma,"=",10))) fit <- ncvsurv(X, y, gamma=1.5) plot(fit, main=expression(paste(gamma,"=",1.5))) fit <- ncvsurv(X, y, penalty="SCAD") plot(fit, main=expression(paste("SCAD, ",gamma,"=",3))) par(op) fit <- ncvsurv(X,y) ll <- log(fit$lambda) op <- par(mfrow=c(2,1)) plot(ll, BIC(fit), type="l", xlim=rev(range(ll))) lam <- fit$lambda[which.min(BIC(fit))] b <- coef(fit, lambda=lam) b[b!=0] plot(fit) abline(v=lam) par(op) S <- predict(fit, X, type='survival', lambda=lam) plot(S, xlim=c(0,200))
data(Lung) X <- Lung$X y <- Lung$y op <- par(mfrow=c(2,2)) fit <- ncvsurv(X, y) plot(fit, main=expression(paste(gamma,"=",3))) fit <- ncvsurv(X, y, gamma=10) plot(fit, main=expression(paste(gamma,"=",10))) fit <- ncvsurv(X, y, gamma=1.5) plot(fit, main=expression(paste(gamma,"=",1.5))) fit <- ncvsurv(X, y, penalty="SCAD") plot(fit, main=expression(paste("SCAD, ",gamma,"=",3))) par(op) fit <- ncvsurv(X,y) ll <- log(fit$lambda) op <- par(mfrow=c(2,1)) plot(ll, BIC(fit), type="l", xlim=rev(range(ll))) lam <- fit$lambda[which.min(BIC(fit))] b <- coef(fit, lambda=lam) b[b!=0] plot(fit) abline(v=lam) par(op) S <- predict(fit, X, type='survival', lambda=lam) plot(S, xlim=c(0,200))
Fits multiple penalized regression models in which the outcome is randomly permuted, thereby allowing estimation of the marginal false discovery rate.
perm.ncvreg( X, y, ..., permute = c("outcome", "residuals"), N = 10, seed, trace = FALSE )
perm.ncvreg( X, y, ..., permute = c("outcome", "residuals"), N = 10, seed, trace = FALSE )
X |
The design matrix, without an intercept, as in |
y |
The response vector, as in |
... |
Additional arguments to |
permute |
What to permute. If |
N |
The number of permutation replications. Default is 10. |
seed |
You may set the seed of the random number generator in order to obtain reproducible results. |
trace |
If set to TRUE, perm.ncvreg will inform the user of its progress by announcing the beginning of each permutation fit. Default is FALSE. |
The function fits a penalized regression model to the actual data, then
repeats the process N
times with a permuted version of the response
vector. This allows estimation of the expected number of variables included
by chance for each value of lambda
. The ratio of this expected
quantity to the number of selected variables using the actual (non-permuted)
response is called the marginal false discovery rate (mFDR).
An object with S3 class "perm.ncvreg"
containing:
EF |
The number of variables selected at each value of |
S |
The actual number of selected variables for the non-permuted data. |
mFDR |
The estimated marginal false discovery rate ( |
fit |
The fitted |
loss |
The loss/deviance for each value of |
Patrick Breheny [email protected]
# Linear regression -------------------------------------------------- data(Prostate) pmfit <- perm.ncvreg(Prostate$X, Prostate$y) op <- par(mfcol=c(2,2)) plot(pmfit) plot(pmfit, type="EF") plot(pmfit$fit) lam <- pmfit$fit$lambda pmfit.r <- perm.ncvreg(Prostate$X, Prostate$y, permute='residuals') plot(pmfit.r, col="red") # Permuting residuals is lines(lam, pmfit$mFDR, col="gray60") # less conservative par(op) # Logistic regression ------------------------------------------------ data(Heart) pmfit <- perm.ncvreg(Heart$X, Heart$y, family="binomial") op <- par(mfcol=c(2,2)) plot(pmfit) plot(pmfit, type="EF") plot(pmfit$fit) par(op)
# Linear regression -------------------------------------------------- data(Prostate) pmfit <- perm.ncvreg(Prostate$X, Prostate$y) op <- par(mfcol=c(2,2)) plot(pmfit) plot(pmfit, type="EF") plot(pmfit$fit) lam <- pmfit$fit$lambda pmfit.r <- perm.ncvreg(Prostate$X, Prostate$y, permute='residuals') plot(pmfit.r, col="red") # Permuting residuals is lines(lam, pmfit$mFDR, col="gray60") # less conservative par(op) # Logistic regression ------------------------------------------------ data(Heart) pmfit <- perm.ncvreg(Heart$X, Heart$y, family="binomial") op <- par(mfcol=c(2,2)) plot(pmfit) plot(pmfit, type="EF") plot(pmfit$fit) par(op)
Fits multiple penalized regression models in which the residuals are randomly permuted, thereby allowing estimation of the marginal false discovery rate.
permres(fit, ...) ## S3 method for class 'ncvreg' permres(fit, lambda, N = 10, seed, trace = FALSE, ...)
permres(fit, ...) ## S3 method for class 'ncvreg' permres(fit, lambda, N = 10, seed, trace = FALSE, ...)
fit |
A fitted ncvreg model, as produced by |
... |
Not used. |
lambda |
The regularization parameter to use for estimating residuals.
Unlike |
N |
The number of permutation replications. Default is 10. |
seed |
You may set the seed of the random number generator in order to obtain reproducible results. |
trace |
If set to TRUE, perm.ncvreg will inform the user of its progress by announcing the beginning of each permutation fit. Default is FALSE. |
The function fits a penalized regression model to the actual data, then
repeats the process N
times with a permuted version of the response
vector. This allows estimation of the expected number of variables included
by chance for each value of lambda
. The ratio of this expected
quantity to the number of selected variables using the actual (non-permuted)
response is called the marginal false discovery rate (mFDR).
A list with the following components:
EF |
The number of variables selected at each value of |
S |
The actual number of selected variables for the non-permuted data. |
mFDR |
The estimated marginal false discovery rate ( |
loss |
The loss/deviance, averaged over the permutation fits. This is an estimate of the explanatory power of the model under null conditions, and can be used to adjust the loss of the fitted model in a manner akin to the idea of an adjusted R-squared in classical regression. |
Patrick Breheny [email protected]
ncvreg()
, 'mfdr()
, perm.ncvreg()
data(Prostate) fit <- ncvreg(Prostate$X, Prostate$y, N=50) permres(fit, lambda=0.15)
data(Prostate) fit <- ncvreg(Prostate$X, Prostate$y, N=50) permres(fit, lambda=0.15)
Plots the cross-validation curve from a cv.ncvreg
or cv.ncvsurv
object,
along with standard error bars.
## S3 method for class 'cv.ncvreg' plot( x, log.l = TRUE, type = c("cve", "rsq", "scale", "snr", "pred", "all"), selected = TRUE, vertical.line = TRUE, col = "red", ... )
## S3 method for class 'cv.ncvreg' plot( x, log.l = TRUE, type = c("cve", "rsq", "scale", "snr", "pred", "all"), selected = TRUE, vertical.line = TRUE, col = "red", ... )
x |
A |
log.l |
Should horizontal axis be on the log scale? Default is TRUE. |
type |
What to plot on the vertical axis:
|
selected |
If |
vertical.line |
If |
col |
Controls the color of the dots (CV estimates). |
... |
Other graphical parameters to |
Error bars representing approximate 68% confidence intervals are plotted
along with the estimates across values of lambda
. For rsq
and snr
applied to models other than linear regression, the Cox-Snell R-squared is used.
Patrick Breheny
Breheny P and Huang J. (2011) Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. Annals of Applied Statistics, 5: 232-253. doi:10.1214/10-AOAS388
# Linear regression -------------------------------------------------- data(Prostate) cvfit <- cv.ncvreg(Prostate$X, Prostate$y) plot(cvfit) op <- par(mfrow=c(2,2)) plot(cvfit, type="all") par(op) # Logistic regression ------------------------------------------------ data(Heart) cvfit <- cv.ncvreg(Heart$X, Heart$y, family="binomial") plot(cvfit) op <- par(mfrow=c(2,2)) plot(cvfit, type="all") par(op) # Cox regression ----------------------------------------------------- data(Lung) cvfit <- cv.ncvsurv(Lung$X, Lung$y) op <- par(mfrow=c(1,2)) plot(cvfit) plot(cvfit, type="rsq") par(op)
# Linear regression -------------------------------------------------- data(Prostate) cvfit <- cv.ncvreg(Prostate$X, Prostate$y) plot(cvfit) op <- par(mfrow=c(2,2)) plot(cvfit, type="all") par(op) # Logistic regression ------------------------------------------------ data(Heart) cvfit <- cv.ncvreg(Heart$X, Heart$y, family="binomial") plot(cvfit) op <- par(mfrow=c(2,2)) plot(cvfit, type="all") par(op) # Cox regression ----------------------------------------------------- data(Lung) cvfit <- cv.ncvsurv(Lung$X, Lung$y) op <- par(mfrow=c(1,2)) plot(cvfit) plot(cvfit, type="rsq") par(op)
Plot marginal false discovery rate curves from an mfdr
or perm.ncvreg
object.
## S3 method for class 'mfdr' plot( x, type = c("mFDR", "EF"), log.l = FALSE, selected = TRUE, legend = TRUE, ... )
## S3 method for class 'mfdr' plot( x, type = c("mFDR", "EF"), log.l = FALSE, selected = TRUE, legend = TRUE, ... )
x |
A |
type |
What to plot on the vertical axis. |
log.l |
Should horizontal axis be on the log scale? Default is |
selected |
If |
legend |
For |
... |
Other graphical parameters to pass to |
Patrick Breheny
Breheny P (2019). Marginal false discovery rates for penalized regression models. Biostatistics, 20: 299-314.
data(Prostate) fit <- ncvreg(Prostate$X, Prostate$y) obj <- mfdr(fit) obj[1:10,] # Some plotting options plot(obj) plot(obj, type="EF") plot(obj, log=TRUE) # Comparison with perm.ncvreg op <- par(mfrow=c(2,2)) plot(obj) plot(obj, type="EF") pmfit <- perm.ncvreg(Prostate$X, Prostate$y) plot(pmfit) plot(pmfit, type="EF") par(op)
data(Prostate) fit <- ncvreg(Prostate$X, Prostate$y) obj <- mfdr(fit) obj[1:10,] # Some plotting options plot(obj) plot(obj, type="EF") plot(obj, log=TRUE) # Comparison with perm.ncvreg op <- par(mfrow=c(2,2)) plot(obj) plot(obj, type="EF") pmfit <- perm.ncvreg(Prostate$X, Prostate$y) plot(pmfit) plot(pmfit, type="EF") par(op)
Produces a plot of the coefficient paths for a fitted ncvreg
object.
## S3 method for class 'ncvreg' plot(x, alpha = 1, log.l = FALSE, shade = TRUE, col, ...)
## S3 method for class 'ncvreg' plot(x, alpha = 1, log.l = FALSE, shade = TRUE, col, ...)
x |
Fitted |
alpha |
Controls alpha-blending, helpful when the number of features is large. Default is alpha=1. |
log.l |
Should horizontal axis be on the log scale? Default is FALSE. |
shade |
Should nonconvex region be shaded? Default is TRUE. |
col |
Vector of colors for coefficient lines. By default, evenly spaced colors are selected automatically. |
... |
Other graphical parameters to |
Patrick Breheny
Breheny P and Huang J. (2011) Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. Annals of Applied Statistics, 5: 232-253. doi:10.1214/10-AOAS388
data(Prostate) fit <- ncvreg(Prostate$X, Prostate$y) plot(fit) plot(fit, col="black") plot(fit, log=TRUE) fit <- ncvreg(Prostate$X, Prostate$y, penalty.factor=rep(c(1, 1, 1, Inf), 2)) plot(fit, col=c('red', 'black', 'green')) # Recycled among nonzero paths
data(Prostate) fit <- ncvreg(Prostate$X, Prostate$y) plot(fit) plot(fit, col="black") plot(fit, log=TRUE) fit <- ncvreg(Prostate$X, Prostate$y, penalty.factor=rep(c(1, 1, 1, Inf), 2)) plot(fit, col=c('red', 'black', 'green')) # Recycled among nonzero paths
Plot survival curve for a model that has been fit using ncvsurv()
followed
by a prediction of the survival function using predict.ncvsurv()
.
## S3 method for class 'ncvsurv.func' plot(x, alpha = 1, ...)
## S3 method for class 'ncvsurv.func' plot(x, alpha = 1, ...)
x |
A |
alpha |
Controls alpha-blending (i.e., transparency). Useful if many overlapping lines are present. |
... |
Other graphical parameters to pass to |
Patrick Breheny
data(Lung) X <- Lung$X y <- Lung$y fit <- ncvsurv(X, y) # A single survival curve S <- predict(fit, X[1,], type='survival', lambda=.15) plot(S, xlim=c(0,200)) # Lots of survival curves S <- predict(fit, X, type='survival', lambda=.08) plot(S, xlim=c(0,200), alpha=0.3)
data(Lung) X <- Lung$X y <- Lung$y fit <- ncvsurv(X, y) # A single survival curve S <- predict(fit, X[1,], type='survival', lambda=.15) plot(S, xlim=c(0,200)) # Lots of survival curves S <- predict(fit, X, type='survival', lambda=.08) plot(S, xlim=c(0,200), alpha=0.3)
Similar to other predict methods, this function returns predictions from a
fitted ncvreg
object.
## S3 method for class 'cv.ncvreg' predict( object, X, type = c("link", "response", "class", "coefficients", "vars", "nvars"), which = object$min, ... ) ## S3 method for class 'cv.ncvreg' coef(object, which = object$min, ...) ## S3 method for class 'cv.ncvsurv' predict( object, X, type = c("link", "response", "survival", "median", "coefficients", "vars", "nvars"), which = object$min, ... ) ## S3 method for class 'ncvreg' predict( object, X, type = c("link", "response", "class", "coefficients", "vars", "nvars"), lambda, which = 1:length(object$lambda), ... ) ## S3 method for class 'ncvreg' coef(object, lambda, which = 1:length(object$lambda), drop = TRUE, ...)
## S3 method for class 'cv.ncvreg' predict( object, X, type = c("link", "response", "class", "coefficients", "vars", "nvars"), which = object$min, ... ) ## S3 method for class 'cv.ncvreg' coef(object, which = object$min, ...) ## S3 method for class 'cv.ncvsurv' predict( object, X, type = c("link", "response", "survival", "median", "coefficients", "vars", "nvars"), which = object$min, ... ) ## S3 method for class 'ncvreg' predict( object, X, type = c("link", "response", "class", "coefficients", "vars", "nvars"), lambda, which = 1:length(object$lambda), ... ) ## S3 method for class 'ncvreg' coef(object, lambda, which = 1:length(object$lambda), drop = TRUE, ...)
object |
Fitted |
X |
Matrix of values at which predictions are to be made. Not used for
|
type |
Type of prediction:
|
which |
Indices of the penalty parameter |
... |
Not used. |
lambda |
Values of the regularization parameter |
drop |
If coefficients for a single value of |
The object returned depends on type.
Patrick Breheny
Breheny P and Huang J. (2011) Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. Annals of Applied Statistics, 5: 232-253. doi:10.1214/10-AOAS388
data(Heart) fit <- ncvreg(Heart$X, Heart$y, family="binomial") coef(fit, lambda=0.05) head(predict(fit, Heart$X, type="link", lambda=0.05)) head(predict(fit, Heart$X, type="response", lambda=0.05)) head(predict(fit, Heart$X, type="class", lambda=0.05)) predict(fit, type="vars", lambda=c(0.05, 0.01)) predict(fit, type="nvars", lambda=c(0.05, 0.01))
data(Heart) fit <- ncvreg(Heart$X, Heart$y, family="binomial") coef(fit, lambda=0.05) head(predict(fit, Heart$X, type="link", lambda=0.05)) head(predict(fit, Heart$X, type="response", lambda=0.05)) head(predict(fit, Heart$X, type="class", lambda=0.05)) predict(fit, type="vars", lambda=c(0.05, 0.01)) predict(fit, type="nvars", lambda=c(0.05, 0.01))
ncvsurv
object.Similar to other predict methods, this function returns predictions from a
fitted ncvsurv
object.
## S3 method for class 'ncvsurv' predict( object, X, type = c("link", "response", "survival", "hazard", "median", "coefficients", "vars", "nvars"), lambda, which = 1:length(object$lambda), ... )
## S3 method for class 'ncvsurv' predict( object, X, type = c("link", "response", "survival", "hazard", "median", "coefficients", "vars", "nvars"), lambda, which = 1:length(object$lambda), ... )
object |
Fitted |
X |
Matrix of values at which predictions are to be made. Not used for
|
type |
Type of prediction:
|
lambda |
Values of the regularization parameter |
which |
Indices of the penalty parameter |
... |
Not used. |
Estimation of baseline survival function conditional on the estimated values
of beta
is carried out according to the method described in Chapter
4.3 of Kalbfleish and Prentice. In particular, it agrees exactly the
results returned by survfit.coxph(..., type='kalbfleisch-prentice')
in the survival
package.
The object returned depends on type.
Patrick Breheny [email protected]
Breheny P and Huang J. (2011) Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. Annals of Applied Statistics, 5: 232-253. doi:10.1214/10-AOAS388
Kalbfleish JD and Prentice RL (2002). The Statistical Analysis of Failure Time Data, 2nd edition. Wiley.
data(Lung) X <- Lung$X y <- Lung$y fit <- ncvsurv(X,y) coef(fit, lambda=0.05) head(predict(fit, X, type="link", lambda=0.05)) head(predict(fit, X, type="response", lambda=0.05)) # Survival function S <- predict(fit, X[1,], type="survival", lambda=0.05) S(100) S <- predict(fit, X, type="survival", lambda=0.05) plot(S, xlim=c(0,200)) # Medians predict(fit, X[1,], type="median", lambda=0.05) M <- predict(fit, X, type="median") M[1:10, 1:10] # Nonzero coefficients predict(fit, type="vars", lambda=c(0.1, 0.01)) predict(fit, type="nvars", lambda=c(0.1, 0.01))
data(Lung) X <- Lung$X y <- Lung$y fit <- ncvsurv(X,y) coef(fit, lambda=0.05) head(predict(fit, X, type="link", lambda=0.05)) head(predict(fit, X, type="response", lambda=0.05)) # Survival function S <- predict(fit, X[1,], type="survival", lambda=0.05) S(100) S <- predict(fit, X, type="survival", lambda=0.05) plot(S, xlim=c(0,200)) # Medians predict(fit, X[1,], type="median", lambda=0.05) M <- predict(fit, X, type="median") M[1:10, 1:10] # Nonzero coefficients predict(fit, type="vars", lambda=c(0.1, 0.01)) predict(fit, type="nvars", lambda=c(0.1, 0.01))
Data from a study by by Stamey et al. (1989) to examine the association between prostate specific antigen (PSA) and several clinical measures that are potentially associated with PSA in men who were about to receive a radical prostatectomy.
Prostate
Prostate
A list of two objects: y
and X
Log PSA
A matrix with 97 instances (rows) and 8 predictor variables
(columns). The remainder of this list describes the columns of X
Log cancer volume
Log prostate weight
The man's age (years)
Log of the amount of benign hyperplasia
Seminal vesicle invasion (1=Yes, 0=No)
Log of capsular penetration
Gleason score
Percent of Gleason scores 4 or 5
https://web.stanford.edu/~hastie/ElemStatLearn/
Hastie T, Tibshirani R, and Friedman J. (2001). The Elements of Statistical Learning. Springer.
Stamey T, et al. (1989). Prostate specific antigen in the diagnosis and treatment of adenocarcinoma of the prostate. II. Radical prostatectomy treated patients. Journal of Urology, 16: 1076-1083.
Currently, only deviance residuals are supported.
## S3 method for class 'ncvreg' residuals(object, lambda, which = 1:length(object$lambda), drop = TRUE, ...)
## S3 method for class 'ncvreg' residuals(object, lambda, which = 1:length(object$lambda), drop = TRUE, ...)
object |
Object of class |
lambda |
Values of the regularization parameter at which residuals are requested (numeric vector). For values of lambda not in the sequence of fitted models, linear interpolation is used. |
which |
Index of the penalty parameter at which residuals are requested (default = all indices). If |
drop |
By default, if a single value of lambda is supplied, a vector of residuals is returned (logical; default= |
... |
Not used. |
data(Prostate) X <- Prostate$X y <- Prostate$y fit <- ncvreg(X, y) residuals(fit)[1:5, 1:5] head(residuals(fit, lambda=0.1))
data(Prostate) X <- Prostate$X y <- Prostate$y fit <- ncvreg(X, y) residuals(fit)[1:5, 1:5] head(residuals(fit, lambda=0.1))
Accepts a design matrix and returns a standardized version of that matrix (i.e., each column will have mean 0 and mean sum of squares equal to 1).
std(X, Xnew)
std(X, Xnew)
X |
A matrix (or object that can be coerced to a matrix, such as a data frame or numeric vector). |
Xnew |
Optional. If supplied, |
This function centers and scales each column of X
so that
and
for all j. This is usually not necessary to call directly, as ncvreg internally
standardizes the design matrix, but inspection of the standardized design matrix
can sometimes be useful. This differs from the base R function scale()
in two ways:
scale()
uses the sample standard deviation sqrt(sum(x^2)/(n-1))
, while std()
uses the root-mean-square standard deviation sqrt(mean(sum(x^2))
without the correction
std
is faster.
The standardized design matrix, with the following attribues:
center , scale
|
mean and standard deviation used to scale the columns |
nonsingular |
A vector indicating which columns of the original design matrix were able to be standardized (constant columns cannot be standardized to have a standard deviation of 1) |
data(Prostate) S <- std(Prostate$X) apply(S, 2, sum) apply(S, 2, function(x) mean(x^2)) # Standardizing new observations X1 <- Prostate$X[1:90,] X2 <- Prostate$X[91:97,] S <- std(X1) head(std(S, X2)) # Useful if you fit to a standardized X, but then get new obs: y <- Prostate$y[1:90] fit <- ncvreg(S, y) predict(fit, std(S, X2), lambda=0.1) # Same as predict(ncvreg(X1, y), X2, lambda=0.1)
data(Prostate) S <- std(Prostate$X) apply(S, 2, sum) apply(S, 2, function(x) mean(x^2)) # Standardizing new observations X1 <- Prostate$X[1:90,] X2 <- Prostate$X[91:97,] S <- std(X1) head(std(S, X2)) # Useful if you fit to a standardized X, but then get new obs: y <- Prostate$y[1:90] fit <- ncvreg(S, y) predict(fit, std(S, X2), lambda=0.1) # Same as predict(ncvreg(X1, y), X2, lambda=0.1)
Summary method for cv.ncvreg
objects
## S3 method for class 'cv.ncvreg' summary(object, ...) ## S3 method for class 'summary.cv.ncvreg' print(x, digits, ...)
## S3 method for class 'cv.ncvreg' summary(object, ...) ## S3 method for class 'summary.cv.ncvreg' print(x, digits, ...)
object |
A |
... |
Further arguments passed to or from other methods. |
x |
A |
digits |
Number of digits past the decimal point to print out. Can be a vector specifying different display digits for each of the five non-integer printed values. |
An object with S3 class summary.cv.ncvreg
. The class has its own
print method and contains the following list elements:
The penalty used by ncvreg
.
Either "linear"
or "logistic"
, depending on the family
option in ncvreg
.
Number of instances
Number of regression coefficients (not including the intercept).
The index of lambda
with the smallest cross-validation error.
The sequence of lambda
values used by cv.ncvreg
.
Cross-validation error (deviance).
Proportion of variance explained by the model, as estimated by cross-validation. For models outside of linear regression, the Cox-Snell approach to defining R-squared is used.
Signal to noise ratio, as estimated by cross-validation.
For linear regression models, the scale parameter estimate.
For logistic regression models, the prediction error (misclassification error).
Patrick Breheny
Breheny P and Huang J. (2011) Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. Annals of Applied Statistics, 5: 232-253. doi:10.1214/10-AOAS388
ncvreg()
, cv.ncvreg()
, plot.cv.ncvreg()
# Linear regression -------------------------------------------------- data(Prostate) cvfit <- cv.ncvreg(Prostate$X, Prostate$y) summary(cvfit) # Logistic regression ------------------------------------------------ data(Heart) cvfit <- cv.ncvreg(Heart$X, Heart$y, family="binomial") summary(cvfit) # Cox regression ----------------------------------------------------- data(Lung) cvfit <- cv.ncvsurv(Lung$X, Lung$y) summary(cvfit)
# Linear regression -------------------------------------------------- data(Prostate) cvfit <- cv.ncvreg(Prostate$X, Prostate$y) summary(cvfit) # Logistic regression ------------------------------------------------ data(Heart) cvfit <- cv.ncvreg(Heart$X, Heart$y, family="binomial") summary(cvfit) # Cox regression ----------------------------------------------------- data(Lung) cvfit <- cv.ncvsurv(Lung$X, Lung$y) summary(cvfit)
Inferential summaries for ncvreg
and ncvsurv
objects based on local marginal false discovery rates.
## S3 method for class 'ncvreg' summary(object, lambda, which, number, cutoff, sort = TRUE, sigma, ...) ## S3 method for class 'summary.ncvreg' print(x, digits, ...)
## S3 method for class 'ncvreg' summary(object, lambda, which, number, cutoff, sort = TRUE, sigma, ...) ## S3 method for class 'summary.ncvreg' print(x, digits, ...)
object |
An |
lambda |
The regularization parameter value at which inference should be reported. |
which |
Alternatively, |
number |
By default, |
cutoff |
Alternatively, specifying for example |
sort |
Should the results be sorted by |
sigma |
For linear regression models, users can supply an estimate of the residual standard deviation. The default is to use RSS / DF, where degrees of freedom are approximated using the number of nonzero coefficients. |
... |
Further arguments; in particular, if you have set |
x |
A |
digits |
Number of digits past the decimal point to print out. Can be a vector specifying different display digits for each of the five non-integer printed values. |
An object with S3 class summary.ncvreg
. The class has its own
print method and contains the following list elements:
penalty |
The penalty used by |
model |
Either |
n |
Number of instances. |
p |
Number of regression coefficients (not including the intercept). |
lambda |
The |
nvars |
The number of nonzero coefficients (again, not including the intercept) at that value of |
table |
A table containing estimates, normalized test statistics (z), and an estimate of the local mfdr for each coefficient. The mfdr may be loosely interpreted, in an empirical Bayes sense, as the probability that the given feature is null. |
unpen.table |
If there are any unpenalized coefficients, a separate inferential summary is given for them. Currently, this is based on |
Patrick Breheny [email protected]
ncvreg()
, cv.ncvreg()
, plot.cv.ncvreg()
, local_mfdr()
# Linear regression -------------------------------------------------- data(Prostate) fit <- ncvreg(Prostate$X, Prostate$y) summary(fit, lambda=0.08) # Logistic regression ------------------------------------------------ data(Heart) fit <- ncvreg(Heart$X, Heart$y, family="binomial") summary(fit, lambda=0.05) # Cox regression ----------------------------------------------------- data(Lung) fit <- ncvsurv(Lung$X, Lung$y) summary(fit, lambda=0.1) # Options ------------------------------------------------------------ fit <- ncvreg(Heart$X, Heart$y, family="binomial") summary(fit, lambda=0.08, number=3) summary(fit, lambda=0.08, number=Inf) summary(fit, lambda=0.08, cutoff=0.5) summary(fit, lambda=0.08, number=3, cutoff=0.5) summary(fit, lambda=0.08, number=5, cutoff=0.1) summary(fit, lambda=0.08, number=Inf, sort=FALSE) summary(fit, lambda=0.08, number=3, cutoff=0.5, sort=FALSE) # If X and y are not returned with the fit, they must be supplied fit <- ncvreg(Heart$X, Heart$y, family="binomial", returnX=FALSE) summary(fit, X=Heart$X, y=Heart$y, lambda=0.08)
# Linear regression -------------------------------------------------- data(Prostate) fit <- ncvreg(Prostate$X, Prostate$y) summary(fit, lambda=0.08) # Logistic regression ------------------------------------------------ data(Heart) fit <- ncvreg(Heart$X, Heart$y, family="binomial") summary(fit, lambda=0.05) # Cox regression ----------------------------------------------------- data(Lung) fit <- ncvsurv(Lung$X, Lung$y) summary(fit, lambda=0.1) # Options ------------------------------------------------------------ fit <- ncvreg(Heart$X, Heart$y, family="binomial") summary(fit, lambda=0.08, number=3) summary(fit, lambda=0.08, number=Inf) summary(fit, lambda=0.08, cutoff=0.5) summary(fit, lambda=0.08, number=3, cutoff=0.5) summary(fit, lambda=0.08, number=5, cutoff=0.1) summary(fit, lambda=0.08, number=Inf, sort=FALSE) summary(fit, lambda=0.08, number=3, cutoff=0.5, sort=FALSE) # If X and y are not returned with the fit, they must be supplied fit <- ncvreg(Heart$X, Heart$y, family="binomial", returnX=FALSE) summary(fit, X=Heart$X, y=Heart$y, lambda=0.08)