survMarkerTwoPhase

Evaluate the prognostic accuracy of a marker under two phase designs.

View on GitHub

This R package computes non-parametric and semi-parametric estimates of accuracy measures for risk prediction markers from survival data under two phase study designs, namely the case-cohort (cch) and the nested case-control (ncc) study designs.

The accuracy measures that can be estimated include the AUC, TPR(c), FPR(c), PPV(c), and NPV(c) for for a specific timepoint and fixed marker cutoff value c. Standard errors, along with normal approximated confidence intervals are also calculated. Note: standard errors for non-parametric estimates from a cch design are currently not available.

Below is a brief tutorial to get you started. Results from a small simulation to serve as package validation are then shown. For more information regarding estimation procedures see the references.

Time-dependent measures of accuracy

For a continuous marker measured at baseline, Y, a marker cutpoint c, failure time T, and a prediction time t, accuracy measures are defined as:

Tutorial

Download and install the package

# download the package from github
if (!require("devtools")) install.packages("devtools")
devtools::install_github("mdbrown/survMarkerTwoPhase")

Alternatively, you can download the source code directly from here https://github.com/mdbrown/survMarkerTwoPhase/releases/

# download the package from local source
install_packages("..(path_to_file).../survMarkerTwoPhase_1.2.tar.gz", repos = NULL)

Once the package is installed, we then load the package and the example data.

library(survMarkerTwoPhase)

# simulated data for illustration
data(SimData)
head(SimData)
##   survTime status        Y
## 1   0.1197      1  1.49310
## 2   1.0231      0 -0.73260
## 3   0.8282      0 -0.50211
## 4   2.0875      1  0.65758
## 5   4.6827      1  1.57806
## 6   0.3001      1  0.02419

Case-cohort design

For the case-cohort design we sample n=150 from entire cohort, and include all participants with observed failures.

# generate a sub-cohort from SimData
set.seed(12321)
# create a sample index. 1 if sampled, 0 if not
N <- nrow(SimData)
sampleInd <- rep(0, N)

# sample all with observed failure time. (200 individuals)
sampleInd[SimData$status == 1] <- 1

# sample 150 more observations from the entire data set without replacement
sampleInd[sample(1:N, 150)] <- 1

table(sampleInd)  #total number of subcohort is 293 
## sampleInd
##   0   1
## 207 293

To estimate accuracy measures for the case-cohort design, sample weights must be calculated. The sample weights w = 1/Pr(Sampled from cohort) for each observation included in the sub-cohort.

cohortData_cch <- SimData

# first calculate the Pr(Sampled from cohort) for each observation
sampleProb <- numeric(500)

# all non-censored observations were sampled, so their sample probability is
# 1
sampleProb[cohortData_cch$status == 1] <- 1
# all other individuals had a 150/N chance to be sampled
sampleProb[cohortData_cch$status == 0] <- 150/N

# the sample weights are 1/(probability of being sampled)
cohortData_cch$weights <- 1/sampleProb

# indicator of inclusion in the subcohort
cohortData_cch$subcohort = sampleInd

# marker data is unavailable for those not in the subcohort
cohortData_cch$Y[sampleInd == 0] = NA

Here we estimate using non-parametric estimation methods using double inverse probability weighting.

# estimate accuracy measures using non-parametric estimates by setting
# estimation.method = 'NP'
survMTP.cch(time = survTime, event = status, marker = Y, weights = weights, 
    subcoh = subcohort, data = cohortData_cch, estimation.method = "NP", predict.time = 2, 
    marker.cutpoint = 0)
## Warning: Standard error calculations are not available for non-parametric
## estimates yet.
##
##  Non-parametric estimates under Case-Cohort study design.
## 
##         estimate     se      lower 0.95  upper 0.95
## 
## AUC        0.749        NA            NA          NA 
## TPR(c)     0.699        NA            NA          NA 
## FPR(c)     0.394        NA            NA          NA 
## PPV(c)     0.331        NA            NA          NA 
## NPV(c)     0.878        NA            NA          NA 
## 
##  marker cutpoint: c = 0

Now estimate measures using semi-parametric estimation methods. These methods assume a Cox proportional hazards model.

# estimate accuracy measures using semi-parametric estimates
survMTP.cch(time = survTime, event = status, marker = Y, weights = weights, 
    subcoh = subcohort, data = cohortData_cch, estimation.method = "SP", predict.time = 2, 
    marker.cutpoint = 0)
##
##  Semi-parametric estimates under Case-Cohort study design.
## 
##         estimate     se      lower 0.95  upper 0.95
## 
## coef       1.088     0.083         0.926       1.249 
## AUC        0.788     0.026         0.731       0.835 
## TPR(c)     0.770     0.032         0.701       0.828 
## FPR(c)     0.347     0.047         0.261       0.444 
## PPV(c)     0.386     0.050         0.294       0.487 
## NPV(c)     0.909     0.022         0.855       0.945 
## 
##  marker cutpoint: c = 0

For more information see ?survMTP.cch.

Nested case-control

For a nested case-control sample, we need to input the full cohort data, with an indicator for inclusion in the subcohort, as well as a dataframe of selected sets, where each row is a case id followed by the corresponding id's that were matched to it.

Generate a nested case-control subcohort using the function ccwc from the Epi package.

require("Epi")

# 2 matched controls for each case
nmatch = 2

cohortData_ncc <- SimData
cohortData_ncc$id = 1:dim(cohortData_ncc)[1]

subcohort_ncc <- ccwc(exit = survTime, fail = status, data = cohortData_ncc, 
    controls = nmatch)  # match 2 controls for each case. 

# indicator for inclusion in the subcohort
sampleInd = rep(0, nrow(cohortData_ncc))  #initialize all equal to zero
sampleInd[subcohort_ncc$Map] = 1

cohortData_ncc$subcohort = sampleInd

table(sampleInd)  #subcohort sample size of 352

# marker data is unavailable for those not in the subcohort
cohortData_ncc$Y[sampleInd == 0] = NA

# now we need to build the  set matrix, which will be dimension (# of
# cases) x (nmatch + 1), so 200x3 here each row denotes a selected set, with the
# case id followed by the matched control ids
Sets <- matrix(nrow = sum(cohortData_ncc$status), ncol = nmatch + 1)

for (i in subcohort_ncc$Set) {
     Sets[i, ] <- unlist(subset(subcohort_ncc, Set == i, select = Map))
}

Now we are ready to calculate the measures using survMTP.ncc. Standard errors for both SP and NP estimates are calculated using perturbation. NP estimates use kernel smoothing while SP estimates assume a Cox proportional hazards model.

# Nonparametric estimates
survMTP.ncc(time = survTime, event = status, marker = Y, subcoh = subcohort, 
    id = id, data = cohortData_ncc, sets = Sets, estimation.method = "NP", 
    predict.time = 2, marker.cutpoint = 0)
##
##  Non-parametric estimates under Nested Case-Control study design.
## 
##         estimate     se      lower 0.95  upper 0.95
## 
## AUC        0.742     0.027         0.687       0.791 
## FPR(c)     0.379     0.032         0.318       0.445 
## TPR(c)     0.684     0.044         0.592       0.763 
## NPV(c)     0.875     0.020         0.831       0.909 
## PPV(c)     0.335     0.033         0.273       0.404 
## 
##  marker cutpoint: c = 0
# Semiparametric estimates
survMTP.ncc(time = survTime, event = status, marker = Y, subcoh = subcohort, 
    id = id, data = cohortData_ncc, sets = Sets, estimation.method = "SP", 
    predict.time = 2, marker.cutpoint = 0)
##
##  Semi-parametric estimates under Nested Case-Control study design.
## 
##         estimate     se      lower 0.95  upper 0.95
## 
## coef       1.017     0.082         0.857       1.177 
## AUC        0.777     0.020         0.736       0.813 
## FPR(c)     0.356     0.030         0.300       0.418 
## TPR(c)     0.755     0.030         0.691       0.808 
## NPV(c)     0.900     0.013         0.871       0.924 
## PPV(c)     0.381     0.032         0.321       0.444 
## 
##  marker cutpoint: c = 0

See ?survMTP.ncc for further help.

Package Validation

To validate the package procedures, a small simulation (1,000 replicates) was run. For each replicate, time-to-event cohort data was simulated with 80% censoring and sample size n=1,000. CCH and NCC sub samples were chosen and accuracy measures where estimated.

Case-cohort

Sampling distribution of estimates under CCH design Non-parametric and semi-parametric estimates under the CCH design. True measure value indicated by the purple vertical lines.

Nested case-control

For nested case-control designs, standard errors where also estimated. To validate these estimates, we compare the sampling distribution of SE estimates to the empirical standard error across simulations of accuracy measure estimates. The table below shows the mean estimate across simulations compared with the true parameter value.

Sampling distribution of estimates under NCC design Non-parametric and semi-parametric estimates of accuracy measures under the NCC design. True measure value indicated by the purple vertical lines. The table below shows the mean estimate across simulations compared with the true parameter value. Sampling distribution of estimates under NCC design Non-parametric and semi-parametric estimates of standard error under the NCC design. Empirical standard error calculated from the simulated measure sampling distribution is shown as the purple vertical dash-lines. A table with mean se across simulations compared with empirical se's is shown below
nonparametric mean(se) nonparametric empirical se semiparametric mean(se) semiparametric empirical se
coef 0.092 0.102
AUC 0.023 0.023 0.018 0.019
FPR 0.035 0.041 0.034 0.038
TPR 0.032 0.034 0.026 0.029
NPV 0.016 0.016 0.013 0.014
PPV 0.030 0.033 0.028 0.030

References

  1. Liu D, Cai T, Zheng Y. Evaluating the predictive value of biomarkers with stratified case-cohort design. Biometrics 2012, 4: 1219-1227.

  2. Pepe MS, Zheng Y, Jin Y. Evaluating the ROC performance of markers for future events. Lifetime Data Analysis. 2008, 14: 86-113.

  3. Zheng Y, Cai T, Pepe MS, Levy, W. Time-dependent predictive values of prognostic biomarkers with failure time outcome. JASA 2008, 103: 362-368.

  4. Cai T. and Zheng Y . Resampling Procedures for Making Inference under Nested Case-control Studies. JASA 2013 (in press).

  5. Cai T and Zheng Y/, Evaluating Prognostic Accuracy of Biomarkers under nested case-control studies. *Biostatistics 2012, 13,1, 89-100.

(* equal contributor and corresponding author).