xwas demo

1. Setup

We’ll assume you have compiled the xwas library from latest Github source. If you are using our provided Docker container then you can assume this is true. The two commands below will install in your R environment if necessary.

> library(devtools)
> install_github('chiragjp/xwas')

If you performed a git checkout of the code itself and ran your own build then the default build puts the library into ~/.R_libs/xwas/ and assumes your R library is set accordingly. The command below outputs the search path for R packages.

.libPaths()

## [1] "/Users/npho/.R_libs"                                           
## [2] "/Library/Frameworks/R.framework/Versions/3.2/Resources/library"

Now you can load the xwas package and are good to go.

library(xwas)

## Loading required package: parallel

## Loading required package: survey

## Loading required package: grid

## 
## Attaching package: 'survey'

## The following object is masked from 'package:graphics':
## 
##     dotchart

## Loading required package: survival

## Warning: package 'survival' was built under R version 3.2.5

## 
## Attaching package: 'xwas'

## The following object is masked from 'package:stats':
## 
##     qqplot

Further validate it was successfully loaded by checking the search() output.

search()

##  [1] ".GlobalEnv"        "package:xwas"      "package:survival" 
##  [4] "package:survey"    "package:grid"      "package:parallel" 
##  [7] "package:stats"     "package:graphics"  "package:grDevices"
## [10] "package:utils"     "package:datasets"  "package:methods"  
## [13] "Autoloads"         "package:base"

2. Package Survey

The xwas library comes with some data, load included NHANES datasets for downstream analyses.

data(nhanes)

Figure 1: the two data.frames when you load the nhanes data set are MainTable and VarDescription.

The data set is divided among two data.frames as diagrammed in Figure 1. The MainTable has all the individuals represented by a row and columns with the co-variates. The VarDescription table has the description of what the co-variates are from the columns in MainTable.

3. Perform Regression Analysis

Let’s look at telomere length and see what environental exposures could influence it.

First, what is the outcome variable codified as? Let’s refer to the VarDescription table.

VarDescription[grep("telomere", VarDescription$var_desc, ignore.case=T), ]

##       tab_name                             tab_desc    series     module
## 31731   telo_a Telomere Mean and Standard Deviation 1999-2000 laboratory
## 31741   telo_b Telomere Mean and Standard Deviation 2001-2002 laboratory
##            var             var_desc analyzable binary_ref_group
## 31731 TELOMEAN Mean Telomere Length          1               NA
## 31741 TELOMEAN Mean Telomere Length          1               NA
##       comment_var version_date is_comment is_weight is_questionnaire
## 31731        <NA>   2014-11-19          0         0                0
## 31741        <NA>   2014-11-19          0         0                0
##       is_ecological is_binary is_ordinal categorical_ref_group
## 31731             0         0          0                    NA
## 31741             0         0          0                    NA
##       categorical_levels    N age_range_low age_range_high age_median
## 31731               <NA> 7827            19             85         47
## 31741               <NA> 7827            19             85         47
##       female_N female_pct white_N white_pct mexican_N mexican_pct black_N
## 31731     4056  0.5182062    3965 0.5065798      1876   0.2396831    1333
## 31741     4056  0.5182062    3965 0.5065798      1876   0.2396831    1333
##       black_pct other_hispanic_N other_hispanic_pct other_N  other_pct
## 31731 0.1703079              417         0.05327712     236 0.03015204
## 31741 0.1703079              417         0.05327712     236 0.03015204
##       series_num category sub_category
## 31731          1    aging         <NA>
## 31741          2    aging         <NA>

Mean telomere length is codified as TELOMEAN in the MainTable by looking at the VarDescription$var column. From the VarDescription$series column we can also see that the 1999-2000 and 2001-2002 cohorts are associated with this measurement.

Now lets conduct an EWAS on TELOMEAN, focusing only on variables present in these two cohorts where the mean telomere length (TELOMEAN) was measured.

VarDescription.telo <- subset(VarDescription, series == '1999-2000' | series == '2001-2002')

Quick check on what this looks like by tabulating the data.frame.

sort(table(VarDescription.telo$category))

## 
##                 aging              cotinine cognitive functioning 
##                     2                     2                     4 
##       sexual behavior          immunization        social support 
##                     4                     6                     6 
##           alcohol use        blood pressure           street drug 
##                     8                     8                     8 
##   polyflourochemicals        phytoestrogens               dioxins 
##                    11                    12                    13 
##                diakyl               hormone        smoking family 
##                    14                    14                    16 
##               housing                furans            phthalates 
##                    18                    19                    19 
##             nutrients       viral infection      physical fitness 
##                    26                    26                    30 
##          hydrocarbons         body measures                 blood 
##                    37                    38                    40 
##          heavy metals            occupation    volatile compounds 
##                    42                    42                    51 
##   bacterial infection      smoking behavior                  pcbs 
##                    53                    55                    59 
##            pesticides               disease          biochemistry 
##                    64                    80                    93 
##        supplement use        pharmaceutical food component recall 
##                   128                   252                   256

pie(sort(table(VarDescription.telo$category)))

The top two categories are “food component recall” and “pharmaceutical” with 256 and 252 entries, respectively. Let us focus our EWAS on the environmental exposures.

useTheseCategories <- c('dioxins', 'furans', 'diakyl', 'heavy metals', 'hormone', 'hydrocarbons', 'pcbs', 'pesticides', 'physical fitness', 'phytoestrogens', 'polyflourochemicals', 'smoking behavior', 'smoking family', 'volatile compounds')
useTheseCategories <- sort(useTheseCategories) # store pre-sorted for easier processing

VarDescription.telo <- VarDescription.telo[VarDescription.telo$category %in% useTheseCategories, ]

Another inspection will confirm that we’re looking at much more tractable set of data.

pie(sort(table(VarDescription.telo$category)))

sort(table(VarDescription.telo$category))

## 
## polyflourochemicals      phytoestrogens             dioxins 
##                  11                  12                  13 
##              diakyl             hormone      smoking family 
##                  14                  14                  16 
##              furans    physical fitness        hydrocarbons 
##                  19                  30                  37 
##        heavy metals  volatile compounds    smoking behavior 
##                  42                  51                  55 
##                pcbs          pesticides 
##                  59                  64

The top two categories are now “pesticides” and “pcbs” with 64 and 59 entries, respectively. We’re close to performing the actual EWAS but will now define specific co-variates to adjust for in each regression. The vector is defined below as adjustfor and then the combined with unique co-variates in our cohort as well as the dependent variable of mean telomere length we are interested in studying (TELOMEAN).

We are also going to derive a new variable RIDAGEYR2 which is the age-squared to test for non-linear effects associated with age.

MainTable$RIDAGEYR2 <- MainTable$RIDAGEYR^2

The newData variable will contain all the pertinent records for analysis. However, the data is right-skewed so we both scale and log transform the data into a new object.

adjustfor <- c('RIDAGEYR', 'RIDAGEYR2', 'female', 'mexican', 'black', 'other_eth', 'other_hispanic', 'INDFMPIR')
variablesToKeep <- c(unique(VarDescription.telo$var), 'TELOMEAN', adjustfor, 'SDMVPSU', 'SDMVSTRA', 'WTMEC2YR', 'WTMEC4YR')
newData <- subset(MainTable, SDDSRVYR <= 2)[, variablesToKeep] ## only keep those that are 

newLogData <- cbind(newData[, is.categorical(newData, lower=0)], apply(newData[, !is.categorical(newData, lower=0)], 2, function(x) { scale(log(x+1e-10)) })) # log transform the data

The xwas library also has integrated support for the survey package to perform survey-weighted analysis. You will have to define your own survey::svydesign objects but it’s taken as an argument to most xwas functions. We define our study in the dsn variable then perform an EWAS both with and with-out the survey design for comparison.

dsn <- svydesign(ids=~SDMVPSU, strata=~SDMVSTRA, weight=~WTMEC4YR, nest=T, data=subset(newLogData, WTMEC4YR > 0)) # create the design object

telo.norm <- xwas(data=subset(newLogData, WTMEC4YR > 0), depvar="TELOMEAN", adjvars=adjustfor, verbose=F) # ewas WITHOUT survey design

## <simpleError in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...): 0 (non-NA) cases>
## <simpleError in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...): 0 (non-NA) cases>

telo.surv <- xwas(data=subset(newLogData, WTMEC4YR > 0), depvar="TELOMEAN", adjvars=adjustfor, design=dsn, verbose=F) # ewas WITH survey design

## Warning in glm.fit(x = structure(numeric(0), .Dim = c(0L, 10L), .Dimnames =
## list(: no observations informative at iteration 1

## Warning: glm.fit: algorithm did not converge

## <simpleError in glm.fit(x = structure(numeric(0), .Dim = c(0L, 10L), .Dimnames = list(    NULL, c("(Intercept)", "I(scale(SMD100MN))", "RIDAGEYR",     "RIDAGEYR2", "female", "mexican", "black", "other_eth", "other_hispanic",     "INDFMPIR")), assign = 0:9), y = structure(numeric(0), class = "AsIs", "`scaled:center`" = 0.00244357177919902, "`scaled:scale`" = 0.999429581414316),     weights = numeric(0), start = NULL, etastart = NULL, mustart = NULL,     offset = NULL, family = structure(list(family = "gaussian",         link = "identity", linkfun = function (mu)         mu, linkinv = function (eta)         eta, variance = function (mu)         rep.int(1, length(mu)), dev.resids = function (y, mu,             wt)         wt * ((y - mu)^2), aic = function (y, n, mu, wt, dev)         {            nobs <- length(y)            nobs * (log(dev/nobs * 2 * pi) + 1) + 2 - sum(log(wt))        }, mu.eta = function (eta)         rep.int(1, length(eta)), initialize = expression({            n <- rep.int(1, nobs)            if (is.null(etastart) && is.null(start) && is.null(mustart) &&                 ((family$link == "inverse" && any(y == 0)) ||                   (family$link == "log" && any(y <= 0))))                 stop("cannot find valid starting values: please specify some")            mustart <- y        }), validmu = function (mu)         TRUE, valideta = function (eta)         TRUE), .Names = c("family", "link", "linkfun", "linkinv",     "variance", "dev.resids", "aic", "mu.eta", "initialize",     "validmu", "valideta"), class = "family"), control = structure(list(        epsilon = 1e-08, maxit = 25, trace = FALSE), .Names = c("epsilon",     "maxit", "trace")), intercept = TRUE): object 'fit' not found>

## Warning in glm.fit(x = structure(numeric(0), .Dim = c(0L, 10L), .Dimnames =
## list(: no observations informative at iteration 1

## Warning in glm.fit(x = structure(numeric(0), .Dim = c(0L, 10L), .Dimnames =
## list(: glm.fit: algorithm did not converge

## <simpleError in glm.fit(x = structure(numeric(0), .Dim = c(0L, 10L), .Dimnames = list(    NULL, c("(Intercept)", "I(scale(SMD100FL))", "RIDAGEYR",     "RIDAGEYR2", "female", "mexican", "black", "other_eth", "other_hispanic",     "INDFMPIR")), assign = 0:9), y = structure(numeric(0), class = "AsIs", "`scaled:center`" = 0.00244357177919902, "`scaled:scale`" = 0.999429581414316),     weights = numeric(0), start = NULL, etastart = NULL, mustart = NULL,     offset = NULL, family = structure(list(family = "gaussian",         link = "identity", linkfun = function (mu)         mu, linkinv = function (eta)         eta, variance = function (mu)         rep.int(1, length(mu)), dev.resids = function (y, mu,             wt)         wt * ((y - mu)^2), aic = function (y, n, mu, wt, dev)         {            nobs <- length(y)            nobs * (log(dev/nobs * 2 * pi) + 1) + 2 - sum(log(wt))        }, mu.eta = function (eta)         rep.int(1, length(eta)), initialize = expression({            n <- rep.int(1, nobs)            if (is.null(etastart) && is.null(start) && is.null(mustart) &&                 ((family$link == "inverse" && any(y == 0)) ||                   (family$link == "log" && any(y <= 0))))                 stop("cannot find valid starting values: please specify some")            mustart <- y        }), validmu = function (mu)         TRUE, valideta = function (eta)         TRUE), .Names = c("family", "link", "linkfun", "linkinv",     "variance", "dev.resids", "aic", "mu.eta", "initialize",     "validmu", "valideta"), class = "family"), control = structure(list(        epsilon = 1e-08, maxit = 25, trace = FALSE), .Names = c("epsilon",     "maxit", "trace")), intercept = TRUE): object 'fit' not found>

As you can see the EWAS with survey-weighted analysis takes longer to perform but it is technically more accurate as you can see within the visualizations below.

4. Visualization

Now that we have performed the heavy lifting of EWAS analysis we can use the packaged visualization functions, specifically manhattan, volcano, and qqplot to gather insights into the relationship of environmental exposures to mean telomere length (TELOMEAN).

par(mfrow=c(1,1), xpd=TRUE) # setup to plot outside of area if necessary

First look at a Manhattan plot of the telomere length without survey-weighted analysis.

manhattan(telo.norm)

Since we looked at the co-variates by sub-categories of environmental exposure it also makes sense to show this visually. If we provided a mapping of variables to categories the Manhattan plot can show this.

groups <- subset(VarDescription, category %in% useTheseCategories)[, c("category", "var")]
groups <- unique(groups)

Now by passing in this additional information we can show a more visually clear relationship. Let’s look at the telomere study without survey-weighted analysis.

manhattan(telo.norm, group.by=groups, legend=T, xlab="", ylim=c(0,14), nlabels=5)

Now we can compare this to the study done with survey-weighted analysis. You can see that there is less noise in the results and the significant co-variates are clearer.

manhattan(telo.surv, group.by=groups, legend=T, xlab="", ylim=c(0,14), nlabels=5)

Given the results, let’s convert to a more readable table and codify all the variables to their descriptions. We’ll just print out the top 10 results by significance.

top <- xwas.to.df(telo.surv) # convert xwas results to a data.frame

# create a table with the labels and descriptions for each variable used
labels <- subset(VarDescription, category %in% useTheseCategories)[, c("var", "var_desc", "category")]
labels <- labels[!duplicated(labels$var), ]
rownames(labels) <- labels$var

# match variables from the xwas output to the labels object
top <- merge(top, labels, all.x=TRUE, by="row.names")
rownames(top) <- top$Row.names
top <- top[, -which(colnames(top) %in% c("Row.names", "var"))]
top <- top[order(top$q), ]

# show the top 10 results with p-values and descriptions
head(top, n=10)

##                     p           x          q
## LBXBCD   0.0001500260 -0.06771447 0.03675636
## LBX138   0.0001689830  0.14720785 0.04123185
## LBXHXC   0.0003406962  0.14354116 0.08278917
## LBX180   0.0004093170  0.16330759 0.09905470
## WTMEC4YR 0.0004678815  0.12921712 0.11275945
## CVDVOMAX 0.0005148082  0.09742866 0.12355396
## LBX099   0.0006243116  0.11235434 0.14921046
## LBX153   0.0008250694  0.15831092 0.19636652
## URXACE   0.0009448780  0.07508905 0.22393607
## LBX170   0.0010012468  0.11929804 0.23629425
##                                       var_desc         category
## LBXBCD                          Cadmium (ug/L)     heavy metals
## LBX138                     PCB138 & 158 (ng/g)             pcbs
## LBXHXC                 3,3,4,4,5,5-hxcb (fg/g)             pcbs
## LBX180                           PCB180 (ng/g)             pcbs
## WTMEC4YR                                  <NA>             <NA>
## CVDVOMAX          Predicted VO2max (ml/kg/min) physical fitness
## LBX099                            PCB99 (ng/g)             pcbs
## LBX153                           PCB153 (ng/g)             pcbs
## URXACE   Acetochlor mercapturate (ug/L) result       pesticides
## LBX170                           PCB170 (ng/g)             pcbs

Another xwas visualization is the volcano plot that allows you to quickly discern both significance and effect size.

volcano(telo.norm, xlim=c(-.2,.2), xlab="Association Size")

Notice how the visualizations change with the survey-weighted analysis results.

volcano(telo.surv, xlim=c(-.2,.2), xlab="Association Size")

We can also compare our output above to Figure 2 from the paper we’re replicating.

Figure 2: this is grabbed from Figure 2 from the paper we’re replicating for comparison, link above.

An xwas visualization to determine significance is the qqplot. Any deviation from the line y=x indicates true signal in the associations.

qqplot(telo.norm, xlim=c(0,14), ylim=c(0,14))

Now with the survey-weighted data you can see that the signal:noise ratio is much more distinct.

qqplot(telo.surv, xlim=c(0,14), ylim=c(0,14))

Additional visualizations and other features coming soon in future releases.