We’ll assume you have compiled the xwas
library from latest Github source. If you are using our provided Docker container then you can assume this is true. The two commands below will install in your R environment if necessary.
> library(devtools)
> install_github('chiragjp/xwas')
If you performed a git checkout
of the code itself and ran your own build then the default build puts the library into ~/.R_libs/xwas/
and assumes your R library is set accordingly. The command below outputs the search path for R packages.
.libPaths()
## [1] "/Users/npho/.R_libs"
## [2] "/Library/Frameworks/R.framework/Versions/3.2/Resources/library"
Now you can load the xwas
package and are good to go.
library(xwas)
## Loading required package: parallel
## Loading required package: survey
## Loading required package: grid
##
## Attaching package: 'survey'
## The following object is masked from 'package:graphics':
##
## dotchart
## Loading required package: survival
## Warning: package 'survival' was built under R version 3.2.5
##
## Attaching package: 'xwas'
## The following object is masked from 'package:stats':
##
## qqplot
Further validate it was successfully loaded by checking the search()
output.
search()
## [1] ".GlobalEnv" "package:xwas" "package:survival"
## [4] "package:survey" "package:grid" "package:parallel"
## [7] "package:stats" "package:graphics" "package:grDevices"
## [10] "package:utils" "package:datasets" "package:methods"
## [13] "Autoloads" "package:base"
The xwas
library comes with some data, load included NHANES datasets for downstream analyses.
data(nhanes)
Figure 1: the two data.frames
when you load the nhanes
data set are MainTable
and VarDescription
.
The data set is divided among two data.frame
s as diagrammed in Figure 1
. The MainTable
has all the individuals represented by a row and columns with the co-variates. The VarDescription
table has the description of what the co-variates are from the columns in MainTable
.
Let’s look at telomere length and see what environental exposures could influence it.
First, what is the outcome variable codified as? Let’s refer to the VarDescription
table.
VarDescription[grep("telomere", VarDescription$var_desc, ignore.case=T), ]
## tab_name tab_desc series module
## 31731 telo_a Telomere Mean and Standard Deviation 1999-2000 laboratory
## 31741 telo_b Telomere Mean and Standard Deviation 2001-2002 laboratory
## var var_desc analyzable binary_ref_group
## 31731 TELOMEAN Mean Telomere Length 1 NA
## 31741 TELOMEAN Mean Telomere Length 1 NA
## comment_var version_date is_comment is_weight is_questionnaire
## 31731 <NA> 2014-11-19 0 0 0
## 31741 <NA> 2014-11-19 0 0 0
## is_ecological is_binary is_ordinal categorical_ref_group
## 31731 0 0 0 NA
## 31741 0 0 0 NA
## categorical_levels N age_range_low age_range_high age_median
## 31731 <NA> 7827 19 85 47
## 31741 <NA> 7827 19 85 47
## female_N female_pct white_N white_pct mexican_N mexican_pct black_N
## 31731 4056 0.5182062 3965 0.5065798 1876 0.2396831 1333
## 31741 4056 0.5182062 3965 0.5065798 1876 0.2396831 1333
## black_pct other_hispanic_N other_hispanic_pct other_N other_pct
## 31731 0.1703079 417 0.05327712 236 0.03015204
## 31741 0.1703079 417 0.05327712 236 0.03015204
## series_num category sub_category
## 31731 1 aging <NA>
## 31741 2 aging <NA>
Mean telomere length is codified as TELOMEAN
in the MainTable
by looking at the VarDescription$var
column. From the VarDescription$series
column we can also see that the 1999-2000 and 2001-2002 cohorts are associated with this measurement.
Now lets conduct an EWAS on TELOMEAN
, focusing only on variables present in these two cohorts where the mean telomere length (TELOMEAN
) was measured.
VarDescription.telo <- subset(VarDescription, series == '1999-2000' | series == '2001-2002')
Quick check on what this looks like by tabulating the data.frame
.
sort(table(VarDescription.telo$category))
##
## aging cotinine cognitive functioning
## 2 2 4
## sexual behavior immunization social support
## 4 6 6
## alcohol use blood pressure street drug
## 8 8 8
## polyflourochemicals phytoestrogens dioxins
## 11 12 13
## diakyl hormone smoking family
## 14 14 16
## housing furans phthalates
## 18 19 19
## nutrients viral infection physical fitness
## 26 26 30
## hydrocarbons body measures blood
## 37 38 40
## heavy metals occupation volatile compounds
## 42 42 51
## bacterial infection smoking behavior pcbs
## 53 55 59
## pesticides disease biochemistry
## 64 80 93
## supplement use pharmaceutical food component recall
## 128 252 256
pie(sort(table(VarDescription.telo$category)))
The top two categories are “food component recall” and “pharmaceutical” with 256 and 252 entries, respectively. Let us focus our EWAS on the environmental exposures.
useTheseCategories <- c('dioxins', 'furans', 'diakyl', 'heavy metals', 'hormone', 'hydrocarbons', 'pcbs', 'pesticides', 'physical fitness', 'phytoestrogens', 'polyflourochemicals', 'smoking behavior', 'smoking family', 'volatile compounds')
useTheseCategories <- sort(useTheseCategories) # store pre-sorted for easier processing
VarDescription.telo <- VarDescription.telo[VarDescription.telo$category %in% useTheseCategories, ]
Another inspection will confirm that we’re looking at much more tractable set of data.
pie(sort(table(VarDescription.telo$category)))
sort(table(VarDescription.telo$category))
##
## polyflourochemicals phytoestrogens dioxins
## 11 12 13
## diakyl hormone smoking family
## 14 14 16
## furans physical fitness hydrocarbons
## 19 30 37
## heavy metals volatile compounds smoking behavior
## 42 51 55
## pcbs pesticides
## 59 64
The top two categories are now “pesticides” and “pcbs” with 64 and 59 entries, respectively. We’re close to performing the actual EWAS but will now define specific co-variates to adjust for in each regression. The vector is defined below as adjustfor
and then the combined with unique co-variates in our cohort as well as the dependent variable of mean telomere length we are interested in studying (TELOMEAN
).
We are also going to derive a new variable RIDAGEYR2
which is the age-squared to test for non-linear effects associated with age.
MainTable$RIDAGEYR2 <- MainTable$RIDAGEYR^2
The newData
variable will contain all the pertinent records for analysis. However, the data is right-skewed so we both scale and log transform the data into a new object.
adjustfor <- c('RIDAGEYR', 'RIDAGEYR2', 'female', 'mexican', 'black', 'other_eth', 'other_hispanic', 'INDFMPIR')
variablesToKeep <- c(unique(VarDescription.telo$var), 'TELOMEAN', adjustfor, 'SDMVPSU', 'SDMVSTRA', 'WTMEC2YR', 'WTMEC4YR')
newData <- subset(MainTable, SDDSRVYR <= 2)[, variablesToKeep] ## only keep those that are
newLogData <- cbind(newData[, is.categorical(newData, lower=0)], apply(newData[, !is.categorical(newData, lower=0)], 2, function(x) { scale(log(x+1e-10)) })) # log transform the data
The xwas
library also has integrated support for the survey
package to perform survey-weighted analysis. You will have to define your own survey::svydesign
objects but it’s taken as an argument to most xwas
functions. We define our study in the dsn
variable then perform an EWAS both with and with-out the survey design for comparison.
dsn <- svydesign(ids=~SDMVPSU, strata=~SDMVSTRA, weight=~WTMEC4YR, nest=T, data=subset(newLogData, WTMEC4YR > 0)) # create the design object
telo.norm <- xwas(data=subset(newLogData, WTMEC4YR > 0), depvar="TELOMEAN", adjvars=adjustfor, verbose=F) # ewas WITHOUT survey design
## <simpleError in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...): 0 (non-NA) cases>
## <simpleError in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...): 0 (non-NA) cases>
telo.surv <- xwas(data=subset(newLogData, WTMEC4YR > 0), depvar="TELOMEAN", adjvars=adjustfor, design=dsn, verbose=F) # ewas WITH survey design
## Warning in glm.fit(x = structure(numeric(0), .Dim = c(0L, 10L), .Dimnames =
## list(: no observations informative at iteration 1
## Warning: glm.fit: algorithm did not converge
## <simpleError in glm.fit(x = structure(numeric(0), .Dim = c(0L, 10L), .Dimnames = list( NULL, c("(Intercept)", "I(scale(SMD100MN))", "RIDAGEYR", "RIDAGEYR2", "female", "mexican", "black", "other_eth", "other_hispanic", "INDFMPIR")), assign = 0:9), y = structure(numeric(0), class = "AsIs", "`scaled:center`" = 0.00244357177919902, "`scaled:scale`" = 0.999429581414316), weights = numeric(0), start = NULL, etastart = NULL, mustart = NULL, offset = NULL, family = structure(list(family = "gaussian", link = "identity", linkfun = function (mu) mu, linkinv = function (eta) eta, variance = function (mu) rep.int(1, length(mu)), dev.resids = function (y, mu, wt) wt * ((y - mu)^2), aic = function (y, n, mu, wt, dev) { nobs <- length(y) nobs * (log(dev/nobs * 2 * pi) + 1) + 2 - sum(log(wt)) }, mu.eta = function (eta) rep.int(1, length(eta)), initialize = expression({ n <- rep.int(1, nobs) if (is.null(etastart) && is.null(start) && is.null(mustart) && ((family$link == "inverse" && any(y == 0)) || (family$link == "log" && any(y <= 0)))) stop("cannot find valid starting values: please specify some") mustart <- y }), validmu = function (mu) TRUE, valideta = function (eta) TRUE), .Names = c("family", "link", "linkfun", "linkinv", "variance", "dev.resids", "aic", "mu.eta", "initialize", "validmu", "valideta"), class = "family"), control = structure(list( epsilon = 1e-08, maxit = 25, trace = FALSE), .Names = c("epsilon", "maxit", "trace")), intercept = TRUE): object 'fit' not found>
## Warning in glm.fit(x = structure(numeric(0), .Dim = c(0L, 10L), .Dimnames =
## list(: no observations informative at iteration 1
## Warning in glm.fit(x = structure(numeric(0), .Dim = c(0L, 10L), .Dimnames =
## list(: glm.fit: algorithm did not converge
## <simpleError in glm.fit(x = structure(numeric(0), .Dim = c(0L, 10L), .Dimnames = list( NULL, c("(Intercept)", "I(scale(SMD100FL))", "RIDAGEYR", "RIDAGEYR2", "female", "mexican", "black", "other_eth", "other_hispanic", "INDFMPIR")), assign = 0:9), y = structure(numeric(0), class = "AsIs", "`scaled:center`" = 0.00244357177919902, "`scaled:scale`" = 0.999429581414316), weights = numeric(0), start = NULL, etastart = NULL, mustart = NULL, offset = NULL, family = structure(list(family = "gaussian", link = "identity", linkfun = function (mu) mu, linkinv = function (eta) eta, variance = function (mu) rep.int(1, length(mu)), dev.resids = function (y, mu, wt) wt * ((y - mu)^2), aic = function (y, n, mu, wt, dev) { nobs <- length(y) nobs * (log(dev/nobs * 2 * pi) + 1) + 2 - sum(log(wt)) }, mu.eta = function (eta) rep.int(1, length(eta)), initialize = expression({ n <- rep.int(1, nobs) if (is.null(etastart) && is.null(start) && is.null(mustart) && ((family$link == "inverse" && any(y == 0)) || (family$link == "log" && any(y <= 0)))) stop("cannot find valid starting values: please specify some") mustart <- y }), validmu = function (mu) TRUE, valideta = function (eta) TRUE), .Names = c("family", "link", "linkfun", "linkinv", "variance", "dev.resids", "aic", "mu.eta", "initialize", "validmu", "valideta"), class = "family"), control = structure(list( epsilon = 1e-08, maxit = 25, trace = FALSE), .Names = c("epsilon", "maxit", "trace")), intercept = TRUE): object 'fit' not found>
As you can see the EWAS with survey-weighted analysis takes longer to perform but it is technically more accurate as you can see within the visualizations below.
Now that we have performed the heavy lifting of EWAS analysis we can use the packaged visualization functions, specifically manhattan
, volcano
, and qqplot
to gather insights into the relationship of environmental exposures to mean telomere length (TELOMEAN
).
par(mfrow=c(1,1), xpd=TRUE) # setup to plot outside of area if necessary
First look at a Manhattan plot of the telomere length without survey-weighted analysis.
manhattan(telo.norm)
Since we looked at the co-variates by sub-categories of environmental exposure it also makes sense to show this visually. If we provided a mapping of variables to categories the Manhattan plot can show this.
groups <- subset(VarDescription, category %in% useTheseCategories)[, c("category", "var")]
groups <- unique(groups)
Now by passing in this additional information we can show a more visually clear relationship. Let’s look at the telomere study without survey-weighted analysis.
manhattan(telo.norm, group.by=groups, legend=T, xlab="", ylim=c(0,14), nlabels=5)
Now we can compare this to the study done with survey-weighted analysis. You can see that there is less noise in the results and the significant co-variates are clearer.
manhattan(telo.surv, group.by=groups, legend=T, xlab="", ylim=c(0,14), nlabels=5)
Given the results, let’s convert to a more readable table and codify all the variables to their descriptions. We’ll just print out the top 10 results by significance.
top <- xwas.to.df(telo.surv) # convert xwas results to a data.frame
# create a table with the labels and descriptions for each variable used
labels <- subset(VarDescription, category %in% useTheseCategories)[, c("var", "var_desc", "category")]
labels <- labels[!duplicated(labels$var), ]
rownames(labels) <- labels$var
# match variables from the xwas output to the labels object
top <- merge(top, labels, all.x=TRUE, by="row.names")
rownames(top) <- top$Row.names
top <- top[, -which(colnames(top) %in% c("Row.names", "var"))]
top <- top[order(top$q), ]
# show the top 10 results with p-values and descriptions
head(top, n=10)
## p x q
## LBXBCD 0.0001500260 -0.06771447 0.03675636
## LBX138 0.0001689830 0.14720785 0.04123185
## LBXHXC 0.0003406962 0.14354116 0.08278917
## LBX180 0.0004093170 0.16330759 0.09905470
## WTMEC4YR 0.0004678815 0.12921712 0.11275945
## CVDVOMAX 0.0005148082 0.09742866 0.12355396
## LBX099 0.0006243116 0.11235434 0.14921046
## LBX153 0.0008250694 0.15831092 0.19636652
## URXACE 0.0009448780 0.07508905 0.22393607
## LBX170 0.0010012468 0.11929804 0.23629425
## var_desc category
## LBXBCD Cadmium (ug/L) heavy metals
## LBX138 PCB138 & 158 (ng/g) pcbs
## LBXHXC 3,3,4,4,5,5-hxcb (fg/g) pcbs
## LBX180 PCB180 (ng/g) pcbs
## WTMEC4YR <NA> <NA>
## CVDVOMAX Predicted VO2max (ml/kg/min) physical fitness
## LBX099 PCB99 (ng/g) pcbs
## LBX153 PCB153 (ng/g) pcbs
## URXACE Acetochlor mercapturate (ug/L) result pesticides
## LBX170 PCB170 (ng/g) pcbs
Another xwas
visualization is the volcano plot that allows you to quickly discern both significance and effect size.
volcano(telo.norm, xlim=c(-.2,.2), xlab="Association Size")
Notice how the visualizations change with the survey-weighted analysis results.
volcano(telo.surv, xlim=c(-.2,.2), xlab="Association Size")
We can also compare our output above to Figure 2 from the paper we’re replicating.
Figure 2: this is grabbed from Figure 2 from the paper we’re replicating for comparison, link above.
An xwas
visualization to determine significance is the qqplot
. Any deviation from the line y=x
indicates true signal in the associations.
qqplot(telo.norm, xlim=c(0,14), ylim=c(0,14))
Now with the survey-weighted data you can see that the signal:noise ratio is much more distinct.
qqplot(telo.surv, xlim=c(0,14), ylim=c(0,14))
Additional visualizations and other features coming soon in future releases.