```{r setup, include=FALSE}
knitr::opts_chunk$set(echo=TRUE)
```

## 1. Setup

We'll assume you have compiled the `xwas` library from latest Github source. If you are using our provided Docker container then you can assume this is true. The two commands below will install in your R environment if necessary.

```{r eval=FALSE}
> library(devtools)
> install_github('chiragjp/xwas')
```

If you performed a `git checkout` of the code itself and ran your own build then the default build puts the library into `~/.R_libs/xwas/` and assumes your R library is set accordingly. The command below outputs the search path for R packages.

```{r}
.libPaths()
```

Now you can load the `xwas` package and are good to go.

```{r}
library(xwas)
```

Further validate it was successfully loaded by checking the `search()` output.

```{r}
search()
```

## 2. Package Survey

The `xwas` library comes with some data, load included NHANES datasets for downstream analyses.

```{r}
data(nhanes)
```

![Figure 1: the two `data.frames` when you load the `nhanes` data set are `MainTable` and `VarDescription`.](nhanes.png)

The data set is divided among two `data.frame`s as diagrammed in `Figure 1`. The `MainTable` has all the individuals represented by a row and columns with the co-variates. The `VarDescription` table has the description of what the co-variates are from the columns in `MainTable`.

## 3. Perform Regression Analysis

**Let's look at telomere length and see what environental exposures could influence it.**

First, what is the outcome variable codified as? Let's refer to the `VarDescription` table.

```{r}
VarDescription[grep("telomere", VarDescription$var_desc, ignore.case=T), ]
```

Mean telomere length is codified as `TELOMEAN` in the `MainTable` by looking at the `VarDescription$var` column. From the `VarDescription$series` column we can also see that the 1999-2000 and 2001-2002 cohorts are associated with this measurement.

Now lets conduct an EWAS on `TELOMEAN`, focusing only on variables present in these two cohorts where the mean telomere length (`TELOMEAN`) was measured.

```{r}
VarDescription.telo <- subset(VarDescription, series == '1999-2000' | series == '2001-2002')
```

Quick check on what this looks like by tabulating the `data.frame`. 

```{r}
sort(table(VarDescription.telo$category))
pie(sort(table(VarDescription.telo$category)))
```

The top two categories are "food component recall" and "pharmaceutical" with 256 and 252 entries, respectively. Let us focus our EWAS on the environmental exposures.

```{r}
useTheseCategories <- c('dioxins', 'furans', 'diakyl', 'heavy metals', 'hormone', 'hydrocarbons', 'pcbs', 'pesticides', 'physical fitness', 'phytoestrogens', 'polyflourochemicals', 'smoking behavior', 'smoking family', 'volatile compounds')
useTheseCategories <- sort(useTheseCategories) # store pre-sorted for easier processing

VarDescription.telo <- VarDescription.telo[VarDescription.telo$category %in% useTheseCategories, ]
```

Another inspection will confirm that we're looking at much more tractable set of data.

```{r}
pie(sort(table(VarDescription.telo$category)))
sort(table(VarDescription.telo$category))
```

The top two categories are now "pesticides" and "pcbs" with 64 and 59 entries, respectively. We're close to performing the actual EWAS but will now define specific co-variates to adjust for in each regression. The vector is defined below as `adjustfor` and then the combined with unique co-variates in our cohort as well as the dependent variable of mean telomere length we are interested in studying (`TELOMEAN`). 

We are also going to derive a new variable `RIDAGEYR2` which is the age-squared to test for non-linear effects associated with age.

```{r}
MainTable$RIDAGEYR2 <- MainTable$RIDAGEYR^2
```

The `newData` variable will contain all the pertinent records for analysis. However, the data is right-skewed so we both scale and log transform the data into a new object.

```{r}
adjustfor <- c('RIDAGEYR', 'RIDAGEYR2', 'female', 'mexican', 'black', 'other_eth', 'other_hispanic', 'INDFMPIR')
variablesToKeep <- c(unique(VarDescription.telo$var), 'TELOMEAN', adjustfor, 'SDMVPSU', 'SDMVSTRA', 'WTMEC2YR', 'WTMEC4YR')
newData <- subset(MainTable, SDDSRVYR <= 2)[, variablesToKeep] ## only keep those that are 

newLogData <- cbind(newData[, is.categorical(newData, lower=0)], apply(newData[, !is.categorical(newData, lower=0)], 2, function(x) { scale(log(x+1e-10)) })) # log transform the data
```

The `xwas` library also has integrated support for the `survey` package to perform survey-weighted analysis. You will have to define your own `survey::svydesign` objects but it's taken as an argument to most `xwas` functions. We define our study in the `dsn` variable then perform an EWAS both with and with-out the survey design for comparison.

```{r}
dsn <- svydesign(ids=~SDMVPSU, strata=~SDMVSTRA, weight=~WTMEC4YR, nest=T, data=subset(newLogData, WTMEC4YR > 0)) # create the design object

telo.norm <- xwas(data=subset(newLogData, WTMEC4YR > 0), depvar="TELOMEAN", adjvars=adjustfor, verbose=F) # ewas WITHOUT survey design
telo.surv <- xwas(data=subset(newLogData, WTMEC4YR > 0), depvar="TELOMEAN", adjvars=adjustfor, design=dsn, verbose=F) # ewas WITH survey design
```

As you can see the EWAS with survey-weighted analysis takes longer to perform but it is technically more accurate as you can see within the visualizations below.

## 4. Visualization

Now that we have performed the heavy lifting of EWAS analysis we can use the packaged visualization functions, specifically `manhattan`, `volcano`, and `qqplot` to gather insights into the relationship of environmental exposures to mean telomere length (`TELOMEAN`).

```{r}
par(mfrow=c(1,1), xpd=TRUE) # setup to plot outside of area if necessary
```

First look at a Manhattan plot of the telomere length without survey-weighted analysis.

```{r}
manhattan(telo.norm)
```

Since we looked at the co-variates by sub-categories of environmental exposure it also makes sense to show this visually. If we provided a mapping of variables to categories the Manhattan plot can show this.

```{r}
groups <- subset(VarDescription, category %in% useTheseCategories)[, c("category", "var")]
groups <- unique(groups)
```

Now by passing in this additional information we can show a more visually clear relationship. Let's look at the telomere study without survey-weighted analysis.

```{r}
manhattan(telo.norm, group.by=groups, legend=T, xlab="", ylim=c(0,14), nlabels=5)
```

Now we can compare this to the study done **with** survey-weighted analysis. You can see that there is less noise in the results and the significant co-variates are clearer.

```{r}
manhattan(telo.surv, group.by=groups, legend=T, xlab="", ylim=c(0,14), nlabels=5)
```

Given the results, let's convert to a more readable table and codify all the variables to their descriptions. We'll just print out the top 10 results by significance.

```{r}
top <- xwas.to.df(telo.surv) # convert xwas results to a data.frame

# create a table with the labels and descriptions for each variable used
labels <- subset(VarDescription, category %in% useTheseCategories)[, c("var", "var_desc", "category")]
labels <- labels[!duplicated(labels$var), ]
rownames(labels) <- labels$var

# match variables from the xwas output to the labels object
top <- merge(top, labels, all.x=TRUE, by="row.names")
rownames(top) <- top$Row.names
top <- top[, -which(colnames(top) %in% c("Row.names", "var"))]
top <- top[order(top$q), ]

# show the top 10 results with p-values and descriptions
head(top, n=10)
```

Another `xwas` visualization is the volcano plot that allows you to quickly discern both significance and effect size.

```{r}
volcano(telo.norm, xlim=c(-.2,.2), xlab="Association Size")
```

Notice how the visualizations change with the survey-weighted analysis results.

```{r}
volcano(telo.surv, xlim=c(-.2,.2), xlab="Association Size")
```

We can also compare our output above to [Figure 2 from the paper](http://ije.oxfordjournals.org/content/early/2016/04/07/ije.dyw043/F2.expansion.html) we're replicating.

![Figure 2: this is grabbed from Figure 2 from the paper we're replicating for comparison, link above.](chirag-f2.gif)

An `xwas` visualization to determine significance is the `qqplot`. Any deviation from the line `y=x` indicates true signal in the associations. 

```{r}
qqplot(telo.norm, xlim=c(0,14), ylim=c(0,14))
```

Now with the survey-weighted data you can see that the signal:noise ratio is much more distinct.

```{r}
qqplot(telo.surv, xlim=c(0,14), ylim=c(0,14))
```

Additional visualizations and other features coming soon in future releases.