What if Mr. Rogers Moved to Gotham? (A STA525 Clustering Example)

# Tag: STA525

# Autoantigen Array Data Part 1

### Overview

In this and a couple of subsequent posts, we will analyze an example Autoantigen Array data-set. In this post, we will load and examine an example data-set that requires normalization before statistical tests of hypotheses and other analyses can be performed.

The Auto antigen Arrays provide good instructional value as they are quite small by typical microarray standards. The size of the data is therefore easier to visualize and otherwise explore. Additionally,the small size presents some complications with respect to data normalization which are also of interest from an instructional perspective.

In subsequent posts, we will develop our own custom normalization approach and run a full analysis off the data.

#### Autoantigen Arrays

A nice overview of protein arrays is provided in this publication: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4610965/

The data we will be analyzing was generated by this lab:

https://microarray.swmed.edu/products/category/protein-array/

In particular, the array we will examine is of the type described here:

https://microarray.swmed.edu/products/product/autoantigen-microarray-panel-i/

### Loading Example Autoantigen Array Data

Our analysis will involve 10 samples that were assigned to the first 10 of the 16 blocks of a single Autoantigen Array slide. As all samples were run on the same slide, batch/slide effects will not need to be modeled. The data is therefore a subset of the two (i.e., ct3 and ct5 scan) GenePix output files that were generated for the slide of interest. The cy3 scan data provides relative IgG autoantigen abundance quantifications and the cy5 provides quantifications for IgM autoantigens.

The data is currently available via github and can be obtained as follows:

```
library(devtools)
install_github("dpgaile/AAarrayExmpl")
```

We will also make use of *aheatmap* function from the NMF library.

```
library(AAarrayExmpl)
library(NMF)
```

The GenePix data has been loaded and stored conveniently into list objects that have been included as part of the AAarrayExmpl R library. The lists can be accessed as follows:

```
data(ListArrays)
```

and provide two objects, *S1g* and *S1r*, which contain the the cy3 and cy5 GenePix output data for slide 1 (the slide of interest for this analysis), respectively. The lists are structured such that they contain 40 arrays each of dimensions: 96 by 16 by 4. The 40 arrays correspond to each of the 40 columns provided by each GenePix output file:

```
names(S1g)
```

## [1] "Block" "Column" ## [3] "Row" "Name" ## [5] "ID" "X" ## [7] "Y" "Dia." ## [9] "F532.Median" "F532.Mean" ## [11] "F532.SD" "F532.CV" ## [13] "B532" "B532.Median" ## [15] "B532.Mean" "B532.SD" ## [17] "B532.CV" "X....B532.1SD" ## [19] "X....B532.2SD" "F532...Sat." ## [21] "Ratio.of.Medians..635.532." "Ratio.of.Means..635.532." ## [23] "Median.of.Ratios..635.532." "Mean.of.Ratios..635.532." ## [25] "Ratios.SD..635.532." "Rgn.Ratio..635.532." ## [27] "Rgn.R2..635.532." "F.Pixels" ## [29] "B.Pixels" "Circularity" ## [31] "Sum.of.Medians..635.532." "Sum.of.Means..635.532." ## [33] "Log.Ratio..635.532." "F532.Median...B532" ## [35] "F532.Mean...B532" "F532.Total.Intensity" ## [37] "SNR.532" "Flags" ## [39] "Normalize" "Autoflag"

Each of the data arrays has components:

```
dimnames(S1g$F532.Median)
```

## [[1]] ## [1] "Aggrecan" ## [2] "alpha-actinin" ## [3] "Amyloid" ## [4] "Beta 2-glycoprotein " ## [5] "Beta 2-microglobulin" ## [6] "Ig control" ## [7] "Cardolipin" ## [8] "CENP-A" ## [9] "CENP-B" ## [10] "Chondroitin Sulfate " ## [11] "Chromatin" ## [12] "Collagen I" ## [13] "Collagen II" ## [14] "Collagen V" ## [15] "Collagen VI" ## [16] "Cytochrome C" ## [17] "DGPS" ## [18] "dsDNA" ## [19] "ssDNA" ## [20] "ssRNA" ## [21] "Elastin" ## [22] "Fibrinogen IV" ## [23] "Fibrinogen S" ## [24] "Gliadin_IgG" ## [25] "GBM_dissociated" ## [26] "Histone H1" ## [27] "Histone H2A" ## [28] "Histone H2B" ## [29] "Hinstone H3" ## [30] "Hinstone H4" ## [31] "Histone-total" ## [32] "Heparan HSPG" ## [33] "Hemocyanin" ## [34] "Heperan Sulfate" ## [35] "Heparin" ## [36] "BPI" ## [37] "Entaktin EDTA" ## [38] "Jo-1" ## [39] "Laminin" ## [40] "La/SSB" ## [41] "LC1" ## [42] "Matrigel" ## [43] "Myosin" ## [44] "PCNA" ## [45] "PL-7" ## [46] "PL-12" ## [47] "PM/Scl-100" ## [48] "Proteoglycan" ## [49] "Nucleolin" ## [50] "Anti-Ig" ## [51] "Ro/SSA_52KDa" ## [52] "Ro/SSA_60KDa" ## [53] "Scl-70" ## [54] "Thyroglobulin" ## [55] "Topoisomerase I" ## [56] "TPO" ## [57] "TTG" ## [58] "U1-snRNP-A" ## [59] "U1-snRNP-BB'" ## [60] "U1-snRNP-C" ## [61] "U1-snRNP-68" ## [62] "LKM1" ## [63] "C1q" ## [64] "Intrinsic Factor" ## [65] "MPO" ## [66] "M2" ## [67] "Mi-2" ## [68] "KU (P70/P80)" ## [69] "Sm/RNP" ## [70] "SRP54" ## [71] "Ribo phaspho protein P1" ## [72] "Ribo phaspho protein P2" ## [73] "PM/Scl-75" ## [74] "PR3" ## [75] "SP100" ## [76] "gP210" ## [77] "Nup62" ## [78] "GP2" ## [79] "Collagen IV" ## [80] "Collagen III" ## [81] "MBP_myelin basic protein" ## [82] "Peroxiredoxin 1" ## [83] "Vimentin" ## [84] "Vitronectin" ## [85] "Prothrombin protien" ## [86] "Nucleosome antigen" ## [87] "Mitochondrial antige" ## [88] "Sm" ## [89] "SmD" ## [90] "Decorin-bovine" ## [91] "MAG-myelin-associated glycoprotein-FC" ## [92] "Ribo phasphoprotein P0" ## [93] "Phophatidylinositol" ## [94] "Glycated Albumin" ## [95] "Sphingomyelin" ## [96] "Fibronectin" ## ## [[2]] ## [1] "NOD.B10.1" "NOD.B10.2" "NOD.B10.3" "NOD.B10.4" "NOD.B10.5" ## [6] "BL.10.1" "BL.10.2" "BL.10.3" "BL.10.4" "BL.10.5" ## [11] "Exp1.F0" "Exp1.MZ" "Exp1.SpB1" "Exp1.PrB1" "Exp2.FO" ## [16] "PBS" ## ## [[3]] ## [1] "rep.1" "rep.2" "rep.3" "rep.4"

Each block on the array contains 196 spots, with 94 of the 96 assays conducted in duplicate and two assays conducted with four replications. The assays with four replications are of the control variety, and we will sidestep them at this time. The first 5 blocks of the slide contain the results for strain “A” of mice while blocks 6 through 10 contain the results for strain “B”. At the time of this blog, the manuscript analyzing this data has yet to be submitted for review, so the exact strain identities have been masked.

The GenePix scan results for the the slides can be inspected using an R Shiny app included in the provided library.

#### R Shiny Visualization of GenePix Data

The GenePix data can be examined using an admittedly crude custom Shiny app provided by the AAarrayExmpl library.

```
quickVis()
```

### Spot and Background Signal Quantifications

Our *AAarrayExmpl* R library provides convenience data handling and visualization functions. Spot and background signal quantifications for IgG and IgM can be extracted as follows:

```
BackdatIgG=ExtractFromGenePix(GPlist=S1g,
SignalName="B532.Median",
SampleIndices=1:10)
BackdatIgM=ExtractFromGenePix(GPlist=S1r,
SignalName="B635.Median",
SampleIndices=1:10)
AAdatIgG=ExtractFromGenePix(GPlist=S1g,
SignalName="F532.Median",
SampleIndices=1:10)
AAdatIgM=ExtractFromGenePix(GPlist=S1r,
SignalName="F635.Median",
SampleIndices=1:10)
```

The new list objects now contain 94 (assays) by 10 (sample) data matrices for each of the duplicate spot assays. We can examine the structures of the objects

```
str(AAdatIgG)
```

## List of 4 ## $ X_raw_1 : int [1:94, 1:10] 259 219 208 621 4478 220 689 1299 909 819 ... ## ..- attr(*, "dimnames")=List of 2 ## .. ..$ : chr [1:94] "Aggrecan" "alpha-actinin" "Amyloid" "Beta 2-glycoprotein " ... ## .. ..$ : chr [1:10] "NOD.B10.1" "NOD.B10.2" "NOD.B10.3" "NOD.B10.4" ... ## $ X_raw_2 : int [1:94, 1:10] 259 231 252 621 5336 228 565 1988 445 707 ... ## ..- attr(*, "dimnames")=List of 2 ## .. ..$ : chr [1:94] "Aggrecan" "alpha-actinin" "Amyloid" "Beta 2-glycoprotein " ... ## .. ..$ : chr [1:10] "NOD.B10.1" "NOD.B10.2" "NOD.B10.3" "NOD.B10.4" ... ## $ X_resid_1: num [1:94, 1:10] 38.4 22 23.9 136.9 1812.5 ... ## ..- attr(*, "dimnames")=List of 2 ## .. ..$ : chr [1:94] "Aggrecan" "alpha-actinin" "Amyloid" "Beta 2-glycoprotein " ... ## .. ..$ : chr [1:10] "NOD.B10.1" "NOD.B10.2" "NOD.B10.3" "NOD.B10.4" ... ## $ X_resid_2: num [1:94, 1:10] 38.4 34 67.9 136.9 2670.5 ... ## ..- attr(*, "dimnames")=List of 2 ## .. ..$ : chr [1:94] "Aggrecan" "alpha-actinin" "Amyloid" "Beta 2-glycoprotein " ... ## .. ..$ : chr [1:10] "NOD.B10.1" "NOD.B10.2" "NOD.B10.3" "NOD.B10.4" ... ## - attr(*, "class")= chr "AutoAnt"

We can see, that at the heart of our simple objects, we have the data matrices:

```
dim(AAdatIgG$X_raw_1)
```

## [1] 94 10

```
dim(AAdatIgG$X_raw_2)
```

## [1] 94 10

where *X_raw_1* consists of the assay values for the first of each pair of duplicate assay values listed in the GenePix output. Similarly, *X_raw_2* consists of the assay values for the duplicates that are listed second. Although it is a common practice to average the replicate values at the early stages of an analysis and carry a single matrix forward, we will keep them separate for the time being.

#### Sample Spot and Background Distributions

The *AAplotDens* function in our *AAarrayExmpl* library provides a convenient function to display kernel density estimators (Wikipedia Link) of specified subsets of the data. We make use of the function to visualize the distributions of the 10 samples of interest to us. Note that we color code samples from strain A with different types of reds and samples from strain B with different types of grey.

```
clrs=c("red","red3","red4","orangered","orangered3",
"gray15","gray30","gray45","gray60","gray75")
pchs=c(rep(3,5),rep(4,5))
par(mfrow=c(2,2));par(mgp=c(1.5,.5,0));par(mar=c(2.75,2.75,1.5,0.25))
AAplotDens(BackdatIgG,SampleIndices=1:10,clrs=clrs,useMat="raw.resid",
xlims=NA,returnModes=F,main="IgG Background")
legend(5,0.11,lty=1,lwd=3,col=clrs,legend=paste("sample",1:10),bty="n",ncol=2)
AAplotDens(BackdatIgM,SampleIndices=1:10,clrs=clrs,useMat="raw.resid",
xlims=NA,returnModes=F,main="IgM Background")
AAplotDens(AAdatIgG,SampleIndices=1:10,clrs=clrs,useMat="raw.resid",
xlims=c(-175,175),returnModes=F,main="IgG Spot Signal")
legend(375,0.07,lty=1,lwd=3,col=clrs,legend=paste("block",1:10),bty="n",ncol=2)
AAplotDens(AAdatIgM,SampleIndices=1:10,clrs=clrs,useMat="raw.resid",
xlims=c(-100,100),returnModes=F,main="IgM Spot Signal")
```

Inspection of density estimates for IgG and IgM signal and background quantifications indicate that data normalization is required. For example, the samples from strain A exhibit are not all similarly distributed as one would expect. The variability observed across the different samples is most likely a result of technical variability rather than biological variability. The difference in background signal distributions suggests that the differing amounts of overall material were labeled/assayed for the various samples.

Plots of background versus spot quantifications across samples (using the same color scheme) also indicates differences in underlying distributions across samples.

```
par(mfrow=c(2,2));par(mgp=c(1.5,.5,0));par(mar=c(2.75,2.75,1.5,0.25))
plot(as.vector(cbind(BackdatIgG$X_raw_1,BackdatIgG$X_raw_2)),
as.vector(cbind(AAdatIgG$X_raw_1,AAdatIgG$X_raw_2)),
type="n",xlab="IgG Background Intensity",ylab="IgG Spot Intensity" )
abline(0,1,lwd=3,col="steelblue")
for(j in 1:10) points(c(BackdatIgG$X_raw_1[,j],BackdatIgG$X_raw_2[,j]),
c(AAdatIgG$X_raw_1[,j],AAdatIgG$X_raw_2[,j]),
pch=j,col=clrs[j],cex=0.5)
plot(as.vector(cbind(BackdatIgM$X_raw_1,BackdatIgM$X_raw_2)),
as.vector(cbind(AAdatIgM$X_raw_1,AAdatIgM$X_raw_2)),
type="n",xlab="IgM Background Intensity",ylab="IgM Spot Intensity" )
abline(0,1,lwd=3,col="steelblue")
for(j in 1:10) points(c(BackdatIgM$X_raw_1[,j],BackdatIgM$X_raw_2[,j]),
c(AAdatIgM$X_raw_1[,j],AAdatIgM$X_raw_2[,j]),
pch=j,col=clrs[j],cex=0.5)
plot(log(as.vector(cbind(BackdatIgG$X_raw_1,BackdatIgG$X_raw_2))),
log(as.vector(cbind(AAdatIgG$X_raw_1,AAdatIgG$X_raw_2))),
type="n",xlab="log IgG Background Intensity",ylab="log IgG Spot Intensity" )
abline(0,1,lwd=3,col="steelblue")
for(j in 1:10) points(log(c(BackdatIgG$X_raw_1[,j],BackdatIgG$X_raw_2[,j])),
log(c(AAdatIgG$X_raw_1[,j],AAdatIgG$X_raw_2[,j])),
pch=j,col=clrs[j],cex=0.5)
plot(log(as.vector(cbind(BackdatIgM$X_raw_1,BackdatIgM$X_raw_2))),
log(as.vector(cbind(AAdatIgM$X_raw_1,AAdatIgM$X_raw_2))),
type="n",xlab="log IgM Background Intensity",ylab="log IgM Spot Intensity" )
abline(0,1,lwd=3,col="steelblue")
for(j in 1:10) points(log(c(BackdatIgM$X_raw_1[,j],BackdatIgM$X_raw_2[,j])),
log(c(AAdatIgM$X_raw_1[,j],AAdatIgM$X_raw_2[,j])),
pch=j,col=clrs[j],cex=0.5)
```

### Heatmaps of Raw Signal

The *NMF* R library provides a function, *aheatmap*, which generates annotated heatmaps (Wikipedia Link). Heatmaps of IgG and IgM signal quantifications confirm that the scale of measurements differs significantly across the 94 autoantigen assays. The “.x” suffix indicates the spot replicate. Comparison of “.1” and “.2” columns for a given sample, speaks to the agreement between replicate assay measurements.

#### Uniform Rank-It Heatmaps

Heatmaps using uniform rank-its (i.e., rank values scaled to the unit interval) of the assays values provide better contrast for visualization.

```
W=cbind(AAdatIgG$X_raw_1[,1],AAdatIgG$X_raw_2[,1])
for(j in 2:10) W=cbind(W,AAdatIgG$X_raw_1[,j],AAdatIgG$X_raw_2[,j])
RnkW=array(rank(W),dim=dim(W))/(20*94+1)
colnames(RnkW)=as.character(as.vector(mapply(rep,1:10,2))
+rep(c(0.1,0.2),5))
par(mfrow=c(1,1))
aheatmap(RnkW,annCol=factor(c(rep("strain A",10),rep("strain B",10))),
main="Rank-it Raw Signal IgG")
```

```
W=cbind(AAdatIgM$X_raw_1[,1],AAdatIgM$X_raw_2[,1])
for(j in 2:10) W=cbind(W,AAdatIgM$X_raw_1[,j],AAdatIgM$X_raw_2[,j])
RnkW=array(rank(W),dim=dim(W))/(20*94+1)
colnames(RnkW)=as.character(as.vector(mapply(rep,1:10,2))
+rep(c(0.1,0.2),5))
par(mfrow=c(1,1))
aheatmap(RnkW,annCol=factor(c(rep("strain A",10),rep("strain B",10))),
main="Rank-it Raw Signal IgM")
```

Inspection of the heatmaps reveal that the different autoantigen assays have distributions of different centers and scales. The *AAplotByMean* in our *AAarrayExmpl* library provides figures which plot the observed signal assay values as a function of the magnitude of the center of each assay. We utilize the Tukey Trimean (Wikipedia Link) to estimate the centers of each assay distribution rather than the mean so as to minimize the effects of outliers. The plots created by the functions calls make clear the presence of heteroskedasticity in our data-set, and that the variance increases with assay distribution means, a not uncommon occurrence in biological data.

```
par(mfrow=c(1,2));par(mgp=c(1.5,.5,0));par(mar=c(2.75,2.75,1.5,0.25))
AAplotByMean(AAdatIgG,SampleIndices=1:10,clrs=clrs,pchs=pchs,MeanSubset=1:10,
useMat="raw",main="IgG raw signals")
AAplotByMean(AAdatIgM,SampleIndices=1:10,clrs=clrs,pchs=pchs,MeanSubset=1:10,
useMat="raw",main="IgM raw signals")
```

#### Uniform Rank-It (Within Assay and Across Samples) Heatmaps

Heatmaps using uniform rank-its of the assays, with ranks calculated with respect to the values across samples provide another useful visualization, with the variance of all rows approximately equal to 1/12. The heatmaps indicate notable differences in sample specific distributions. For example, sample 5 provides consistently larger spot intensity quantifications.

```
RnkSmplW=cbind(AAdatIgG$X_raw_1[,1],AAdatIgG$X_raw_2[,1])
for(j in 2:10) RnkSmplW=cbind(RnkSmplW,AAdatIgG$X_raw_1[,j],AAdatIgG$X_raw_2[,j])
for(i in 1:94) RnkSmplW[i,]=rank(RnkSmplW[i,])/(21)
colnames(RnkSmplW)=as.character(as.vector(mapply(rep,1:10,2))
+rep(c(0.1,0.2),5))
par(mfrow=c(1,1))
aheatmap(RnkSmplW,annCol=factor(c(rep("strain A",10),rep("strain B",10))),
main="Rank-it (Across Samples) Raw Signal IgG")
```

```
RnkSmplW=cbind(AAdatIgM$X_raw_1[,1],AAdatIgM$X_raw_2[,1])
for(j in 2:10) RnkSmplW=cbind(RnkSmplW,AAdatIgM$X_raw_1[,j],AAdatIgM$X_raw_2[,j])
for(i in 1:94) RnkSmplW[i,]=rank(RnkSmplW[i,])/(21)
colnames(RnkSmplW)=as.character(as.vector(mapply(rep,1:10,2))
+rep(c(0.1,0.2),5))
par(mfrow=c(1,1))
aheatmap(RnkSmplW,annCol=factor(c(rep("strain A",10),rep("strain B",10))),
main="Rank-it (Across Samples) Raw Signal IgM")
```

# A note regarding running shiny apps from a DigitalOcean Ubuntu install of RStudio Server

This link provides step by step instructions for setting up RStudio Server and Shiny on a Digital Ocean Ubuntu droplet: Dean Attali’s Blog Entry

It is very important to note that if one follows Step 9: Make pretty URLs for RStudio Server and Shiny Server, that one also follows these additional steps (from Dean Attali’s blog):

Bonus for advanced users:The above setup should be just fine for most users, but I did notice a few small issues with RStudio that seem to be fixed by allowing nginx to proxy WebSockets. For example, I noticed that when using the`ggvis`

package in my RStudio, tooltips were not working. The fix is to add the following three lines inside the`location /rstudio/`

settings (keep the`proxy_pass`

line and just add these three, and remember you have to restart nginx after changing the settings):`proxy_http_version 1.1; proxy_set_header Upgrade $http_upgrade; proxy_set_header Connection "upgrade";`

If you’re hosting a Shiny Server, add these 3 lines after the

`proxy_pass`

line inside`location /shiny/`

as well.

If those steps are not taken, then one will not be able to execute shiny apps from RStudio server (i.e., the app will start but the window will “gray” out). Complications may also appear with the R shiny server. However, allowing nginx to proxy Websockets remedies the problems.