Fun with Gene Expression Data - GSE69618

Let’s load some data. It is gene expression data. Google for GSE69618 to read about it and download it.

setwd("~jkinne/public_html/cs459-bd4isu-s2019/code/")
data <- read.csv("GSE69618_data.csv")

Now we clean our data a little - remove (NA-ize) values less than 10, log2 things, clean up the column names to make them shorter for displaying on figures

tooSmall <- data[,3:ncol(data)] <= 10

NA_count <- sum(is.na(data[,3:ncol(data)])) # how many values were NA already
small_count <- sum(tooSmall) # how many are too small and being turned into NA
total_count <- prod(dim(data[,3:ncol(data)])) # how many points total
print(paste("Percent of values that were ready as NA:  ", NA_count/total_count*100))
## [1] "Percent of values that were ready as NA:   0"
print(paste("Percent that were <= 10 so setting to NA: ", small_count/total_count*100))
## [1] "Percent that were <= 10 so setting to NA:  25.610428186457"
data[,3:ncol(data)][tooSmall] <- NA
data[,3:ncol(data)] <- log2(data[,3:ncol(data)])
temp <- colnames(data)
temp <- gsub("shRNA.", "", temp)
temp <- gsub("H1.", "", temp)
colnames(data) <- temp

# average expression value over entire dataset.  note this is with the NA's (which includes <= 10 values that were removed) removed already.
mean(as.matrix(data[,3:ncol(data)]), na.rm=TRUE)
## [1] 9.145149

Now we boxplot to look at things. Note that this shows the median, max, min for each sample.

temp <- par()$mai
par(mai=c(1,2,1,1)) # bottom, left, top, right
boxplot(data[,3:ncol(data)], 
        main="GSE69618, <=10 removed, log2", xlab="log2", las=2, ylab="",
        horizontal=TRUE, varwidth=TRUE, col=c(2,3))

par(mai=temp)

Heh data, how many NA’s - both per sample (column) and per gene (row)?

percentNA <- function(x) {
  return(sum(is.na(x))/length(x)*100)
}
na_by_col <- apply(data[3:ncol(data)], 2, percentNA) # apply per column
na_by_row <- apply(data[3:ncol(data)], 1, percentNA) # apply per row

# could have done this for na_by_col...
#na_by_col <- apply(data[3:ncol(data)], 2, function(x) { sum(is.na(x))/length(x)*100})

plot(na_by_col, main="Percent NA, per sample", ylab="%NA")

# Heh genes, how many NA's?  How many genes are at least 50% NAs, 20%, 10%, >0?
hist(na_by_row, main="Percent NA, per gene", xlab="%NA")

Top 10 highest expressed transcript. Bottom 10. Note - sum the rows, then …

stats <- data.frame(RefSeqID=data$RefSeqID)
stats$total <- apply(data[,3:ncol(data)], 1, sum, na.rm = TRUE)
stats$avg = stats$total/(ncol(data)-2)
dec_total <- order(stats$total, decreasing=TRUE, na.last=TRUE)
z <- stats[head(dec_total, n=10),]
print(z)
##           RefSeqID    total      avg
## 10186 NM_001190702 588.9476 17.32199
## 34536    NR_002312 579.7350 17.05103
## 34692    NR_002819 566.3305 16.65678
## 10176 NM_001190470 564.2952 16.59692
## 10165 NM_001190452 558.0484 16.41319
## 34845    NR_003051 542.5549 15.95750
## 16025    NM_001961 519.1043 15.26777
## 15519    NM_001402 510.4576 15.01346
## 19867    NM_006098 509.9307 14.99796
## 31008    NM_153201 507.7433 14.93363
zeros <- sum(stats$total == 0)
print(paste("Number of genes with no expression in any sample (or <= 10):",zeros))
## [1] "Number of genes with no expression in any sample (or <= 10): 1893"
hist(stats$avg)