Neue Datei hochladen

4499c56a · aakan96 · e73cf5ec · 4499c56a
Commit 4499c56a authored 1 year ago by aakan96
--- a/01_Differential_Expression_Analysis/Limma/limma_GSE29001.Rmd
+++ b/01_Differential_Expression_Analysis/Limma/limma_GSE29001.Rmd
+---
+title: "limma_GSE29001"
+output: pdf_document
+---
+```{r setup, include=FALSE}
+# Version info: R 4.2.2, Biobase 2.58.0, GEOquery 2.66.0, limma 3.54.0
+################################################################
+#   Differential expression analysis with limma
+library(GEOquery)
+library(limma)
+library(umap)
+# load series and platform data from GEO
+gset <- getGEO("GSE29001", GSEMatrix =TRUE, AnnotGPL=TRUE)
+if (length(gset) > 1) idx <- grep("GPL571", attr(gset, "names")) else idx <- 1
+gset <- gset[[idx]]
+# make proper column names to match toptable 
+fvarLabels(gset) <- make.names(fvarLabels(gset))
+# group membership for all samples
+gsms <- "001001100110011001100100110011001100110011001"
+sml <- strsplit(gsms, split="")[[1]]
+# log2 transformation
+ex <- exprs(gset)
+qx <- as.numeric(quantile(ex, c(0., 0.25, 0.5, 0.75, 0.99, 1.0), na.rm=T))
+LogC <- (qx[5] > 100) ||
+  (qx[6]-qx[1] > 50 && qx[2] > 0)
+if (LogC) { ex[which(ex <= 0)] <- NaN
+exprs(gset) <- log2(ex) }
+```
+```{r}
+# assign samples to groups and set up design matrix
+gs <- factor(sml)
+groups <- make.names(c("normal","cancer"))
+levels(gs) <- groups
+gset$group <- gs
+design <- model.matrix(~group + 0, gset)
+colnames(design) <- levels(gs)
+gset <- gset[complete.cases(exprs(gset)), ] # skip missing values
+fit <- lmFit(gset, design)  # fit linear model
+# set up contrasts of interest and recalculate model coefficients
+cts <- c(paste(groups[1],"-",groups[2],sep=""))
+cont.matrix <- makeContrasts(contrasts=cts, levels=design)
+fit2 <- contrasts.fit(fit, cont.matrix)
+# compute statistics and table of top significant genes
+fit2 <- eBayes(fit2, 0.01)
+tT <- topTable(fit2, adjust="BH", sort.by="B", number=250)
+# here an error could occur but we can delete the unnecessary columns and continue:
+tT <- subset(tT, select=c("ID","adj.P.Val","P.Value","t","B","logFC","Gene.symbol","Gene.title"))
+#write.table(tT, file=stdout(), row.names=F, sep="\t")
+```
+```{r}
+# Visualize and quality control test results.
+# Build histogram of P-values for all genes. Normal test
+# assumption is that most genes are not differentially expressed.
+tT2 <- topTable(fit2, adjust="BH", sort.by="B", number=Inf)
+hist(tT2$adj.P.Val, col = "grey", border = "white", xlab = "P-adj",
+     ylab = "Number of genes", main = "P-adj value distribution")
+```
+The above plot shows the adjusted p-value distribution across the number of genes. The p-value for this experiment was adjusted using Benjamini Hochberg method(BH). Genes falling to the left of the significance threshold of 0.05 are considered statistically significant and may have potential biological relevance. 
+```{r}
+# summarize test results as "up", "down" or "not expressed"
+dT <- decideTests(fit2, adjust.method="BH", p.value=0.05, lfc=0.263)
+# Venn diagram of results
+vennDiagram(dT, circle.col=palette())
+```
+The Venn diagram visualizes how many genes fall into each category ("up," "down," and "not expressed"), and if there are any overlapping genes between these categories. In the plot above we can see that , there were 6928 significant up and down regulated genes. others 15349 were not expressed.
+```{r}
+# create Q-Q plot for t-statistic
+t.good <- which(!is.na(fit2$F)) # filter out bad probes
+qqt(fit2$t[t.good], fit2$df.total[t.good], main="Moderated t statistic")
+```
+Q-Q plots (Quantile-Quantile plots) are useful for visually assessing whether the observed data follows an expected theoretical distribution, such as the normal distribution. In the plot above we can observe that the data points fall approximately along a straight line which suggests that the t-statistics are approximately normally distributed.
+```{r}
+# volcano plot (log P-value vs log fold change)
+#colnames(fit2) # list contrast names
+ct <- 1        # choose contrast of interest
+volcanoplot(fit2, coef=ct, main=colnames(fit2)[ct], pch=20,
+            highlight=length(which(dT[,ct]!=0)), names=rep('+', nrow(fit2)))
+```
+Volcano plots are commonly used in genomics to identify differentially expressed genes based on their statistical significance and magnitude of change. The resulting volcano plot has have the log-fold change (x-axis) i.e. 0.263 plotted against the negative logarithm of the adjusted p-value (y-axis)i.e. 0.05. Genes with a significant change in expression (based on adjusted p-values) appear as points far away from the center along the y-axis, while genes with a substantial fold change appear farther away from the center along the x-axis.
+```{r}
+# MD plot (log fold change vs mean log expression)
+# highlight statistically significant (p-adj < 0.05) probes
+plotMD(fit2, column=ct, status=dT[,ct], legend=F, pch=20, cex=1)
+abline(h=0)
+```
+MD plots are commonly used to assess the magnitude and direction of gene expression changes between groups. Each point on the plot represents a gene, and the position of the point indicates the log-fold change and the mean log expression level for that gene.
+```{r}
+################################################################
+# General expression data analysis
+ex <- exprs(gset)
+# box-and-whisker plot
+#dev.new(width=3+ncol(gset)/6, height=5)
+ord <- order(gs)  # order samples by group
+palette(c("#1B9E77", "#7570B3", "#E7298A", "#E6AB02", "#D95F02",
+          "#66A61E", "#A6761D", "#B32424", "#B324B3", "#666666"))
+par(mar=c(7,4,2,1))
+title <- paste ("GSE29001", "/", annotation(gset), sep ="")
+boxplot(ex[,ord], boxwex=0.6, notch=T, main=title, outline=FALSE, las=2, col=gs[ord])
+legend("topleft", groups, fill=palette(), bty="n")
+# expression value distribution
+par(mar=c(4,4,2,1))
+title <- paste ("GSE29001", "/", annotation(gset), " value distribution", sep ="")
+plotDensities(ex, group=gs, main=title, legend ="topright")
+# UMAP plot (dimensionality reduction)
+ex <- na.omit(ex) # eliminate rows with NAs
+ex <- ex[!duplicated(ex), ]  # remove duplicates
+ump <- umap(t(ex), n_neighbors = 15, random_state = 123)
+par(mar=c(3,3,2,6), xpd=TRUE)
+plot(ump$layout, main="UMAP plot, nbrs=15", xlab="", ylab="", col=gs, pch=20, cex=1.5)
+legend("topright", inset=c(-0.15,0), legend=levels(gs), pch=20,
+       col=1:nlevels(gs), title="Group", pt.cex=1.5)
+library("maptools")  # point labels without overlaps
+pointLabel(ump$layout, labels = rownames(ump$layout), method="SANN", cex=0.6)
+# mean-variance trend, helps to see if precision weights are needed
+plotSA(fit2, main="Mean variance trend, GSE29001")
+#dev.off()
+```
+1. Box-and-Whisker Plot:depicts the distribution of gene expression values across distinct sample groups (here, normal and cancer).The boxes show the interquartile range (IQR), while the centre line inside the box reflects the median expression value. The whiskers extend from the margins of the boxes and represent the variability of the data. Outliers are points that are not within the whiskers. The figure allows us to examine the expression distributions of the two groups and discover any variations in their central tendencies and spread.\
+2. Expression Value Distribution Plot: shows the density of gene expression levels for each sample group.The plot shows how gene expression values are spread among each group. It enables us to determine if the distributions of the normal and cancer groups are similar or dissimilar. Denser patches imply higher levels of gene expression, whereas sparser regions indicate lower levels of expression.\
+3. UMAP Plot (Dimensionality Reduction): displays a reduced-dimensional depiction of gene expression data using the UMAP method. The UMAPmethod uses dimensionality reduction to project high-dimensional gene expression data onto a 2D space. Each point on the plot represents a sample, and the colors indicate whether the sample is normal or cancer. The map allows us to see the separation or grouping of samples depending on their gene expression patterns. \
+4. Mean-Variance Trend Plot:depicts the connection between mean expression and variance in gene expression data.  The plot shows us the variation of gene expression varies with the mean expression level. \
\ No newline at end of file