Skip to content
Snippets Groups Projects
Commit 4499c56a authored by aakan96's avatar aakan96
Browse files

Neue Datei hochladen

parent e73cf5ec
No related branches found
No related tags found
No related merge requests found
---
title: "limma_GSE29001"
output: pdf_document
---
```{r setup, include=FALSE}
# Version info: R 4.2.2, Biobase 2.58.0, GEOquery 2.66.0, limma 3.54.0
################################################################
# Differential expression analysis with limma
library(GEOquery)
library(limma)
library(umap)
# load series and platform data from GEO
gset <- getGEO("GSE29001", GSEMatrix =TRUE, AnnotGPL=TRUE)
if (length(gset) > 1) idx <- grep("GPL571", attr(gset, "names")) else idx <- 1
gset <- gset[[idx]]
# make proper column names to match toptable
fvarLabels(gset) <- make.names(fvarLabels(gset))
# group membership for all samples
gsms <- "001001100110011001100100110011001100110011001"
sml <- strsplit(gsms, split="")[[1]]
# log2 transformation
ex <- exprs(gset)
qx <- as.numeric(quantile(ex, c(0., 0.25, 0.5, 0.75, 0.99, 1.0), na.rm=T))
LogC <- (qx[5] > 100) ||
(qx[6]-qx[1] > 50 && qx[2] > 0)
if (LogC) { ex[which(ex <= 0)] <- NaN
exprs(gset) <- log2(ex) }
```
```{r}
# assign samples to groups and set up design matrix
gs <- factor(sml)
groups <- make.names(c("normal","cancer"))
levels(gs) <- groups
gset$group <- gs
design <- model.matrix(~group + 0, gset)
colnames(design) <- levels(gs)
gset <- gset[complete.cases(exprs(gset)), ] # skip missing values
fit <- lmFit(gset, design) # fit linear model
# set up contrasts of interest and recalculate model coefficients
cts <- c(paste(groups[1],"-",groups[2],sep=""))
cont.matrix <- makeContrasts(contrasts=cts, levels=design)
fit2 <- contrasts.fit(fit, cont.matrix)
# compute statistics and table of top significant genes
fit2 <- eBayes(fit2, 0.01)
tT <- topTable(fit2, adjust="BH", sort.by="B", number=250)
# here an error could occur but we can delete the unnecessary columns and continue:
tT <- subset(tT, select=c("ID","adj.P.Val","P.Value","t","B","logFC","Gene.symbol","Gene.title"))
#write.table(tT, file=stdout(), row.names=F, sep="\t")
```
```{r}
# Visualize and quality control test results.
# Build histogram of P-values for all genes. Normal test
# assumption is that most genes are not differentially expressed.
tT2 <- topTable(fit2, adjust="BH", sort.by="B", number=Inf)
hist(tT2$adj.P.Val, col = "grey", border = "white", xlab = "P-adj",
ylab = "Number of genes", main = "P-adj value distribution")
```
The above plot shows the adjusted p-value distribution across the number of genes. The p-value for this experiment was adjusted using Benjamini Hochberg method(BH). Genes falling to the left of the significance threshold of 0.05 are considered statistically significant and may have potential biological relevance.
```{r}
# summarize test results as "up", "down" or "not expressed"
dT <- decideTests(fit2, adjust.method="BH", p.value=0.05, lfc=0.263)
# Venn diagram of results
vennDiagram(dT, circle.col=palette())
```
The Venn diagram visualizes how many genes fall into each category ("up," "down," and "not expressed"), and if there are any overlapping genes between these categories. In the plot above we can see that , there were 6928 significant up and down regulated genes. others 15349 were not expressed.
```{r}
# create Q-Q plot for t-statistic
t.good <- which(!is.na(fit2$F)) # filter out bad probes
qqt(fit2$t[t.good], fit2$df.total[t.good], main="Moderated t statistic")
```
Q-Q plots (Quantile-Quantile plots) are useful for visually assessing whether the observed data follows an expected theoretical distribution, such as the normal distribution. In the plot above we can observe that the data points fall approximately along a straight line which suggests that the t-statistics are approximately normally distributed.
```{r}
# volcano plot (log P-value vs log fold change)
#colnames(fit2) # list contrast names
ct <- 1 # choose contrast of interest
volcanoplot(fit2, coef=ct, main=colnames(fit2)[ct], pch=20,
highlight=length(which(dT[,ct]!=0)), names=rep('+', nrow(fit2)))
```
Volcano plots are commonly used in genomics to identify differentially expressed genes based on their statistical significance and magnitude of change. The resulting volcano plot has have the log-fold change (x-axis) i.e. 0.263 plotted against the negative logarithm of the adjusted p-value (y-axis)i.e. 0.05. Genes with a significant change in expression (based on adjusted p-values) appear as points far away from the center along the y-axis, while genes with a substantial fold change appear farther away from the center along the x-axis.
```{r}
# MD plot (log fold change vs mean log expression)
# highlight statistically significant (p-adj < 0.05) probes
plotMD(fit2, column=ct, status=dT[,ct], legend=F, pch=20, cex=1)
abline(h=0)
```
MD plots are commonly used to assess the magnitude and direction of gene expression changes between groups. Each point on the plot represents a gene, and the position of the point indicates the log-fold change and the mean log expression level for that gene.
```{r}
################################################################
# General expression data analysis
ex <- exprs(gset)
# box-and-whisker plot
#dev.new(width=3+ncol(gset)/6, height=5)
ord <- order(gs) # order samples by group
palette(c("#1B9E77", "#7570B3", "#E7298A", "#E6AB02", "#D95F02",
"#66A61E", "#A6761D", "#B32424", "#B324B3", "#666666"))
par(mar=c(7,4,2,1))
title <- paste ("GSE29001", "/", annotation(gset), sep ="")
boxplot(ex[,ord], boxwex=0.6, notch=T, main=title, outline=FALSE, las=2, col=gs[ord])
legend("topleft", groups, fill=palette(), bty="n")
# expression value distribution
par(mar=c(4,4,2,1))
title <- paste ("GSE29001", "/", annotation(gset), " value distribution", sep ="")
plotDensities(ex, group=gs, main=title, legend ="topright")
# UMAP plot (dimensionality reduction)
ex <- na.omit(ex) # eliminate rows with NAs
ex <- ex[!duplicated(ex), ] # remove duplicates
ump <- umap(t(ex), n_neighbors = 15, random_state = 123)
par(mar=c(3,3,2,6), xpd=TRUE)
plot(ump$layout, main="UMAP plot, nbrs=15", xlab="", ylab="", col=gs, pch=20, cex=1.5)
legend("topright", inset=c(-0.15,0), legend=levels(gs), pch=20,
col=1:nlevels(gs), title="Group", pt.cex=1.5)
library("maptools") # point labels without overlaps
pointLabel(ump$layout, labels = rownames(ump$layout), method="SANN", cex=0.6)
# mean-variance trend, helps to see if precision weights are needed
plotSA(fit2, main="Mean variance trend, GSE29001")
#dev.off()
```
1. Box-and-Whisker Plot:depicts the distribution of gene expression values across distinct sample groups (here, normal and cancer).The boxes show the interquartile range (IQR), while the centre line inside the box reflects the median expression value. The whiskers extend from the margins of the boxes and represent the variability of the data. Outliers are points that are not within the whiskers. The figure allows us to examine the expression distributions of the two groups and discover any variations in their central tendencies and spread.\
2. Expression Value Distribution Plot: shows the density of gene expression levels for each sample group.The plot shows how gene expression values are spread among each group. It enables us to determine if the distributions of the normal and cancer groups are similar or dissimilar. Denser patches imply higher levels of gene expression, whereas sparser regions indicate lower levels of expression.\
3. UMAP Plot (Dimensionality Reduction): displays a reduced-dimensional depiction of gene expression data using the UMAP method. The UMAPmethod uses dimensionality reduction to project high-dimensional gene expression data onto a 2D space. Each point on the plot represents a sample, and the colors indicate whether the sample is normal or cancer. The map allows us to see the separation or grouping of samples depending on their gene expression patterns. \
4. Mean-Variance Trend Plot:depicts the connection between mean expression and variance in gene expression data. The plot shows us the variation of gene expression varies with the mean expression level. \
\ No newline at end of file
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment