Data Simulation is a helpful tool for learning Bioinformatics
As a self-study beginner, I understand how challenging it can be to put the first bricks to build bioinformatic skills, especially the eagerness to master the codes necessary for running real analyses. However, the path from the starting point to the moment of actual implementation is long and can be somewhat frustrating.
Image by DALL-E 3 prompted by Author
In this blog, I aim to provide code simulations to assist learners in understanding how to run a few basic bioinformatic pipelines in R. These include RNA sequencing, DNA methylation, single-cell RNA sequencing, and spatial transcriptome.
The main benefit of this simulation approach is that there’s no need to download data, a process that often requires significant time and effort and is prone to errors. This way, beginners can focus on the crucial aspects of data analysis and visualization within these pipelines.
Another cool thing is that beginners can understand the data structure or shape by examining the codes of simulation. This will be helpful when we prepare the real data for our research projects.
Preparation
Things we need to prepare before diving into codes:
- About 5GB space free
- Installing packages
- Creating appropriate directory
Let us prepare the necessary packages for this tutorial blog. If you’ve already installed these packages, there’s no need to run the following code as it could be time-consuming. However, if you only need a few packages, feel free to install them individually. Please note that running this code could take some time, ranging from several minutes to an hour:
general_packages = c("ggplot2", "gridExtra", "dplyr", "tidyverse",
"BiocManager", "readr", "writexl")
bioc_packages = c("org.Hs.eg.db", "DESeq2", "fgsea",
"GSVA", "ChAMP", "Seurat",
"IlluminaHumanMethylation450kanno.ilmn12.hg19",
"IlluminaHumanMethylationEPICanno.ilm10b4.hg19",
"SpotClean")
lapply(general_packages, install.packages, character.only=TRUE)
lapply(bioc_packages, BiocManager::install, character.only=TRUE)
We have to make sure these packages are properly installed for subsequent analyses. Yet, these packages are important for our simulated data analyses as well as future real data analyses.
Once the packages are installed, we need to set up a directory to ensure our work is organized and easily accessible.
- Firstly, create an empty R script named
simulation.R
. - Then, place the
simulation.R
file in a folder that is both visible and convenient for you. - Exit RStudio and then restart it by opening the
simulation.R
script. This action sets the working directory to the location of thesimulation.R
file.
Image by Author
root = paste0(getwd(), "/simulated_blog")
rnaseq_root = paste0(getwd(), "/simulated_blog/RNAseq")
methylation_root = paste0(getwd(), "/simulated_blog/methylation")
scrnaseq_root = paste0(getwd(), "/simulated_blog/ScRNA")
ST_root = paste0(getwd(), "/simulated_blog/ST")
# Create the directories
dir.create(root)
dir.create(rnaseq_root)
dir.create(methylation_root)
dir.create(scrnaseq_root)
dir.create(ST_root)
After running this code block, we will get the folder like this:
Image by Author
RNA-seq Simulation and Analysis
Here is the simulation code:
setwd(rnaseq_root)
library(org.Hs.eg.db)
set.seed(1)
# Define genes
Genes = keys(org.Hs.eg.db, keytype="SYMBOL")
Genes = Genes[!duplicated(Genes)]
n_genes = length(Genes)
# Define samples
n_samples = 50
Samples = paste0("S", 1:n_samples)
# Assume 2 real phenotypes: A and B
real_phenotype = c(rep("A", n_samples/2), rep("B", n_samples/2))
# Count matrix simulation
nA = n_samples/2 ; nB = n_samples/2
countData = matrix(0, nrow=n_genes, ncol=n_samples)
pb = txtProgressBar(min=0, max=n_genes, initial=0, style=3)
for (i in 1:n_genes){
sizeA = round(runif(1, 1, 500))
sizeB = round(runif(1, 1, 500))
probA = runif(1)
probB = runif(1)
readcountA = rnbinom(n=nA, size=sizeA, prob=probA)
readcountB = rnbinom(n=nB, size=sizeB, prob=probB)
readcount = c(readcountA, readcountB)
countData[i,] = readcount
setTxtProgressBar(pb, i)
}
rownames(countData) = Genes
colnames(countData) = Samples
# Simulate phenotypes in reality
HistologyA = sample(c("H1", "H2"), size=nA, prob=c(0.8, 0.2), replace=TRUE)
HistologyB = sample(c("H1", "H2"), size=nB, prob=c(0.7, 0.3), replace=TRUE)
Histology = c(HistologyA, HistologyB)
MolecularA = sample(c("M1", "M2"), size=nA, prob=c(0.6, 0.4), replace=TRUE)
MolecularB = sample(c("M1", "M2"), size=nA, prob=c(0.1, 0.9), replace=TRUE)
Molecular = c(MolecularA, MolecularB)
colData = data.frame(Histology, Molecular)
The final outputs of this code are countData
and colData
, which represent the count matrix and the phenotype metadata, respectively. In this code, I’ve constructed a for loop to generate distinct negative binomial distributions between phenotype A and phenotype B for any given gene. The goal is to create two unique biological clusters associated with these phenotypes. Additionally, I’ve created real-world phenotypes, including Histology and Molecular. These have been intentionally designed with a degree of noise, as real-world data is seldom perfect.
This is how the countData
and colData
look like:
> print(countData[1:5, 1:5])
S1 S2 S3 S4 S5
A1BG 95 90 111 88 90
A2M 427 451 407 401 407
A2MP1 299 305 297 341 266
NAT1 1135 1286 1409 1149 1199
NAT2 462 421 409 404 385
> print(colData[1:5,])
Histology Molecular
S1 H1 M1
S2 H1 M2
S3 H1 M2
S4 H1 M1
S5 H1 M2
In countData
, each column corresponds to a sample and each row corresponds to a gene. The values represent the expression level of a particular gene in a specific sample. In colData
, each column represents a real-world phenotype. The order of the rows in colData
matches the order of the columns in countData
, indicating that they correspond to the same sample.
We can then proceed with the differential expression analysis. For this purpose, I utilize DESeq2:
library(DESeq2)
# Input of DESeq2: countData, colData
# If we want to compare M1 and M2 of molecular phenotype:
dds = DESeqDataSetFromMatrix(countData=countData, colData=colData, design=~Molecular)
# Run DESeq2
dds = DESeq(dds)
res = results(dds)
print(res)
# Additional: how to save the results
library(readr)
library(writexl)
Gene = rownames(res)
final_res = as.data.frame(cbind(Gene, res))
# If we want to save csv file
write_csv(final_res, "DESeq2_results_Molecular.csv")
# If we want to save xlsx file
write_xlsx(final_res, "DESeq2_results_Molecular.xlsx")
Next, we conduct pathway analysis. In this instance, I employ the Gene Set Enrichment Analysis (GSEA):
library(msigdbr)
library(fgsea)
library(dplyr)
library(tidyverse)
library(tidyr)
# Define the pathway
# I will use "H" collection in MSigDB database
H = msigdbr(species="Homo sapiens", category="H")
pathways = H %>% group_by(gs_name) %>%
summarise(pathways=list(gene_symbol)) %>%
unique() %>%
deframe()
# list of pathways is the format fgsea uses
# Define the metrics
# In this case I use the metric from the DESeq2 result
stats = res$stat
names(stats) = rownames(res)
# Run GSEA
gsea_res = fgsea(pathways=pathways, stats=stats)
# Additional: how to save the results
library(readr)
library(writexl)
# If we want to save csv file
write_csv(final_res, "GSEA_results_Molecular.csv")
# If we want to save xlsx file
write_xlsx(final_res, "GSEA_results_Molecular.xlsx")
The gsea_res
object encompasses all the computed statistics from the conventional GSEA. An explanation of these terms is beyond the scope of this blog.
Lastly, we can investigate the heterogeneity of pathway activity by conducting single-sample GSEA. For this, I utilize the Gene Set Variational Analysis (GSVA):
library(msigdbr)
library(GSVA)
library(dplyr)
library(tidyverse)
library(tidyr)
# Define the pathway
# I will use "H" collection in MSigDB database
H = msigdbr(species="Homo sapiens", category="H")
pathways = H %>% group_by(gs_name) %>%
summarise(pathways=list(gene_symbol)) %>%
unique() %>%
deframe()
# list of pathways is the format GSVA uses
gsva_res = gsva(expr=countData, gset.idx.list=pathways)
head(gsva_res)
# Additional: how to save the results
library(readr)
library(writexl)
Pathway = rownames(gsva_res)
gsva_res = as.data.frame(c(Pathway, gsva_res))
# If we want to save csv file
write_csv(gsva_res, "GSVA_results_H.csv")
# If we want to save xlsx file
write_xlsx(gsva_res, "GSVA_results_H.xlsx")
The resulting gsva_res
object comprises 50 Pathways x 20 Samples. Each cell contains the enrichment score of a given pathway for a sample.
Upon execution of the codes, the “RNAseq” folder is as follows:
Image by Author
There are numerous other analyses applicable to RNA-seq data. However, they fall outside the purview of this blog.
Methylation Simulation
Here is the simulation code:
setwd(methylation_root)
library(IlluminaHumanMethylation450kanno.ilmn12.hg19)
set.seed(1)
# Define genes
CpGs = IlluminaHumanMethylation450kanno.ilmn12.hg19::Other
CpGs = rownames(CpGs)
n_cpg = length(CpGs)
# Define samples
n_samples = 50
Samples = paste0("S", 1:n_samples)
# Assume 2 real phenotypes: A and B
real_phenotype = c(rep("A", n_samples/2), rep("B", n_samples/2))
# Beta-value matrix simulation
nA = n_samples/2 ; nB = n_samples/2
beta_matrix = matrix(0, nrow=n_cpg, ncol=n_samples)
pb = txtProgressBar(min=0, max=n_cpg, initial=0, style=3)
for (i in 1:n_cpg){
shape1A = runif(1, 1, 500)
shape1B = runif(1, 1, 500)
shape2A = runif(1, 1, 500)
shape2B = runif(1, 1, 500)
betaA = rbeta(n=nA, shape1=shape1A, shape2=shape2A)
betaB = rbeta(n=nB, shape1=shape1B, shape2=shape2B)
beta = c(betaA, betaB)
beta_matrix[i,] = beta
setTxtProgressBar(pb, i)
}
rownames(beta_matrix) = CpGs
colnames(beta_matrix) = Samples
# Simulate phenotypes in reality
HistologyA = sample(c("H1", "H2"), size=nA, prob=c(0.9, 0.1), replace=TRUE)
HistologyB = sample(c("H1", "H2"), size=nB, prob=c(0.1, 0.9), replace=TRUE)
Histology = c(HistologyA, HistologyB)
MolecularA = sample(c("M1", "M2"), size=nA, prob=c(0.8, 0.2), replace=TRUE)
MolecularB = sample(c("M1", "M2"), size=nA, prob=c(0.1, 0.9), replace=TRUE)
Molecular = c(MolecularA, MolecularB)
In this code, I’ve constructed a for loop to generate varying methylation distributions between phenotype A and phenotype B. The goal is to create two distinct biological clusters associated with these phenotypes. Similarly to RNA-seq simulation, I’ve created real-world phenotypes, including Histology and Molecular. Currently, I’m utilizing the probes of the Illumina Infinium HumanMethylation450 BeadChip. However, we can easily switch to using the Illumina MethylationEPIC BeadChip probes by simply replacing the CpGs with those related to EPIC.
library(IlluminaHumanMethylationEPICanno.ilm10b4.hg19)
CpGs = IlluminaHumanMethylationEPICanno.ilm10b4.hg19::Other
CpGs = rownames(CpGs)
This is how the beta_matrix
looks like:
> print(beta_matrix[1:5, 1:5])
S1 S2 S3 S4 S5
cg00050873 0.2958955 0.3261187 0.2963041 0.3303318 0.3372731
cg00212031 0.3100877 0.3247370 0.3113969 0.2962113 0.3145456
cg00213748 0.4729288 0.4960375 0.4759625 0.4660788 0.5242928
cg00214611 0.4066274 0.4113257 0.3547656 0.4055290 0.3877596
cg00455876 0.8325121 0.8565954 0.8418491 0.8520826 0.8732811
In beta_matrix
, each column corresponds to a sample and each row corresponds to a CpG position. The values represent the methylation level of a particular CpG position in a specific sample.
I briefly demonstrate the example code of the methylation analysis by using the ChAMP package:
# I recommend to run this code block step by step
library(ChAMP)
myImport = list()
myImport$beta = beta_matrix
myImport$pd = data.frame(Histology, Molecular)
myLoad = champ.filter()
# Quality Control
champ.QC(pheno=myLoad$pd$Histology)
champ.QC(pheno=myLoad$pd$Molecular)
# Data Normalization
myNorm = champ.norm()
# Differential Methylation Position
myDMP = champ.DMP(pheno=myLoad$pd$Histology)
DMP.GUI(pheno=myLoad$pd$Histology)
# myDMP = champ.DMP(pheno=myLoad$pd$Molecular)
# Differential Methylation Region
myDMR = champ.DMR(pheno=myLoad$pd$Histology)
DMR.GUI(pheno=myLoad$pd$Histology)
# myDMR = champ.DMR(pheno=myLoad$pd$Molecular)
# Block analysis
myBlock = champ.Block(pheno=myLoad$pd$Histology)
Block.GUI(pheno=myLoad$pd$Histology)
# myBlock = champ.Block(pheno=myLoad$pd$Molecular)
# GSEA
myGSEA = champ.GSEA(pheno=myLoad$.pd$Histology)
# myGSEA = champ.GSEA(pheno=myLoad$pd$Molecular)
# Additional: how to save the results
library(readr)
library(writexl)
CpG = rownames(myDMP$H1_to_H2)
DMP_output = as.data.frame(cbind(CpG, myDMP$H1_to_H2))
DMR_output = myDMR$BumphunterDMR
Block_output = myBlock$Block
GSEA_output = myGSEA$DMR
# Saving files in .csv format (we can save in .xlsx like previous done)
write_csv(DMP_output, "DMP_results_Histology.csv")
write_csv(DMR_output, "DMR_results_Histology.csv")
write_csv(Block_output, "Block_results_Histology.csv")
write_csv(GSEA_output, "GSEA_results_Histology.csv")
The ChAMP package offers a comprehensive pipeline, ranging from initial to more advanced analyses, including Differential Methylation Position (DMP), Differential Methylation Region (DMR), and GSEA. This package generates numerous intriguing plots during the analysis process, which you’re free to explore.
Upon execution of these codes, the “methylation” folder is as follows:
Image by Author
Single-cell RNA-seq Simulation
Here is the simulation code:
setwd(scrnaseq_root)
# Load required libraries
library(Matrix)
library(ggplot2)
# Set the number of genes and cells
library(org.Hs.eg.db)
Genes = keys(org.Hs.eg.db, keytype="SYMBOL")
n_genes = length(Genes)
n_cells = 100
Cells = paste0("Cell", 1:n_cells)
n_CellType = 5
celltypes = c(rep("Fibroblast", n_cells/n_CellType),
rep("Neutrophil", n_cells/n_CellType),
rep("Macrophage", n_cells/n_CellType),
rep("Mast Cell", n_cells/n_CellType),
rep("Eosinophil", n_cells/n_CellType)
)
# Simulate gene expression data
set.seed(123)
mat <- matrix(0, nrow = n_genes, ncol=n_cells)
pb = txtProgressBar(min=0, max=n_genes, initial=0, style=3)
for (i in 1:n_genes) {
readcount_fibroblast = rnbinom(n=n_cells/n_CellType, prob=runif(1), size=round(runif(1,1,50)))
readcount_neutrophil = rnbinom(n=n_cells/n_CellType, prob=runif(1), size=round(runif(1,1,50)))
readcount_macrophage = rnbinom(n=n_cells/n_CellType, prob=runif(1), size=round(runif(1,1,50)))
readcount_mastcell = rnbinom(n=n_cells/n_CellType, prob=runif(1), size=round(runif(1,1,50)))
readcount_eosinophil = rnbinom(n=n_cells/n_CellType, prob=runif(1), size=round(runif(1,1,50)))
readcount = c(readcount_fibroblast,
readcount_neutrophil,
readcount_macrophage,
readcount_mastcell,
readcount_eosinophil)
mat[i,] = readcount
setTxtProgressBar(pb, i)
}
rownames(mat) = Genes
colnames(mat) = Cells
sparse_mat <- as(mat, "CsparseMatrix")
# Save the data as .mtx file
writeMM(sparse_mat, file = "matrix.mtx")
# Create and save features.tsv.gz and barcodes.tsv.gz files
d = data.frame(feature.names=rownames(mat), gene.column=rownames(mat))
write.table(d, file = "features.tsv", quote = FALSE, row.names = FALSE, col.names = FALSE)
write.table(colnames(mat), file = "barcodes.tsv", quote = FALSE, row.names = FALSE, col.names = FALSE)
# Remove previous data if it exists
if (file.exists("example")){
unlink("example", recursive = TRUE)
}
# Pack the data
dir_path <- "./example"
library(R.utils)
gzip("matrix.mtx", "example/matrix.mtx.gz")
gzip("features.tsv", "example/features.tsv.gz")
gzip("barcodes.tsv", "example/barcodes.tsv.gz")
In a similar manner, I’ve constructed a for loop to generate varying RNA-seq distributions for, let’s say, 5 different cell types. I’ve also implemented codes to create the real-world objects of scRNA, which are used as input by the Seurat package. These include three compressed files: matrix, features, and barcodes. All these objects are stored in the “example” folder.
I’ve provided an example code for sc-RNA analysis using the Seurat package:
library(Seurat)
data = Read10X(data.dir = "example")
seurat.obj = CreateSeuratObject(counts = data, project="Example")
# Normalize and Scale
seurat.obj = NormalizeData(seurat.obj)
seurat.obj = ScaleData(seurat.obj, features = rownames(seurat.obj))
# Find variable genes
seurat.obj = FindVariableFeatures(seurat.obj)
# Dimension reduction
seurat.obj = RunPCA(seurat.obj, features = VariableFeatures(object = seurat.obj), npcs=10)
seurat.obj = FindNeighbors(seurat.obj, dims = 1:10)
seurat.obj = FindClusters(seurat.obj, resolution = 0.5)
seurat.obj = RunUMAP(seurat.obj, dims = 1:10)
DimPlot(seurat.obj)
# Visualization
# Find 5 most variable genes
features = VariableFeatures(object=seurat.obj)[1:5]
# Which cell cluster has high expression of these genes
FeaturePlot(seurat.obj, features = features)
VlnPlot(seurat.obj, features = features)
RidgePlot(seurat.obj, features = features)
DotPlot(seurat.obj, features = features)
DoHeatmap(seurat.obj, features = features)
The Seurat package is a classic and comprehensive tool that we utilize for sc-RNA data analysis. This package is capable of generating numerous impressive plots.
Upon execution of the codes, the “ScRNA” folder is as follows:
Image by Author
Spatial Transcriptome (ST) Simulation
Here is the simulation code:
SimulateSTData = function(X, Y, n_clusters, tissue, root=getwd()) {
"
This function create a simple one-resolutation ST data
X, Y: width and height of the tissue image
n_clusters: simulated cluster of transcriptomes
tissue: list of tissue features, containing: centerX, centerY, radius, shape('I', 'II')
"
width = X
height = Y
n_cells = X * Y
Cells = paste0("Cell", 1:n_cells)
library(org.Hs.eg.db)
Genes = keys(org.Hs.eg.db, keytype="SYMBOL")
Genes = Genes[!duplicated(Genes)]
n_genes = length(Genes)
# Create tissue mask: a circle with a sin-like border with predefined radius
mask = matrix(0, nrow=width, ncol=height)
centerX = tissue$centerX ; centerY = tissue$centerY
radius = tissue$radius
a <- 0.6 * radius # amplitude of the sinusoidal variation
w <- 2 # frequency of the sinusoidal variation
# Define the sinusoidal coefficient function
sin_coef <- function(x, y) {
theta <- atan2(y, x) # angle with the x-axis
return(a * sin(w * theta))
}
# Define the circle function
if (tissue$shape == "I"){
circle <- function(x, y, r) {
return(x^2 + y^2 - r^2 - sin_coef(x, y))
}
}
if (tissue$shape == "II") {
circle <- function(x, y, r) {
return((x^2 + y^2 - (r + sin_coef(x, y))^2))
}
}
for (x in 1:width) {
for (y in 1:height){
if (circle(x=(x-centerX), y=(y-centerY), r=radius) <= 0){
mask[y, x] = 1
}
}
}
# Simulate gene expression data
set.seed(1)
mat <- matrix(0, nrow = n_genes, ncol=n_cells)
pb = txtProgressBar(min=0, max=n_genes, initial=0, style=3)
for (i in 1:n_genes) {
readcount = NULL
for (k in 1:(n_clusters-1)){
readcount = c(readcount, rnbinom(n=n_cells/n_clusters, prob=runif(1), size=round(runif(1,1,50))))
}
readcount = c(readcount, rnbinom(n=n_cells-length(readcount), prob=runif(1), size=round(runif(1,1,50))))
mat[i,] = readcount
setTxtProgressBar(pb, i)
}
rownames(mat) = Genes
colnames(mat) = Cells
# Make sure non-tissue space have no/baseline expression
for (cell in 1:ncol(mat)){
x = cell%%height
y = as.integer(cell/height)
if (circle(x=(x-centerX), y=(y-centerY), r=radius) > 0){
mat[,cell] = round(runif(1, 0, 5))
}
}
sparse_mat <- as(mat, "CsparseMatrix")
# Create raw folder
d = data.frame(feature.names=rownames(mat), gene.column=rownames(mat), assay=rep("Expression", nrow(mat)))
dir_path <- paste0(root, "/raw")
dir.create(dir_path)
writeMM(sparse_mat, file = "matrix.mtx")
write.table(d, file = "features.tsv", quote = FALSE, row.names = FALSE, col.names = FALSE)
write.table(colnames(mat), file = "barcodes.tsv", quote = FALSE, row.names = FALSE, col.names = FALSE)
library(R.utils)
gzip("matrix.mtx", paste0(dir_path, "/matrix.mtx.gz"))
gzip("features.tsv", paste0(dir_path, "/features.tsv.gz"))
gzip("barcodes.tsv", paste0(dir_path, "/barcodes.tsv.gz"))
# Create spatial folder
spatial_dir = paste0(root, "/spatial")
dir.create(spatial_dir)
# scalefactors_json.json
json <- list(
regist_target_img_scalef = 0,
tissue_hires_scalef = 0,
tissue_lowres_scalef = 1.0,
fiducial_diameter_fullres = 0,
spot_diameter_fullres = 0
)
json_data <- toJSON(json, auto_unbox = TRUE)
write(json_data, file = paste0(spatial_dir, "/scalefactors_json.json"))
library(png)
library(grid)
# Set the dimensions of the image tissue_lowres_image.png
width <- Y
height <- X
set.seed(1)
img <- matrix(runif(width * height * 3), nrow = width, ncol = height)
mask
img = img*mask
# Reshape the matrix into an array
img <- array(data = img, dim = c(width, height, 3))
# Save the image as a .png file
writePNG(img, paste0(spatial_dir, "/tissue_lowres_image.png"))
# tissue_positions.csv
in_tissue = NULL ; i = 0
array_row = rep(seq(0, X-1), Y)
array_row = array_row[order(array_row)]
array_col = rep(seq(0, (Y-1), 1), X)
for (i in 1:length(array_row)) {
x = array_col[i]
y = array_row[i]
if (circle(x=(x-centerX), y=(y-centerY), r=radius) <= 0){
in_tissue[i] = 1
} else {
in_tissue[i] = 0
}
}
tissue_positions = data.frame(barcode=Cells, in_tissue, array_row, array_col,
pxl_row_in_fullres=array_row, pxl_col_in_fullres=array_col)
write_csv(tissue_positions, paste0(spatial_dir, "/tissue_positions.csv"))
}
Simulating ST data is a complex task. It necessitates an understanding of the spatial correlation among the image, count matrix, barcode, and tissue position metadata. Therefore, I’ve encapsulated the simulation codes into a function for future use. In this simulated data, we treat low-resolution and high-resolution images as having the same size due to memory space constraints. I’ve introduced a “simulated tissue” with an apparently natural shape using a mathematical formula. To add more intrigue, I’ve created two types of tissue shapes.
Image by Author
We start using the simulation function:
setwd(ST_root)
library(Matrix)
library(ggplot2)
library(rhdf5)
library(jsonlite)
library(png)
library(SpotClean)
library(readr)
library(Seurat)
# Define parameters
X = 30
Y = 30
n_clusters = 5
tissue = list(centerX = 10, centerY=20, radius=5, shape="I")
root = "STData"
# Remove previous data if it exists
if (file.exists(root)){
unlink(root, recursive = TRUE)
}
# Simulate ST data
SimulateSTData(X=X, Y=Y, n_clusters=n_clusters, tissue=tissue, root=root)
# Load the data
raw = Read10X("STData/raw")
slide_info = read10xSlide("STData/spatial/tissue_positions.csv",
"STData/spatial/tissue_lowres_image.png",
"STData/spatial/scalefactors_json.json")
ST_obj = createSlide(count_mat = raw,
slide_info = slide_info)
The “ST” folder after running the code:
Image by Author
The real-world “STData/spatial” folder contains additional files, such as H&E images and immunofluorescence images. Simulating these images within this blog would be quite cumbersome and not particularly beneficial. To give you an idea, here’s what this folder looks like with real-world data:
Image by Author
To show the low-resolution original tissue:
visualizeSlide(slide_obj = ST_obj)
Image by Author
If we want to visualize the spatial expression of a given gene such as TP53. We can use the visualizeHeatmap function like this:
# We use TP53 gene as an example
g = "TP53"
visualizeHeatmap(ST_obj, g)
The image should be:
Image by Author
To make it clearer, we could use our customed code with the ggplot2 package:
pd = visualizeHeatmap(ST_obj, g)
pd = pd$data
tmp = expand.grid(row=1:Y, col=1:X)
ggplot(pd) +
geom_point(data=tmp, aes(x=col, y=-row), size=5, color="grey") +
geom_point(data=pd, aes(x=col, y=-row, color=value), size=5) +
scale_color_viridis_c(g)
The resulting image should be:
Image by Author
We found that there is too much contamination of expression outside the tissue. The SpotClean package helps us decontaminate the heatmap:
decont_obj = spotclean(ST_obj)
g = "TP53"
pd = visualizeHeatmap(decont_obj, g)
pd = pd$data
tmp = expand.grid(row=1:Y, col=1:X)
ggplot(pd) +
geom_point(data=tmp, aes(x=col, y=-row), size=5, color="grey") +
geom_point(data=pd, aes(x=col, y=-row, color=value), size=5) +
scale_color_viridis_c(g)
Image by Author
With that, I conclude this blog by creating a combined plot showing the spatial expression of TP53, CCND1, RB1, CTNNB1, EGFR, and IDH1, which are interested genes in cancer research, using this simulated data:
Raw heatmaps - Image by Author
Decontaminated heatmaps - Image by Author
Here is the code I used to generate this plot:
making_fun_plots = function(ST_obj, g){
pd = visualizeHeatmap(ST_obj, g)
pd = pd$data
tmp = expand.grid(row=1:Y, col=1:X)
p = ggplot(pd) +
geom_point(data=tmp, aes(x=col, y=-row), size=5, color="grey") +
geom_point(data=pd, aes(x=col, y=-row, color=value), size=5) +
scale_color_viridis_c(g) +
ggtitle(g) +
theme(legend.position = "none",
plot.title = element_text(size=20, face="bold"),
axis.text = element_blank(),
axis.title = element_blank())
return(p)
}
# Raw combined plot
p1 = making_fun_plots(ST_obj=ST_obj, g="TP53")
p2 = making_fun_plots(ST_obj=ST_obj,g="CCND1")
p3 = making_fun_plots(ST_obj=ST_obj,g="RB1")
p4 = making_fun_plots(ST_obj=ST_obj,g="CTNNB1")
p5 = making_fun_plots(ST_obj=ST_obj,g="EGFR")
p6 = making_fun_plots(ST_obj=ST_obj,g="IDH1")
gridExtra::grid.arrange(p1, p2, p3,
p4, p5, p6,
nrow=2)
# Decontaminated combined plot
p1 = making_fun_plots(ST_obj=decont_obj, g="TP53")
p2 = making_fun_plots(ST_obj=decont_obj,g="CCND1")
p3 = making_fun_plots(ST_obj=decont_obj,g="RB1")
p4 = making_fun_plots(ST_obj=decont_obj,g="CTNNB1")
p5 = making_fun_plots(ST_obj=decont_obj,g="EGFR")
p6 = making_fun_plots(ST_obj=decont_obj,g="IDH1")
gridExtra::grid.arrange(p1, p2, p3,
p4, p5, p6,
nrow=2)
Conclusion
Thank you for taking the time to read this blog. I believe that these simulation codes can serve as valuable tools for learning bioinformatics and understanding the data structure of significant bioinformatic data types.
I welcome any suggestions for improving the provided codes.