哈佛大學(xué)單細(xì)胞課程|筆記匯總 (八)
哈佛大學(xué)劉小樂教授講授的計(jì)算生物學(xué)和生物信息學(xué)導(dǎo)論 (2020 視頻+資料)
(八)Single-cell RNA-seq clustering analysis
得到高質(zhì)量的細(xì)胞后,我們可以探究細(xì)胞群中的不同細(xì)胞類型。
Exploration of quality control metrics
為了確定我們的細(xì)胞亞群是否可能是由于諸如細(xì)胞周期階段或線粒體表達(dá)之類的假象引起的,可以對(duì)不同cluster之間進(jìn)行多種可視化分析判斷影響因素。這里我們將調(diào)用DimPlot()和FeaturePlot()函數(shù)來分析并可視化多種質(zhì)控指標(biāo)。
按照樣本對(duì)細(xì)胞亞群進(jìn)行分離
從探索每個(gè)樣品中不同細(xì)胞亞群開始分析:
# 從Seurat對(duì)象中提取身份和樣本信息,以確定每個(gè)樣本的每個(gè)簇的細(xì)胞數(shù)
n_cells <- FetchData(seurat_integrated,
vars = c("ident", "orig.ident")) %>%
dplyr::count(ident, orig.ident) %>%
tidyr::spread(ident, n)
# View table
View(n_cells)
使用UMAP對(duì)每個(gè)樣品不同細(xì)胞亞群進(jìn)行可視化:
# UMAP of cells in each cluster by sample
DimPlot(seurat_integrated,
label = TRUE,
split.by = "sample") + NoLegend()
從上圖沒有看到單獨(dú)由樣品來源引起的聚類簇。
按照細(xì)胞周期對(duì)細(xì)胞亞群進(jìn)行分離
我們可以探討細(xì)胞是否在不同的細(xì)胞周期階段聚類。當(dāng)我們利用SCTransform執(zhí)行歸一化和無意義的變異源回歸(regression of uninteresting sources of variation)時(shí),沒有考慮細(xì)胞周期的影響。
如果我們的細(xì)胞簇在不同細(xì)胞周期上顯示出很大差異,則表明我們要重新運(yùn)行SCTransform并將S.Score和G2M.Score (細(xì)胞周期得分)添加到變量中以進(jìn)行回歸,然后重新運(yùn)行其余步驟 。
# Explore whether clusters segregate by cell cycle phase
DimPlot(seurat_integrated,
label = TRUE,
split.by = "Phase") + NoLegend()
從上圖未能看到單個(gè)細(xì)胞周期主導(dǎo)的聚類簇。
可視化其它指標(biāo)對(duì)細(xì)胞亞群分群的影響
我們還可以探索其他指標(biāo),例如每個(gè)細(xì)胞的UMI和基因數(shù)量,S期和G2M期的標(biāo)記,以及通過UMAP進(jìn)行的線粒體基因表達(dá)。其中各個(gè)S和G2M分?jǐn)?shù)可提供更多的周期信息。
# Determine metrics to plot present in [email protected]
metrics <- c("nUMI", "nGene", "S.Score", "G2M.Score", "mitoRatio")
FeaturePlot(seurat_integrated,
reduction = "umap",
features = metrics,
pt.size = 0.4,
sort.cell = TRUE,
min.cutoff = 'q10',
label = TRUE)
NOTE: The
sort.cellargument will plot the positive cells above the negative cells, while themin.cutoffargument will determine the threshold for shading. Amin.cutoffofq10translates to the 10% of cells with the lowest expression of the gene will not exhibit any purple shading (completely gray).
這些指標(biāo)在不同的cluster中還是比較均勻的,但nUMI和nGene在cluster 3、9、14和15以及也許還有cluster 17中顯示出了更高的值。
探索導(dǎo)致不同分群clusters的PC
我們可以探索不同PC分群clusters的情況,以確定定義的PC是否能夠很好地區(qū)分細(xì)胞類型。為了可視化此信息,我們需要提取細(xì)胞的UMAP坐標(biāo)信息以及每個(gè)PC的相應(yīng)分?jǐn)?shù),并通過UMAP查看。
首先識(shí)別Seurat對(duì)象中的待提取信息,然后使用FetchData()函數(shù)提取。
# Defining the information in the seurat object of interest
columns <- c(paste0("PC_", 1:16),
"ident",
"UMAP_1", "UMAP_2")
# Extracting this data from the seurat object
pc_data <- FetchData(seurat_integrated,
vars = columns)NOTE: How did we know in the
FetchData()function to includeUMAP_1to obtain the UMAP coordinates? The Seurat cheatsheet(https://satijalab.org/seurat/essential_commands.html) describes the function as being able to pull any data from the expression matrices, cell embeddings, or metadata.For instance, if you explore the
seurat_integrated@reductionslist object, the first component is for PCA, and includes a slot forcell.embeddings. We can use the column names (PC_1, PC_2, PC_3, etc.) to pull out the coordinates or PC scores corresponding to each cell for each of the PCs.We could do the same thing for UMAP:
# Extract the UMAP coordinates for the first 10 cells
seurat_integrated@[email protected][1:10, 1:2]
The FetchData() function just allows us to extract the data more easily.利用FetchData()函數(shù)能更輕松地提取數(shù)據(jù)。
在以下UMAP圖中,細(xì)胞將根據(jù)PC score進(jìn)行上色。下面的是top16 PCs:
# Adding cluster label to center of cluster on UMAP
umap_label <- FetchData(seurat_integrated,
vars = c("ident", "UMAP_1", "UMAP_2")) %>%
group_by(ident) %>%
summarise(x=mean(UMAP_1), y=mean(UMAP_2))
# Plotting a UMAP plot for each of the PCs
map(paste0("PC_", 1:16), function(pc){
ggplot(pc_data,
aes(UMAP_1, UMAP_2)) +
geom_point(aes_string(color=pc),
alpha = 0.7) +
scale_color_gradient(guide = FALSE,
low = "grey90",
high = "blue") +
geom_text(data=umap_label,
aes(label=ident, x, y)) +
ggtitle(pc)
}) %>%
plot_grid(plotlist = .)
我們可以看到clusters是如何由不同的PC表示。例如,驅(qū)動(dòng)PC_2的基因在cluster 6、11和17中貢獻(xiàn)更大(在15中也可能更高),因此我們可以回顧一下驅(qū)動(dòng)此PC的基因,以了解細(xì)胞類型可能是什么:
# Examine PCA results
print(seurat_integrated[["pca"]], dims = 1:5, nfeatures = 5)
使用CD79A基因和HLA基因作為PC_2的陽性標(biāo)記,我們可以假設(shè)cluster 6、11和17對(duì)應(yīng)于B細(xì)胞。這一點(diǎn)只能為我們細(xì)胞分群提供線索,cluster的真正身份還需要通過PC組合來確定。
探究已知的細(xì)胞類型marker
細(xì)胞聚類之后,我們可以通過尋找已知的marker來探索細(xì)胞類型。下圖顯示了帶有marker的簇的UMAP圖,然后顯示了預(yù)期的不同細(xì)胞類型。
DimPlot(object = seurat_integrated,
reduction = "umap",
label = TRUE) + NoLegend()
可以利用FeaturePlot()對(duì)細(xì)胞marker進(jìn)行展示:

利用Seurat的FeaturePlot()函數(shù)可以輕松地在UMAP可視化上探索已知的標(biāo)記,進(jìn)而確定簇的身份。同時(shí)我們可以使用RNA assay slot中的標(biāo)準(zhǔn)化計(jì)數(shù)數(shù)據(jù),來訪問所有基因的表達(dá)水平(而不僅僅是3000個(gè)高度可變的基因)。
# Select the RNA counts slot to be the default assay
DefaultAssay(seurat_integrated) <- "RNA"
# Normalize RNA data for visualization purposes
seurat_integrated <- NormalizeData(seurat_integrated, verbose = FALSE)同時(shí)還需要考慮簇與簇之間marker表達(dá)的一致性。例如,如果一個(gè)細(xì)胞類型有兩個(gè)標(biāo)記基因,而簇中僅表達(dá)了其中一個(gè),那么我們就不能絕對(duì)地將該簇分配給該細(xì)胞類型。
CD14+ monocyte markers
FeaturePlot(seurat_integrated,
reduction = "umap",
features = c("CD14", "LYZ"),
sort.cell = TRUE,
min.cutoff = 'q10',
label = TRUE)
從Marker表達(dá)來看,CD14+單核細(xì)胞似乎與cluster 1、3和14相對(duì)應(yīng)。
FCGR3A+ monocyte markers
FeaturePlot(seurat_integrated,
reduction = "umap",
features = c("FCGR3A", "MS4A7"),
sort.cell = TRUE,
min.cutoff = 'q10',
label = TRUE)
從Marker表達(dá)來看,F(xiàn)CGR3A+單核細(xì)胞與cluster 9對(duì)應(yīng)。
Macrophages
FeaturePlot(seurat_integrated,
reduction = "umap",
features = c("MARCO", "ITGAM", "ADGRE1"),
sort.cell = TRUE,
min.cutoff = 'q10',
label = TRUE)
似乎沒有cluster對(duì)應(yīng)于巨噬細(xì)胞。
Conventional dendritic cell markers
FeaturePlot(seurat_integrated,
reduction = "umap",
features = c("FCER1A", "CST3"),
sort.cell = TRUE,
min.cutoff = 'q10',
label = TRUE)
conventional dendritic cells(cDC)對(duì)應(yīng)于cluster 15。
Plasmacytoid dendritic cell markers
FeaturePlot(seurat_integrated,
reduction = "umap",
features = c("IL3RA", "GZMB", "SERPINF1", "ITM2C"),
sort.cell = TRUE,
min.cutoff = 'q10',
label = TRUE)
Plasmacytoid dendritic cells (pDC)對(duì)應(yīng)于cluster 19。
NOTE: If any cluster appears to contain two separate cell types, it’s helpful to increase our clustering resolution to properly subset the clusters. Alternatively, if we still can’t separate out the clusters using increased resolution, then it’s possible that we had used too few principal components such that we are just not separating out these cell types of interest. To inform our choice of PCs, we could look at our PC gene expression overlapping the UMAP plots and determine whether our cell populations are separating by the PCs included.
上面的分析中還存著待解決的問題:
cluster 7和20的細(xì)胞類型標(biāo)識(shí)是什么?
對(duì)應(yīng)于相同細(xì)胞類型的簇是否具有生物學(xué)上有意義的差異?
這些細(xì)胞類型有亞群?jiǎn)幔?/p>
通過為這些簇鑒定其他標(biāo)記基因,我們能否更相信這些細(xì)胞類型身份?
下一步將執(zhí)行marker識(shí)別分析,該分析將輸出簇之間表達(dá)差異顯著的基因,能對(duì)上面的問題進(jìn)行解惑。使用這些基因,我們可以確定或提高對(duì)簇/亞群身份的信心。
