Mouse Erythroid

This notebook will show you the basic usage of UniTVelo. In this tutorial, we will use the mouse gastrulation subset to erythroid lineage as example. Similar process can be applied on your own data or please refer to other tutorials.

Import Packages

Some of the pre- or post-processing steps for scRNA-seq data require functions from scVelo. For the time being, both scVelo and UniTVelo are imported.

[1]:

import scvelo as scv
scv.settings.verbosity = 0
import unitvelo as utv

(Running UniTVelo 0.2.0)
2022-08-02 05:34:58

[2]:

adata = scv.datasets.gastrulation_erythroid()
adata

[2]:

AnnData object with n_obs × n_vars = 9815 × 53801
    obs: 'sample', 'stage', 'sequencing.batch', 'theiler', 'celltype'
    var: 'Accession', 'Chromosome', 'End', 'Start', 'Strand', 'MURK_gene', 'Δm', 'scaled Δm'
    uns: 'celltype_colors'
    obsm: 'X_pca', 'X_umap'
    layers: 'spliced', 'unspliced'

Here we define some basic parameters needed for the method, - dataset: abosolute or relative path to the dataset (loom, h5ad, …) - label: name of the annotation column of observations in adata.obs - exp_metrics: to store the evaluation results, default {}

[3]:

dataset = './data/Gastrulation/erythroid_lineage.h5ad'
label = 'celltype'
exp_metrics = {}

Parameter cluster_edges is for algorithm evaluation purpose given expert annotated ground truth. It contains a list of tuples in which stores the source cluster and target cluster of cells.

[4]:

cluster_edges = [
    ("Blood progenitors 1", "Blood progenitors 2"),
    ("Blood progenitors 2", "Erythroid1"),
    ("Erythroid1", "Erythroid2"),
    ("Erythroid2", "Erythroid3")]

scVelo stochastic

For the following sections, we will comapre the model performance with both co-variance based stochastic mode and likelihood based dynamical mode.

[5]:

title = 'scVelo stochastic mode'
adata = scv.datasets.gastrulation_erythroid()
scv.pp.filter_and_normalize(adata, min_shared_counts=20, n_top_genes=2000)
scv.pp.moments(adata, n_pcs=30, n_neighbors=30)

scv.tl.velocity(adata, mode='stochastic')

scv.tl.velocity_graph(adata)
scv.pl.velocity_embedding_stream(adata, color=label, dpi=100, title=title)

Model performance can be measured with two quantitative metrics beside visual inspection: - Cross-Boundary Direction Correctness (A -> B): transition correctness between clusters - In-Cluster Coherence: transition smoothness within clusters

[7]:

scv.pp.neighbors(adata)
adata_velo = adata[:, adata.var.loc[adata.var['velocity_genes'] == True].index]
exp_metrics["model_sto"] = utv.evaluate(adata_velo, cluster_edges, label, 'velocity')

# Cross-Boundary Direction Correctness (A->B)
{('Blood progenitors 1', 'Blood progenitors 2'): 0.5155429131874362, ('Blood progenitors 2', 'Erythroid1'): 0.2579814031216949, ('Erythroid1', 'Erythroid2'): 0.32298710704390915, ('Erythroid2', 'Erythroid3'): -0.49925458479887247}
Total Mean: 0.14931420963854192
# In-cluster Coherence
{'Blood progenitors 1': 0.746955, 'Blood progenitors 2': 0.64198303, 'Erythroid1': 0.709536, 'Erythroid2': 0.67009205, 'Erythroid3': 0.93389046}
Total Mean: 0.7404912710189819

scVelo dynamic

Dynamical mode tries to infer full transcriptional dynamics using a generalized mode. It also yields latent time and other subtle cellular processes.

[8]:

title = 'scVelo dynamical mode'
adata = scv.datasets.gastrulation_erythroid()
scv.pp.filter_and_normalize(adata, min_shared_counts=20, n_top_genes=2000)
scv.pp.moments(adata, n_pcs=30, n_neighbors=30)

scv.tl.recover_dynamics(adata, n_jobs=20)
scv.tl.velocity(adata, mode='dynamical')

scv.tl.velocity_graph(adata)
scv.pl.velocity_embedding_stream(adata, color=label, dpi=100, title=title)

[9]:

scv.pp.neighbors(adata)
adata_velo = adata[:, adata.var.loc[adata.var['velocity_genes'] == True].index]
exp_metrics["model_dyn"] = utv.evaluate(adata_velo, cluster_edges, label, 'velocity')

# Cross-Boundary Direction Correctness (A->B)
{('Blood progenitors 1', 'Blood progenitors 2'): -0.2417314949668877, ('Blood progenitors 2', 'Erythroid1'): -0.11234413583476499, ('Erythroid1', 'Erythroid2'): -0.7968808353191983, ('Erythroid2', 'Erythroid3'): -0.2568351816216074}
Total Mean: -0.35194791193561464
# In-cluster Coherence
{'Blood progenitors 1': 0.7927679910222623, 'Blood progenitors 2': 0.8114391255652827, 'Erythroid1': 0.8692230934014971, 'Erythroid2': 0.874011179638546, 'Erythroid3': 0.842684935958577}
Total Mean: 0.8380252651172331

Latent time is introduced based on transcriptional dynamics. It approximates the differentiated time experienced by cells.

[10]:

scv.tl.latent_time(adata)
scv.pl.scatter(adata, color='latent_time', color_map='gnuplot', size=20, dpi=100)

UniTVelo

Here we re-analysis the mouse gastrulation data subset to erythroid lineage. We demonstrate that UniTVelo with unified-time setting could alleviate the over-fitting problem.

UniTVelo requires a configuration file as input. You may sub-class it from base config file config.py and override the parameters you need to change (demonstrated below). For the details and comments of each parameter, please refer to config.py.

[4]:

velo_config = utv.config.Configuration()
velo_config.R2_ADJUST = True
velo_config.IROOT = None
velo_config.FIT_OPTION = '1'
velo_config.AGENES_R2 = 1

utv.run_model is an integrated function for RNA velocity analysis. It includes the core velocity estimation process and a few pre-processing functions provided by scVelo (normalization, neighbor graph construction to be specific).

[12]:

adata = utv.run_model('./data/Gastrulation/erythroid_lineage.h5ad', label, config_file=velo_config)
scv.pl.velocity_embedding_stream(adata, color=adata.uns['label'], dpi=100, title='')

-------> Model Configuration Settings <-------

 GPU: 2 FIG_DIR: ./figures/     BASE_FUNCTION: Gaussian
 GENERAL: Curve BASIS: None     N_TOP_GENES: 2000
 OFFSET_GENES: False    FILTER_CELLS: False     EXAMINE_GENE: False
 RESCALE_TIME: False    RESCALE_DATA: True      R2_ADJUST: True
 IROOT: None    NUM_REPEAT: 1   FIT_OPTION: 1
 DENSITY: SVD   REORDER_CELL: Soft_Reorder      AGGREGATE_T: True
 ASSIGN_POS_U: False    WIN_SIZE: 50    LEARNING_RATE: 0.01
 MAX_ITER: 10000        USE_RAW: False  RAW_GENES: False

---> # of velocity genes used 513
---> # of velocity genes used 505
---> # of velocity genes used 503

Running UniTVelo may take a while depending on number of cells within beacuse currently the model adpots Gradient Descent for optimizing the loss function. Thus, returned data and fitted results for each gene are automatically saved under tempdata folder for future use.

[13]:

import os
os.listdir('./data/Gastrulation/tempdata')

[13]:

['Ms.csv', 'fitu.csv', 'Mu.csv', 'fits.csv', 'fitvar.csv', 'temp.h5ad']

Colors of each cluster can be changed manually if there is potential confusion on analysis.

[27]:

adata = scv.read('./data/Gastrulation/tempdata/temp.h5ad')
adata.uns[f'{label}_colors'] = ['#1f77b4', '#aec7e8', '#ff7f0e', '#ffbb78', '#2ca02c']

Evaluation metrics of UniTVelo illustrate the improved performance compare with previous method. Both CBDir and ICCoh value are higher.

[15]:

scv.pp.neighbors(adata)
adata_velo = adata[:, adata.var.loc[adata.var['velocity_genes'] == True].index]
exp_metrics["model_unit"] = utv.evaluate(adata_velo, cluster_edges, label, 'velocity')

# Cross-Boundary Direction Correctness (A->B)
{('Blood progenitors 1', 'Blood progenitors 2'): 0.9072786593461084, ('Blood progenitors 2', 'Erythroid1'): 0.7650428018475586, ('Erythroid1', 'Erythroid2'): 0.8304018448051743, ('Erythroid2', 'Erythroid3'): 0.8277381954264528}
Total Mean: 0.8326153753563236
# In-cluster Coherence
{'Blood progenitors 1': 0.983277426107309, 'Blood progenitors 2': 0.9736145077846102, 'Erythroid1': 0.9904579975276204, 'Erythroid2': 0.9954329107426398, 'Erythroid3': 0.9962790431382197}
Total Mean: 0.9878123770600797

Similar results can also be observed using latent time.

[16]:

scv.tl.latent_time(adata, min_likelihood=None)
scv.pl.scatter(adata, color='latent_time', color_map='gnuplot', size=20, dpi=100)

Further Analysis

Both scVelo and UniTVelo undergo gene selction procedure before inferring veloctity matrix. Therefore, only a subset of genes have the inferred parameters.

[29]:

subvar = adata.var.loc[adata.var['velocity_genes'] == True]
sub = adata[:, subvar.index]

UniTVelo is designed based on spiced mRNA oriented framework and currently a RBF model is used as linking function. The model has the ability to infer the activation time by scaling parameter, expression strength and peak time of individual genes. Specifically, by plotting the histogram of peak times, a brief activity of the biological system can be revealed. For instance, in this mouse erythroid lineage, most genes are classified as repression genes whilst a few are considered as induction genes.

[19]:

import seaborn as sns
import matplotlib.pyplot as plt
sns.displot(sub.var['fit_t0'].values, kde=True, bins=20)
plt.xticks([0, 0.5, 1], fontsize=13)
plt.yticks(fontsize=13)
plt.ylabel('Number of Genes', fontsize=15)
plt.title('Peak Time', fontsize=15)

[19]:

Text(0.5, 1.0, 'Peak Time')

The phase portraits and transcriptional activities of each gene can be visualized by plotting function. For instance, Cyr61. It also supports the checking of un/spliced changes along time.

[20]:

utv.pl.plot_range('Cyr61', adata, velo_config, show_legend=False, show_ax=True)

The following three heatmaps shows the expression profile (rows) againese cell ordering (columns, inferred cell time). Genes are separated into three groups for validation purpose, repression genes, inducitons genes and transient genes. Brighter color represents higher expression.

[21]:

gene = sub.var.loc[sub.var['fit_t0'] < 0.05].index # repression gene
scv.pl.heatmap(
    adata, var_names=gene, sortby='latent_time', yticklabels=False,
    col_color='celltype', n_convolve=100)

[22]:

gene = sub.var.loc[sub.var['fit_t0'] > 0.95].index # induction gene
scv.pl.heatmap(
    adata, var_names=gene, sortby='latent_time', yticklabels=False,
    col_color='celltype', n_convolve=100)

[23]:

gene = sub.var.loc[(sub.var['fit_t0'] > 0.05) & (sub.var['fit_t0'] < 0.95)].index # transient gene
scv.pl.heatmap(
    adata, var_names=gene, sortby='latent_time', yticklabels=False,
    col_color='celltype', n_convolve=100)