scRNA-seq#

scRNA-seq measures gene expression of individual cells.

Their analysis is typically based on data objects like AnnData, SingleCellExperiment & Seurat objects.

These objects often contain non-validated metadata, making data integration & interpretation hard.

In this notebook, LaminDB is used to turn AnnData objects into validated & queryable assets.

Setup#

!lamin init --storage ./test-scrna --schema bionty

import lamindb as ln
import lnschema_bionty as lb
import pandas as pd

ln.track()

💡 loaded instance: testuser1/test-scrna (lamindb 0.54.2)

💡 notebook imports: lamindb==0.54.2 lnschema_bionty==0.31.2 pandas==1.5.3

💡 Transform(id='Nv48yAceNSh8z8', name='scRNA-seq', short_name='scrna', version='0', type=notebook, updated_at=2023-09-27 19:01:09, created_by_id='DzTjkKse')

💡 Run(id='lOuLIG7bl1pT2UwyrsU1', run_at=2023-09-27 19:01:09, transform_id='Nv48yAceNSh8z8', created_by_id='DzTjkKse')

Human immune cells: Conde22#

lb.settings.species = "human"

Access #

Let’s look at a scRNA-seq count matrix in form of an AnnData object that we’d like to ingest into LaminDB:

adata = ln.dev.datasets.anndata_human_immune_cells(
    populate_registries=True  # this pre-populates registries
)

adata

AnnData object with n_obs × n_vars = 1648 × 36503
    obs: 'donor', 'tissue', 'cell_type', 'assay'
    var: 'feature_is_filtered', 'feature_reference', 'feature_biotype'
    uns: 'cell_type_ontology_term_id_colors', 'default_embedding', 'schema_version', 'title'
    obsm: 'X_umap'

This AnnData object does not require filtering, normalizing or formatting, hence, there is no step.

Validate #

Validate genes in `.var`#

lb.Gene.validate(adata.var.index, lb.Gene.ensembl_gene_id);

❗ 148 terms (0.40%) are not validated for ensembl_gene_id: ENSG00000269933, ENSG00000261737, ENSG00000259834, ENSG00000256374, ENSG00000263464, ENSG00000203812, ENSG00000272196, ENSG00000272880, ENSG00000270188, ENSG00000287116, ENSG00000237133, ENSG00000224739, ENSG00000227902, ENSG00000239467, ENSG00000272551, ENSG00000280374, ENSG00000236886, ENSG00000229352, ENSG00000286601, ENSG00000227021, ...

148 gene identifiers can’t be validated (not currently in the Gene registry). Lt’s inspect them to see what to do:

inspector = lb.Gene.inspect(adata.var.index, lb.Gene.ensembl_gene_id)

Logging says 35 of the non-validated ids can be found in the Bionty reference. Let’s register them:

records = lb.Gene.from_values(inspector.non_validated, lb.Gene.ensembl_gene_id)
ln.save(records)

The remaining 113 are legacy IDs, not present in the current Ensembl assembly (e.g. ENSG00000112096).

We’d still like to register them:

validated = lb.Gene.validate(adata.var.index, lb.Gene.ensembl_gene_id)
records = [lb.Gene(ensembl_gene_id=id) for id in adata.var.index[~validated]]
ln.save(records)

Now all genes pass validation:

lb.Gene.validate(adata.var.index, lb.Gene.ensembl_gene_id);

Validate metadata in `.obs`#

adata.obs.columns

Index(['donor', 'tissue', 'cell_type', 'assay'], dtype='object')

ln.Feature.validate(adata.obs.columns);

❗ 1 term (25.00%) is not validated for name: donor

1 feature is not validated: "donor". Let’s register it:

Tip

Use features = ln.Feature.from_df(df) to bulk create features with types.

feature = ln.Feature(name="donor", type="category", registries=[ln.ULabel])
ln.save(feature)

All metadata columns are now validated as feature:

ln.Feature.validate(adata.obs.columns);

Next, let’s validate the corresponding labels of each feature.

Some of the metadata labels can be typed using dedicated registries like CellType:

validated = lb.CellType.validate(adata.obs.cell_type)

❗ received 32 unique terms, 1616 empty/duplicated terms are ignored

❗ 2 terms (6.20%) are not validated for name: germinal center B cell, megakaryocyte

records = lb.CellType.from_values(adata.obs.cell_type[~validated], "name")
ln.save(records)

❗ now recursing through parents: this only happens once, but is much slower than bulk saving

lb.ExperimentalFactor.validate(adata.obs.assay)
lb.Tissue.validate(adata.obs.tissue);

Because we didn’t mount a custom schema that contains a Donor registry, we use the ULabel registry to track donor ids:

ln.ULabel.validate(adata.obs.donor);

❗ received 12 unique terms, 1636 empty/duplicated terms are ignored

❗ 12 terms (100.00%) are not validated for name: D496, 621B, A29, A36, A35, 637C, A52, A37, D503, 640C, A31, 582C

Donor labels are not validated, so let’s register them:

donors = [ln.ULabel(name=name) for name in adata.obs.donor.unique()]
ln.save(donors)

ln.ULabel.validate(adata.obs.donor);

Register #

modalities = ln.Modality.lookup()
experimental_factors = lb.ExperimentalFactor.lookup()
species = lb.Species.lookup()
features = ln.Feature.lookup()

Register data#

When we create a File object from an AnnData, we’ll automatically link its feature sets and get information about unmapped categories:

file = ln.File.from_anndata(
    adata, description="Conde22", field=lb.Gene.ensembl_gene_id, modality=modalities.rna
)

file.save()

The file has the following 2 linked feature sets:

file.features

Features:
  var: FeatureSet(id='vMT7XKlnTnDzWOZKdE73', n=36503, type='number', registry='bionty.Gene', hash='dnRexHCtxtmOU81_EpoJ', updated_at=2023-09-27 19:01:52, modality_id='m5vl3uGJ', created_by_id='DzTjkKse')
    'TRPM8', 'MCUB', 'NBDY', 'None', 'CTBS', 'None', 'VPS51', 'None', 'USP17L20', 'FNDC1-IT1', ...
  obs: FeatureSet(id='cXbZKNPrA429XX9OmdaD', n=4, registry='core.Feature', hash='LcJmEKQ6sT39iKzClddB', updated_at=2023-09-27 19:01:58, modality_id='cX7s1QPp', created_by_id='DzTjkKse')
    🔗 donor (0, core.ULabel): 
    🔗 assay (0, bionty.ExperimentalFactor): 
    🔗 cell_type (0, bionty.CellType): 
    🔗 tissue (0, bionty.Tissue): 

Register metadata links#

Let us first link external labels for the entire file:

file.labels.add(species.human, feature=features.species)
file.labels.add(experimental_factors.single_cell_rna_sequencing, feature=features.assay)

Next, we parse the columns of adata.obs for additional metadata:

file.labels.add(adata.obs.cell_type, feature=features.cell_type)
file.labels.add(adata.obs.assay, feature=features.assay)
file.labels.add(adata.obs.tissue, feature=features.tissue)
file.labels.add(adata.obs.donor, feature=features.donor)

file.features

Features:
  var: FeatureSet(id='vMT7XKlnTnDzWOZKdE73', n=36503, type='number', registry='bionty.Gene', hash='dnRexHCtxtmOU81_EpoJ', updated_at=2023-09-27 19:01:52, modality_id='m5vl3uGJ', created_by_id='DzTjkKse')
    'TRPM8', 'MCUB', 'NBDY', 'None', 'CTBS', 'None', 'VPS51', 'None', 'USP17L20', 'FNDC1-IT1', ...
  obs: FeatureSet(id='cXbZKNPrA429XX9OmdaD', n=4, registry='core.Feature', hash='LcJmEKQ6sT39iKzClddB', updated_at=2023-09-27 19:01:58, modality_id='cX7s1QPp', created_by_id='DzTjkKse')
    🔗 donor (12, core.ULabel): 'A36', '637C', '582C', 'A37', '621B', 'A35', 'D496', 'D503', 'A31', 'A52', ...
    🔗 assay (4, bionty.ExperimentalFactor): 'single-cell RNA sequencing', '10x 5' v1', '10x 5' v2', '10x 3' v3'
    🔗 cell_type (32, bionty.CellType): 'mast cell', 'macrophage', 'CD4-positive helper T cell', 'memory B cell', 'gamma-delta T cell', 'progenitor cell', 'CD16-negative, CD56-bright natural killer cell, human', 'alveolar macrophage', 'classical monocyte', 'effector memory CD8-positive, alpha-beta T cell, terminally differentiated', ...
    🔗 tissue (17, bionty.Tissue): 'jejunal epithelium', 'caecum', 'duodenum', 'omentum', 'thoracic lymph node', 'transverse colon', 'bone marrow', 'skeletal muscle tissue', 'sigmoid colon', 'spleen', ...
  external: FeatureSet(id='aF7lJgCN3yktEdwxJbHd', n=1, registry='core.Feature', hash='p-8KUgzl35HIJDfSWsA9', updated_at=2023-09-27 19:02:00, modality_id='cX7s1QPp', created_by_id='DzTjkKse')
    🔗 species (1, bionty.Species): 'human'

The file is now queryable by everything we linked:

file.describe()

File(id='hPfA4G55q74eh6N0Sz23', suffix='.h5ad', accessor='AnnData', description='Conde22', size=28049505, hash='WEFcMZxJNmMiUOFrcSTaig', hash_type='md5', updated_at=2023-09-27 19:01:58)

Provenance:
  🗃️ storage: Storage(id='yIHH0369', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna', type='local', updated_at=2023-09-27 19:01:05, created_by_id='DzTjkKse')
  💫 transform: Transform(id='Nv48yAceNSh8z8', name='scRNA-seq', short_name='scrna', version='0', type=notebook, updated_at=2023-09-27 19:01:52, created_by_id='DzTjkKse')
  👣 run: Run(id='lOuLIG7bl1pT2UwyrsU1', run_at=2023-09-27 19:01:09, transform_id='Nv48yAceNSh8z8', created_by_id='DzTjkKse')
  👤 created_by: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-09-27 19:01:05)
Features:
  var: FeatureSet(id='vMT7XKlnTnDzWOZKdE73', n=36503, type='number', registry='bionty.Gene', hash='dnRexHCtxtmOU81_EpoJ', updated_at=2023-09-27 19:01:52, modality_id='m5vl3uGJ', created_by_id='DzTjkKse')
    'TRPM8', 'MCUB', 'NBDY', 'None', 'CTBS', 'None', 'VPS51', 'None', 'USP17L20', 'FNDC1-IT1', ...
  obs: FeatureSet(id='cXbZKNPrA429XX9OmdaD', n=4, registry='core.Feature', hash='LcJmEKQ6sT39iKzClddB', updated_at=2023-09-27 19:01:58, modality_id='cX7s1QPp', created_by_id='DzTjkKse')
    🔗 donor (12, core.ULabel): 'A36', '637C', '582C', 'A37', '621B', 'A35', 'D496', 'D503', 'A31', 'A52', ...
    🔗 assay (4, bionty.ExperimentalFactor): 'single-cell RNA sequencing', '10x 5' v1', '10x 5' v2', '10x 3' v3'
    🔗 cell_type (32, bionty.CellType): 'mast cell', 'macrophage', 'CD4-positive helper T cell', 'memory B cell', 'gamma-delta T cell', 'progenitor cell', 'CD16-negative, CD56-bright natural killer cell, human', 'alveolar macrophage', 'classical monocyte', 'effector memory CD8-positive, alpha-beta T cell, terminally differentiated', ...
    🔗 tissue (17, bionty.Tissue): 'jejunal epithelium', 'caecum', 'duodenum', 'omentum', 'thoracic lymph node', 'transverse colon', 'bone marrow', 'skeletal muscle tissue', 'sigmoid colon', 'spleen', ...
  external: FeatureSet(id='aF7lJgCN3yktEdwxJbHd', n=1, registry='core.Feature', hash='p-8KUgzl35HIJDfSWsA9', updated_at=2023-09-27 19:02:00, modality_id='cX7s1QPp', created_by_id='DzTjkKse')
    🔗 species (1, bionty.Species): 'human'
Labels:
  🏷️ species (1, bionty.Species): 'human'
  🏷️ tissues (17, bionty.Tissue): 'jejunal epithelium', 'caecum', 'duodenum', 'omentum', 'thoracic lymph node', 'transverse colon', 'bone marrow', 'skeletal muscle tissue', 'sigmoid colon', 'spleen', ...
  🏷️ cell_types (32, bionty.CellType): 'mast cell', 'macrophage', 'CD4-positive helper T cell', 'memory B cell', 'gamma-delta T cell', 'progenitor cell', 'CD16-negative, CD56-bright natural killer cell, human', 'alveolar macrophage', 'classical monocyte', 'effector memory CD8-positive, alpha-beta T cell, terminally differentiated', ...
  🏷️ experimental_factors (4, bionty.ExperimentalFactor): 'single-cell RNA sequencing', '10x 5' v1', '10x 5' v2', '10x 3' v3'
  🏷️ ulabels (12, core.ULabel): 'A36', '637C', '582C', 'A37', '621B', 'A35', 'D496', 'D503', 'A31', 'A52', ...

A less well curated dataset#

Access #

Let’s now consider a dataset with less-well curated features:

pbmc68k = ln.dev.datasets.anndata_pbmc68k_reduced()
pbmc68k

AnnData object with n_obs × n_vars = 70 × 765
    obs: 'cell_type', 'n_genes', 'percent_mito', 'louvain'
    var: 'n_counts', 'highly_variable'
    uns: 'louvain', 'louvain_colors', 'neighbors', 'pca'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    obsp: 'connectivities', 'distances'

We see that this dataset is indexed by gene symbols:

pbmc68k.var.head()

	n_counts	highly_variable
index
HES4	1153.387451	True
TNFRSF4	304.358154	True
SSU72	2530.272705	False
PARK7	7451.664062	False
RBP7	272.811035	True

Validate #

lb.Gene.validate(pbmc68k.var.index, lb.Gene.symbol);

❗ 70 terms (9.20%) are not validated for symbol: ATPIF1, C1orf228, CCBL2, RP11-782C8.1, RP11-277L2.3, RP11-156E8.1, AC079767.4, GPX1, H1FX, SELT, ATP5I, IGJ, CCDC109B, FYB, H2AFY, FAM65B, HIST1H4C, HIST1H1E, ZNRD1, C6orf48, ...

lb.Gene.inspect(pbmc68k.var.index, lb.Gene.symbol);

Standardize symbols and register additional symbols from Bionty:

pbmc68k.var.index = lb.Gene.standardize(pbmc68k.var.index, lb.Gene.symbol)
gene_records = lb.Gene.from_values(pbmc68k.var.index, lb.Gene.symbol)
ln.save(gene_records)

In this case, we only want to register data with validated genes:

validated = lb.Gene.validate(pbmc68k.var.index, lb.Gene.symbol)
pbmc68k_validated = pbmc68k[:, validated].copy()

Convert gene symbols into ensembl gene ids:

records = lb.Gene.filter(id__in=[record.id for record in gene_records])
mapper = pd.DataFrame(records.values_list("symbol", "ensembl_gene_id")).set_index(0)[1]
pbmc68k_validated.var.insert(0, "gene_symbol", pbmc68k_validated.var.index)
pbmc68k_validated.var.rename(index=mapper, inplace=True)

pbmc68k_validated.var.head()

	gene_symbol	n_counts	highly_variable
ENSG00000188290	HES4	1153.387451	True
ENSG00000186827	TNFRSF4	304.358154	True
ENSG00000160075	SSU72	2530.272705	False
ENSG00000116288	PARK7	7451.664062	False
ENSG00000162444	RBP7	272.811035	True

Validate cell types:

# inspect shows none of the terms are mappable
lb.CellType.inspect(pbmc68k_validated.obs.cell_type)

# here we search the cell type names from the public ontology and grab the top match
# then add the cell type names from the pbmc68k as synonyms
celltype_bt = lb.CellType.bionty()
ontology_ids = []
mapper = {}
for ct in pbmc68k_validated.obs.cell_type.unique():
    ontology_id = celltype_bt.search(ct).iloc[0].ontology_id
    record = lb.CellType.from_bionty(ontology_id=ontology_id)
    mapper[ct] = record.name
    record.save()
    record.add_synonym(ct)

# standardize cell type names in the dataset
pbmc68k_validated.obs.cell_type = pbmc68k_validated.obs.cell_type.map(mapper)

Now, all cell types are validated:

lb.CellType.validate(pbmc68k_validated.obs.cell_type);

Register #

file = ln.File.from_anndata(
    pbmc68k_validated,
    description="10x reference pbmc68k",
    field=lb.Gene.ensembl_gene_id,
    modality=modalities.rna,
)

file.save()

file.labels.add(pbmc68k_validated.obs.cell_type, features.cell_type)
file.labels.add(species.human, feature=features.species)
file.labels.add(experimental_factors.single_cell_rna_sequencing, feature=features.assay)

file.features

Features:
  var: FeatureSet(id='efKTdXvuyyptTz4pN37A', n=754, type='number', registry='bionty.Gene', hash='WMDxN7253SdzGwmznV5d', updated_at=2023-09-27 19:02:29, modality_id='m5vl3uGJ', created_by_id='DzTjkKse')
    'SDF2L1', 'CCL5', 'H2AJ', 'KAT5', 'MCUB', 'BTG1', 'PARK7', 'POP5', 'COA1', 'GZMH', ...
  obs: FeatureSet(id='yITOYQY8CqEegvuctAgt', n=1, registry='core.Feature', hash='AmEc5BjhuP4J85kGIQkC', updated_at=2023-09-27 19:02:30, modality_id='cX7s1QPp', created_by_id='DzTjkKse')
    🔗 cell_type (9, bionty.CellType): 'dendritic cell', 'monocyte', 'central memory CD8-positive, alpha-beta T cell', 'CD16-positive, CD56-dim natural killer cell, human', 'mature T cell', 'CD8-positive, CD25-positive, alpha-beta regulatory T cell', 'Cd4-negative, CD8_alpha-negative, CD11b-positive dendritic cell', 'B cell, CD19-positive', 'CD8-positive, alpha-beta memory T cell'
  external: FeatureSet(id='Dz6eP9kSkhtM3hFWEFrN', n=2, registry='core.Feature', hash='dR9prGZGs-GKL-5z6VXm', updated_at=2023-09-27 19:02:30, modality_id='cX7s1QPp', created_by_id='DzTjkKse')
    🔗 species (1, bionty.Species): 'human'
    🔗 assay (1, bionty.ExperimentalFactor): 'single-cell RNA sequencing'

file.describe()

File(id='ng0CwZ6maQnlYjysmXLV', suffix='.h5ad', accessor='AnnData', description='10x reference pbmc68k', size=660792, hash='GU-hbSJqGkENOxVKFLmvbA', hash_type='md5', updated_at=2023-09-27 19:02:30)

Provenance:
  🗃️ storage: Storage(id='yIHH0369', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna', type='local', updated_at=2023-09-27 19:01:05, created_by_id='DzTjkKse')
  💫 transform: Transform(id='Nv48yAceNSh8z8', name='scRNA-seq', short_name='scrna', version='0', type=notebook, updated_at=2023-09-27 19:02:29, created_by_id='DzTjkKse')
  👣 run: Run(id='lOuLIG7bl1pT2UwyrsU1', run_at=2023-09-27 19:01:09, transform_id='Nv48yAceNSh8z8', created_by_id='DzTjkKse')
  👤 created_by: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-09-27 19:01:05)
Features:
  var: FeatureSet(id='efKTdXvuyyptTz4pN37A', n=754, type='number', registry='bionty.Gene', hash='WMDxN7253SdzGwmznV5d', updated_at=2023-09-27 19:02:29, modality_id='m5vl3uGJ', created_by_id='DzTjkKse')
    'SDF2L1', 'CCL5', 'H2AJ', 'KAT5', 'MCUB', 'BTG1', 'PARK7', 'POP5', 'COA1', 'GZMH', ...
  obs: FeatureSet(id='yITOYQY8CqEegvuctAgt', n=1, registry='core.Feature', hash='AmEc5BjhuP4J85kGIQkC', updated_at=2023-09-27 19:02:30, modality_id='cX7s1QPp', created_by_id='DzTjkKse')
    🔗 cell_type (9, bionty.CellType): 'dendritic cell', 'monocyte', 'central memory CD8-positive, alpha-beta T cell', 'CD16-positive, CD56-dim natural killer cell, human', 'mature T cell', 'CD8-positive, CD25-positive, alpha-beta regulatory T cell', 'Cd4-negative, CD8_alpha-negative, CD11b-positive dendritic cell', 'B cell, CD19-positive', 'CD8-positive, alpha-beta memory T cell'
  external: FeatureSet(id='Dz6eP9kSkhtM3hFWEFrN', n=2, registry='core.Feature', hash='dR9prGZGs-GKL-5z6VXm', updated_at=2023-09-27 19:02:30, modality_id='cX7s1QPp', created_by_id='DzTjkKse')
    🔗 species (1, bionty.Species): 'human'
    🔗 assay (1, bionty.ExperimentalFactor): 'single-cell RNA sequencing'
Labels:
  🏷️ species (1, bionty.Species): 'human'
  🏷️ cell_types (9, bionty.CellType): 'dendritic cell', 'monocyte', 'central memory CD8-positive, alpha-beta T cell', 'CD16-positive, CD56-dim natural killer cell, human', 'mature T cell', 'CD8-positive, CD25-positive, alpha-beta regulatory T cell', 'Cd4-negative, CD8_alpha-negative, CD11b-positive dendritic cell', 'B cell, CD19-positive', 'CD8-positive, alpha-beta memory T cell'
  🏷️ experimental_factors (1, bionty.ExperimentalFactor): 'single-cell RNA sequencing'

file.view_flow()

https://d33wubrfki0l68.cloudfront.net/9a243542e1eac949359904b6f6290829cadaf14b/ddf65/_images/ce1a843e6afbfdfda76c9211f99ffdf39631ec327eab7391a2e63a1984388f41.svg

🎉 Now let’s continue with data integration: Integrate scRNA-seq datasets

scRNA-seq#

Setup#

Human immune cells: Conde22#

Access #

Validate #

Validate genes in .var#

Validate metadata in .obs#

Register #

Register data#

Register metadata links#

A less well curated dataset#

Access #

Validate #

Register #

Validate genes in `.var`#

Validate metadata in `.obs`#