Jupyter Notebook Binder

Integrate scRNA-seq datasets#

scRNA-seq data integration is the process of analyzing data from several scRNA sequencing experiments to uncover common or distinct biological insights and patterns.

Here, weโ€™ll demonstrate how to fetch two scRNA-seq datasets by registered metadata such as cell types to finally integrate them.

Setup#

!lamin load test-scrna
Hide code cell output
๐Ÿ’ก found cached instance metadata: /home/runner/.lamin/instance--testuser1--test-scrna.env
๐Ÿ’ก loaded instance: testuser1/test-scrna

import lamindb as ln
import lnschema_bionty as lb
import anndata as ad
๐Ÿ’ก loaded instance: testuser1/test-scrna (lamindb 0.54.2)
ln.track()
๐Ÿ’ก notebook imports: anndata==0.9.2 lamindb==0.54.2 lnschema_bionty==0.31.2
โ— record with similar name exist! did you mean to load it?
id __ratio__
name
scRNA-seq Nv48yAceNSh8z8 90.0
๐Ÿ’ก Transform(id='agayZTonayqAz8', name='Integrate scRNA-seq datasets', short_name='scrna2', version='0', type=notebook, updated_at=2023-09-27 19:02:37, created_by_id='DzTjkKse')
๐Ÿ’ก Run(id='CYidf9UlqBcuIDGaY1CV', run_at=2023-09-27 19:02:37, transform_id='agayZTonayqAz8', created_by_id='DzTjkKse')

Access #

Query files by provenance metadata#

users = ln.User.lookup()
ln.Transform.filter(created_by=users.testuser1).search("scrna")
id __ratio__
name
Integrate scRNA-seq datasets agayZTonayqAz8 90.0
scRNA-seq Nv48yAceNSh8z8 90.0
transform = ln.Transform.filter(id="Nv48yAceNSh8z8").one()
ln.File.filter(transform=transform).df()
storage_id key suffix accessor description version size hash hash_type transform_id run_id initial_version_id updated_at created_by_id
id
hPfA4G55q74eh6N0Sz23 yIHH0369 None .h5ad AnnData Conde22 None 28049505 WEFcMZxJNmMiUOFrcSTaig md5 Nv48yAceNSh8z8 lOuLIG7bl1pT2UwyrsU1 None 2023-09-27 19:01:58 DzTjkKse
ng0CwZ6maQnlYjysmXLV yIHH0369 None .h5ad AnnData 10x reference pbmc68k None 660792 GU-hbSJqGkENOxVKFLmvbA md5 Nv48yAceNSh8z8 lOuLIG7bl1pT2UwyrsU1 None 2023-09-27 19:02:30 DzTjkKse

Query files based on biological metadata#

assays = lb.ExperimentalFactor.lookup()
species = lb.Species.lookup()
cell_types = lb.CellType.lookup()
query = ln.File.filter(
    experimental_factors=assays.single_cell_rna_sequencing,
    species=species.human,
    cell_types=cell_types.gamma_delta_t_cell,
)
query.df()
storage_id key suffix accessor description version size hash hash_type transform_id run_id initial_version_id updated_at created_by_id
id
hPfA4G55q74eh6N0Sz23 yIHH0369 None .h5ad AnnData Conde22 None 28049505 WEFcMZxJNmMiUOFrcSTaig md5 Nv48yAceNSh8z8 lOuLIG7bl1pT2UwyrsU1 None 2023-09-27 19:01:58 DzTjkKse

Transform #

Compare gene sets#

Get file objects:

query = ln.File.filter()
file1, file2 = query.list()
file1.describe()
File(id='hPfA4G55q74eh6N0Sz23', suffix='.h5ad', accessor='AnnData', description='Conde22', size=28049505, hash='WEFcMZxJNmMiUOFrcSTaig', hash_type='md5', updated_at=2023-09-27 19:01:58)

Provenance:
  ๐Ÿ—ƒ๏ธ storage: Storage(id='yIHH0369', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna', type='local', updated_at=2023-09-27 19:01:05, created_by_id='DzTjkKse')
  ๐Ÿ“” transform: Transform(id='Nv48yAceNSh8z8', name='scRNA-seq', short_name='scrna', version='0', type='notebook', updated_at=2023-09-27 19:02:29, created_by_id='DzTjkKse')
  ๐Ÿ‘ฃ run: Run(id='lOuLIG7bl1pT2UwyrsU1', run_at=2023-09-27 19:01:09, transform_id='Nv48yAceNSh8z8', created_by_id='DzTjkKse')
  ๐Ÿ‘ค created_by: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-09-27 19:01:05)
Features:
  var: FeatureSet(id='vMT7XKlnTnDzWOZKdE73', n=36503, type='number', registry='bionty.Gene', hash='dnRexHCtxtmOU81_EpoJ', updated_at=2023-09-27 19:01:52, modality_id='m5vl3uGJ', created_by_id='DzTjkKse')
    'TRPM8', 'MCUB', 'NBDY', 'None', 'CTBS', 'None', 'VPS51', 'None', 'USP17L20', 'FNDC1-IT1', ...
  obs: FeatureSet(id='cXbZKNPrA429XX9OmdaD', n=4, registry='core.Feature', hash='LcJmEKQ6sT39iKzClddB', updated_at=2023-09-27 19:01:58, modality_id='cX7s1QPp', created_by_id='DzTjkKse')
    ๐Ÿ”— donor (12, core.ULabel): 'A36', '637C', '582C', 'A37', '621B', 'A35', 'D496', 'D503', 'A31', 'A52', ...
    ๐Ÿ”— assay (4, bionty.ExperimentalFactor): 'single-cell RNA sequencing', '10x 5' v1', '10x 5' v2', '10x 3' v3'
    ๐Ÿ”— cell_type (32, bionty.CellType): 'mast cell', 'macrophage', 'CD4-positive helper T cell', 'memory B cell', 'gamma-delta T cell', 'progenitor cell', 'CD16-negative, CD56-bright natural killer cell, human', 'alveolar macrophage', 'classical monocyte', 'effector memory CD8-positive, alpha-beta T cell, terminally differentiated', ...
    ๐Ÿ”— tissue (17, bionty.Tissue): 'jejunal epithelium', 'caecum', 'duodenum', 'omentum', 'thoracic lymph node', 'transverse colon', 'bone marrow', 'skeletal muscle tissue', 'sigmoid colon', 'spleen', ...
Labels:
  ๐Ÿท๏ธ species (1, bionty.Species): 'human'
  ๐Ÿท๏ธ tissues (17, bionty.Tissue): 'jejunal epithelium', 'caecum', 'duodenum', 'omentum', 'thoracic lymph node', 'transverse colon', 'bone marrow', 'skeletal muscle tissue', 'sigmoid colon', 'spleen', ...
  ๐Ÿท๏ธ cell_types (32, bionty.CellType): 'mast cell', 'macrophage', 'CD4-positive helper T cell', 'memory B cell', 'gamma-delta T cell', 'progenitor cell', 'CD16-negative, CD56-bright natural killer cell, human', 'alveolar macrophage', 'classical monocyte', 'effector memory CD8-positive, alpha-beta T cell, terminally differentiated', ...
  ๐Ÿท๏ธ experimental_factors (4, bionty.ExperimentalFactor): 'single-cell RNA sequencing', '10x 5' v1', '10x 5' v2', '10x 3' v3'
  ๐Ÿท๏ธ ulabels (12, core.ULabel): 'A36', '637C', '582C', 'A37', '621B', 'A35', 'D496', 'D503', 'A31', 'A52', ...
file1.view_flow()
https://d33wubrfki0l68.cloudfront.net/7cdb32ecb4e21433f33dff6757ae1a457ebc20b0/bbeef/_images/b052ed27635ba78588536aebaf1b66a020c92038cee489444bb8bd08f30f6947.svg
file2.describe()
File(id='ng0CwZ6maQnlYjysmXLV', suffix='.h5ad', accessor='AnnData', description='10x reference pbmc68k', size=660792, hash='GU-hbSJqGkENOxVKFLmvbA', hash_type='md5', updated_at=2023-09-27 19:02:30)

Provenance:
  ๐Ÿ—ƒ๏ธ storage: Storage(id='yIHH0369', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna', type='local', updated_at=2023-09-27 19:01:05, created_by_id='DzTjkKse')
  ๐Ÿ“” transform: Transform(id='Nv48yAceNSh8z8', name='scRNA-seq', short_name='scrna', version='0', type='notebook', updated_at=2023-09-27 19:02:29, created_by_id='DzTjkKse')
  ๐Ÿ‘ฃ run: Run(id='lOuLIG7bl1pT2UwyrsU1', run_at=2023-09-27 19:01:09, transform_id='Nv48yAceNSh8z8', created_by_id='DzTjkKse')
  ๐Ÿ‘ค created_by: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-09-27 19:01:05)
Features:
  var: FeatureSet(id='efKTdXvuyyptTz4pN37A', n=754, type='number', registry='bionty.Gene', hash='WMDxN7253SdzGwmznV5d', updated_at=2023-09-27 19:02:29, modality_id='m5vl3uGJ', created_by_id='DzTjkKse')
    'SDF2L1', 'CCL5', 'H2AJ', 'KAT5', 'MCUB', 'BTG1', 'PARK7', 'POP5', 'COA1', 'GZMH', ...
  obs: FeatureSet(id='yITOYQY8CqEegvuctAgt', n=1, registry='core.Feature', hash='AmEc5BjhuP4J85kGIQkC', updated_at=2023-09-27 19:02:30, modality_id='cX7s1QPp', created_by_id='DzTjkKse')
    ๐Ÿ”— cell_type (9, bionty.CellType): 'dendritic cell', 'monocyte', 'central memory CD8-positive, alpha-beta T cell', 'CD16-positive, CD56-dim natural killer cell, human', 'mature T cell', 'CD8-positive, CD25-positive, alpha-beta regulatory T cell', 'Cd4-negative, CD8_alpha-negative, CD11b-positive dendritic cell', 'B cell, CD19-positive', 'CD8-positive, alpha-beta memory T cell'
  external: FeatureSet(id='Dz6eP9kSkhtM3hFWEFrN', n=2, registry='core.Feature', hash='dR9prGZGs-GKL-5z6VXm', updated_at=2023-09-27 19:02:30, modality_id='cX7s1QPp', created_by_id='DzTjkKse')
    ๐Ÿ”— species (1, bionty.Species): 'human'
    ๐Ÿ”— assay (1, bionty.ExperimentalFactor): 'single-cell RNA sequencing'
Labels:
  ๐Ÿท๏ธ species (1, bionty.Species): 'human'
  ๐Ÿท๏ธ cell_types (9, bionty.CellType): 'dendritic cell', 'monocyte', 'central memory CD8-positive, alpha-beta T cell', 'CD16-positive, CD56-dim natural killer cell, human', 'mature T cell', 'CD8-positive, CD25-positive, alpha-beta regulatory T cell', 'Cd4-negative, CD8_alpha-negative, CD11b-positive dendritic cell', 'B cell, CD19-positive', 'CD8-positive, alpha-beta memory T cell'
  ๐Ÿท๏ธ experimental_factors (1, bionty.ExperimentalFactor): 'single-cell RNA sequencing'
file2.view_flow()
https://d33wubrfki0l68.cloudfront.net/9a243542e1eac949359904b6f6290829cadaf14b/ddf65/_images/ce1a843e6afbfdfda76c9211f99ffdf39631ec327eab7391a2e63a1984388f41.svg

Load files into memory:

file1_adata = file1.load()
file2_adata = file2.load()

Here we compute shared genes without loading files:

file1_genes = file1.features["var"]
file2_genes = file2.features["var"]

shared_genes = file1_genes & file2_genes
len(shared_genes)
749
shared_genes.list("symbol")[:10]
['MCUB',
 'GZMB',
 'LY86',
 'TIGIT',
 'NDUFB11',
 'PTPN7',
 'TXN',
 'GLOD4',
 'HLA-DQB1',
 'EIF3E']

Compare cell types#

file1_celltypes = file1.cell_types.all()
file2_celltypes = file2.cell_types.all()

shared_celltypes = file1_celltypes & file2_celltypes
shared_celltypes_names = shared_celltypes.list("name")
shared_celltypes_names
['CD8-positive, alpha-beta memory T cell',
 'CD16-positive, CD56-dim natural killer cell, human']

We can now subset the two datasets by shared cell types:

file1_adata_subset = file1_adata[
    file1_adata.obs["cell_type"].isin(shared_celltypes_names)
]

file2_adata_subset = file2_adata[
    file2_adata.obs["cell_type"].isin(shared_celltypes_names)
]

Concatenate subsetted datasets:

adata_concat = ad.concat(
    [file1_adata_subset, file2_adata_subset],
    label="file",
    keys=[file1.description, file2.description],
)
adata_concat
AnnData object with n_obs ร— n_vars = 244 ร— 749
    obs: 'cell_type', 'file'
    obsm: 'X_umap'
adata_concat.obs.value_counts()
cell_type                                           file                 
CD8-positive, alpha-beta memory T cell              Conde22                  120
CD16-positive, CD56-dim natural killer cell, human  Conde22                  114
CD8-positive, alpha-beta memory T cell              10x reference pbmc68k      7
CD16-positive, CD56-dim natural killer cell, human  10x reference pbmc68k      3
dtype: int64
# clean up test instance
!lamin delete --force test-scrna
!rm -r ./test-scrna
Hide code cell output
๐Ÿ’ก deleting instance testuser1/test-scrna
โœ…     deleted instance settings file: /home/runner/.lamin/instance--testuser1--test-scrna.env
โœ…     instance cache deleted
โœ…     deleted '.lndb' sqlite file
โ—     consider manually deleting your stored data: /home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna