Jupyter Notebook Binder

scRNA-seq#

You’ll learn how to manage a growing number of scRNA-seq data batches as a single queryable dataset.

Along the way, you’ll see how to create reports, leverage data lineage, and query statistics of individual data batches stored as files.

Specifically, you will:

  1. read a single .h5ad file as an AnnData and seed a growing dataset with it

  2. append a new data batch (a new .h5ad file) and create a new version of this dataset (here)

  3. query files by metadata individually and inspect their features (here, can be skipped)

  4. load the dataset into memory and save analytical results as plots (here)

  5. iterate over the dataset and train a model (to come)

  6. annotate the dataset by a cell type prediction (to come)

  7. discuss migrating a lakehouse of files to a single TileDB SOMA store of the same data (to come)

Setup#

!lamin init --storage ./test-scrna --schema bionty
Hide code cell output
✅ saved: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-10-01 16:42:00)
✅ saved: Storage(id='mw7lxLgT', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna', type='local', updated_at=2023-10-01 16:42:00, created_by_id='DzTjkKse')
💡 loaded instance: testuser1/test-scrna
💡 did not register local instance on hub (if you want, call `lamin register`)

import lamindb as ln
import lnschema_bionty as lb
import pandas as pd

ln.track()
💡 loaded instance: testuser1/test-scrna (lamindb 0.54.4)
💡 notebook imports: lamindb==0.54.4 lnschema_bionty==0.31.2 pandas==1.5.3
💡 Transform(id='Nv48yAceNSh8z8', name='scRNA-seq', short_name='scrna', version='0', type=notebook, updated_at=2023-10-01 16:42:05, created_by_id='DzTjkKse')
💡 Run(id='QbgMTXxxJxPHD0ovcimm', run_at=2023-10-01 16:42:05, transform_id='Nv48yAceNSh8z8', created_by_id='DzTjkKse')

Access #

Let us look at the data of Conde et al., Science (2022).

These data are available in standardized form from the CellxGene data portal.

Here, we’ll use it to seed a growing in-house store of scRNA-seq data managed with the corresponding metadata in LaminDB registries.

Note

If you’re not interested in managing large collections of in-house data and you’d just like to query public data, please take a look at CellxGene census, which exposes all datasets hosted in the data portal as a concatenated TileDB SOMA store.

lb.settings.species = "human"

By calling ln.dev.datasets.anndata_human_immune_cells below, we download the dataset from the CellxGene portal here and pre-populate some LaminDB registries.

adata = ln.dev.datasets.anndata_human_immune_cells(
    populate_registries=True  # this pre-populates registries
)
adata
AnnData object with n_obs × n_vars = 1648 × 36503
    obs: 'donor', 'tissue', 'cell_type', 'assay'
    var: 'feature_is_filtered', 'feature_reference', 'feature_biotype'
    uns: 'cell_type_ontology_term_id_colors', 'default_embedding', 'schema_version', 'title'
    obsm: 'X_umap'

This AnnData is already standardized using the same public ontologies underlying lnschema-bionty, hence, we expect validation to be simple.

Nonetheless, LaminDB focuses on building clean in-house registries

Note

In the next notebook, we’ll look at the more difficult case of a non-standardized dataset that requires curation.

Validate #

Validate genes in .var#

lb.Gene.validate(adata.var.index, lb.Gene.ensembl_gene_id);
148 terms (0.40%) are not validated for ensembl_gene_id: ENSG00000269933, ENSG00000261737, ENSG00000259834, ENSG00000256374, ENSG00000263464, ENSG00000203812, ENSG00000272196, ENSG00000272880, ENSG00000270188, ENSG00000287116, ENSG00000237133, ENSG00000224739, ENSG00000227902, ENSG00000239467, ENSG00000272551, ENSG00000280374, ENSG00000236886, ENSG00000229352, ENSG00000286601, ENSG00000227021, ...

148 gene identifiers can’t be validated (not currently in the Gene registry). Let’s inspect them to see what to do:

inspector = lb.Gene.inspect(adata.var.index, lb.Gene.ensembl_gene_id)
148 terms (0.40%) are not validated for ensembl_gene_id: ENSG00000269933, ENSG00000261737, ENSG00000259834, ENSG00000256374, ENSG00000263464, ENSG00000203812, ENSG00000272196, ENSG00000272880, ENSG00000270188, ENSG00000287116, ENSG00000237133, ENSG00000224739, ENSG00000227902, ENSG00000239467, ENSG00000272551, ENSG00000280374, ENSG00000236886, ENSG00000229352, ENSG00000286601, ENSG00000227021, ...
   detected 35 Gene terms in Bionty for ensembl_gene_id: 'ENSG00000276256', 'ENSG00000278704', 'ENSG00000277856', 'ENSG00000198786', 'ENSG00000275249', 'ENSG00000277196', 'ENSG00000274175', 'ENSG00000277836', 'ENSG00000277475', 'ENSG00000198727', 'ENSG00000198712', 'ENSG00000198804', 'ENSG00000274792', 'ENSG00000198899', 'ENSG00000198938', 'ENSG00000278817', 'ENSG00000198886', 'ENSG00000275869', 'ENSG00000268674', 'ENSG00000273554', ...
→  add records from Bionty to your Gene registry via .from_values()
   couldn't validate 113 terms: 'ENSG00000273888', 'ENSG00000280095', 'ENSG00000287388', 'ENSG00000244952', 'ENSG00000277050', 'ENSG00000258861', 'ENSG00000278927', 'ENSG00000254740', 'ENSG00000233776', 'ENSG00000272354', 'ENSG00000268955', 'ENSG00000285162', 'ENSG00000285106', 'ENSG00000256618', 'ENSG00000228139', 'ENSG00000232295', 'ENSG00000286601', 'ENSG00000204092', 'ENSG00000258414', 'ENSG00000273923', ...
→  if you are sure, create new records via ln.Gene() and save to your registry

Logging says 35 of the non-validated ids can be found in the Bionty reference. Let’s register them:

records = lb.Gene.from_values(inspector.non_validated, lb.Gene.ensembl_gene_id)
ln.save(records)
did not create Gene records for 113 non-validated ensembl_gene_ids: 'ENSG00000112096', 'ENSG00000182230', 'ENSG00000203812', 'ENSG00000204092', 'ENSG00000215271', 'ENSG00000221995', 'ENSG00000224739', 'ENSG00000224745', 'ENSG00000225932', 'ENSG00000226377', 'ENSG00000226380', 'ENSG00000226403', 'ENSG00000227021', 'ENSG00000227220', 'ENSG00000227902', 'ENSG00000228139', 'ENSG00000228906', 'ENSG00000229352', 'ENSG00000231575', 'ENSG00000232196', ...

The remaining 113 are legacy IDs, not present in the current Ensembl assembly (e.g. ENSG00000112096).

We’d still like to register them, but won’t dive into the details of converting them from an old Ensembl version to the current one.

validated = lb.Gene.validate(adata.var.index, lb.Gene.ensembl_gene_id)
records = [lb.Gene(ensembl_gene_id=id) for id in adata.var.index[~validated]]
ln.save(records)
113 terms (0.30%) are not validated for ensembl_gene_id: ENSG00000269933, ENSG00000261737, ENSG00000259834, ENSG00000256374, ENSG00000263464, ENSG00000203812, ENSG00000272196, ENSG00000272880, ENSG00000270188, ENSG00000287116, ENSG00000237133, ENSG00000224739, ENSG00000227902, ENSG00000239467, ENSG00000272551, ENSG00000280374, ENSG00000236886, ENSG00000229352, ENSG00000286601, ENSG00000227021, ...

Now all genes pass validation:

lb.Gene.validate(adata.var.index, lb.Gene.ensembl_gene_id);

Our in-house Gene registry provides rich metadata for each gene measured in the AnnData:

lb.Gene.filter().df().head(10)
symbol stable_id ensembl_gene_id ncbi_gene_ids biotype description synonyms species_id bionty_source_id updated_at created_by_id
id
IIHQvzCLpelJ FILIP1 None ENSG00000118407 27145 protein_coding filamin A interacting protein 1 [Source:HGNC S... FILIP|KIAA1275 uHJU THKY 2023-10-01 16:42:13 DzTjkKse
NT6izGn0KNWn PTGES3L None ENSG00000267060 100885848 protein_coding prostaglandin E synthase 3 like [Source:HGNC S... uHJU THKY 2023-10-01 16:42:13 DzTjkKse
0dmlUdJnTKCf PIGA None ENSG00000165195 5277 protein_coding phosphatidylinositol glycan anchor biosynthesi... PIG-A|GPI3 uHJU THKY 2023-10-01 16:42:13 DzTjkKse
CaYYdlLhSH71 PSKH1 None ENSG00000159792 5681 protein_coding protein serine kinase H1 [Source:HGNC Symbol;A... uHJU THKY 2023-10-01 16:42:13 DzTjkKse
2TZXN1Uv4mpE TONSL-AS1 None ENSG00000232600 lncRNA TONSL antisense RNA 1 [Source:HGNC Symbol;Acc:... uHJU THKY 2023-10-01 16:42:13 DzTjkKse
1Hr0U7auzSVA None None ENSG00000286693 lncRNA novel transcript, antisense to UNC5D uHJU THKY 2023-10-01 16:42:13 DzTjkKse
kKc8pYcwV9Ug None None ENSG00000283464 IG_V_pseudogene novel pseudogene identical to IGHVII-44-2D uHJU THKY 2023-10-01 16:42:13 DzTjkKse
4t1Yj21ATJp3 FOXN3 None ENSG00000053254 1112 protein_coding forkhead box N3 [Source:HGNC Symbol;Acc:HGNC:1... C14ORF116|CHES1 uHJU THKY 2023-10-01 16:42:13 DzTjkKse
HjjAb70nQgcP None None ENSG00000286413 lncRNA novel transcript, antisense to MPP7 uHJU THKY 2023-10-01 16:42:13 DzTjkKse
rigC9STpRDNQ None None ENSG00000287010 lncRNA novel transcript uHJU THKY 2023-10-01 16:42:13 DzTjkKse

There are about 36k genes in the registry, all for species “human”.

lb.Gene.filter().df().shape
(36503, 11)

Validate metadata in .obs#

adata.obs.columns
Index(['donor', 'tissue', 'cell_type', 'assay'], dtype='object')
ln.Feature.validate(adata.obs.columns)
1 term (25.00%) is not validated for name: donor
array([False,  True,  True,  True])

1 feature is not validated: "donor". Let’s register it:

feature = ln.Feature(name="donor", type="category", registries=[ln.ULabel])
ln.save(feature)

Tip

You can also use features = ln.Feature.from_df(df) to bulk create features with types.

All metadata columns are now validated:

ln.Feature.validate(adata.obs.columns)
array([ True,  True,  True,  True])

Next, let’s validate the corresponding labels of each feature.

Some of the metadata labels can be typed using dedicated registries like CellType:

validated = lb.CellType.validate(adata.obs.cell_type)
❗ received 32 unique terms, 1616 empty/duplicated terms are ignored
2 terms (6.20%) are not validated for name: germinal center B cell, megakaryocyte

Register non-validated cell types - they can all be loaded from a public ontology through Bionty:

records = lb.CellType.from_values(adata.obs.cell_type[~validated], "name")
ln.save(records)
❗ now recursing through parents: this only happens once, but is much slower than bulk saving
lb.ExperimentalFactor.validate(adata.obs.assay)
lb.Tissue.validate(adata.obs.tissue);

Because we didn’t mount a custom schema that contains a Donor registry, we use the ULabel registry to track donor ids:

ln.ULabel.validate(adata.obs.donor);
❗ received 12 unique terms, 1636 empty/duplicated terms are ignored
12 terms (100.00%) are not validated for name: D496, 621B, A29, A36, A35, 637C, A52, A37, D503, 640C, A31, 582C

Donor labels are not validated, so let’s register them:

donors = [ln.ULabel(name=name) for name in adata.obs.donor.unique()]
ln.save(donors)
ln.ULabel.validate(adata.obs.donor);

Register #

modalities = ln.Modality.lookup()
experimental_factors = lb.ExperimentalFactor.lookup()
species = lb.Species.lookup()
features = ln.Feature.lookup()

Register data#

When we create a File object from an AnnData, we’ll automatically link its feature sets and get information about unmapped categories:

file = ln.File.from_anndata(
    adata, description="Conde22", field=lb.Gene.ensembl_gene_id, modality=modalities.rna
)
file.save()

The file has the following 2 linked feature sets:

file.features
Features:
  var: FeatureSet(id='NbpG08k5MyhtFXpSVTO8', n=36503, type='number', registry='bionty.Gene', hash='dnRexHCtxtmOU81_EpoJ', updated_at=2023-10-01 16:42:37, modality_id='ea9UjVtc', created_by_id='DzTjkKse')
    'FILIP1', 'PTGES3L', 'PIGA', 'PSKH1', 'TONSL-AS1', 'None', 'None', 'None', 'IGLV2-8', 'FOXN3', 'None', 'None', 'None', 'LINC03007', 'ASTN1', 'RASSF5', 'TMEM220-AS1', 'LMO7DN', 'None', 'None', ...
  obs: FeatureSet(id='eqbPhBQrbb7AoBWu5XY8', n=4, registry='core.Feature', hash='BUgwklP7zIs3Qx8yNIJ1', updated_at=2023-10-01 16:42:41, modality_id='FqWJG3xl', created_by_id='DzTjkKse')
    🔗 assay (0, bionty.ExperimentalFactor): 
    🔗 donor (0, core.ULabel): 
    🔗 tissue (0, bionty.Tissue): 
    🔗 cell_type (0, bionty.CellType): 

Create a dataset from the file#

dataset = ln.Dataset(file, name="My versioned scRNA-seq dataset", version="1")

dataset
Dataset(id='vAk7VTi8De3y0rT7H8u9', name='My versioned scRNA-seq dataset', version='1', hash='WEFcMZxJNmMiUOFrcSTaig', transform_id='Nv48yAceNSh8z8', run_id='QbgMTXxxJxPHD0ovcimm', file_id='vAk7VTi8De3y0rT7H8u9', created_by_id='DzTjkKse')

Let’s inspect the features measured in this dataset which were inherited from the file:

dataset.features
Features:
  var: FeatureSet(id='NbpG08k5MyhtFXpSVTO8', n=36503, type='number', registry='bionty.Gene', hash='dnRexHCtxtmOU81_EpoJ', updated_at=2023-10-01 16:42:37, modality_id='ea9UjVtc', created_by_id='DzTjkKse')
    'FILIP1', 'PTGES3L', 'PIGA', 'PSKH1', 'TONSL-AS1', 'None', 'None', 'None', 'IGLV2-8', 'FOXN3', 'None', 'None', 'None', 'LINC03007', 'ASTN1', 'RASSF5', 'TMEM220-AS1', 'LMO7DN', 'None', 'None', ...
  obs: FeatureSet(id='eqbPhBQrbb7AoBWu5XY8', n=4, registry='core.Feature', hash='BUgwklP7zIs3Qx8yNIJ1', updated_at=2023-10-01 16:42:41, modality_id='FqWJG3xl', created_by_id='DzTjkKse')
    🔗 assay (0, bionty.ExperimentalFactor): 
    🔗 donor (0, core.ULabel): 
    🔗 tissue (0, bionty.Tissue): 
    🔗 cell_type (0, bionty.CellType): 
  external: FeatureSet(id='hACccFumFfzUy8GiPLIA', n=1, registry='core.Feature', hash='nE8LzUHK6BMPNKZRcs7C', updated_at=2023-10-01 16:42:42, modality_id='FqWJG3xl', created_by_id='DzTjkKse')
    🔗 species (0, bionty.Species): 

This looks all good, hence, let’s save it:

dataset.save()

Annotate by linking labels:

dataset.labels.add(experimental_factors.single_cell_rna_sequencing, features.assay)
dataset.labels.add(species.human, features.species)
dataset.labels.add(adata.obs.cell_type, feature=features.cell_type)
dataset.labels.add(adata.obs.assay, feature=features.assay)
dataset.labels.add(adata.obs.tissue, feature=features.tissue)
dataset.labels.add(adata.obs.donor, feature=features.donor)

For this version 1 of the dataset, dataset and file match each other. But they’re independently tracked and queryable through their registries.

dataset.describe()
Dataset(id='vAk7VTi8De3y0rT7H8u9', name='My versioned scRNA-seq dataset', version='1', hash='WEFcMZxJNmMiUOFrcSTaig', updated_at=2023-10-01 16:42:46)

Provenance:
  💫 transform: Transform(id='Nv48yAceNSh8z8', name='scRNA-seq', short_name='scrna', version='0', type=notebook, updated_at=2023-10-01 16:42:46, created_by_id='DzTjkKse')
  👣 run: Run(id='QbgMTXxxJxPHD0ovcimm', run_at=2023-10-01 16:42:05, transform_id='Nv48yAceNSh8z8', created_by_id='DzTjkKse')
  📄 file: File(id='vAk7VTi8De3y0rT7H8u9', suffix='.h5ad', accessor='AnnData', description='Conde22', size=28049505, hash='WEFcMZxJNmMiUOFrcSTaig', hash_type='md5', updated_at=2023-10-01 16:42:46, storage_id='mw7lxLgT', transform_id='Nv48yAceNSh8z8', run_id='QbgMTXxxJxPHD0ovcimm', created_by_id='DzTjkKse')
  👤 created_by: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-10-01 16:42:00)
Features:
  var: FeatureSet(id='NbpG08k5MyhtFXpSVTO8', n=36503, type='number', registry='bionty.Gene', hash='dnRexHCtxtmOU81_EpoJ', updated_at=2023-10-01 16:42:37, modality_id='ea9UjVtc', created_by_id='DzTjkKse')
    'FILIP1', 'PTGES3L', 'PIGA', 'PSKH1', 'TONSL-AS1', 'None', 'None', 'None', 'IGLV2-8', 'FOXN3', 'None', 'None', 'None', 'LINC03007', 'ASTN1', 'RASSF5', 'TMEM220-AS1', 'LMO7DN', 'None', 'None', ...
  obs: FeatureSet(id='eqbPhBQrbb7AoBWu5XY8', n=4, registry='core.Feature', hash='BUgwklP7zIs3Qx8yNIJ1', updated_at=2023-10-01 16:42:41, modality_id='FqWJG3xl', created_by_id='DzTjkKse')
    🔗 assay (4, bionty.ExperimentalFactor): 'single-cell RNA sequencing', '10x 5' v1', '10x 3' v3', '10x 5' v2'
    🔗 donor (12, core.ULabel): '582C', 'A52', 'A37', 'D496', '640C', '621B', 'A29', 'A35', 'A36', '637C', ...
    🔗 tissue (17, bionty.Tissue): 'thoracic lymph node', 'lung', 'duodenum', 'bone marrow', 'blood', 'omentum', 'transverse colon', 'caecum', 'mesenteric lymph node', 'lamina propria', ...
    🔗 cell_type (32, bionty.CellType): 'alpha-beta T cell', 'alveolar macrophage', 'macrophage', 'progenitor cell', 'CD4-positive helper T cell', 'lymphocyte', 'effector memory CD4-positive, alpha-beta T cell', 'plasmacytoid dendritic cell', 'CD16-negative, CD56-bright natural killer cell, human', 'classical monocyte', ...
  external: FeatureSet(id='hACccFumFfzUy8GiPLIA', n=1, registry='core.Feature', hash='nE8LzUHK6BMPNKZRcs7C', updated_at=2023-10-01 16:42:42, modality_id='FqWJG3xl', created_by_id='DzTjkKse')
    🔗 species (1, bionty.Species): 'human'
Labels:
  🏷️ species (1, bionty.Species): 'human'
  🏷️ tissues (17, bionty.Tissue): 'thoracic lymph node', 'lung', 'duodenum', 'bone marrow', 'blood', 'omentum', 'transverse colon', 'caecum', 'mesenteric lymph node', 'lamina propria', ...
  🏷️ cell_types (32, bionty.CellType): 'alpha-beta T cell', 'alveolar macrophage', 'macrophage', 'progenitor cell', 'CD4-positive helper T cell', 'lymphocyte', 'effector memory CD4-positive, alpha-beta T cell', 'plasmacytoid dendritic cell', 'CD16-negative, CD56-bright natural killer cell, human', 'classical monocyte', ...
  🏷️ experimental_factors (4, bionty.ExperimentalFactor): 'single-cell RNA sequencing', '10x 5' v1', '10x 3' v3', '10x 5' v2'
  🏷️ ulabels (12, core.ULabel): '582C', 'A52', 'A37', 'D496', '640C', '621B', 'A29', 'A35', 'A36', '637C', ...

And we can access the file like so:

dataset.file
File(id='vAk7VTi8De3y0rT7H8u9', suffix='.h5ad', accessor='AnnData', description='Conde22', size=28049505, hash='WEFcMZxJNmMiUOFrcSTaig', hash_type='md5', updated_at=2023-10-01 16:42:46, storage_id='mw7lxLgT', transform_id='Nv48yAceNSh8z8', run_id='QbgMTXxxJxPHD0ovcimm', created_by_id='DzTjkKse')
dataset.view_flow()
https://d33wubrfki0l68.cloudfront.net/a40f8927ba2ee12d45732bbb9755bae7b3107fed/69230/_images/ffc30f76e5a6572abb8e3cb9c338554ad597cfc58efd9976740149b7f4e8c862.svg