Append a new batch of data#

Here, we’ll both learn how standardize a less well curated dataset and how to append it as a new batch of data to the growing versioned dataset.

import lamindb as ln
import lnschema_bionty as lb
import pandas as pd

ln.track()

💡 loaded instance: testuser1/test-scrna (lamindb 0.54.4)

💡 notebook imports: lamindb==0.54.4 lnschema_bionty==0.31.2 pandas==1.5.3

💡 Transform(id='ManDYgmftZ8Cz8', name='Append a new batch of data', short_name='scrna1', version='0', type=notebook, updated_at=2023-10-01 16:42:52, created_by_id='DzTjkKse')

💡 Run(id='k9J7cQZqsnA5sK7oIO1S', run_at=2023-10-01 16:42:52, transform_id='ManDYgmftZ8Cz8', created_by_id='DzTjkKse')

Access #

Let’s now consider a dataset with less-well curated features:

adata = ln.dev.datasets.anndata_pbmc68k_reduced()
adata

AnnData object with n_obs × n_vars = 70 × 765
    obs: 'cell_type', 'n_genes', 'percent_mito', 'louvain'
    var: 'n_counts', 'highly_variable'
    uns: 'louvain', 'louvain_colors', 'neighbors', 'pca'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    obsp: 'connectivities', 'distances'

We see that this dataset is indexed by gene symbols. Because we assume that in-house, we index all datasets by Ensembl IDs, we’ll need to re-curate:

adata.var.head()

	n_counts	highly_variable
index
HES4	1153.387451	True
TNFRSF4	304.358154	True
SSU72	2530.272705	False
PARK7	7451.664062	False
RBP7	272.811035	True

We are still working with human data, and can globally instruct bionty to assume human:

lb.settings.species = "human"

Validate #

Curate & validate genes#

lb.Gene.validate(adata.var.index, lb.Gene.symbol);

❗ 70 terms (9.20%) are not validated for symbol: ATPIF1, C1orf228, CCBL2, RP11-782C8.1, RP11-277L2.3, RP11-156E8.1, AC079767.4, GPX1, H1FX, SELT, ATP5I, IGJ, CCDC109B, FYB, H2AFY, FAM65B, HIST1H4C, HIST1H1E, ZNRD1, C6orf48, ...

lb.Gene.inspect(adata.var.index, lb.Gene.symbol);

❗ 70 terms (9.20%) are not validated for symbol: ATPIF1, C1orf228, CCBL2, RP11-782C8.1, RP11-277L2.3, RP11-156E8.1, AC079767.4, GPX1, H1FX, SELT, ATP5I, IGJ, CCDC109B, FYB, H2AFY, FAM65B, HIST1H4C, HIST1H1E, ZNRD1, C6orf48, ...

   detected 54 terms with synonyms: ATPIF1, C1orf228, CCBL2, AC079767.4, H1FX, SELT, ATP5I, IGJ, CCDC109B, FYB, H2AFY, FAM65B, HIST1H4C, HIST1H1E, ZNRD1, C6orf48, SEPT7, WBSCR22, RSBN1L-AS1, CCDC132, ...

→  standardize terms via .standardize()

   detected 5 Gene terms in Bionty for symbol: 'SNORD3B-2', 'SOD2', 'IGLL5', 'GPX1', 'RN7SL1'

→  add records from Bionty to your Gene registry via .from_values()

   couldn't validate 11 terms: 'RP11-489E7.4', 'RP11-390E23.6', 'RP11-620J15.3', 'RP11-277L2.3', 'RP11-156E8.1', 'CTD-3138B18.5', 'AC084018.1', 'RP11-291B21.2', 'RP11-782C8.1', 'RP3-467N11.1', 'TMBIM4-1'

→  if you are sure, create new records via ln.Gene() and save to your registry

Standardize symbols and register additional symbols from Bionty:

adata.var.index = lb.Gene.standardize(adata.var.index, lb.Gene.symbol)
gene_records = lb.Gene.from_values(adata.var.index, lb.Gene.symbol)
ln.save(gene_records)

❗ did not create Gene records for 11 non-validated symbols: 'AC084018.1', 'CTD-3138B18.5', 'RP11-156E8.1', 'RP11-277L2.3', 'RP11-291B21.2', 'RP11-390E23.6', 'RP11-489E7.4', 'RP11-620J15.3', 'RP11-782C8.1', 'RP3-467N11.1', 'TMBIM4-1'

We only want to register data with validated genes: data related to other features wouldn’t be useful to us, anyway.

Hence, we submet the AnnData object to the validated genes:

validated = lb.Gene.validate(adata.var.index, lb.Gene.symbol)
adata_validated = adata[:, validated].copy()

❗ 11 terms (1.40%) are not validated for symbol: RP11-782C8.1, RP11-277L2.3, RP11-156E8.1, RP3-467N11.1, RP11-390E23.6, RP11-489E7.4, RP11-291B21.2, RP11-620J15.3, TMBIM4-1, AC084018.1, CTD-3138B18.5

Now, we need to convert gene symbols into ensembl gene ids:

records = lb.Gene.filter(id__in=[record.id for record in gene_records])
mapper = pd.DataFrame(records.values_list("symbol", "ensembl_gene_id")).set_index(0)[1]
adata_validated.var.insert(0, "gene_symbol", adata_validated.var.index)
adata_validated.var.rename(index=mapper, inplace=True)

adata_validated.var.head()

	gene_symbol	n_counts	highly_variable
ENSG00000188290	HES4	1153.387451	True
ENSG00000186827	TNFRSF4	304.358154	True
ENSG00000160075	SSU72	2530.272705	False
ENSG00000116288	PARK7	7451.664062	False
ENSG00000162444	RBP7	272.811035	True

Curate & validate cell types#

Inspection shows none of the terms are validated:

inspector = lb.CellType.inspect(adata_validated.obs.cell_type)

❗ received 9 unique terms, 61 empty/duplicated terms are ignored

❗ 9 terms (100.00%) are not validated for name: Dendritic cells, CD19+ B, CD4+/CD45RO+ Memory, CD8+ Cytotoxic T, CD4+/CD25 T Reg, CD14+ Monocytes, CD56+ NK, CD8+/CD45RA+ Naive Cytotoxic, CD34+

   couldn't validate 9 terms: 'CD8+ Cytotoxic T', 'Dendritic cells', 'CD4+/CD25 T Reg', 'CD34+', 'CD4+/CD45RO+ Memory', 'CD8+/CD45RA+ Naive Cytotoxic', 'CD19+ B', 'CD14+ Monocytes', 'CD56+ NK'

→  if you are sure, create new records via ln.CellType() and save to your registry

Let us search the cell type names from the public ontology, and add the name value found in the AnnData object as a synonym to the top match found in the public ontology.

bionty = lb.CellType.bionty()  # access the public ontology through bionty
name_mapper = {}
for name in adata_validated.obs.cell_type.unique():
    ontology_id = (
        bionty.search(name).iloc[0].ontology_id
    )  # search the public ontology and use the ontology id of the top match
    record = lb.CellType.from_bionty(
        ontology_id=ontology_id
    )  # create a record by loading the top match from bionty
    name_mapper[name] = record.name  # map the original name to standardized name
    record.save()  # save the record
    record.add_synonym(
        name
    )  # add the original name as a synonym, so that next time, we can just run .standardize()

❗ now recursing through parents: this only happens once, but is much slower than bulk saving

❗ now recursing through parents: this only happens once, but is much slower than bulk saving

❗ now recursing through parents: this only happens once, but is much slower than bulk saving

❗ now recursing through parents: this only happens once, but is much slower than bulk saving

❗ now recursing through parents: this only happens once, but is much slower than bulk saving

We can now standardize cell type names using the search-based mapper:

adata_validated.obs.cell_type = adata_validated.obs.cell_type.map(name_mapper)

Now, all cell types are validated:

validated = lb.CellType.validate(adata_validated.obs.cell_type)
assert all(validated)

We don’t want to store any of the other metadata columns:

for column in ["n_genes", "percent_mito", "louvain"]:
    adata.obs.drop(column, axis=1)

Register #

modalities = ln.Modality.lookup()
experimental_factors = lb.ExperimentalFactor.lookup()
species = lb.Species.lookup()
features = ln.Feature.lookup()

file = ln.File.from_anndata(
    adata_validated,
    description="10x reference adata",
    field=lb.Gene.ensembl_gene_id,
    modality=modalities.rna,
)

❗    3 terms (75.00%) are not validated for name: n_genes, percent_mito, louvain

As we do not want to manage the remaining unvalidated terms in registries, we can save the file.

file.save()

file.labels.add(adata_validated.obs.cell_type, features.cell_type)
file.labels.add(species.human, feature=features.species)
file.labels.add(experimental_factors.single_cell_rna_sequencing, feature=features.assay)

file.describe()

File(id='tKBg5wZP51aKaLiPXbGK', suffix='.h5ad', accessor='AnnData', description='10x reference adata', size=660792, hash='a2V0IgOjMRHsCeZH169UOQ', hash_type='md5', updated_at=2023-10-01 16:43:12)

Provenance:
  🗃️ storage: Storage(id='mw7lxLgT', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna', type='local', updated_at=2023-10-01 16:42:00, created_by_id='DzTjkKse')
  💫 transform: Transform(id='ManDYgmftZ8Cz8', name='Append a new batch of data', short_name='scrna1', version='0', type=notebook, updated_at=2023-10-01 16:43:12, created_by_id='DzTjkKse')
  👣 run: Run(id='k9J7cQZqsnA5sK7oIO1S', run_at=2023-10-01 16:42:52, transform_id='ManDYgmftZ8Cz8', created_by_id='DzTjkKse')
  👤 created_by: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-10-01 16:42:00)
Features:
  var: FeatureSet(id='2wFhGDUjuAeoU8jTR3XH', n=754, type='number', registry='bionty.Gene', hash='WMDxN7253SdzGwmznV5d', updated_at=2023-10-01 16:43:12, modality_id='ea9UjVtc', created_by_id='DzTjkKse')
    'NSUN6', 'S100A6', 'LAG3', 'EIF2AK1', 'DHRS4L2', 'LSM5', 'G0S2', 'CCDC107', 'PSMD7', 'HNRNPF', 'EIF3G', 'PSMC5', 'HLA-DMA', 'MFSD14B', 'OSBPL8', 'CD63', 'DHRS7', 'PNN', 'MRPS33', 'LYPD2', ...
  obs: FeatureSet(id='jugdraH3kleI6wzGVAaT', n=1, registry='core.Feature', hash='QFilx9ah7bacDSHBYJOD', updated_at=2023-10-01 16:43:12, modality_id='FqWJG3xl', created_by_id='DzTjkKse')
    🔗 cell_type (9, bionty.CellType): 'CD24-positive, CD4 single-positive thymocyte', 'B cell, CD19-positive', 'CD16-positive, CD56-dim natural killer cell, human', 'monocyte', 'gamma-delta T cell', 'cytotoxic T cell', 'CD8-positive, CD25-positive, alpha-beta regulatory T cell', 'dendritic cell', 'CD4-positive, alpha-beta T cell'
  external: FeatureSet(id='P1bXXRGZjslUlZiURh3N', n=2, registry='core.Feature', hash='pzAiye3Tiav8qFggirus', updated_at=2023-10-01 16:43:12, modality_id='FqWJG3xl', created_by_id='DzTjkKse')
    🔗 species (1, bionty.Species): 'human'
    🔗 assay (1, bionty.ExperimentalFactor): 'single-cell RNA sequencing'
Labels:
  🏷️ species (1, bionty.Species): 'human'
  🏷️ cell_types (9, bionty.CellType): 'CD24-positive, CD4 single-positive thymocyte', 'B cell, CD19-positive', 'CD16-positive, CD56-dim natural killer cell, human', 'monocyte', 'gamma-delta T cell', 'cytotoxic T cell', 'CD8-positive, CD25-positive, alpha-beta regulatory T cell', 'dendritic cell', 'CD4-positive, alpha-beta T cell'
  🏷️ experimental_factors (1, bionty.ExperimentalFactor): 'single-cell RNA sequencing'

file.view_flow()

https://d33wubrfki0l68.cloudfront.net/5891131c8d90dffba62150325709fd751fe76c5e/d4d3d/_images/ffe6314344188a3e3a703d09f34d475a1cd235371c4459bd3120a011104b56df.svg

Create a new version of the dataset by appending a file#

import lamindb as ln

Query the old version:

file = ln.File.filter().order_by("-created_at").first()

file

File(id='tKBg5wZP51aKaLiPXbGK', suffix='.h5ad', accessor='AnnData', description='10x reference adata', size=660792, hash='a2V0IgOjMRHsCeZH169UOQ', hash_type='md5', updated_at=2023-10-01 16:43:12, storage_id='mw7lxLgT', transform_id='ManDYgmftZ8Cz8', run_id='k9J7cQZqsnA5sK7oIO1S', created_by_id='DzTjkKse')

dataset_v1 = ln.Dataset.filter(name="My versioned scRNA-seq dataset", version="1").one()

dataset_v2 = ln.Dataset(
    [file, dataset_v1.file],
    is_new_version_of=dataset_v1,
)

dataset_v2

Dataset(id='vAk7VTi8De3y0rT7H87n', name='My versioned scRNA-seq dataset', version='2', hash='0Uq1qU7xX7R6pyWN3oOT', transform_id='ManDYgmftZ8Cz8', run_id='k9J7cQZqsnA5sK7oIO1S', initial_version_id='vAk7VTi8De3y0rT7H8u9', created_by_id='DzTjkKse')

dataset_v2.save()

dataset_v2.labels.add_from(file)
dataset_v2.labels.add_from(dataset_v1)

dataset_v2.view_flow()

https://d33wubrfki0l68.cloudfront.net/0ca7bd1cf8201b3d0e139d3f69fd452de8244436/6e0fa/_images/b2101198dfa0db0efdffe7be5d2661cd2a3529d9c885b680d7382d7a28ab1fc6.svg

Version 2 of the dataset covers significantly more conditions.

dataset_v2.describe()

Dataset(id='vAk7VTi8De3y0rT7H87n', name='My versioned scRNA-seq dataset', version='2', hash='0Uq1qU7xX7R6pyWN3oOT', updated_at=2023-10-01 16:43:18)

Provenance:
  💫 transform: Transform(id='ManDYgmftZ8Cz8', name='Append a new batch of data', short_name='scrna1', version='0', type=notebook, updated_at=2023-10-01 16:43:13, created_by_id='DzTjkKse')
  👣 run: Run(id='k9J7cQZqsnA5sK7oIO1S', run_at=2023-10-01 16:42:52, transform_id='ManDYgmftZ8Cz8', created_by_id='DzTjkKse')
  🔖 initial_version: Dataset(id='vAk7VTi8De3y0rT7H8u9', name='My versioned scRNA-seq dataset', version='1', hash='WEFcMZxJNmMiUOFrcSTaig', updated_at=2023-10-01 16:42:46, transform_id='Nv48yAceNSh8z8', run_id='QbgMTXxxJxPHD0ovcimm', file_id='vAk7VTi8De3y0rT7H8u9', created_by_id='DzTjkKse')
  👤 created_by: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-10-01 16:42:00)
Features:
  var: FeatureSet(id='H98mJCGEeflhKjjhDFRH', n=37257, type='number', registry='bionty.Gene', hash='bWfNZ3hy-yGnPj65T3kc', updated_at=2023-10-01 16:43:13, created_by_id='DzTjkKse')
    'BEND6', 'None', 'SCAMP3', 'None', 'PSMB3', 'None', 'UGT2B7', 'ADCY6', 'NMUR1', 'None', 'None', 'STAG1', 'LINC01035', 'None', 'SNRPA1-DT', 'LINC01730', 'TWF2', 'SCGB1D2', 'LINC03025', 'RGPD4', ...
  obs: FeatureSet(id='eqbPhBQrbb7AoBWu5XY8', n=4, registry='core.Feature', hash='BUgwklP7zIs3Qx8yNIJ1', updated_at=2023-10-01 16:42:41, modality_id='FqWJG3xl', created_by_id='DzTjkKse')
    🔗 assay (4, bionty.ExperimentalFactor): 'single-cell RNA sequencing', '10x 5' v1', '10x 3' v3', '10x 5' v2'
    🔗 donor (12, core.ULabel): '582C', 'A52', 'A37', 'D496', '640C', '621B', 'A29', 'A35', 'A36', '637C', ...
    🔗 tissue (17, bionty.Tissue): 'bone marrow', 'omentum', 'ileum', 'blood', 'thoracic lymph node', 'liver', 'sigmoid colon', 'lung', 'caecum', 'skeletal muscle tissue', ...
    🔗 cell_type (39, bionty.CellType): 'CD24-positive, CD4 single-positive thymocyte', 'B cell, CD19-positive', 'CD16-positive, CD56-dim natural killer cell, human', 'monocyte', 'gamma-delta T cell', 'cytotoxic T cell', 'CD8-positive, CD25-positive, alpha-beta regulatory T cell', 'dendritic cell', 'CD4-positive, alpha-beta T cell', 'memory B cell', ...
  external: FeatureSet(id='P1bXXRGZjslUlZiURh3N', n=2, registry='core.Feature', hash='pzAiye3Tiav8qFggirus', updated_at=2023-10-01 16:43:12, modality_id='FqWJG3xl', created_by_id='DzTjkKse')
    🔗 species (1, bionty.Species): 'human'
    🔗 assay (4, bionty.ExperimentalFactor): 'single-cell RNA sequencing', '10x 5' v1', '10x 3' v3', '10x 5' v2'
Labels:
  🏷️ species (1, bionty.Species): 'human'
  🏷️ tissues (17, bionty.Tissue): 'bone marrow', 'omentum', 'ileum', 'blood', 'thoracic lymph node', 'liver', 'sigmoid colon', 'lung', 'caecum', 'skeletal muscle tissue', ...
  🏷️ cell_types (39, bionty.CellType): 'CD24-positive, CD4 single-positive thymocyte', 'B cell, CD19-positive', 'CD16-positive, CD56-dim natural killer cell, human', 'monocyte', 'gamma-delta T cell', 'cytotoxic T cell', 'CD8-positive, CD25-positive, alpha-beta regulatory T cell', 'dendritic cell', 'CD4-positive, alpha-beta T cell', 'memory B cell', ...
  🏷️ experimental_factors (4, bionty.ExperimentalFactor): 'single-cell RNA sequencing', '10x 5' v1', '10x 3' v3', '10x 5' v2'
  🏷️ ulabels (12, core.ULabel): '582C', 'A52', 'A37', 'D496', '640C', '621B', 'A29', 'A35', 'A36', '637C', ...