Analyze the dataset in memory#

Here, we’ll analyze the growing dataset by loading it into memory.

This is only possible if it’s not too large. If you deal with particularly large data, please read the guide on iterating over datta batches (to come).

import lamindb as ln
import lnschema_bionty as lb
import anndata as ad
💡 loaded instance: testuser1/test-scrna (lamindb 0.54.4)
ln.track()
💡 notebook imports: anndata==0.9.2 lamindb==0.54.4 lnschema_bionty==0.31.2 scanpy==1.9.5
💡 Transform(id='mfWKm8OtAzp8z8', name='Analyze the dataset in memory', short_name='scrna3', version='0', type=notebook, updated_at=2023-10-01 16:43:31, created_by_id='DzTjkKse')
💡 Run(id='FlQSeyOHOli05fdSjPIA', run_at=2023-10-01 16:43:31, transform_id='mfWKm8OtAzp8z8', created_by_id='DzTjkKse')
ln.Dataset.filter().df()
name description version hash reference reference_type transform_id run_id file_id initial_version_id updated_at created_by_id
id
vAk7VTi8De3y0rT7H8u9 My versioned scRNA-seq dataset None 1 WEFcMZxJNmMiUOFrcSTaig None None Nv48yAceNSh8z8 QbgMTXxxJxPHD0ovcimm vAk7VTi8De3y0rT7H8u9 None 2023-10-01 16:42:46 DzTjkKse
vAk7VTi8De3y0rT7H87n My versioned scRNA-seq dataset None 2 0Uq1qU7xX7R6pyWN3oOT None None ManDYgmftZ8Cz8 k9J7cQZqsnA5sK7oIO1S None vAk7VTi8De3y0rT7H8u9 2023-10-01 16:43:18 DzTjkKse
dataset = ln.Dataset.filter(name="My versioned scRNA-seq dataset", version="2").one()
dataset.files.df()
storage_id key suffix accessor description version size hash hash_type transform_id run_id initial_version_id updated_at created_by_id
id
tKBg5wZP51aKaLiPXbGK mw7lxLgT None .h5ad AnnData 10x reference adata None 660792 a2V0IgOjMRHsCeZH169UOQ md5 ManDYgmftZ8Cz8 k9J7cQZqsnA5sK7oIO1S None 2023-10-01 16:43:12 DzTjkKse
vAk7VTi8De3y0rT7H8u9 mw7lxLgT None .h5ad AnnData Conde22 None 28049505 WEFcMZxJNmMiUOFrcSTaig md5 Nv48yAceNSh8z8 QbgMTXxxJxPHD0ovcimm None 2023-10-01 16:42:46 DzTjkKse

If the dataset doesn’t consist of too many files, we can now load it into memory.

Under-the-hood, the AnnData objects are concatenated during loading.

The amount of time this takes depends on a variety of factors.

If it occurs often, one might consider storing a concatenated version of the dataset, rather than the individual pieces.

adata = dataset.load()

The default is an outer join during concatenation as in pandas:

adata
AnnData object with n_obs × n_vars = 1718 × 36508
    obs: 'cell_type', 'n_genes', 'percent_mito', 'louvain', 'donor', 'tissue', 'assay', 'file_id'
    obsm: 'X_pca', 'X_umap'

The AnnData has the reference to the individual files in the .obs annotations:

adata.obs.file_id.cat.categories
Index(['tKBg5wZP51aKaLiPXbGK', 'vAk7VTi8De3y0rT7H8u9'], dtype='object')

We can easily obtain ensemble IDs for gene symbols using the look up object:

genes = lb.Gene.lookup(field="symbol")
genes.itm2a.ensembl_gene_id
'ENSG00000078596'

Let us create a plot:

import scanpy as sc

sc.pp.pca(adata, n_comps=2)
2023-10-01 16:43:34,816:INFO - Failed to extract font properties from /usr/share/fonts/truetype/noto/NotoColorEmoji.ttf: In FT2Font: Can not load face (unknown file format; error code 0x2)
2023-10-01 16:43:34,902:INFO - generated new fontManager
sc.pl.pca(
    adata,
    color=genes.itm2a.ensembl_gene_id,
    title=(
        f"{genes.itm2a.symbol} / {genes.itm2a.ensembl_gene_id} /"
        f" {genes.itm2a.description}"
    ),
    save="_itm2a",
)
WARNING: saving figure to file figures/pca_itm2a.pdf
https://d33wubrfki0l68.cloudfront.net/a16b94d52965bac89711015a9f479cef8f8ce793/6e514/_images/ce5748da790bc9e7ae953d5579b9db31f24f18e7140bef36910844e66a896392.png
file = ln.File("./figures/pca_itm2a.pdf", description="My result on ITM2A")
file.save()
file.view_flow()
https://d33wubrfki0l68.cloudfront.net/8a4f6f9daecc068a91e2eb63a65b2bbc4122c2e9/c0f38/_images/4692cee5bbbbb76d791566fa57628fe4c0e0580b572610868f75cd7583fb6c4e.svg