Analyze the dataset in memory#

Here, we’ll analyze the growing dataset by loading it into memory.

This is only possible if it’s not too large. If you deal with particularly large data, please read the guide on iterating over datta batches (to come).

import lamindb as ln
import lnschema_bionty as lb
import anndata as ad

💡 loaded instance: testuser1/test-scrna (lamindb 0.54.4)

ln.track()

💡 notebook imports: anndata==0.9.2 lamindb==0.54.4 lnschema_bionty==0.31.2 scanpy==1.9.5

💡 Transform(id='mfWKm8OtAzp8z8', name='Analyze the dataset in memory', short_name='scrna3', version='0', type=notebook, updated_at=2023-10-01 16:43:31, created_by_id='DzTjkKse')

💡 Run(id='FlQSeyOHOli05fdSjPIA', run_at=2023-10-01 16:43:31, transform_id='mfWKm8OtAzp8z8', created_by_id='DzTjkKse')

ln.Dataset.filter().df()

	name	description	version	hash	reference	reference_type	transform_id	run_id	file_id	initial_version_id	updated_at	created_by_id
id
vAk7VTi8De3y0rT7H8u9	My versioned scRNA-seq dataset	None	1	WEFcMZxJNmMiUOFrcSTaig	None	None	Nv48yAceNSh8z8	QbgMTXxxJxPHD0ovcimm	vAk7VTi8De3y0rT7H8u9	None	2023-10-01 16:42:46	DzTjkKse
vAk7VTi8De3y0rT7H87n	My versioned scRNA-seq dataset	None	2	0Uq1qU7xX7R6pyWN3oOT	None	None	ManDYgmftZ8Cz8	k9J7cQZqsnA5sK7oIO1S	None	vAk7VTi8De3y0rT7H8u9	2023-10-01 16:43:18	DzTjkKse

dataset = ln.Dataset.filter(name="My versioned scRNA-seq dataset", version="2").one()

dataset.files.df()

	storage_id	key	suffix	accessor	description	version	size	hash	hash_type	transform_id	run_id	initial_version_id	updated_at	created_by_id
id
tKBg5wZP51aKaLiPXbGK	mw7lxLgT	None	.h5ad	AnnData	10x reference adata	None	660792	a2V0IgOjMRHsCeZH169UOQ	md5	ManDYgmftZ8Cz8	k9J7cQZqsnA5sK7oIO1S	None	2023-10-01 16:43:12	DzTjkKse
vAk7VTi8De3y0rT7H8u9	mw7lxLgT	None	.h5ad	AnnData	Conde22	None	28049505	WEFcMZxJNmMiUOFrcSTaig	md5	Nv48yAceNSh8z8	QbgMTXxxJxPHD0ovcimm	None	2023-10-01 16:42:46	DzTjkKse

If the dataset doesn’t consist of too many files, we can now load it into memory.

Under-the-hood, the AnnData objects are concatenated during loading.

The amount of time this takes depends on a variety of factors.

If it occurs often, one might consider storing a concatenated version of the dataset, rather than the individual pieces.

adata = dataset.load()

The default is an outer join during concatenation as in pandas:

adata

AnnData object with n_obs × n_vars = 1718 × 36508
    obs: 'cell_type', 'n_genes', 'percent_mito', 'louvain', 'donor', 'tissue', 'assay', 'file_id'
    obsm: 'X_pca', 'X_umap'

The AnnData has the reference to the individual files in the .obs annotations:

adata.obs.file_id.cat.categories

Index(['tKBg5wZP51aKaLiPXbGK', 'vAk7VTi8De3y0rT7H8u9'], dtype='object')

We can easily obtain ensemble IDs for gene symbols using the look up object:

genes = lb.Gene.lookup(field="symbol")

genes.itm2a.ensembl_gene_id

'ENSG00000078596'

Let us create a plot:

import scanpy as sc

sc.pp.pca(adata, n_comps=2)

2023-10-01 16:43:34,816:INFO - Failed to extract font properties from /usr/share/fonts/truetype/noto/NotoColorEmoji.ttf: In FT2Font: Can not load face (unknown file format; error code 0x2)

2023-10-01 16:43:34,902:INFO - generated new fontManager

sc.pl.pca(
    adata,
    color=genes.itm2a.ensembl_gene_id,
    title=(
        f"{genes.itm2a.symbol} / {genes.itm2a.ensembl_gene_id} /"
        f" {genes.itm2a.description}"
    ),
    save="_itm2a",
)

WARNING: saving figure to file figures/pca_itm2a.pdf

https://d33wubrfki0l68.cloudfront.net/a16b94d52965bac89711015a9f479cef8f8ce793/6e514/_images/ce5748da790bc9e7ae953d5579b9db31f24f18e7140bef36910844e66a896392.png

file = ln.File("./figures/pca_itm2a.pdf", description="My result on ITM2A")

file.save()

file.view_flow()

https://d33wubrfki0l68.cloudfront.net/8a4f6f9daecc068a91e2eb63a65b2bbc4122c2e9/c0f38/_images/4692cee5bbbbb76d791566fa57628fe4c0e0580b572610868f75cd7583fb6c4e.svg