Train an autoencoder to get a low-dimensional representation#

import lamindb as ln
import anndata as ad
import numpy as np
import scgen
๐Ÿ’ก loaded instance: testuser1/test-scrna (lamindb 0.54.4)
2023-10-01 16:43:44,125:INFO - Created a temporary directory at /tmp/tmp12mhyx8w
2023-10-01 16:43:44,128:INFO - Writing /tmp/tmp12mhyx8w/_remote_module_non_scriptable.py
/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/scvi/_settings.py:63: UserWarning: Since v1.0.0, scvi-tools no longer uses a random seed by default. Run `scvi.settings.seed = 0` to reproduce results from previous versions.
  self.seed = seed
/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/scvi/_settings.py:70: UserWarning: Setting `dl_pin_memory_gpu_training` is deprecated in v1.0 and will be removed in v1.1. Please pass in `pin_memory` to the data loaders instead.
  self.dl_pin_memory_gpu_training = (
ln.track()
๐Ÿ’ก notebook imports: anndata==0.9.2 lamindb==0.54.4 numpy==1.25.2 scgen==2.1.1
๐Ÿ’ก Transform(id='Qr1kIHvK506rz8', name='Train an autoencoder to get a low-dimensional representation', short_name='scrna4', version='0', type=notebook, updated_at=2023-10-01 16:43:46, created_by_id='DzTjkKse')
๐Ÿ’ก Run(id='S2zEJMDz9f5iqX8LKq3P', run_at=2023-10-01 16:43:46, transform_id='Qr1kIHvK506rz8', created_by_id='DzTjkKse')
dataset = ln.Dataset.filter(name="My versioned scRNA-seq dataset", version="2").one()

Train scgen model on the concatenated dataset#

data_train = dataset.load(join="inner")
data_train
AnnData object with n_obs ร— n_vars = 1718 ร— 749
    obs: 'cell_type', 'file_id'
    obsm: 'X_umap'
data_train.obs.file_id.value_counts()
vAk7VTi8De3y0rT7H8u9    1648
tKBg5wZP51aKaLiPXbGK      70
Name: file_id, dtype: int64

We use SCGEN here instead of SCVI or SCANVI because we have access only to normalized exression data.

scgen.SCGEN.setup_anndata(data_train)
vae = scgen.SCGEN(data_train)
vae.train(max_epochs=1)  # we use max_epochs=1 to be able to run it on CI
INFO: GPU available: False, used: False
2023-10-01 16:43:46,745:INFO - GPU available: False, used: False
INFO: TPU available: False, using: 0 TPU cores
2023-10-01 16:43:46,748:INFO - TPU available: False, using: 0 TPU cores
INFO: IPU available: False, using: 0 IPUs
2023-10-01 16:43:46,750:INFO - IPU available: False, using: 0 IPUs
INFO: HPU available: False, using: 0 HPUs
2023-10-01 16:43:46,751:INFO - HPU available: False, using: 0 HPUs
Training:   0%|          | 0/1 [00:00<?, ?it/s]
Epoch 1/1:   0%|          | 0/1 [00:00<?, ?it/s]
Epoch 1/1: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 1/1 [00:00<00:00,  2.92it/s]
Epoch 1/1: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 1/1 [00:00<00:00,  2.92it/s, v_num=1, train_loss_step=238, train_loss_epoch=264]
INFO: `Trainer.fit` stopped: `max_epochs=1` reached.
2023-10-01 16:43:47,295:INFO - `Trainer.fit` stopped: `max_epochs=1` reached.
Epoch 1/1: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 1/1 [00:00<00:00,  2.83it/s, v_num=1, train_loss_step=238, train_loss_epoch=264]

Train on the files iteratively#

For a large number of huge files it might be better to train the model iteratively.

file1, file2 = dataset.files.list()
shared_genes = file1.features["var"] & file2.features["var"]
shred_genes_ensembl = shared_genes.list("ensembl_gene_id")
data_train1 = file1.load()[:, shred_genes_ensembl].copy()
data_train1
AnnData object with n_obs ร— n_vars = 70 ร— 749
    obs: 'cell_type', 'n_genes', 'percent_mito', 'louvain'
    var: 'gene_symbol', 'n_counts', 'highly_variable'
    uns: 'louvain', 'louvain_colors', 'neighbors', 'pca'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    obsp: 'connectivities', 'distances'
scgen.SCGEN.setup_anndata(data_train1)
vae = scgen.SCGEN(data_train1)
vae.train(max_epochs=1)  # we use max_epochs=1 to be able to run it on CI
INFO: GPU available: False, used: False
2023-10-01 16:43:47,432:INFO - GPU available: False, used: False
INFO: TPU available: False, using: 0 TPU cores
2023-10-01 16:43:47,434:INFO - TPU available: False, using: 0 TPU cores
INFO: IPU available: False, using: 0 IPUs
2023-10-01 16:43:47,435:INFO - IPU available: False, using: 0 IPUs
INFO: HPU available: False, using: 0 HPUs
2023-10-01 16:43:47,436:INFO - HPU available: False, using: 0 HPUs
/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/lightning/pytorch/loops/fit_loop.py:281: PossibleUserWarning: The number of training batches (1) is smaller than the logging interval Trainer(log_every_n_steps=10). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.
  rank_zero_warn(
Training:   0%|          | 0/1 [00:00<?, ?it/s]
Epoch 1/1:   0%|          | 0/1 [00:00<?, ?it/s]
Epoch 1/1: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 1/1 [00:00<00:00, 28.04it/s, v_num=1, train_loss_step=446, train_loss_epoch=446]
INFO: `Trainer.fit` stopped: `max_epochs=1` reached.
2023-10-01 16:43:47,495:INFO - `Trainer.fit` stopped: `max_epochs=1` reached.
Epoch 1/1: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 1/1 [00:00<00:00, 20.84it/s, v_num=1, train_loss_step=446, train_loss_epoch=446]

vae.save("saved_models/scgen")
data_train2 = file2.load()[:, shred_genes_ensembl].copy()
data_train2
AnnData object with n_obs ร— n_vars = 1648 ร— 749
    obs: 'donor', 'tissue', 'cell_type', 'assay'
    var: 'feature_is_filtered', 'feature_reference', 'feature_biotype'
    uns: 'cell_type_ontology_term_id_colors', 'default_embedding', 'schema_version', 'title'
    obsm: 'X_umap'
vae = scgen.SCGEN.load("saved_models/scgen", data_train2)
INFO    
 File saved_models/scgen/model.pt already downloaded                                                       
vae.train(max_epochs=1)  # we use max_epochs=1 to be able to run it on CI
INFO: GPU available: False, used: False
2023-10-01 16:43:47,702:INFO - GPU available: False, used: False
INFO: TPU available: False, using: 0 TPU cores
2023-10-01 16:43:47,704:INFO - TPU available: False, using: 0 TPU cores
INFO: IPU available: False, using: 0 IPUs
2023-10-01 16:43:47,706:INFO - IPU available: False, using: 0 IPUs
INFO: HPU available: False, using: 0 HPUs
2023-10-01 16:43:47,707:INFO - HPU available: False, using: 0 HPUs
Training:   0%|          | 0/1 [00:00<?, ?it/s]
Epoch 1/1:   0%|          | 0/1 [00:00<?, ?it/s]
Epoch 1/1: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 1/1 [00:00<00:00,  3.42it/s]
Epoch 1/1: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 1/1 [00:00<00:00,  3.42it/s, v_num=1, train_loss_step=185, train_loss_epoch=255]
INFO: `Trainer.fit` stopped: `max_epochs=1` reached.
2023-10-01 16:43:48,023:INFO - `Trainer.fit` stopped: `max_epochs=1` reached.
Epoch 1/1: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 1/1 [00:00<00:00,  3.27it/s, v_num=1, train_loss_step=185, train_loss_epoch=255]

vae.save("saved_models/scgen", overwrite=True)

Save the model weights#

weights = ln.File("saved_models/scgen/model.pt", key="models/scgen/model.pt")
weights.save()

Get and store the low-dimensional representation#

latent1 = vae.get_latent_representation(data_train1)
latent2 = vae.get_latent_representation(data_train2)

latent = np.vstack((latent1, latent2))
INFO    
 Input AnnData not setup with scvi-tools. attempting to transfer AnnData setup                             
adata_latent = ad.AnnData(X=latent)

Set file id.

adata_latent.obs["file_id"] = np.concatenate(
    (np.full(len(data_train1), file1.id), np.full(len(data_train2), file2.id))
)
file_latent = ln.File(adata_latent, key="adata_latent.h5ad")
... storing 'file_id' as categorical
file_latent.save()
file_latent.genes.set(shared_genes)
file_latent.describe()
File(id='jsbgoOCSz9mAbH0PPyKL', key='adata_latent.h5ad', suffix='.h5ad', accessor='AnnData', size=801552, hash='ZQN73PwOhAlXdca2h43-Yg', hash_type='md5', updated_at=2023-10-01 16:43:48)

Provenance:
  ๐Ÿ—ƒ๏ธ storage: Storage(id='mw7lxLgT', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna', type='local', updated_at=2023-10-01 16:42:00, created_by_id='DzTjkKse')
  ๐Ÿ’ซ transform: Transform(id='Qr1kIHvK506rz8', name='Train an autoencoder to get a low-dimensional representation', short_name='scrna4', version='0', type=notebook, updated_at=2023-10-01 16:43:48, created_by_id='DzTjkKse')
  ๐Ÿ‘ฃ run: Run(id='S2zEJMDz9f5iqX8LKq3P', run_at=2023-10-01 16:43:46, transform_id='Qr1kIHvK506rz8', created_by_id='DzTjkKse')
  ๐Ÿ‘ค created_by: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-10-01 16:42:00)
Labels:
  ๐Ÿท๏ธ genes (749, bionty.Gene): 'MRPS25', 'HMGA1', 'SH3YL1', 'NAP1L1', 'SIAH2', 'TMEM176B', 'TMEM208', 'HMGB2', 'RAB7A', 'MIF', ...

Append the low-dimensional representation to the dataset#

dataset_v3 = ln.Dataset(
    dataset.files.list() + [file_latent],
    is_new_version_of=dataset,
)
dataset_v3
Dataset(id='vAk7VTi8De3y0rT7H8Lf', name='My versioned scRNA-seq dataset', version='3', hash='Fqz8hhxsjtN3qowY_Lbq', transform_id='Qr1kIHvK506rz8', run_id='S2zEJMDz9f5iqX8LKq3P', initial_version_id='vAk7VTi8De3y0rT7H8u9', created_by_id='DzTjkKse')
dataset_v3.save()
# clean up test instance
!lamin delete --force test-scrna
!rm -r ./test-scrna
๐Ÿ’ก deleting instance testuser1/test-scrna
โœ…     deleted instance settings file: /home/runner/.lamin/instance--testuser1--test-scrna.env
โœ…     instance cache deleted
โœ…     deleted '.lndb' sqlite file
โ—     consider manually deleting your stored data: /home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna