Train an autoencoder to get a low-dimensional representation#
import lamindb as ln
import anndata as ad
import numpy as np
import scgen
๐ก loaded instance: testuser1/test-scrna (lamindb 0.54.4)
2023-10-01 16:43:44,125:INFO - Created a temporary directory at /tmp/tmp12mhyx8w
2023-10-01 16:43:44,128:INFO - Writing /tmp/tmp12mhyx8w/_remote_module_non_scriptable.py
/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/scvi/_settings.py:63: UserWarning: Since v1.0.0, scvi-tools no longer uses a random seed by default. Run `scvi.settings.seed = 0` to reproduce results from previous versions.
self.seed = seed
/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/scvi/_settings.py:70: UserWarning: Setting `dl_pin_memory_gpu_training` is deprecated in v1.0 and will be removed in v1.1. Please pass in `pin_memory` to the data loaders instead.
self.dl_pin_memory_gpu_training = (
ln.track()
๐ก notebook imports: anndata==0.9.2 lamindb==0.54.4 numpy==1.25.2 scgen==2.1.1
๐ก Transform(id='Qr1kIHvK506rz8', name='Train an autoencoder to get a low-dimensional representation', short_name='scrna4', version='0', type=notebook, updated_at=2023-10-01 16:43:46, created_by_id='DzTjkKse')
๐ก Run(id='S2zEJMDz9f5iqX8LKq3P', run_at=2023-10-01 16:43:46, transform_id='Qr1kIHvK506rz8', created_by_id='DzTjkKse')
dataset = ln.Dataset.filter(name="My versioned scRNA-seq dataset", version="2").one()
Train scgen model on the concatenated dataset#
data_train = dataset.load(join="inner")
data_train
AnnData object with n_obs ร n_vars = 1718 ร 749
obs: 'cell_type', 'file_id'
obsm: 'X_umap'
data_train.obs.file_id.value_counts()
vAk7VTi8De3y0rT7H8u9 1648
tKBg5wZP51aKaLiPXbGK 70
Name: file_id, dtype: int64
We use SCGEN
here instead of SCVI
or SCANVI
because we have access only to normalized exression data.
scgen.SCGEN.setup_anndata(data_train)
vae = scgen.SCGEN(data_train)
vae.train(max_epochs=1) # we use max_epochs=1 to be able to run it on CI
INFO: GPU available: False, used: False
2023-10-01 16:43:46,745:INFO - GPU available: False, used: False
INFO: TPU available: False, using: 0 TPU cores
2023-10-01 16:43:46,748:INFO - TPU available: False, using: 0 TPU cores
INFO: IPU available: False, using: 0 IPUs
2023-10-01 16:43:46,750:INFO - IPU available: False, using: 0 IPUs
INFO: HPU available: False, using: 0 HPUs
2023-10-01 16:43:46,751:INFO - HPU available: False, using: 0 HPUs
Training: 0%| | 0/1 [00:00<?, ?it/s]
Epoch 1/1: 0%| | 0/1 [00:00<?, ?it/s]
Epoch 1/1: 100%|โโโโโโโโโโ| 1/1 [00:00<00:00, 2.92it/s]
Epoch 1/1: 100%|โโโโโโโโโโ| 1/1 [00:00<00:00, 2.92it/s, v_num=1, train_loss_step=238, train_loss_epoch=264]
INFO: `Trainer.fit` stopped: `max_epochs=1` reached.
2023-10-01 16:43:47,295:INFO - `Trainer.fit` stopped: `max_epochs=1` reached.
Epoch 1/1: 100%|โโโโโโโโโโ| 1/1 [00:00<00:00, 2.83it/s, v_num=1, train_loss_step=238, train_loss_epoch=264]
Train on the files iteratively#
For a large number of huge files it might be better to train the model iteratively.
file1, file2 = dataset.files.list()
shared_genes = file1.features["var"] & file2.features["var"]
shred_genes_ensembl = shared_genes.list("ensembl_gene_id")
data_train1 = file1.load()[:, shred_genes_ensembl].copy()
data_train1
AnnData object with n_obs ร n_vars = 70 ร 749
obs: 'cell_type', 'n_genes', 'percent_mito', 'louvain'
var: 'gene_symbol', 'n_counts', 'highly_variable'
uns: 'louvain', 'louvain_colors', 'neighbors', 'pca'
obsm: 'X_pca', 'X_umap'
varm: 'PCs'
obsp: 'connectivities', 'distances'
scgen.SCGEN.setup_anndata(data_train1)
vae = scgen.SCGEN(data_train1)
vae.train(max_epochs=1) # we use max_epochs=1 to be able to run it on CI
INFO: GPU available: False, used: False
2023-10-01 16:43:47,432:INFO - GPU available: False, used: False
INFO: TPU available: False, using: 0 TPU cores
2023-10-01 16:43:47,434:INFO - TPU available: False, using: 0 TPU cores
INFO: IPU available: False, using: 0 IPUs
2023-10-01 16:43:47,435:INFO - IPU available: False, using: 0 IPUs
INFO: HPU available: False, using: 0 HPUs
2023-10-01 16:43:47,436:INFO - HPU available: False, using: 0 HPUs
/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/lightning/pytorch/loops/fit_loop.py:281: PossibleUserWarning: The number of training batches (1) is smaller than the logging interval Trainer(log_every_n_steps=10). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.
rank_zero_warn(
Training: 0%| | 0/1 [00:00<?, ?it/s]
Epoch 1/1: 0%| | 0/1 [00:00<?, ?it/s]
Epoch 1/1: 100%|โโโโโโโโโโ| 1/1 [00:00<00:00, 28.04it/s, v_num=1, train_loss_step=446, train_loss_epoch=446]
INFO: `Trainer.fit` stopped: `max_epochs=1` reached.
2023-10-01 16:43:47,495:INFO - `Trainer.fit` stopped: `max_epochs=1` reached.
Epoch 1/1: 100%|โโโโโโโโโโ| 1/1 [00:00<00:00, 20.84it/s, v_num=1, train_loss_step=446, train_loss_epoch=446]
vae.save("saved_models/scgen")
data_train2 = file2.load()[:, shred_genes_ensembl].copy()
data_train2
AnnData object with n_obs ร n_vars = 1648 ร 749
obs: 'donor', 'tissue', 'cell_type', 'assay'
var: 'feature_is_filtered', 'feature_reference', 'feature_biotype'
uns: 'cell_type_ontology_term_id_colors', 'default_embedding', 'schema_version', 'title'
obsm: 'X_umap'
vae = scgen.SCGEN.load("saved_models/scgen", data_train2)
INFO
File saved_models/scgen/model.pt already downloaded
vae.train(max_epochs=1) # we use max_epochs=1 to be able to run it on CI
INFO: GPU available: False, used: False
2023-10-01 16:43:47,702:INFO - GPU available: False, used: False
INFO: TPU available: False, using: 0 TPU cores
2023-10-01 16:43:47,704:INFO - TPU available: False, using: 0 TPU cores
INFO: IPU available: False, using: 0 IPUs
2023-10-01 16:43:47,706:INFO - IPU available: False, using: 0 IPUs
INFO: HPU available: False, using: 0 HPUs
2023-10-01 16:43:47,707:INFO - HPU available: False, using: 0 HPUs
Training: 0%| | 0/1 [00:00<?, ?it/s]
Epoch 1/1: 0%| | 0/1 [00:00<?, ?it/s]
Epoch 1/1: 100%|โโโโโโโโโโ| 1/1 [00:00<00:00, 3.42it/s]
Epoch 1/1: 100%|โโโโโโโโโโ| 1/1 [00:00<00:00, 3.42it/s, v_num=1, train_loss_step=185, train_loss_epoch=255]
INFO: `Trainer.fit` stopped: `max_epochs=1` reached.
2023-10-01 16:43:48,023:INFO - `Trainer.fit` stopped: `max_epochs=1` reached.
Epoch 1/1: 100%|โโโโโโโโโโ| 1/1 [00:00<00:00, 3.27it/s, v_num=1, train_loss_step=185, train_loss_epoch=255]
vae.save("saved_models/scgen", overwrite=True)
Save the model weights#
weights = ln.File("saved_models/scgen/model.pt", key="models/scgen/model.pt")
weights.save()
Get and store the low-dimensional representation#
latent1 = vae.get_latent_representation(data_train1)
latent2 = vae.get_latent_representation(data_train2)
latent = np.vstack((latent1, latent2))
INFO
Input AnnData not setup with scvi-tools. attempting to transfer AnnData setup
adata_latent = ad.AnnData(X=latent)
Set file id.
adata_latent.obs["file_id"] = np.concatenate(
(np.full(len(data_train1), file1.id), np.full(len(data_train2), file2.id))
)
file_latent = ln.File(adata_latent, key="adata_latent.h5ad")
... storing 'file_id' as categorical
file_latent.save()
file_latent.genes.set(shared_genes)
file_latent.describe()
File(id='jsbgoOCSz9mAbH0PPyKL', key='adata_latent.h5ad', suffix='.h5ad', accessor='AnnData', size=801552, hash='ZQN73PwOhAlXdca2h43-Yg', hash_type='md5', updated_at=2023-10-01 16:43:48)
Provenance:
๐๏ธ storage: Storage(id='mw7lxLgT', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna', type='local', updated_at=2023-10-01 16:42:00, created_by_id='DzTjkKse')
๐ซ transform: Transform(id='Qr1kIHvK506rz8', name='Train an autoencoder to get a low-dimensional representation', short_name='scrna4', version='0', type=notebook, updated_at=2023-10-01 16:43:48, created_by_id='DzTjkKse')
๐ฃ run: Run(id='S2zEJMDz9f5iqX8LKq3P', run_at=2023-10-01 16:43:46, transform_id='Qr1kIHvK506rz8', created_by_id='DzTjkKse')
๐ค created_by: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-10-01 16:42:00)
Labels:
๐ท๏ธ genes (749, bionty.Gene): 'MRPS25', 'HMGA1', 'SH3YL1', 'NAP1L1', 'SIAH2', 'TMEM176B', 'TMEM208', 'HMGB2', 'RAB7A', 'MIF', ...
Append the low-dimensional representation to the dataset#
dataset_v3 = ln.Dataset(
dataset.files.list() + [file_latent],
is_new_version_of=dataset,
)
dataset_v3
Dataset(id='vAk7VTi8De3y0rT7H8Lf', name='My versioned scRNA-seq dataset', version='3', hash='Fqz8hhxsjtN3qowY_Lbq', transform_id='Qr1kIHvK506rz8', run_id='S2zEJMDz9f5iqX8LKq3P', initial_version_id='vAk7VTi8De3y0rT7H8u9', created_by_id='DzTjkKse')
dataset_v3.save()
# clean up test instance
!lamin delete --force test-scrna
!rm -r ./test-scrna
๐ก deleting instance testuser1/test-scrna
โ
deleted instance settings file: /home/runner/.lamin/instance--testuser1--test-scrna.env
โ
instance cache deleted
โ
deleted '.lndb' sqlite file
โ consider manually deleting your stored data: /home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna