Data integration for mouse brain Spatial-epigenome-transcriptome

We collected the data from from AtlasXplore Raw data.

Loading package

[1]:
from lib_3d_OT.utils import *
import scanpy as sc
import numpy as np
import pandas as pd
import torch
from lib_3d_OT.multi_modialty import *
import torch.optim as optim
import warnings
warnings.filterwarnings("ignore")
During startup - Warning messages:
1: package ‘methods’ was built under R version 4.3.2
2: package ‘datasets’ was built under R version 4.3.2
3: package ‘utils’ was built under R version 4.3.2
4: package ‘grDevices’ was built under R version 4.3.2
5: package ‘graphics’ was built under R version 4.3.2
6: package ‘stats’ was built under R version 4.3.2
R[write to console]:                    __           __
   ____ ___  _____/ /_  _______/ /_
  / __ `__ \/ ___/ / / / / ___/ __/
 / / / / / / /__/ / /_/ (__  ) /_
/_/ /_/ /_/\___/_/\__,_/____/\__/   version 6.1.1
Type 'citation("mclust")' for citing this R package in publications.

[ ]:
device = torch.device("cuda:1" if torch.cuda.is_available() else "cpu")

Loading data

[3]:
adata1=sc.read_h5ad('/home/dbj/mouse/SpatialGlue/Mouse_Brain/adata_RNA.h5ad')
adata2=sc.read_h5ad('/home/dbj/mouse/SpatialGlue/Mouse_Brain/adata_peaks_normalized.h5ad')
adata1.obs['truth']=adata1.obs['RNA_clusters']
adata2.obs['truth']=adata2.obs['ATAC_clusters']
adata1.var_names_make_unique()
adata2.var_names_make_unique()

adata3=sc.read_h5ad('/home/dbj/mouse/spatialglue_alldata/Dataset9_Mouse_Brain_H3K27ac/adata_RNA.h5ad')
adata4=sc.read_h5ad('/home/dbj/mouse/spatialglue_alldata/Dataset9_Mouse_Brain_H3K27ac/adata_peaks_normalized.h5ad')
adata3.obs['truth'] = adata3.obs['RNA_clusters']
adata4.obs['truth'] = adata4.obs['H3K27ac_clusters']
adata3.var_names_make_unique()
adata4.var_names_make_unique()

Pre-processing data

  • 3d-OT adopts standard pre-processing steps for the transcriptomic and chromatin peak data.

  • Specifically,for the transcriptomics data,the gene expression counts are log-transformed and normalized by library size via the SCANPY package.

  • The top 3,000 highly variable genes (HVGs) are selected as input of PCA for dimension reduction.

  • To ensure a consistent input dimension with the chromatin peak data, the first 50 principal components are retained and used as the input of the model.

  • For the chromatin peak data, we used LSI (latent semantic indexing) to reduce the raw chromatin peak counts data to 50 dimensions.

[4]:
sc.pp.filter_genes(adata1, min_cells=10)
sc.pp.filter_cells(adata1, min_genes=200)
sc.pp.highly_variable_genes(adata1, flavor="seurat_v3", n_top_genes=3000)
adata1 =  adata1[:, adata1.var['highly_variable']]
sc.pp.normalize_total(adata1, target_sum=1e4)
sc.pp.log1p(adata1)
sc.pp.scale(adata1)
adata1.obsm['feat'] = pca(adata1, n_comps=50)

adata2 = adata2[adata1.obs_names].copy()
if 'X_lsi' not in adata2.obsm.keys():
    sc.pp.highly_variable_genes(adata2, flavor="seurat_v3", n_top_genes=3000)
    lsi(adata2, use_highly_variable=False, n_components=51)
adata2.obsm['feat'] = adata2.obsm['X_lsi'].copy()

sc.pp.filter_genes(adata3, min_cells=10)
sc.pp.filter_cells(adata3, min_genes=200)
adata3.obsm['spatial'] = adata3.obsm['spatial'][:, [1, 0]]
adata3.obsm['spatial'][:, 1] = -adata3.obsm['spatial'][:, 1]
sc.pp.highly_variable_genes(adata3, flavor="seurat_v3", n_top_genes=3000)
adata3 =  adata3[:, adata3.var['highly_variable']]
sc.pp.normalize_total(adata3, target_sum=1e4)
sc.pp.log1p(adata3)
sc.pp.scale(adata3)
adata3.obsm['feat'] = pca(adata3, n_comps=50)

adata4 = adata4[adata3.obs_names].copy()
adata4.obsm['spatial'] = adata3.obsm['spatial']
if 'X_lsi' not in adata4.obsm.keys():
    sc.pp.highly_variable_genes(adata4, flavor="seurat_v3", n_top_genes=3000)
    lsi(adata4, use_highly_variable=False, n_components=51)
adata4.obsm['feat'] = adata4.obsm['X_lsi'].copy()

Constructing neighbor graph and training the PointNet++Encoder1

[5]:
set_seed(7)
graph1 = prepare_data(adata1, location="spatial", nb_neighbors=16).to(device)
graph2 = prepare_data(adata2, location="spatial", nb_neighbors=16).to(device)
input_dim = graph1.express.shape[-1]
hidden_dim=32
model = extractModel(input_dim,hidden_dim,n_heads=4, n_layers=3)
model = model.to(device)
optimizer = optim.Adam(model.parameters(), lr=0.001)
start_time = time.time()
best_model1 = train_graph_extractor(graph1,graph2, model ,optimizer, device,epochs=300)
Epoch 300/300, Loss: 2.5819, Loss1: 1.8757, Loss2: 0.7062

Constructing neighbor graph and training the PointNet++Encoder2

[6]:
graph3 = prepare_data(adata3, location="spatial", nb_neighbors=16).to(device)
graph4 = prepare_data(adata4, location="spatial", nb_neighbors=16).to(device)
input_dim = graph3.express.shape[-1]
hidden_dim=32
model2 = extractModel(input_dim,hidden_dim,n_heads=4, n_layers=3)
model2 = model2.to(device)
optimizer2 = optim.Adam(model2.parameters(), lr=0.001)
best_model2 = train_graph_extractor(graph3,graph4, model2 ,optimizer2, device,epochs=300)
Epoch 300/300, Loss: 2.7008, Loss1: 1.9689, Loss2: 0.7319

Obtain the fusion feature representation of ATAC+RNA

[7]:
with torch.no_grad():
    MODEL=extractModel(input_dim=input_dim,hidden_dim=hidden_dim)
    MODEL.to(device)
    MODEL.load_state_dict(best_model1)
    MODEL.eval()
    recon1, recon2,mixed_modal_features = MODEL(graph1, graph2)
    mixed_modal_features = mixed_modal_features.squeeze(0)
    gene_expression_matrix = mixed_modal_features.cpu().detach().numpy()
adata1.obsm['3d-OT']=gene_expression_matrix

After integration, we perform clustering analysis using the fusion feature representation

[8]:
clustering(adata1, n_clusters=18,key='3d-OT', method='mclust',random=2024,n_comp=20)
Using 3d-OT representation for clustering...
fitting ...
  |======================================================================| 100%
[9]:
import matplotlib.pyplot as plt
import copy
plt.rcParams['figure.figsize'] = (4,4)
plt.rcParams['font.size'] = 20

fig, ax = plt.subplots()
sc.pl.embedding(adata1,basis='spatial',color='3d-OT',size=25,ax=ax)
../_images/Multi_Omics_spatial_domain_identification_mousebrain_16_0.png

Calculate ARI using manual annotation

[10]:
adata=sc.read_h5ad('/home/dbj/mouse/spatialglue_alldata/Dataset7_Mouse_Brain_ATAC/3d-OT.h5ad')
[11]:
import matplotlib.pyplot as plt
import copy
plt.rcParams['figure.figsize'] = (4,4)
plt.rcParams['font.size'] = 20

fig, ax = plt.subplots()
sc.pl.embedding(adata,basis='spatial',color='truth',size=25,ax=ax)
../_images/Multi_Omics_spatial_domain_identification_mousebrain_19_0.png
[12]:
import numpy as np
from sklearn.metrics import (homogeneity_score,mutual_info_score,v_measure_score,adjusted_mutual_info_score,normalized_mutual_info_score,adjusted_rand_score
)

true_labels = np.array(adata.obs['truth'])
cluster_labels = np.array(adata1.obs['3d-OT'])

homogeneity = homogeneity_score(true_labels, cluster_labels)
mutual_info = mutual_info_score(true_labels, cluster_labels)
v_measure = v_measure_score(true_labels, cluster_labels)
ami = adjusted_mutual_info_score(true_labels, cluster_labels)
nmi = normalized_mutual_info_score(true_labels, cluster_labels)
ari = adjusted_rand_score(true_labels, cluster_labels)

print("Homogeneity:", homogeneity)
print("Mutual Information:", mutual_info)
print("V-Measure:", v_measure)
print("AMI:", ami)
print("NMI:", nmi)
print("ARI:", ari)
Homogeneity: 0.6730358682433512
Mutual Information: 1.3446039990672198
V-Measure: 0.5969821720866962
AMI: 0.5956272981301872
NMI: 0.5969821720866962
ARI: 0.43275843056253743

Obtain the fusion feature representation of H3K27ac+RNA

[11]:
with torch.no_grad():
    MODEL=extractModel(input_dim=input_dim,hidden_dim=hidden_dim)
    MODEL.to(device)
    MODEL.load_state_dict(best_model2)
    MODEL.eval()
    recon1, recon2,mixed_modal_features = MODEL(graph3, graph4)
    mixed_modal_features = mixed_modal_features.squeeze(0)
    gene_expression_matrix = mixed_modal_features.cpu().detach().numpy()
adata3.obsm['3d-OT']=gene_expression_matrix
[12]:
clustering(adata3, n_clusters=16,key='3d-OT', method='mclust',random=2024,n_comp=20)
Using 3d-OT representation for clustering...
fitting ...
  |======================================================================| 100%
[13]:
import matplotlib.pyplot as plt
import copy
plt.rcParams['figure.figsize'] = (4,4)
plt.rcParams['font.size'] = 20

fig, ax = plt.subplots()
sc.pl.embedding(adata3,basis='spatial',color='3d-OT',size=25,ax=ax)
../_images/Multi_Omics_spatial_domain_identification_mousebrain_24_0.png

H3K4me3+RNA

[ ]:
adata1=sc.read_h5ad('/home/dbj/mouse/spatialglue_alldata/Dataset8_Mouse_Brain_H3K4me3/adata_RNA.h5ad')
adata2=sc.read_h5ad('/home/dbj/mouse/spatialglue_alldata/Dataset8_Mouse_Brain_H3K4me3/adata_peaks_normalized.h5ad')
adata1.obs['truth'] = adata1.obs['RNA_clusters']
adata2.obs['truth'] = adata2.obs['H3K4me3_clusters']
adata1.var_names_make_unique()
adata2.var_names_make_unique()
[16]:
sc.pp.filter_genes(adata1, min_cells=10)
sc.pp.filter_cells(adata1, min_genes=200)
sc.pp.highly_variable_genes(adata1, flavor="seurat_v3", n_top_genes=3000)
adata1 =  adata1[:, adata1.var['highly_variable']]
sc.pp.normalize_total(adata1, target_sum=1e4)
sc.pp.log1p(adata1)
sc.pp.scale(adata1)
adata1.obsm['feat'] = pca(adata1, n_comps=50)

adata2 = adata2[adata1.obs_names].copy()
if 'X_lsi' not in adata2.obsm.keys():
    sc.pp.highly_variable_genes(adata2, flavor="seurat_v3", n_top_genes=3000)
    lsi(adata2, use_highly_variable=False, n_components=51)
adata2.obsm['feat'] = adata2.obsm['X_lsi'].copy()
[17]:
set_seed(7)
graph1 = prepare_data(adata1, location="spatial", nb_neighbors=16).to(device)
graph2 = prepare_data(adata2, location="spatial", nb_neighbors=16).to(device)
input_dim = graph1.express.shape[-1]
hidden_dim=32
model = extractModel(input_dim,hidden_dim,n_heads=4, n_layers=3)
model = model.to(device)
optimizer = optim.Adam(model.parameters(), lr=0.001)
start_time = time.time()
best_model1 = train_graph_extractor(graph1,graph2, model ,optimizer, device,epochs=300)
Epoch 300/300, Loss: 2.7659, Loss1: 1.9033, Loss2: 0.8626
[18]:
with torch.no_grad():
    MODEL=extractModel(input_dim=input_dim,hidden_dim=hidden_dim)
    MODEL.to(device)
    MODEL.load_state_dict(best_model1)
    MODEL.eval()
    recon1, recon2,mixed_modal_features = MODEL(graph1, graph2)
    mixed_modal_features = mixed_modal_features.squeeze(0)
    gene_expression_matrix = mixed_modal_features.cpu().detach().numpy()
adata1.obsm['3d-OT']=gene_expression_matrix
[19]:
clustering(adata1, n_clusters=18,key='3d-OT', method='mclust',random=2024,n_comp=20)
Using 3d-OT representation for clustering...
fitting ...
  |======================================================================| 100%
[24]:
import matplotlib.pyplot as plt
import copy
plt.rcParams['figure.figsize'] = (4,4)
plt.rcParams['font.size'] = 20
adata1_rotated = copy.deepcopy(adata1)
coords = adata1_rotated.obsm['spatial']

adata1_rotated.obsm['spatial'] = np.column_stack((coords[:, 1],-coords[:, 0]))
fig, ax = plt.subplots()
sc.pl.embedding(adata1_rotated,basis='spatial',color='3d-OT',size=25,ax=ax)
../_images/Multi_Omics_spatial_domain_identification_mousebrain_31_0.png

H3K27me3+RNA

[26]:
adata1=sc.read_h5ad('/home/dbj/mouse/spatialglue_alldata/Dataset10_Mouse_Brain_H3K27me3/adata_RNA.h5ad')
adata2=sc.read_h5ad('/home/dbj/mouse/spatialglue_alldata/Dataset10_Mouse_Brain_H3K27me3/adata_peaks_normalized.h5ad')
adata1.obs['truth'] = adata1.obs['RNA_clusters']
adata2.obs['truth'] = adata2.obs['H3K27me3_clusters']
adata1.var_names_make_unique()
adata2.var_names_make_unique()
[ ]:
sc.pp.filter_genes(adata1, min_cells=10)
sc.pp.filter_cells(adata1, min_genes=200)
sc.pp.highly_variable_genes(adata1, flavor="seurat_v3", n_top_genes=3000)
adata1 =  adata1[:, adata1.var['highly_variable']]
sc.pp.normalize_total(adata1, target_sum=1e4)
sc.pp.log1p(adata1)
sc.pp.scale(adata1)
adata1.obsm['feat'] = pca(adata1, n_comps=50)

adata2 = adata2[adata1.obs_names].copy()
if 'X_lsi' not in adata2.obsm.keys():
    sc.pp.highly_variable_genes(adata2, flavor="seurat_v3", n_top_genes=3000)
    lsi(adata2, use_highly_variable=False, n_components=51)
adata2.obsm['feat'] = adata2.obsm['X_lsi'].copy()
[32]:
set_seed(7)
graph1 = prepare_data(adata1, location="spatial", nb_neighbors=8).to(device)
graph2 = prepare_data(adata2, location="spatial", nb_neighbors=8).to(device)
input_dim = graph1.express.shape[-1]
hidden_dim=32
model = extractModel(input_dim,hidden_dim,n_heads=4, n_layers=3)
model = model.to(device)
optimizer = optim.Adam(model.parameters(), lr=0.001)
start_time = time.time()
best_model1 = train_graph_extractor(graph1,graph2, model ,optimizer, device,epochs=500)
Epoch 500/500, Loss: 2.1722, Loss1: 1.5743, Loss2: 0.5979
[33]:
with torch.no_grad():
    MODEL=extractModel(input_dim=input_dim,hidden_dim=hidden_dim)
    MODEL.to(device)
    MODEL.load_state_dict(best_model1)
    MODEL.eval()
    recon1, recon2,mixed_modal_features = MODEL(graph1, graph2)
    mixed_modal_features = mixed_modal_features.squeeze(0)
    gene_expression_matrix = mixed_modal_features.cpu().detach().numpy()
adata1.obsm['3d-OT']=gene_expression_matrix
[34]:
clustering(adata1, n_clusters=18,key='3d-OT', method='mclust',random=2024,n_comp=20)
Using 3d-OT representation for clustering...
fitting ...
  |======================================================================| 100%
[39]:
import matplotlib.pyplot as plt
import copy
plt.rcParams['figure.figsize'] = (4,4)
plt.rcParams['font.size'] = 20

fig, ax = plt.subplots()
sc.pl.embedding(adata1,basis='spatial',color='3d-OT',size=25,ax=ax)
../_images/Multi_Omics_spatial_domain_identification_mousebrain_38_0.png