{ "cells": [ { "cell_type": "markdown", "id": "dff710cb-1074-4eda-87ac-26aa283a4f13", "metadata": { "execution": { "iopub.execute_input": "2025-03-14T19:52:37.933712Z", "iopub.status.busy": "2025-03-14T19:52:37.932638Z", "iopub.status.idle": "2025-03-14T19:52:37.948501Z", "shell.execute_reply": "2025-03-14T19:52:37.947061Z", "shell.execute_reply.started": "2025-03-14T19:52:37.933660Z" } }, "source": [ "# Stage 1: Data preprocessing\n", "\n", "In this tutorial, we'll first walk through how to prepare the datasets for use in\n", "CASCADE, using the [Norman, et al. (2019)](https://doi.org/10.1126/science.aax4438)\n", "dataset as an example. This dataset contains single- and double-gene CRISPRa\n", "perturbations." ] }, { "cell_type": "code", "execution_count": 1, "id": "38a1ac22-f453-4c41-a786-633c9068539d", "metadata": { "execution": { "iopub.execute_input": "2025-03-20T06:42:20.063503Z", "iopub.status.busy": "2025-03-20T06:42:20.062713Z", "iopub.status.idle": "2025-03-20T06:42:25.040388Z", "shell.execute_reply": "2025-03-20T06:42:25.039358Z", "shell.execute_reply.started": "2025-03-20T06:42:20.063457Z" } }, "outputs": [], "source": [ "import networkx as nx\n", "import numpy as np\n", "import pandas as pd\n", "import scanpy as sc\n", "from sklearn.preprocessing import OneHotEncoder, StandardScaler\n", "\n", "from cascade.data import (\n", " configure_dataset,\n", " encode_regime,\n", " filter_unobserved_targets,\n", " get_all_targets,\n", " get_configuration,\n", " neighbor_impute,\n", ")\n", "from cascade.graph import assemble_scaffolds" ] }, { "cell_type": "markdown", "id": "2238cb0c-1d50-44f9-ac4f-a4a8fc342da7", "metadata": {}, "source": [ "## Read data\n", "\n", "First, we need to prepare the dataset into `AnnData` objects. See the\n", "[documentation](https://anndata.readthedocs.io/) for more details if you are\n", "unfamiliar, including how to construct `AnnData` objects from scratch, and how\n", "to read data in other formats (csv, mtx, loom, etc.) into `AnnData` objects.\n", "\n", "Here we just load existing `h5ad` files, which is the native file format for\n", "`AnnData`. The `h5ad` file used in this tutorial can be downloaded from here:\n", "\n", "- http://ftp.cbi.pku.edu.cn/pub/cascade-download/Norman-2019.h5ad" ] }, { "cell_type": "code", "execution_count": 2, "id": "dc79dbab-5127-4cbc-9067-576e0e2e47df", "metadata": { "execution": { "iopub.execute_input": "2025-03-20T06:42:25.041992Z", "iopub.status.busy": "2025-03-20T06:42:25.041486Z", "iopub.status.idle": "2025-03-20T06:42:36.865741Z", "shell.execute_reply": "2025-03-20T06:42:36.864853Z", "shell.execute_reply.started": "2025-03-20T06:42:25.041968Z" } }, "outputs": [ { "data": { "text/plain": [ "AnnData object with n_obs × n_vars = 86744 × 22881\n", " obs: 'guide_id', 'gemgroup', 'ncounts', 'knockup'\n", " var: 'perturbed'" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "adata = sc.read_h5ad(\"Norman-2019.h5ad\")\n", "adata" ] }, { "cell_type": "markdown", "id": "b64b112e-2bbe-47f5-84d5-f42ff867065e", "metadata": {}, "source": [ "## Data format requirements\n", "\n", "CASCADE requires the following data format:\n", "\n", "- Raw counts in `adata.X`;\n", "- Total count in `adata.obs`, which would be used when fitting data with\n", " the negative binomial distribution;\n", "- HGNC gene symbols as `adata.var_names`;\n", "- Perturbation label in `adata.obs` that specifies which genes are perturbed\n", " in each cell:\n", " - For control cells with no perturbation, the value **MUST** be an empty\n", " string `\"\"`;\n", " - For cells with multiple perturbations, the perturbed genes should be\n", " comma-separated, e.g., `\"CEBPB,KLF1\"`;\n", " - Name of perturbed genes must match those in `adata.var_names`.\n", "\n", "In this case, we can verify that the expression matrix contains raw counts:" ] }, { "cell_type": "code", "execution_count": 3, "id": "5a2a0d5b-b18d-443a-a916-9b9709da3591", "metadata": { "execution": { "iopub.execute_input": "2025-03-20T06:42:36.866987Z", "iopub.status.busy": "2025-03-20T06:42:36.866753Z", "iopub.status.idle": "2025-03-20T06:42:36.874179Z", "shell.execute_reply": "2025-03-20T06:42:36.873263Z", "shell.execute_reply.started": "2025-03-20T06:42:36.866967Z" } }, "outputs": [ { "data": { "text/plain": [ "(,\n", " array([ 1., 1., 1., ..., 12., 3., 16.], dtype=float32))" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "adata.X, adata.X.data" ] }, { "cell_type": "markdown", "id": "f5b5be25-a512-42ae-a00e-40c9ee175dbd", "metadata": {}, "source": [ "The total counts are stored as `\"ncounts\"` in `adata.obs`:" ] }, { "cell_type": "code", "execution_count": 4, "id": "e8258211-2b10-4120-a50e-baf923389a63", "metadata": { "execution": { "iopub.execute_input": "2025-03-20T06:42:36.876137Z", "iopub.status.busy": "2025-03-20T06:42:36.875941Z", "iopub.status.idle": "2025-03-20T06:42:36.884305Z", "shell.execute_reply": "2025-03-20T06:42:36.883582Z", "shell.execute_reply.started": "2025-03-20T06:42:36.876119Z" } }, "outputs": [ { "data": { "text/plain": [ "TTGAACGAGACTCGGA 15097.0\n", "CGTTGGGGTGTTTGTG 8551.0\n", "GAACCTAAGTGTTAGA 10999.0\n", "CCTTCCCTCCGTCATC 38454.0\n", "TCCCGATGTCTCTTAT 21433.0\n", " ... \n", "TTTCCTCGTACGCACC 11991.0\n", "TTTCCTCTCTTGCCGT 16561.0\n", "TTTGCGCAGTCATGCT 5192.0\n", "TTTGCGCCAGGACCCT 15704.0\n", "TTTGCGCTCTCGCATC-1 6825.0\n", "Name: ncounts, Length: 86744, dtype: float32" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "adata.obs[\"ncounts\"]" ] }, { "cell_type": "markdown", "id": "da3edbb7-3b0c-4133-9f23-9f600456600d", "metadata": {}, "source": [ "And perturbation labels are stored as `\"knockup\"` in `adata.obs`:" ] }, { "cell_type": "code", "execution_count": 5, "id": "79a87d1d-ba24-49a2-a055-ac75d8723d72", "metadata": { "execution": { "iopub.execute_input": "2025-03-20T06:42:36.885130Z", "iopub.status.busy": "2025-03-20T06:42:36.884950Z", "iopub.status.idle": "2025-03-20T06:42:36.892925Z", "shell.execute_reply": "2025-03-20T06:42:36.892213Z", "shell.execute_reply.started": "2025-03-20T06:42:36.885113Z" } }, "outputs": [ { "data": { "text/plain": [ "TTGAACGAGACTCGGA ARID1A\n", "CGTTGGGGTGTTTGTG BCORL1\n", "GAACCTAAGTGTTAGA FOSB\n", "CCTTCCCTCCGTCATC KLF1,SET\n", "TCCCGATGTCTCTTAT BAK1,KLF1\n", " ... \n", "TTTCCTCGTACGCACC \n", "TTTCCTCTCTTGCCGT HK2\n", "TTTGCGCAGTCATGCT RHOXF2\n", "TTTGCGCCAGGACCCT BAK1,BCL2L11\n", "TTTGCGCTCTCGCATC-1 CEBPB,OSR2\n", "Name: knockup, Length: 86744, dtype: category\n", "Categories (237, object): ['', 'AHR', 'AHR,FEV', 'AHR,KLF1', ..., 'ZBTB10', 'ZBTB25', 'ZC3HAV1', 'ZNF318']" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "adata.obs[\"knockup\"]" ] }, { "cell_type": "markdown", "id": "3babc0b8-99df-4452-903d-6ded8addcea8", "metadata": {}, "source": [ "Before any further processing, we back up the raw UMI counts in a layer called\n", "`“counts”`, which will be used later during model training." ] }, { "cell_type": "code", "execution_count": 6, "id": "87098765-026a-4484-9cd7-0ea0c6360e52", "metadata": { "execution": { "iopub.execute_input": "2025-03-20T06:42:36.893717Z", "iopub.status.busy": "2025-03-20T06:42:36.893540Z", "iopub.status.idle": "2025-03-20T06:42:37.992858Z", "shell.execute_reply": "2025-03-20T06:42:37.991798Z", "shell.execute_reply.started": "2025-03-20T06:42:36.893700Z" } }, "outputs": [], "source": [ "adata.layers[\"counts\"] = adata.X.copy()" ] }, { "cell_type": "markdown", "id": "897ee91f-d3e6-43f4-a860-b66c9cd59ed8", "metadata": {}, "source": [ "## Cell and gene selection\n", "\n", "Since CASCADE can only model perturbations in measured genes, we first filter out\n", "any perturbation that was missing from the readout. A utility function called\n", "[filter_unobserved_targets](api/cascade.data.filter_unobserved_targets.rst) is\n", "provided for this purpose.\n", "\n", "In this case no cell was filtered:" ] }, { "cell_type": "code", "execution_count": 7, "id": "0580b413-657c-4ae7-aada-16bf589b1b15", "metadata": { "execution": { "iopub.execute_input": "2025-03-20T06:42:37.993687Z", "iopub.status.busy": "2025-03-20T06:42:37.993485Z", "iopub.status.idle": "2025-03-20T06:42:38.233906Z", "shell.execute_reply": "2025-03-20T06:42:38.233128Z", "shell.execute_reply.started": "2025-03-20T06:42:37.993668Z" } }, "outputs": [ { "data": { "text/plain": [ "View of AnnData object with n_obs × n_vars = 86744 × 22881\n", " obs: 'guide_id', 'gemgroup', 'ncounts', 'knockup'\n", " var: 'perturbed'\n", " layers: 'counts'" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "filter_unobserved_targets(adata, \"knockup\")" ] }, { "cell_type": "markdown", "id": "26504de2-bd59-4855-b73b-1b9fecff48a4", "metadata": {}, "source": [ "Next, we identify highly variable genes using the `\"seurat_v3\"` method,\n", "to allow the model to focus on informative genes:" ] }, { "cell_type": "code", "execution_count": 8, "id": "99819c64-421e-4e95-82c1-3c3231ec7c77", "metadata": { "execution": { "iopub.execute_input": "2025-03-20T06:42:38.234825Z", "iopub.status.busy": "2025-03-20T06:42:38.234608Z", "iopub.status.idle": "2025-03-20T06:42:43.415170Z", "shell.execute_reply": "2025-03-20T06:42:43.414125Z", "shell.execute_reply.started": "2025-03-20T06:42:38.234796Z" } }, "outputs": [], "source": [ "sc.pp.highly_variable_genes(adata, n_top_genes=1000, flavor=\"seurat_v3\")" ] }, { "cell_type": "markdown", "id": "e1d022cc-4cf1-4141-8f74-c8f9a88915c3", "metadata": {}, "source": [ "Again, as CASCADE can only model perturbations in measured genes, we expand this\n", "highly variable gene set to incorporating all perturbed genes (via a utility\n", "function [get_all_targets](api/cascade.data.get_all_targets.rst)) to avoid\n", "discarding useful perturbation information:" ] }, { "cell_type": "code", "execution_count": 9, "id": "6f75a5a3-381e-4618-b246-5bb2ad90d76a", "metadata": { "execution": { "iopub.execute_input": "2025-03-20T06:42:43.416536Z", "iopub.status.busy": "2025-03-20T06:42:43.416307Z", "iopub.status.idle": "2025-03-20T06:42:43.424544Z", "shell.execute_reply": "2025-03-20T06:42:43.423834Z", "shell.execute_reply.started": "2025-03-20T06:42:43.416517Z" } }, "outputs": [ { "data": { "text/plain": [ "AHR,ARID1A,ARRDC3,ATL1,BAK1,BCL2L11,BCORL1,BPGM,CBARP,CBFA2T3,CBL,CDKN1A,CDKN1B,CDKN1C,CEBPA,CEBPB,CEBPE,CELF2,CITED1,CKS1B,CLDN6,CNN1,CNNM4,COL1A1,COL2A1,CSRNP1,DLX2,DUSP9,EGR1,ETS2,FEV,FOSB,FOXA1,FOXA3,FOXF1,FOXL2,FOXL2NB,FOXO4,GLB1L2,HES7,HK2,HNF4A,HOXA13,HOXB9,HOXC13,IER5L,IGDCC3,IKZF3,IRF1,ISL2,JUN,KIF18B,KIF2C,KLF1,KMT2A,LHX1,LYL1,MAML2,MAP2K3,MAP2K6,MAP3K21,MAP4K3,MAP4K5,MAP7D1,MAPK1,MEIS1,MIDEAS,MIDN,NCL,NIT1,OSR2,PLK4,POU3F2,PRDM1,PRTG,PTPN1,PTPN12,PTPN13,PTPN9,RHOXF2,RREB1,RUNX1T1,S1PR2,SAMD1,SET,SGK1,SLC38A2,SLC4A1,SLC6A9,SNAI1,SPI1,STIL,TBX2,TBX3,TGFBR2,TMSB4X,TP73,TSC22D1,UBASH3A,UBASH3B,ZBTB1,ZBTB10,ZBTB25,ZC3HAV1,ZNF318" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "all_targets = get_all_targets(adata, key=\"knockup\")\n", "all_targets" ] }, { "cell_type": "code", "execution_count": 10, "id": "ed389bd3-244c-4f10-9e36-02f303e02787", "metadata": { "execution": { "iopub.execute_input": "2025-03-20T06:42:43.425465Z", "iopub.status.busy": "2025-03-20T06:42:43.425287Z", "iopub.status.idle": "2025-03-20T06:42:43.432780Z", "shell.execute_reply": "2025-03-20T06:42:43.432085Z", "shell.execute_reply.started": "2025-03-20T06:42:43.425449Z" } }, "outputs": [ { "data": { "text/plain": [ "1064" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "adata.var[\"selected\"] = adata.var[\"highly_variable\"] | adata.var_names.isin(all_targets)\n", "adata.var[\"selected\"].sum()" ] }, { "cell_type": "markdown", "id": "db7a11c1-df29-4ec7-900b-e19129465c6d", "metadata": {}, "source": [ "## Encode intervention regime\n", "\n", "CASCADE represents genetic perturbations as a cell-by-gene binary matrix,\n", "which can be encoded from the `adata.obs[\"knockup\"]` column using the\n", "[encode_regime](api/cascade.data.encode_regime.rst) function. The function\n", "stores the encoded regime matrix in a layer with user-specified name,\n", "here using the name `\"interv\"`." ] }, { "cell_type": "code", "execution_count": 11, "id": "1338e090-26ac-4612-a4e2-bfdbc3a89e0c", "metadata": { "execution": { "iopub.execute_input": "2025-03-20T06:42:43.433591Z", "iopub.status.busy": "2025-03-20T06:42:43.433420Z", "iopub.status.idle": "2025-03-20T06:42:49.310483Z", "shell.execute_reply": "2025-03-20T06:42:49.309478Z", "shell.execute_reply.started": "2025-03-20T06:42:43.433575Z" } }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "encode_regime(adata, \"interv\", key=\"knockup\")\n", "adata.layers[\"interv\"]" ] }, { "cell_type": "markdown", "id": "c3e62ab0-2224-4b14-a35f-a36425422f14", "metadata": {}, "source": [ "## Encode technical covariates\n", "\n", "To minimize the effect of technical confounding on the causal discovery process,\n", "it is recommended to add all possible confounding factors into a covariate matrix\n", "in `adata.obsm`.\n", "\n", "Here we will add the one-hot encoded batch label (`\"gemgroup\"`) and log-centered\n", "total counts as the covariate:" ] }, { "cell_type": "code", "execution_count": 12, "id": "ce792cd2-7914-4e20-8e26-ae404a57c114", "metadata": { "execution": { "iopub.execute_input": "2025-03-20T06:42:49.311908Z", "iopub.status.busy": "2025-03-20T06:42:49.311640Z", "iopub.status.idle": "2025-03-20T06:42:49.329984Z", "shell.execute_reply": "2025-03-20T06:42:49.329234Z", "shell.execute_reply.started": "2025-03-20T06:42:49.311888Z" } }, "outputs": [ { "data": { "text/plain": [ "array([[0., 1., 0., ..., 0., 0., 0.],\n", " [0., 0., 0., ..., 0., 1., 0.],\n", " [0., 0., 0., ..., 1., 0., 0.],\n", " ...,\n", " [0., 1., 0., ..., 0., 0., 0.],\n", " [0., 0., 1., ..., 0., 0., 0.],\n", " [0., 0., 0., ..., 1., 0., 0.]])" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "batch = OneHotEncoder().fit_transform(adata.obs[[\"gemgroup\"]]).toarray()\n", "batch" ] }, { "cell_type": "code", "execution_count": 13, "id": "227867b9-eae8-4214-8832-b8a08ec62a40", "metadata": { "execution": { "iopub.execute_input": "2025-03-20T06:42:49.332791Z", "iopub.status.busy": "2025-03-20T06:42:49.332590Z", "iopub.status.idle": "2025-03-20T06:42:49.343880Z", "shell.execute_reply": "2025-03-20T06:42:49.343145Z", "shell.execute_reply.started": "2025-03-20T06:42:49.332774Z" } }, "outputs": [ { "data": { "text/plain": [ "array([[ 0.46084866],\n", " [-0.8991263 ],\n", " [-0.29681763],\n", " ...,\n", " [-2.092782 ],\n", " [ 0.5551557 ],\n", " [-1.4385142 ]], dtype=float32)" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "log_ncounts = StandardScaler().fit_transform(np.log10(adata.obs[[\"ncounts\"]]))\n", "log_ncounts" ] }, { "cell_type": "code", "execution_count": 14, "id": "36257898-2bd9-4780-81a9-50aa187cb77d", "metadata": { "execution": { "iopub.execute_input": "2025-03-20T06:42:49.344753Z", "iopub.status.busy": "2025-03-20T06:42:49.344574Z", "iopub.status.idle": "2025-03-20T06:42:49.352562Z", "shell.execute_reply": "2025-03-20T06:42:49.351834Z", "shell.execute_reply.started": "2025-03-20T06:42:49.344736Z" } }, "outputs": [ { "data": { "text/plain": [ "(86744, 9)" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "adata.obsm[\"covariate\"] = np.concatenate([batch, log_ncounts], axis=1)\n", "adata.obsm[\"covariate\"].shape" ] }, { "cell_type": "markdown", "id": "e65cb81b-5ee7-4d33-a910-66bc94c740f1", "metadata": {}, "source": [ "## Data normalization\n", "\n", "Next, we follow the standard scRNA-seq preprocessing approach in `scanpy`\n", "to normalize the expression matrix in `adata.X`. You may visit its\n", "[documentation](https://scanpy.readthedocs.io/) if unfamiliar." ] }, { "cell_type": "code", "execution_count": 15, "id": "3f72c8be-86a7-4421-b54c-0a4f5b83abc5", "metadata": { "execution": { "iopub.execute_input": "2025-03-20T06:42:49.353442Z", "iopub.status.busy": "2025-03-20T06:42:49.353263Z", "iopub.status.idle": "2025-03-20T06:42:51.120183Z", "shell.execute_reply": "2025-03-20T06:42:51.119546Z", "shell.execute_reply.started": "2025-03-20T06:42:49.353426Z" } }, "outputs": [], "source": [ "sc.pp.normalize_total(adata, target_sum=1e4)\n", "sc.pp.log1p(adata)" ] }, { "cell_type": "markdown", "id": "d522642c-a6e8-4b2e-97a4-64f69036994f", "metadata": {}, "source": [ "Now we can subset the dataset to retain the selected genes only:" ] }, { "cell_type": "code", "execution_count": 16, "id": "f0c9d506-534d-4265-ad8d-7e15e542f34f", "metadata": { "execution": { "iopub.execute_input": "2025-03-20T06:42:51.121235Z", "iopub.status.busy": "2025-03-20T06:42:51.121014Z", "iopub.status.idle": "2025-03-20T06:42:53.418293Z", "shell.execute_reply": "2025-03-20T06:42:53.417389Z", "shell.execute_reply.started": "2025-03-20T06:42:51.121215Z" } }, "outputs": [ { "data": { "text/plain": [ "AnnData object with n_obs × n_vars = 86744 × 1064\n", " obs: 'guide_id', 'gemgroup', 'ncounts', 'knockup'\n", " var: 'perturbed', 'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm', 'selected'\n", " uns: 'hvg', 'log1p'\n", " obsm: 'covariate'\n", " layers: 'counts', 'interv'" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "adata = adata[:, adata.var[\"selected\"]].copy()\n", "adata" ] }, { "cell_type": "markdown", "id": "5f6e3999-5ac2-4ee0-a668-23be2332d8fc", "metadata": {}, "source": [ "## Neighbor-based imputation\n", "\n", "Given that scRNA-seq data can be sparse, we recommend conducting a lightweight\n", "neighbor-based data imputation before model training. This is done by aggregating\n", "similar cells in the PCA space with the same perturbation. We provide a utility\n", "function called [neighbor_impute](api/cascade.data.neighbor_impute.rst) for this\n", "purpose:" ] }, { "cell_type": "code", "execution_count": 17, "id": "51a28912-6c44-4269-bf6d-fc0b219e32ed", "metadata": { "execution": { "iopub.execute_input": "2025-03-20T06:42:53.419681Z", "iopub.status.busy": "2025-03-20T06:42:53.419454Z", "iopub.status.idle": "2025-03-20T06:43:05.607917Z", "shell.execute_reply": "2025-03-20T06:43:05.605650Z", "shell.execute_reply.started": "2025-03-20T06:42:53.419662Z" } }, "outputs": [], "source": [ "sc.pp.pca(adata)" ] }, { "cell_type": "code", "execution_count": 18, "id": "90da17c9-ffd3-447f-b36f-af5f106d838f", "metadata": { "execution": { "iopub.execute_input": "2025-03-20T06:43:05.609281Z", "iopub.status.busy": "2025-03-20T06:43:05.608966Z", "iopub.status.idle": "2025-03-20T06:43:12.538213Z", "shell.execute_reply": "2025-03-20T06:43:12.536996Z", "shell.execute_reply.started": "2025-03-20T06:43:05.609251Z" } }, "outputs": [], "source": [ "adata = neighbor_impute(\n", " adata,\n", " k=20,\n", " use_rep=\"X_pca\",\n", " use_batch=\"knockup\",\n", " X_agg=\"mean\",\n", " obs_agg={\"ncounts\": \"sum\"},\n", " obsm_agg={\"covariate\": \"mean\"},\n", " layers_agg={\"counts\": \"sum\"},\n", ")" ] }, { "cell_type": "markdown", "id": "fd4ad7c4-768a-41e5-8333-ff2ed73e0b8f", "metadata": {}, "source": [ "Note that we used the `\"sum\"` aggregation for `adata.obs[\"ncounts\"]` and\n", "`adata.layers[\"counts\"]`, which maintains their count-based nature.\n", "\n", "## Configure dataset\n", "\n", "Now we can use the function\n", "[configure_dataset](api/cascade.data.configure_dataset.rst) to tell CASCADE\n", "where the expression matrix, intervention regime, covariates and total counts\n", "are stored:" ] }, { "cell_type": "code", "execution_count": 19, "id": "db76c707-0b43-48ae-9e81-35950ff229ab", "metadata": { "execution": { "iopub.execute_input": "2025-03-20T06:43:12.539881Z", "iopub.status.busy": "2025-03-20T06:43:12.539630Z", "iopub.status.idle": "2025-03-20T06:43:12.546887Z", "shell.execute_reply": "2025-03-20T06:43:12.546081Z", "shell.execute_reply.started": "2025-03-20T06:43:12.539860Z" } }, "outputs": [ { "data": { "text/plain": [ "{'regime': 'interv',\n", " 'covariate': 'covariate',\n", " 'size': 'ncounts',\n", " 'weight': None,\n", " 'layer': 'counts'}" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "configure_dataset(\n", " adata,\n", " use_regime=\"interv\",\n", " use_covariate=\"covariate\",\n", " use_size=\"ncounts\",\n", " use_layer=\"counts\",\n", ")\n", "get_configuration(adata)" ] }, { "cell_type": "markdown", "id": "dbf1a98e-41c0-410a-9562-4df607f88329", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "## Construct scaffold graph\n", "\n", "Next, we need to construct a scaffold graph to guide the causal discovery process.\n", "\n", "The following 4 pre-built human gene scaffolds are available for download:\n", "\n", "- KEGG pathways:\n", " http://ftp.cbi.pku.edu.cn/pub/cascade-download/inferred_kegg_gene_only.gml.gz\n", "- TF-target predictions:\n", " http://ftp.cbi.pku.edu.cn/pub/cascade-download/TF-target.gml.gz\n", "- BioGRID protein-protein interactions:\n", " http://ftp.cbi.pku.edu.cn/pub/cascade-download/biogrid.gml.gz\n", "- GTEx correlated genes:\n", " http://ftp.cbi.pku.edu.cn/pub/cascade-download/corr.gml.gz" ] }, { "cell_type": "code", "execution_count": 20, "id": "8f99dec1-31ef-4e35-9f22-31965a68f78a", "metadata": { "execution": { "iopub.execute_input": "2025-03-20T06:43:12.547832Z", "iopub.status.busy": "2025-03-20T06:43:12.547643Z", "iopub.status.idle": "2025-03-20T06:46:52.653547Z", "shell.execute_reply": "2025-03-20T06:46:52.652535Z", "shell.execute_reply.started": "2025-03-20T06:43:12.547807Z" } }, "outputs": [], "source": [ "kegg = nx.read_gml(\"inferred_kegg_gene_only.gml.gz\")\n", "tf_target = nx.read_gml(\"TF-target.gml.gz\")\n", "biogrid = nx.read_gml(\"biogrid.gml.gz\")\n", "corr = nx.read_gml(\"corr.gml.gz\")" ] }, { "cell_type": "markdown", "id": "15505c92-483e-4bbc-ad0f-4a115108b6d7", "metadata": {}, "source": [ "The individual scaffold components can be assembled into a hybrid scaffold using\n", "the [assemble_scaffolds](api/cascade.graph.assemble_scaffolds.rst) function,\n", "which also marginalizes these components with regard to the genes being\n", "modeled here:" ] }, { "cell_type": "code", "execution_count": 21, "id": "740f5fae-7585-4320-800c-6eab81f71505", "metadata": { "execution": { "iopub.execute_input": "2025-03-20T06:46:52.654962Z", "iopub.status.busy": "2025-03-20T06:46:52.654708Z", "iopub.status.idle": "2025-03-20T06:47:33.292982Z", "shell.execute_reply": "2025-03-20T06:47:33.292171Z", "shell.execute_reply.started": "2025-03-20T06:46:52.654942Z" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32m14:47:12.452\u001b[0m | \u001b[33m\u001b[1mWARNING \u001b[0m | \u001b[33m1470173\u001b[0m:\u001b[36mgraph\u001b[0m:\u001b[36mmarginalize\u001b[0m - \u001b[33m\u001b[1m187 nodes are missing from the input graph and will be ignored.\u001b[0m\n", "\u001b[32m14:47:23.102\u001b[0m | \u001b[33m\u001b[1mWARNING \u001b[0m | \u001b[33m1470173\u001b[0m:\u001b[36mgraph\u001b[0m:\u001b[36mmarginalize\u001b[0m - \u001b[33m\u001b[1m182 nodes are missing from the input graph and will be ignored.\u001b[0m\n", "\u001b[32m14:47:28.593\u001b[0m | \u001b[33m\u001b[1mWARNING \u001b[0m | \u001b[33m1470173\u001b[0m:\u001b[36mgraph\u001b[0m:\u001b[36mmarginalize\u001b[0m - \u001b[33m\u001b[1m189 nodes are missing from the input graph and will be ignored.\u001b[0m\n", "\u001b[32m14:47:29.730\u001b[0m | \u001b[33m\u001b[1mWARNING \u001b[0m | \u001b[33m1470173\u001b[0m:\u001b[36mgraph\u001b[0m:\u001b[36mmarginalize\u001b[0m - \u001b[33m\u001b[1m647 nodes are missing from the input graph and will be ignored.\u001b[0m\n" ] }, { "data": { "text/plain": [ "(1064, 32264)" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "scaffold = assemble_scaffolds(corr, biogrid, tf_target, kegg, nodes=adata.var_names)\n", "scaffold.number_of_nodes(), scaffold.number_of_edges()" ] }, { "cell_type": "markdown", "id": "1b4d0a11-55f6-4527-9f56-bc28960f96c9", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "## Prepare gene function embeddings\n", "\n", "Lastly, we also fetch relevant entries from gene embeddings pre-computed using\n", "LSI of their GO annotations, which can be downloaded here:\n", "\n", "- http://ftp.cbi.pku.edu.cn/pub/cascade-download/gene2gos_lsi.csv.gz\n", "\n", "This will serve as the input of the interventional latent variable in CASCADE:" ] }, { "cell_type": "code", "execution_count": 22, "id": "cc958939-aa3e-4832-b812-185f38815c77", "metadata": { "execution": { "iopub.execute_input": "2025-03-20T06:47:33.294016Z", "iopub.status.busy": "2025-03-20T06:47:33.293806Z", "iopub.status.idle": "2025-03-20T06:47:33.595607Z", "shell.execute_reply": "2025-03-20T06:47:33.594771Z", "shell.execute_reply.started": "2025-03-20T06:47:33.293999Z" } }, "outputs": [ { "data": { "text/plain": [ "(866, 32)" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "latent_emb = pd.read_csv(\"gene2gos_lsi.csv.gz\", index_col=0)\n", "latent_emb = latent_emb.reindex(adata.var_names).dropna()\n", "latent_emb.shape" ] }, { "cell_type": "markdown", "id": "6911f682-4dd9-4d90-bd76-dfe9f937f8f7", "metadata": {}, "source": [ "## Save processed data files\n", "\n", "Finally, save the preprocessed data files for use in [stage 2](training.ipynb)." ] }, { "cell_type": "code", "execution_count": 23, "id": "877d9e00-f5ac-4dd1-b0f0-2624e881db38", "metadata": { "execution": { "iopub.execute_input": "2025-03-20T06:47:33.596665Z", "iopub.status.busy": "2025-03-20T06:47:33.596468Z", "iopub.status.idle": "2025-03-20T06:48:00.195888Z", "shell.execute_reply": "2025-03-20T06:48:00.194377Z", "shell.execute_reply.started": "2025-03-20T06:47:33.596647Z" } }, "outputs": [], "source": [ "adata.write(\"adata.h5ad\", compression=\"gzip\")" ] }, { "cell_type": "code", "execution_count": 24, "id": "2e2faaf6-30f8-40b0-991b-a63b5c3c2a01", "metadata": { "execution": { "iopub.execute_input": "2025-03-20T06:48:00.199080Z", "iopub.status.busy": "2025-03-20T06:48:00.198048Z", "iopub.status.idle": "2025-03-20T06:48:01.675562Z", "shell.execute_reply": "2025-03-20T06:48:01.674696Z", "shell.execute_reply.started": "2025-03-20T06:48:00.199033Z" } }, "outputs": [], "source": [ "nx.write_gml(scaffold, \"scaffold.gml.gz\")" ] }, { "cell_type": "code", "execution_count": 25, "id": "23d8a848-3052-4405-8e65-d08de9b1838c", "metadata": { "execution": { "iopub.execute_input": "2025-03-20T06:48:01.676794Z", "iopub.status.busy": "2025-03-20T06:48:01.676588Z", "iopub.status.idle": "2025-03-20T06:48:01.878803Z", "shell.execute_reply": "2025-03-20T06:48:01.878172Z", "shell.execute_reply.started": "2025-03-20T06:48:01.676776Z" } }, "outputs": [], "source": [ "latent_emb.to_csv(\"latent_emb.csv.gz\")" ] }, { "cell_type": "markdown", "id": "c24ff9a2-3f58-49f0-83cd-6f146bbd189b", "metadata": {}, "source": [ "## Afterwords\n", "\n", "Described above is the minimal preprocessing for running CASCADE. Additional\n", "steps such as filtering non-perturbed cells using\n", "[mixscape](https://pertpy.readthedocs.io/en/latest/tutorials/notebooks/mixscape.html)\n", "may also be useful depending on the data at hand." ] } ], "metadata": { "jupytext": { "formats": "ipynb,py:percent" }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.9" }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 5 }