{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "dff710cb-1074-4eda-87ac-26aa283a4f13",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2025-03-14T19:52:37.933712Z",
     "iopub.status.busy": "2025-03-14T19:52:37.932638Z",
     "iopub.status.idle": "2025-03-14T19:52:37.948501Z",
     "shell.execute_reply": "2025-03-14T19:52:37.947061Z",
     "shell.execute_reply.started": "2025-03-14T19:52:37.933660Z"
    }
   },
   "source": [
    "# Stage 1: Data preprocessing\n",
    "\n",
    "In this tutorial, we'll first walk through how to prepare the datasets for use in\n",
    "CASCADE, using the [Norman, et al. (2019)](https://doi.org/10.1126/science.aax4438)\n",
    "dataset as an example. This dataset contains single- and double-gene CRISPRa\n",
    "perturbations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "38a1ac22-f453-4c41-a786-633c9068539d",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2025-03-20T06:42:20.063503Z",
     "iopub.status.busy": "2025-03-20T06:42:20.062713Z",
     "iopub.status.idle": "2025-03-20T06:42:25.040388Z",
     "shell.execute_reply": "2025-03-20T06:42:25.039358Z",
     "shell.execute_reply.started": "2025-03-20T06:42:20.063457Z"
    }
   },
   "outputs": [],
   "source": [
    "import networkx as nx\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import scanpy as sc\n",
    "from sklearn.preprocessing import OneHotEncoder, StandardScaler\n",
    "\n",
    "from cascade.data import (\n",
    "    configure_dataset,\n",
    "    encode_regime,\n",
    "    filter_unobserved_targets,\n",
    "    get_all_targets,\n",
    "    get_configuration,\n",
    "    neighbor_impute,\n",
    ")\n",
    "from cascade.graph import assemble_scaffolds"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2238cb0c-1d50-44f9-ac4f-a4a8fc342da7",
   "metadata": {},
   "source": [
    "## Read data\n",
    "\n",
    "First, we need to prepare the dataset into `AnnData` objects. See the\n",
    "[documentation](https://anndata.readthedocs.io/) for more details if you are\n",
    "unfamiliar, including how to construct `AnnData` objects from scratch, and how\n",
    "to read data in other formats (csv, mtx, loom, etc.) into `AnnData` objects.\n",
    "\n",
    "Here we just load existing `h5ad` files, which is the native file format for\n",
    "`AnnData`. The `h5ad` file used in this tutorial can be downloaded from here:\n",
    "\n",
    "- http://ftp.cbi.pku.edu.cn/pub/cascade-download/Norman-2019.h5ad"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "dc79dbab-5127-4cbc-9067-576e0e2e47df",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2025-03-20T06:42:25.041992Z",
     "iopub.status.busy": "2025-03-20T06:42:25.041486Z",
     "iopub.status.idle": "2025-03-20T06:42:36.865741Z",
     "shell.execute_reply": "2025-03-20T06:42:36.864853Z",
     "shell.execute_reply.started": "2025-03-20T06:42:25.041968Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "AnnData object with n_obs × n_vars = 86744 × 22881\n",
       "    obs: 'guide_id', 'gemgroup', 'ncounts', 'knockup'\n",
       "    var: 'perturbed'"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "adata = sc.read_h5ad(\"Norman-2019.h5ad\")\n",
    "adata"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b64b112e-2bbe-47f5-84d5-f42ff867065e",
   "metadata": {},
   "source": [
    "## Data format requirements\n",
    "\n",
    "CASCADE requires the following data format:\n",
    "\n",
    "- Raw counts in `adata.X`;\n",
    "- Total count in `adata.obs`, which would be used when fitting data with\n",
    "  the negative binomial distribution;\n",
    "- HGNC gene symbols as `adata.var_names`;\n",
    "- Perturbation label in `adata.obs` that specifies which genes are perturbed\n",
    "  in each cell:\n",
    "  - For control cells with no perturbation, the value **MUST** be an empty\n",
    "    string `\"\"`;\n",
    "  - For cells with multiple perturbations, the perturbed genes should be\n",
    "    comma-separated, e.g., `\"CEBPB,KLF1\"`;\n",
    "  - Name of perturbed genes must match those in `adata.var_names`.\n",
    "\n",
    "In this case, we can verify that the expression matrix contains raw counts:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "5a2a0d5b-b18d-443a-a916-9b9709da3591",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2025-03-20T06:42:36.866987Z",
     "iopub.status.busy": "2025-03-20T06:42:36.866753Z",
     "iopub.status.idle": "2025-03-20T06:42:36.874179Z",
     "shell.execute_reply": "2025-03-20T06:42:36.873263Z",
     "shell.execute_reply.started": "2025-03-20T06:42:36.866967Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(<Compressed Sparse Row sparse matrix of dtype 'float32'\n",
       " \twith 268281595 stored elements and shape (86744, 22881)>,\n",
       " array([ 1.,  1.,  1., ..., 12.,  3., 16.], dtype=float32))"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "adata.X, adata.X.data"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f5b5be25-a512-42ae-a00e-40c9ee175dbd",
   "metadata": {},
   "source": [
    "The total counts are stored as `\"ncounts\"` in `adata.obs`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "e8258211-2b10-4120-a50e-baf923389a63",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2025-03-20T06:42:36.876137Z",
     "iopub.status.busy": "2025-03-20T06:42:36.875941Z",
     "iopub.status.idle": "2025-03-20T06:42:36.884305Z",
     "shell.execute_reply": "2025-03-20T06:42:36.883582Z",
     "shell.execute_reply.started": "2025-03-20T06:42:36.876119Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "TTGAACGAGACTCGGA      15097.0\n",
       "CGTTGGGGTGTTTGTG       8551.0\n",
       "GAACCTAAGTGTTAGA      10999.0\n",
       "CCTTCCCTCCGTCATC      38454.0\n",
       "TCCCGATGTCTCTTAT      21433.0\n",
       "                       ...   \n",
       "TTTCCTCGTACGCACC      11991.0\n",
       "TTTCCTCTCTTGCCGT      16561.0\n",
       "TTTGCGCAGTCATGCT       5192.0\n",
       "TTTGCGCCAGGACCCT      15704.0\n",
       "TTTGCGCTCTCGCATC-1     6825.0\n",
       "Name: ncounts, Length: 86744, dtype: float32"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "adata.obs[\"ncounts\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "da3edbb7-3b0c-4133-9f23-9f600456600d",
   "metadata": {},
   "source": [
    "And perturbation labels are stored as `\"knockup\"` in `adata.obs`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "79a87d1d-ba24-49a2-a055-ac75d8723d72",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2025-03-20T06:42:36.885130Z",
     "iopub.status.busy": "2025-03-20T06:42:36.884950Z",
     "iopub.status.idle": "2025-03-20T06:42:36.892925Z",
     "shell.execute_reply": "2025-03-20T06:42:36.892213Z",
     "shell.execute_reply.started": "2025-03-20T06:42:36.885113Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "TTGAACGAGACTCGGA            ARID1A\n",
       "CGTTGGGGTGTTTGTG            BCORL1\n",
       "GAACCTAAGTGTTAGA              FOSB\n",
       "CCTTCCCTCCGTCATC          KLF1,SET\n",
       "TCCCGATGTCTCTTAT         BAK1,KLF1\n",
       "                          ...     \n",
       "TTTCCTCGTACGCACC                  \n",
       "TTTCCTCTCTTGCCGT               HK2\n",
       "TTTGCGCAGTCATGCT            RHOXF2\n",
       "TTTGCGCCAGGACCCT      BAK1,BCL2L11\n",
       "TTTGCGCTCTCGCATC-1      CEBPB,OSR2\n",
       "Name: knockup, Length: 86744, dtype: category\n",
       "Categories (237, object): ['', 'AHR', 'AHR,FEV', 'AHR,KLF1', ..., 'ZBTB10', 'ZBTB25', 'ZC3HAV1', 'ZNF318']"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "adata.obs[\"knockup\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3babc0b8-99df-4452-903d-6ded8addcea8",
   "metadata": {},
   "source": [
    "Before any further processing, we back up the raw UMI counts in a layer called\n",
    "`“counts”`, which will be used later during model training."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "87098765-026a-4484-9cd7-0ea0c6360e52",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2025-03-20T06:42:36.893717Z",
     "iopub.status.busy": "2025-03-20T06:42:36.893540Z",
     "iopub.status.idle": "2025-03-20T06:42:37.992858Z",
     "shell.execute_reply": "2025-03-20T06:42:37.991798Z",
     "shell.execute_reply.started": "2025-03-20T06:42:36.893700Z"
    }
   },
   "outputs": [],
   "source": [
    "adata.layers[\"counts\"] = adata.X.copy()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "897ee91f-d3e6-43f4-a860-b66c9cd59ed8",
   "metadata": {},
   "source": [
    "## Cell and gene selection\n",
    "\n",
    "Since CASCADE can only model perturbations in measured genes, we first filter out\n",
    "any perturbation that was missing from the readout. A utility function called\n",
    "[filter_unobserved_targets](api/cascade.data.filter_unobserved_targets.rst) is\n",
    "provided for this purpose.\n",
    "\n",
    "In this case no cell was filtered:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "0580b413-657c-4ae7-aada-16bf589b1b15",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2025-03-20T06:42:37.993687Z",
     "iopub.status.busy": "2025-03-20T06:42:37.993485Z",
     "iopub.status.idle": "2025-03-20T06:42:38.233906Z",
     "shell.execute_reply": "2025-03-20T06:42:38.233128Z",
     "shell.execute_reply.started": "2025-03-20T06:42:37.993668Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "View of AnnData object with n_obs × n_vars = 86744 × 22881\n",
       "    obs: 'guide_id', 'gemgroup', 'ncounts', 'knockup'\n",
       "    var: 'perturbed'\n",
       "    layers: 'counts'"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "filter_unobserved_targets(adata, \"knockup\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "26504de2-bd59-4855-b73b-1b9fecff48a4",
   "metadata": {},
   "source": [
    "Next, we identify highly variable genes using the `\"seurat_v3\"` method,\n",
    "to allow the model to focus on informative genes:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "99819c64-421e-4e95-82c1-3c3231ec7c77",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2025-03-20T06:42:38.234825Z",
     "iopub.status.busy": "2025-03-20T06:42:38.234608Z",
     "iopub.status.idle": "2025-03-20T06:42:43.415170Z",
     "shell.execute_reply": "2025-03-20T06:42:43.414125Z",
     "shell.execute_reply.started": "2025-03-20T06:42:38.234796Z"
    }
   },
   "outputs": [],
   "source": [
    "sc.pp.highly_variable_genes(adata, n_top_genes=1000, flavor=\"seurat_v3\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e1d022cc-4cf1-4141-8f74-c8f9a88915c3",
   "metadata": {},
   "source": [
    "Again, as CASCADE can only model perturbations in measured genes, we expand this\n",
    "highly variable gene set to incorporating all perturbed genes (via a utility\n",
    "function [get_all_targets](api/cascade.data.get_all_targets.rst)) to avoid\n",
    "discarding useful perturbation information:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "6f75a5a3-381e-4618-b246-5bb2ad90d76a",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2025-03-20T06:42:43.416536Z",
     "iopub.status.busy": "2025-03-20T06:42:43.416307Z",
     "iopub.status.idle": "2025-03-20T06:42:43.424544Z",
     "shell.execute_reply": "2025-03-20T06:42:43.423834Z",
     "shell.execute_reply.started": "2025-03-20T06:42:43.416517Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "AHR,ARID1A,ARRDC3,ATL1,BAK1,BCL2L11,BCORL1,BPGM,CBARP,CBFA2T3,CBL,CDKN1A,CDKN1B,CDKN1C,CEBPA,CEBPB,CEBPE,CELF2,CITED1,CKS1B,CLDN6,CNN1,CNNM4,COL1A1,COL2A1,CSRNP1,DLX2,DUSP9,EGR1,ETS2,FEV,FOSB,FOXA1,FOXA3,FOXF1,FOXL2,FOXL2NB,FOXO4,GLB1L2,HES7,HK2,HNF4A,HOXA13,HOXB9,HOXC13,IER5L,IGDCC3,IKZF3,IRF1,ISL2,JUN,KIF18B,KIF2C,KLF1,KMT2A,LHX1,LYL1,MAML2,MAP2K3,MAP2K6,MAP3K21,MAP4K3,MAP4K5,MAP7D1,MAPK1,MEIS1,MIDEAS,MIDN,NCL,NIT1,OSR2,PLK4,POU3F2,PRDM1,PRTG,PTPN1,PTPN12,PTPN13,PTPN9,RHOXF2,RREB1,RUNX1T1,S1PR2,SAMD1,SET,SGK1,SLC38A2,SLC4A1,SLC6A9,SNAI1,SPI1,STIL,TBX2,TBX3,TGFBR2,TMSB4X,TP73,TSC22D1,UBASH3A,UBASH3B,ZBTB1,ZBTB10,ZBTB25,ZC3HAV1,ZNF318"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "all_targets = get_all_targets(adata, key=\"knockup\")\n",
    "all_targets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "ed389bd3-244c-4f10-9e36-02f303e02787",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2025-03-20T06:42:43.425465Z",
     "iopub.status.busy": "2025-03-20T06:42:43.425287Z",
     "iopub.status.idle": "2025-03-20T06:42:43.432780Z",
     "shell.execute_reply": "2025-03-20T06:42:43.432085Z",
     "shell.execute_reply.started": "2025-03-20T06:42:43.425449Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1064"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "adata.var[\"selected\"] = adata.var[\"highly_variable\"] | adata.var_names.isin(all_targets)\n",
    "adata.var[\"selected\"].sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "db7a11c1-df29-4ec7-900b-e19129465c6d",
   "metadata": {},
   "source": [
    "## Encode intervention regime\n",
    "\n",
    "CASCADE represents genetic perturbations as a cell-by-gene binary matrix,\n",
    "which can be encoded from the `adata.obs[\"knockup\"]` column using the\n",
    "[encode_regime](api/cascade.data.encode_regime.rst) function. The function\n",
    "stores the encoded regime matrix in a layer with user-specified name,\n",
    "here using the name `\"interv\"`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "1338e090-26ac-4612-a4e2-bfdbc3a89e0c",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2025-03-20T06:42:43.433591Z",
     "iopub.status.busy": "2025-03-20T06:42:43.433420Z",
     "iopub.status.idle": "2025-03-20T06:42:49.310483Z",
     "shell.execute_reply": "2025-03-20T06:42:49.309478Z",
     "shell.execute_reply.started": "2025-03-20T06:42:43.433575Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<Compressed Sparse Row sparse matrix of dtype 'bool'\n",
       "\twith 108992 stored elements and shape (86744, 22881)>"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "encode_regime(adata, \"interv\", key=\"knockup\")\n",
    "adata.layers[\"interv\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c3e62ab0-2224-4b14-a35f-a36425422f14",
   "metadata": {},
   "source": [
    "## Encode technical covariates\n",
    "\n",
    "To minimize the effect of technical confounding on the causal discovery process,\n",
    "it is recommended to add all possible confounding factors into a covariate matrix\n",
    "in `adata.obsm`.\n",
    "\n",
    "Here we will add the one-hot encoded batch label (`\"gemgroup\"`) and log-centered\n",
    "total counts as the covariate:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "ce792cd2-7914-4e20-8e26-ae404a57c114",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2025-03-20T06:42:49.311908Z",
     "iopub.status.busy": "2025-03-20T06:42:49.311640Z",
     "iopub.status.idle": "2025-03-20T06:42:49.329984Z",
     "shell.execute_reply": "2025-03-20T06:42:49.329234Z",
     "shell.execute_reply.started": "2025-03-20T06:42:49.311888Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[0., 1., 0., ..., 0., 0., 0.],\n",
       "       [0., 0., 0., ..., 0., 1., 0.],\n",
       "       [0., 0., 0., ..., 1., 0., 0.],\n",
       "       ...,\n",
       "       [0., 1., 0., ..., 0., 0., 0.],\n",
       "       [0., 0., 1., ..., 0., 0., 0.],\n",
       "       [0., 0., 0., ..., 1., 0., 0.]])"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "batch = OneHotEncoder().fit_transform(adata.obs[[\"gemgroup\"]]).toarray()\n",
    "batch"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "227867b9-eae8-4214-8832-b8a08ec62a40",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2025-03-20T06:42:49.332791Z",
     "iopub.status.busy": "2025-03-20T06:42:49.332590Z",
     "iopub.status.idle": "2025-03-20T06:42:49.343880Z",
     "shell.execute_reply": "2025-03-20T06:42:49.343145Z",
     "shell.execute_reply.started": "2025-03-20T06:42:49.332774Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ 0.46084866],\n",
       "       [-0.8991263 ],\n",
       "       [-0.29681763],\n",
       "       ...,\n",
       "       [-2.092782  ],\n",
       "       [ 0.5551557 ],\n",
       "       [-1.4385142 ]], dtype=float32)"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "log_ncounts = StandardScaler().fit_transform(np.log10(adata.obs[[\"ncounts\"]]))\n",
    "log_ncounts"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "36257898-2bd9-4780-81a9-50aa187cb77d",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2025-03-20T06:42:49.344753Z",
     "iopub.status.busy": "2025-03-20T06:42:49.344574Z",
     "iopub.status.idle": "2025-03-20T06:42:49.352562Z",
     "shell.execute_reply": "2025-03-20T06:42:49.351834Z",
     "shell.execute_reply.started": "2025-03-20T06:42:49.344736Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(86744, 9)"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "adata.obsm[\"covariate\"] = np.concatenate([batch, log_ncounts], axis=1)\n",
    "adata.obsm[\"covariate\"].shape"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e65cb81b-5ee7-4d33-a910-66bc94c740f1",
   "metadata": {},
   "source": [
    "## Data normalization\n",
    "\n",
    "Next, we follow the standard scRNA-seq preprocessing approach in `scanpy`\n",
    "to normalize the expression matrix in `adata.X`. You may visit its\n",
    "[documentation](https://scanpy.readthedocs.io/) if unfamiliar."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "3f72c8be-86a7-4421-b54c-0a4f5b83abc5",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2025-03-20T06:42:49.353442Z",
     "iopub.status.busy": "2025-03-20T06:42:49.353263Z",
     "iopub.status.idle": "2025-03-20T06:42:51.120183Z",
     "shell.execute_reply": "2025-03-20T06:42:51.119546Z",
     "shell.execute_reply.started": "2025-03-20T06:42:49.353426Z"
    }
   },
   "outputs": [],
   "source": [
    "sc.pp.normalize_total(adata, target_sum=1e4)\n",
    "sc.pp.log1p(adata)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d522642c-a6e8-4b2e-97a4-64f69036994f",
   "metadata": {},
   "source": [
    "Now we can subset the dataset to retain the selected genes only:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "f0c9d506-534d-4265-ad8d-7e15e542f34f",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2025-03-20T06:42:51.121235Z",
     "iopub.status.busy": "2025-03-20T06:42:51.121014Z",
     "iopub.status.idle": "2025-03-20T06:42:53.418293Z",
     "shell.execute_reply": "2025-03-20T06:42:53.417389Z",
     "shell.execute_reply.started": "2025-03-20T06:42:51.121215Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "AnnData object with n_obs × n_vars = 86744 × 1064\n",
       "    obs: 'guide_id', 'gemgroup', 'ncounts', 'knockup'\n",
       "    var: 'perturbed', 'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm', 'selected'\n",
       "    uns: 'hvg', 'log1p'\n",
       "    obsm: 'covariate'\n",
       "    layers: 'counts', 'interv'"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "adata = adata[:, adata.var[\"selected\"]].copy()\n",
    "adata"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5f6e3999-5ac2-4ee0-a668-23be2332d8fc",
   "metadata": {},
   "source": [
    "## Neighbor-based imputation\n",
    "\n",
    "Given that scRNA-seq data can be sparse, we recommend conducting a lightweight\n",
    "neighbor-based data imputation before model training. This is done by aggregating\n",
    "similar cells in the PCA space with the same perturbation. We provide a utility\n",
    "function called [neighbor_impute](api/cascade.data.neighbor_impute.rst) for this\n",
    "purpose:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "51a28912-6c44-4269-bf6d-fc0b219e32ed",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2025-03-20T06:42:53.419681Z",
     "iopub.status.busy": "2025-03-20T06:42:53.419454Z",
     "iopub.status.idle": "2025-03-20T06:43:05.607917Z",
     "shell.execute_reply": "2025-03-20T06:43:05.605650Z",
     "shell.execute_reply.started": "2025-03-20T06:42:53.419662Z"
    }
   },
   "outputs": [],
   "source": [
    "sc.pp.pca(adata)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "90da17c9-ffd3-447f-b36f-af5f106d838f",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2025-03-20T06:43:05.609281Z",
     "iopub.status.busy": "2025-03-20T06:43:05.608966Z",
     "iopub.status.idle": "2025-03-20T06:43:12.538213Z",
     "shell.execute_reply": "2025-03-20T06:43:12.536996Z",
     "shell.execute_reply.started": "2025-03-20T06:43:05.609251Z"
    }
   },
   "outputs": [],
   "source": [
    "adata = neighbor_impute(\n",
    "    adata,\n",
    "    k=20,\n",
    "    use_rep=\"X_pca\",\n",
    "    use_batch=\"knockup\",\n",
    "    X_agg=\"mean\",\n",
    "    obs_agg={\"ncounts\": \"sum\"},\n",
    "    obsm_agg={\"covariate\": \"mean\"},\n",
    "    layers_agg={\"counts\": \"sum\"},\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fd4ad7c4-768a-41e5-8333-ff2ed73e0b8f",
   "metadata": {},
   "source": [
    "Note that we used the `\"sum\"` aggregation for `adata.obs[\"ncounts\"]` and\n",
    "`adata.layers[\"counts\"]`, which maintains their count-based nature.\n",
    "\n",
    "## Configure dataset\n",
    "\n",
    "Now we can use the function\n",
    "[configure_dataset](api/cascade.data.configure_dataset.rst) to tell CASCADE\n",
    "where the expression matrix, intervention regime, covariates and total counts\n",
    "are stored:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "db76c707-0b43-48ae-9e81-35950ff229ab",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2025-03-20T06:43:12.539881Z",
     "iopub.status.busy": "2025-03-20T06:43:12.539630Z",
     "iopub.status.idle": "2025-03-20T06:43:12.546887Z",
     "shell.execute_reply": "2025-03-20T06:43:12.546081Z",
     "shell.execute_reply.started": "2025-03-20T06:43:12.539860Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'regime': 'interv',\n",
       " 'covariate': 'covariate',\n",
       " 'size': 'ncounts',\n",
       " 'weight': None,\n",
       " 'layer': 'counts'}"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "configure_dataset(\n",
    "    adata,\n",
    "    use_regime=\"interv\",\n",
    "    use_covariate=\"covariate\",\n",
    "    use_size=\"ncounts\",\n",
    "    use_layer=\"counts\",\n",
    ")\n",
    "get_configuration(adata)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dbf1a98e-41c0-410a-9562-4df607f88329",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "source": [
    "## Construct scaffold graph\n",
    "\n",
    "Next, we need to construct a scaffold graph to guide the causal discovery process.\n",
    "\n",
    "The following 4 pre-built human gene scaffolds are available for download:\n",
    "\n",
    "- KEGG pathways:\n",
    "  http://ftp.cbi.pku.edu.cn/pub/cascade-download/inferred_kegg_gene_only.gml.gz\n",
    "- TF-target predictions:\n",
    "  http://ftp.cbi.pku.edu.cn/pub/cascade-download/TF-target.gml.gz\n",
    "- BioGRID protein-protein interactions:\n",
    "  http://ftp.cbi.pku.edu.cn/pub/cascade-download/biogrid.gml.gz\n",
    "- GTEx correlated genes:\n",
    "  http://ftp.cbi.pku.edu.cn/pub/cascade-download/corr.gml.gz"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "8f99dec1-31ef-4e35-9f22-31965a68f78a",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2025-03-20T06:43:12.547832Z",
     "iopub.status.busy": "2025-03-20T06:43:12.547643Z",
     "iopub.status.idle": "2025-03-20T06:46:52.653547Z",
     "shell.execute_reply": "2025-03-20T06:46:52.652535Z",
     "shell.execute_reply.started": "2025-03-20T06:43:12.547807Z"
    }
   },
   "outputs": [],
   "source": [
    "kegg = nx.read_gml(\"inferred_kegg_gene_only.gml.gz\")\n",
    "tf_target = nx.read_gml(\"TF-target.gml.gz\")\n",
    "biogrid = nx.read_gml(\"biogrid.gml.gz\")\n",
    "corr = nx.read_gml(\"corr.gml.gz\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "15505c92-483e-4bbc-ad0f-4a115108b6d7",
   "metadata": {},
   "source": [
    "The individual scaffold components can be assembled into a hybrid scaffold using\n",
    "the [assemble_scaffolds](api/cascade.graph.assemble_scaffolds.rst) function,\n",
    "which also marginalizes these components with regard to the genes being\n",
    "modeled here:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "740f5fae-7585-4320-800c-6eab81f71505",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2025-03-20T06:46:52.654962Z",
     "iopub.status.busy": "2025-03-20T06:46:52.654708Z",
     "iopub.status.idle": "2025-03-20T06:47:33.292982Z",
     "shell.execute_reply": "2025-03-20T06:47:33.292171Z",
     "shell.execute_reply.started": "2025-03-20T06:46:52.654942Z"
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32m14:47:12.452\u001b[0m | \u001b[33m\u001b[1mWARNING \u001b[0m | \u001b[33m1470173\u001b[0m:\u001b[36mgraph\u001b[0m:\u001b[36mmarginalize\u001b[0m - \u001b[33m\u001b[1m187 nodes are missing from the input graph and will be ignored.\u001b[0m\n",
      "\u001b[32m14:47:23.102\u001b[0m | \u001b[33m\u001b[1mWARNING \u001b[0m | \u001b[33m1470173\u001b[0m:\u001b[36mgraph\u001b[0m:\u001b[36mmarginalize\u001b[0m - \u001b[33m\u001b[1m182 nodes are missing from the input graph and will be ignored.\u001b[0m\n",
      "\u001b[32m14:47:28.593\u001b[0m | \u001b[33m\u001b[1mWARNING \u001b[0m | \u001b[33m1470173\u001b[0m:\u001b[36mgraph\u001b[0m:\u001b[36mmarginalize\u001b[0m - \u001b[33m\u001b[1m189 nodes are missing from the input graph and will be ignored.\u001b[0m\n",
      "\u001b[32m14:47:29.730\u001b[0m | \u001b[33m\u001b[1mWARNING \u001b[0m | \u001b[33m1470173\u001b[0m:\u001b[36mgraph\u001b[0m:\u001b[36mmarginalize\u001b[0m - \u001b[33m\u001b[1m647 nodes are missing from the input graph and will be ignored.\u001b[0m\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "(1064, 32264)"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "scaffold = assemble_scaffolds(corr, biogrid, tf_target, kegg, nodes=adata.var_names)\n",
    "scaffold.number_of_nodes(), scaffold.number_of_edges()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1b4d0a11-55f6-4527-9f56-bc28960f96c9",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "source": [
    "## Prepare gene function embeddings\n",
    "\n",
    "Lastly, we also fetch relevant entries from gene embeddings pre-computed using\n",
    "LSI of their GO annotations, which can be downloaded here:\n",
    "\n",
    "- http://ftp.cbi.pku.edu.cn/pub/cascade-download/gene2gos_lsi.csv.gz\n",
    "\n",
    "This will serve as the input of the interventional latent variable in CASCADE:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "cc958939-aa3e-4832-b812-185f38815c77",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2025-03-20T06:47:33.294016Z",
     "iopub.status.busy": "2025-03-20T06:47:33.293806Z",
     "iopub.status.idle": "2025-03-20T06:47:33.595607Z",
     "shell.execute_reply": "2025-03-20T06:47:33.594771Z",
     "shell.execute_reply.started": "2025-03-20T06:47:33.293999Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(866, 32)"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "latent_emb = pd.read_csv(\"gene2gos_lsi.csv.gz\", index_col=0)\n",
    "latent_emb = latent_emb.reindex(adata.var_names).dropna()\n",
    "latent_emb.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6911f682-4dd9-4d90-bd76-dfe9f937f8f7",
   "metadata": {},
   "source": [
    "## Save processed data files\n",
    "\n",
    "Finally, save the preprocessed data files for use in [stage 2](training.ipynb)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "877d9e00-f5ac-4dd1-b0f0-2624e881db38",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2025-03-20T06:47:33.596665Z",
     "iopub.status.busy": "2025-03-20T06:47:33.596468Z",
     "iopub.status.idle": "2025-03-20T06:48:00.195888Z",
     "shell.execute_reply": "2025-03-20T06:48:00.194377Z",
     "shell.execute_reply.started": "2025-03-20T06:47:33.596647Z"
    }
   },
   "outputs": [],
   "source": [
    "adata.write(\"adata.h5ad\", compression=\"gzip\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "2e2faaf6-30f8-40b0-991b-a63b5c3c2a01",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2025-03-20T06:48:00.199080Z",
     "iopub.status.busy": "2025-03-20T06:48:00.198048Z",
     "iopub.status.idle": "2025-03-20T06:48:01.675562Z",
     "shell.execute_reply": "2025-03-20T06:48:01.674696Z",
     "shell.execute_reply.started": "2025-03-20T06:48:00.199033Z"
    }
   },
   "outputs": [],
   "source": [
    "nx.write_gml(scaffold, \"scaffold.gml.gz\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "id": "23d8a848-3052-4405-8e65-d08de9b1838c",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2025-03-20T06:48:01.676794Z",
     "iopub.status.busy": "2025-03-20T06:48:01.676588Z",
     "iopub.status.idle": "2025-03-20T06:48:01.878803Z",
     "shell.execute_reply": "2025-03-20T06:48:01.878172Z",
     "shell.execute_reply.started": "2025-03-20T06:48:01.676776Z"
    }
   },
   "outputs": [],
   "source": [
    "latent_emb.to_csv(\"latent_emb.csv.gz\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c24ff9a2-3f58-49f0-83cd-6f146bbd189b",
   "metadata": {},
   "source": [
    "## Afterwords\n",
    "\n",
    "Described above is the minimal preprocessing for running CASCADE. Additional\n",
    "steps such as filtering non-perturbed cells using\n",
    "[mixscape](https://pertpy.readthedocs.io/en/latest/tutorials/notebooks/mixscape.html)\n",
    "may also be useful depending on the data at hand."
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "formats": "ipynb,py:percent"
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.9"
  },
  "widgets": {
   "application/vnd.jupyter.widget-state+json": {
    "state": {},
    "version_major": 2,
    "version_minor": 0
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}