Using pycisTopic on human cerebellum single-cell multiome data

[1]:
import os
os.chdir("/staging/leuven/stg_00002/lcb/sdewin/PhD/python_modules/pycisTopic_polars_tutorial")
[2]:
import pycisTopic
pycisTopic.__version__
[2]:
'2.0a0'

Downloading data

The data used for this tutorial is freely accessibile and can be downloaded from here, processed loom files can be found on Scope data can be explored on UCSC genome browser as well.

[5]:
!mkdir -p data
!wget -O data/fragments.tsv.gz https://cf.10xgenomics.com/samples/cell-arc/1.0.0/human_brain_3k/human_brain_3k_atac_fragments.tsv.gz
!wget -O data/fragments.tsv.gz.tbi https://cf.10xgenomics.com/samples/cell-arc/1.0.0/human_brain_3k/human_brain_3k_atac_fragments.tsv.gz.tbi
!wget -O data/cell_data.tsv https://raw.githubusercontent.com/aertslab/pycisTopic/polars/data/cell_data_human_cerebellum.tsv
--2024-02-29 17:02:20--  https://cf.10xgenomics.com/samples/cell-arc/1.0.0/human_brain_3k/human_brain_3k_atac_fragments.tsv.gz
Resolving cf.10xgenomics.com (cf.10xgenomics.com)... 2606:4700::6812:ad, 2606:4700::6812:1ad, 104.18.1.173, ...
Connecting to cf.10xgenomics.com (cf.10xgenomics.com)|2606:4700::6812:ad|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1647677988 (1.5G) [text/tab-separated-values]
Saving to: ‘data/fragments.tsv.gz’

data/fragments.tsv. 100%[===================>]   1.53G   134MB/s    in 9.6s

2024-02-29 17:02:30 (164 MB/s) - ‘data/fragments.tsv.gz’ saved [1647677988/1647677988]

--2024-02-29 17:02:30--  https://cf.10xgenomics.com/samples/cell-arc/1.0.0/human_brain_3k/human_brain_3k_atac_fragments.tsv.gz.tbi
Resolving cf.10xgenomics.com (cf.10xgenomics.com)... 2606:4700::6812:1ad, 2606:4700::6812:ad, 104.18.1.173, ...
Connecting to cf.10xgenomics.com (cf.10xgenomics.com)|2606:4700::6812:1ad|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 944619 (922K) [binary/octet-stream]
Saving to: ‘data/fragments.tsv.gz.tbi’

data/fragments.tsv. 100%[===================>] 922.48K  --.-KB/s    in 0.02s

2024-02-29 17:02:30 (37.7 MB/s) - ‘data/fragments.tsv.gz.tbi’ saved [944619/944619]

--2024-02-29 17:02:30--  https://raw.githubusercontent.com/aertslab/pycisTopic/polars/data/cell_data_human_cerebellum.tsv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8001::154, 2606:50c0:8000::154, 2606:50c0:8002::154, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8001::154|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 309204 (302K) [text/plain]
Saving to: ‘data/cell_data.tsv’

data/cell_data.tsv  100%[===================>] 301.96K  --.-KB/s    in 0.02s

2024-02-29 17:02:31 (17.6 MB/s) - ‘data/cell_data.tsv’ saved [309204/309204]

Set up

Let’s make some directories to store the output of pycisTopic.

[3]:
import os
out_dir = "outs"
os.makedirs(out_dir, exist_ok = True)

Define a dictionary mapping sample ids ("10x_multiome_brain") to fragment files ("data/fragments.tsv.gz").

Multiple entries can be added to this dictionary in case you have multiple samples.

Note:

pycisTopic will automatically append sample ids to barcodes to avoid barcode collisions between samples!

[4]:
fragments_dict = {
    "10x_multiome_brain": "data/fragments.tsv.gz"
}

Getting pseudobulk profiles from cell annotations

In this tutorial we assume we are analyzing the scATAC-seq data from a multiome dataset, which allows to easily get the cell annotations from the scRNA-seq analysis. In case of independent scATAC-seq data, the cell annotation can also be obtained from alternative methods, such as unnanotated/preliminary clustering analysis (using predefined regions, for example SCREEN for mouse and human). In the later case, you can skip this section and use bulk regions as input to the QC step.

First we read the barcode-to-cell type annotation as a pd.DataFrame.

The index of the dataframe should correspond to the barcode column in the fragments.tsv.gz file (the sample-id can be optionally appended to the barcode using a split_pattern in this case "-") and must contain a column containing the sample id. Those sample ids should match with the sample ids defined in the fragments dictionary defined above.

Alternatively, a column name "barcode" can be added the metadata containing cell barcodes (without any suffixes!). In this case the index of the dataframe will not be used.

[5]:
import pandas as pd
cell_data = pd.read_table("data/cell_data.tsv", index_col = 0)
cell_data.head()
[5]:
VSN_cell_type VSN_leiden_res0.3 VSN_leiden_res0.6 VSN_leiden_res0.9 VSN_leiden_res1.2 VSN_sample_id Seurat_leiden_res0.6 Seurat_leiden_res1.2 Seurat_cell_type
AAACAGCCATTATGCG-1-10x_multiome_brain MOL_B MOL_B (0) MOL_B_1 (0) MOL_B_1 (1) MOL_B_3 (6) 10x_multiome_brain NFOL (1) MOL (1) MOL
AAACCAACATAGACCC-1-10x_multiome_brain MOL_B MOL_B (0) MOL_B_1 (0) MOL_B_3 (5) MOL_B_4 (4) 10x_multiome_brain NFOL (1) NFOL (3) NFOL
AAACCGAAGATGCCTG-1-10x_multiome_brain INH_VIP INH_VIP (6) INH_VIP (8) INH_VIP (8) INH_VIP (10) 10x_multiome_brain INH_VIP (7) INH_VIP (6) INH_VIP
AAACCGAAGTTAGCTA-1-10x_multiome_brain MOL_A MOL_A (1) MOL_A_2 (1) MOL_A_1 (0) MOL_A_2 (0) 10x_multiome_brain NFOL (1) NFOL (3) NFOL
AAACCGCGTTAGCCAA-1-10x_multiome_brain MGL MGL (7) MGL (10) MGL (10) MGL (12) 10x_multiome_brain MGL (8) MGL (9) MGL

We will also need the size of each chromosome, this can be downloaded from the UCSC databases.

[6]:
chromsizes = pd.read_table(
    "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes",
    header = None,
    names = ["Chromosome", "End"]
)
chromsizes.insert(1, "Start", 0)
chromsizes.head()
[6]:
Chromosome Start End
0 chr1 0 248956422
1 chr2 0 242193529
2 chr3 0 198295559
3 chr4 0 190214555
4 chr5 0 181538259

Now we a ready to generate pseudobulk ATAC-seq profiles.

This function will generate a fragments.tsv.gz and a bigwig file for each cell type defined by the variable parameter.

For each cell type:

  • The fragments.tsv.gz contains all fragments for barcodes of that cell type.

  • A bigwig file will be generated for each fragments.tsv.gz file, this you can visualize in any genome browser.

[8]:
from pycisTopic.pseudobulk_peak_calling import export_pseudobulk
os.makedirs(os.path.join(out_dir, "consensus_peak_calling"), exist_ok = True)
os.makedirs(os.path.join(out_dir, "consensus_peak_calling/pseudobulk_bed_files"), exist_ok = True)
os.makedirs(os.path.join(out_dir, "consensus_peak_calling/pseudobulk_bw_files"), exist_ok = True)


bw_paths, bed_paths = export_pseudobulk(
    input_data = cell_data,
    variable = "VSN_cell_type",
    sample_id_col = "VSN_sample_id",
    chromsizes = chromsizes,
    bed_path = os.path.join(out_dir, "consensus_peak_calling/pseudobulk_bed_files"),
    bigwig_path = os.path.join(out_dir, "consensus_peak_calling/pseudobulk_bw_files"),
    path_to_fragments = fragments_dict,
    n_cpu = 10,
    normalize_bigwig = True,
    temp_dir = "/tmp",
    split_pattern = "-"
)
2024-03-05 13:59:29,358 cisTopic     INFO     Splitting fragments by cell type.
2024-03-05 14:00:46,456 cisTopic     INFO     generating bigwig files
/lustre1/project/stg_00002/mambaforge/vsc33053/envs/pycisTopic_polars_tutorial/lib/python3.11/site-packages/joblib/externals/loky/process_executor.py:752: UserWarning: A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.
  warnings.warn(

We will need the paths to the bed files later on, so let’s save them to disk.

[9]:
with open(os.path.join(out_dir, "consensus_peak_calling/bw_paths.tsv"), "wt") as f:
    for v in bw_paths:
        _ = f.write(f"{v}\t{bw_paths[v]}\n")
[10]:
with open(os.path.join(out_dir, "consensus_peak_calling/bed_paths.tsv"), "wt") as f:
    for v in bed_paths:
        _ = f.write(f"{v}\t{bed_paths[v]}\n")

Inferring consensus peaks

Next we will use MACS to call peaks for each pseudobulk fragments.tsv.gz file.

[11]:
bw_paths = {}
with open(os.path.join(out_dir, "consensus_peak_calling/bw_paths.tsv")) as f:
    for line in f:
        v, p = line.strip().split("\t")
        bw_paths.update({v: p})
[12]:
bed_paths = {}
with open(os.path.join(out_dir, "consensus_peak_calling/bed_paths.tsv")) as f:
    for line in f:
        v, p = line.strip().split("\t")
        bed_paths.update({v: p})
[14]:
from pycisTopic.pseudobulk_peak_calling import peak_calling
macs_path = "macs2"

os.makedirs(os.path.join(out_dir, "consensus_peak_calling/MACS"), exist_ok = True)

narrow_peak_dict = peak_calling(
    macs_path = macs_path,
    bed_paths = bed_paths,
    outdir = os.path.join(os.path.join(out_dir, "consensus_peak_calling/MACS")),
    genome_size = 'hs',
    n_cpu = 10,
    input_format = 'BEDPE',
    shift = 73,
    ext_size = 146,
    keep_dup = 'all',
    q_value = 0.05,
    _temp_dir = '/scratch/leuven/330/vsc33053/ray_spill'
)
2024-02-29 18:05:49,348 INFO worker.py:1724 -- Started a local Ray instance.
(macs_call_peak_ray pid=1232492) 2024-02-29 18:05:51,892 cisTopic     INFO     Calling peaks for AST_CER with macs2 callpeak --treatment outs/consensus_peak_calling/pseudobulk_bed_files/AST_CER.fragments.tsv.gz --name AST_CER  --outdir outs/consensus_peak_calling/MACS --format BEDPE --gsize hs --qvalue 0.05 --nomodel --shift 73 --extsize 146 --keep-dup all --call-summits --nolambda
(macs_call_peak_ray pid=1232485) 2024-02-29 18:05:51,919 cisTopic     INFO     Calling peaks for GC with macs2 callpeak --treatment outs/consensus_peak_calling/pseudobulk_bed_files/GC.fragments.tsv.gz --name GC  --outdir outs/consensus_peak_calling/MACS --format BEDPE --gsize hs --qvalue 0.05 --nomodel --shift 73 --extsize 146 --keep-dup all --call-summits --nolambda
(macs_call_peak_ray pid=1232488) 2024-02-29 18:05:51,975 cisTopic     INFO     Calling peaks for OPC with macs2 callpeak --treatment outs/consensus_peak_calling/pseudobulk_bed_files/OPC.fragments.tsv.gz --name OPC  --outdir outs/consensus_peak_calling/MACS --format BEDPE --gsize hs --qvalue 0.05 --nomodel --shift 73 --extsize 146 --keep-dup all --call-summits --nolambda
(macs_call_peak_ray pid=1232483) 2024-02-29 18:05:51,936 cisTopic     INFO     Calling peaks for ASTP with macs2 callpeak --treatment outs/consensus_peak_calling/pseudobulk_bed_files/ASTP.fragments.tsv.gz --name ASTP  --outdir outs/consensus_peak_calling/MACS --format BEDPE --gsize hs --qvalue 0.05 --nomodel --shift 73 --extsize 146 --keep-dup all --call-summits --nolambda
(macs_call_peak_ray pid=1232484) 2024-02-29 18:05:51,888 cisTopic     INFO     Calling peaks for INH_PVALB with macs2 callpeak --treatment outs/consensus_peak_calling/pseudobulk_bed_files/INH_PVALB.fragments.tsv.gz --name INH_PVALB  --outdir outs/consensus_peak_calling/MACS --format BEDPE --gsize hs --qvalue 0.05 --nomodel --shift 73 --extsize 146 --keep-dup all --call-summits --nolambda
(macs_call_peak_ray pid=1232489) 2024-02-29 18:05:51,895 cisTopic     INFO     Calling peaks for MGL with macs2 callpeak --treatment outs/consensus_peak_calling/pseudobulk_bed_files/MGL.fragments.tsv.gz --name MGL  --outdir outs/consensus_peak_calling/MACS --format BEDPE --gsize hs --qvalue 0.05 --nomodel --shift 73 --extsize 146 --keep-dup all --call-summits --nolambda
(macs_call_peak_ray pid=1232490) 2024-02-29 18:05:51,975 cisTopic     INFO     Calling peaks for INH_VIP with macs2 callpeak --treatment outs/consensus_peak_calling/pseudobulk_bed_files/INH_VIP.fragments.tsv.gz --name INH_VIP  --outdir outs/consensus_peak_calling/MACS --format BEDPE --gsize hs --qvalue 0.05 --nomodel --shift 73 --extsize 146 --keep-dup all --call-summits --nolambda
(macs_call_peak_ray pid=1232491) 2024-02-29 18:05:51,896 cisTopic     INFO     Calling peaks for MOL_A with macs2 callpeak --treatment outs/consensus_peak_calling/pseudobulk_bed_files/MOL_A.fragments.tsv.gz --name MOL_A  --outdir outs/consensus_peak_calling/MACS --format BEDPE --gsize hs --qvalue 0.05 --nomodel --shift 73 --extsize 146 --keep-dup all --call-summits --nolambda
(macs_call_peak_ray pid=1232486) 2024-02-29 18:05:51,961 cisTopic     INFO     Calling peaks for PURK with macs2 callpeak --treatment outs/consensus_peak_calling/pseudobulk_bed_files/PURK.fragments.tsv.gz --name PURK  --outdir outs/consensus_peak_calling/MACS --format BEDPE --gsize hs --qvalue 0.05 --nomodel --shift 73 --extsize 146 --keep-dup all --call-summits --nolambda
(macs_call_peak_ray pid=1232487) 2024-02-29 18:05:51,915 cisTopic     INFO     Calling peaks for MOL_B with macs2 callpeak --treatment outs/consensus_peak_calling/pseudobulk_bed_files/MOL_B.fragments.tsv.gz --name MOL_B  --outdir outs/consensus_peak_calling/MACS --format BEDPE --gsize hs --qvalue 0.05 --nomodel --shift 73 --extsize 146 --keep-dup all --call-summits --nolambda
(macs_call_peak_ray pid=1232484) 2024-02-29 18:05:58,586 cisTopic     INFO     INH_PVALB done!
(macs_call_peak_ray pid=1232484) 2024-02-29 18:05:58,594 cisTopic     INFO     Calling peaks for INH_SST with macs2 callpeak --treatment outs/consensus_peak_calling/pseudobulk_bed_files/INH_SST.fragments.tsv.gz --name INH_SST  --outdir outs/consensus_peak_calling/MACS --format BEDPE --gsize hs --qvalue 0.05 --nomodel --shift 73 --extsize 146 --keep-dup all --call-summits --nolambda
(macs_call_peak_ray pid=1232486) 2024-02-29 18:06:12,711 cisTopic     INFO     PURK done!
(macs_call_peak_ray pid=1232486) 2024-02-29 18:06:12,725 cisTopic     INFO     Calling peaks for AST with macs2 callpeak --treatment outs/consensus_peak_calling/pseudobulk_bed_files/AST.fragments.tsv.gz --name AST  --outdir outs/consensus_peak_calling/MACS --format BEDPE --gsize hs --qvalue 0.05 --nomodel --shift 73 --extsize 146 --keep-dup all --call-summits --nolambda
(macs_call_peak_ray pid=1232483) 2024-02-29 18:06:21,180 cisTopic     INFO     ASTP done!
(macs_call_peak_ray pid=1232483) 2024-02-29 18:06:21,201 cisTopic     INFO     Calling peaks for GP with macs2 callpeak --treatment outs/consensus_peak_calling/pseudobulk_bed_files/GP.fragments.tsv.gz --name GP  --outdir outs/consensus_peak_calling/MACS --format BEDPE --gsize hs --qvalue 0.05 --nomodel --shift 73 --extsize 146 --keep-dup all --call-summits --nolambda
(macs_call_peak_ray pid=1232489) 2024-02-29 18:06:21,581 cisTopic     INFO     MGL done!
(macs_call_peak_ray pid=1232489) 2024-02-29 18:06:21,604 cisTopic     INFO     Calling peaks for ENDO with macs2 callpeak --treatment outs/consensus_peak_calling/pseudobulk_bed_files/ENDO.fragments.tsv.gz --name ENDO  --outdir outs/consensus_peak_calling/MACS --format BEDPE --gsize hs --qvalue 0.05 --nomodel --shift 73 --extsize 146 --keep-dup all --call-summits --nolambda
(macs_call_peak_ray pid=1232489) 2024-02-29 18:06:28,838 cisTopic     INFO     ENDO done!
(macs_call_peak_ray pid=1232486) 2024-02-29 18:06:34,004 cisTopic     INFO     AST done!
(macs_call_peak_ray pid=1232490) 2024-02-29 18:06:46,068 cisTopic     INFO     INH_VIP done!
(macs_call_peak_ray pid=1232488) 2024-02-29 18:06:47,921 cisTopic     INFO     OPC done!
(macs_call_peak_ray pid=1232484) 2024-02-29 18:06:52,269 cisTopic     INFO     INH_SST done!
(macs_call_peak_ray pid=1232491) 2024-02-29 18:07:00,088 cisTopic     INFO     MOL_A done!
(macs_call_peak_ray pid=1232483) 2024-02-29 18:07:04,270 cisTopic     INFO     GP done!
(macs_call_peak_ray pid=1232487) 2024-02-29 18:07:05,433 cisTopic     INFO     MOL_B done!
(macs_call_peak_ray pid=1232485) 2024-02-29 18:07:15,367 cisTopic     INFO     GC done!
(macs_call_peak_ray pid=1232492) 2024-02-29 18:07:17,439 cisTopic     INFO     AST_CER done!

Finally, it is time to derive the consensus peaks. To do so, we use the TGCA iterative peak filtering approach. First, each summit is extended a peak_half_width in each direction and then we iteratively filter out less significant peaks that overlap with a more significant one. During this procedure peaks will be merged and depending on the number of peaks included into them, different processes will happen:

  • 1 peak: The original peak region will be kept

  • 2 peaks: The original peak region with the highest score will be kept

  • 3 or more peaks: The orignal peak region with the most significant score will be taken, and all the original peak regions in this merged peak region that overlap with the significant peak region will be removed. The process is repeated with the next most significant peak (if it was not removed already) until all peaks are processed.

This proccess will happen twice, first for each pseudobulk peaks; and after peak score normalization, to process all peaks together.

[16]:
from pycisTopic.iterative_peak_calling import get_consensus_peaks
# Other param
peak_half_width=250
path_to_blacklist="pycisTopic/blacklist/hg38-blacklist.v2.bed"
# Get consensus peaks
consensus_peaks = get_consensus_peaks(
    narrow_peaks_dict = narrow_peak_dict,
    peak_half_width = peak_half_width,
    chromsizes = chromsizes,
    path_to_blacklist = path_to_blacklist)
2024-02-29 18:07:54,059 cisTopic     INFO     Extending and merging peaks per class
2024-02-29 18:08:50,783 cisTopic     INFO     Normalizing peak scores
2024-02-29 18:08:51,331 cisTopic     INFO     Merging peaks
2024-02-29 18:09:59,689 cisTopic     INFO     Done!
[17]:
consensus_peaks.to_bed(
    path = os.path.join(out_dir, "consensus_peak_calling/consensus_regions.bed"),
    keep =True,
    compression = 'infer',
    chain = False)

QC

The next step is to perform QC for the scATAC-seq samples (in this case, only one run). There are several measurements and visualizations performed in this step:

  • Barcode rank plot

  • Duplication rate

  • Insertion size

  • TSS enrichment

  • Fraction of Reads In Peaks (FRIP)

To calculate the TSS enrichment we need to provide TSS annotations. You can easily download them via the pycistopic tss get_tss command.

In case you are unsure which column name is used by Ensembl to specify gene names in their databases, run the pycistopic tss gene_annotation_list and grep for your species.

[14]:
!pycistopic tss gene_annotation_list | grep Human
hsapiens_gene_ensembl   Human genes (GRCh38.p14)
[17]:
!mkdir -p outs/qc
!pycistopic tss get_tss \
    --output outs/qc/tss.bed \
    --name "hsapiens_gene_ensembl" \
    --to-chrom-source ucsc \
    --ucsc hg38
- Get TSS annotation from Ensembl BioMart with the following settings:
  - biomart_name: "hsapiens_gene_ensembl"
  - biomart_host: "http://www.ensembl.org"
  - transcript_type: ['protein_coding']
  - use_cache: True
- Getting chromosome sizes and alias mapping for "hg38" from UCSC.
- Update chromosome names in TSS annotation to "ucsc" chromosome names.
- Writing TSS annotation BED file to "outs/qc/tss.bed".
[19]:
!head outs/qc/tss.bed | column -t
#     Chromosome  Start  End      Gene  Score  Strand          Transcript_type
chrM  3306        3307   MT-ND1   .     +      protein_coding
chrM  4469        4470   MT-ND2   .     +      protein_coding
chrM  5903        5904   MT-CO1   .     +      protein_coding
chrM  7585        7586   MT-CO2   .     +      protein_coding
chrM  8365        8366   MT-ATP8  .     +      protein_coding
chrM  8526        8527   MT-ATP6  .     +      protein_coding
chrM  9206        9207   MT-CO3   .     +      protein_coding
chrM  10058       10059  MT-ND3   .     +      protein_coding
chrM  10469       10470  MT-ND4L  .     +      protein_coding

Next, let’s calculate the QC metrics using the pycistopic qc command.

[24]:
!pycistopic qc \
    --fragments data/fragments.tsv.gz \
    --regions outs/consensus_peak_calling/consensus_regions.bed \
    --tss outs/qc/tss.bed \
    --output outs/qc/10x_multiome_brain

In case you have multiple samples, you can run the QC step in parallel as follows.

[ ]:
regions_bed_filename = os.path.join(out_dir, "consensus_peak_calling/consensus_regions.bed")
tss_bed_filename = os.path.join(out_dir, "qc", "tss.bed")

pycistopic_qc_commands_filename = "pycistopic_qc_commands.txt"

# Create text file with all pycistopic qc command lines.
with open(pycistopic_qc_commands_filename, "w") as fh:
    for sample, fragment_filename in fragments_dict.items():
        print(
            "pycistopic qc",
            f"--fragments {fragment_filename}",
            f"--regions {regions_bed_filename}",
            f"--tss {tss_bed_filename}",
            f"--output {os.path.join(out_dir, "qc")}/{sample}",
            sep=" ",
            file=fh,
        )

And run the following in a command line environment.

cat pycistopic_qc_commands.txt | parallel -j 4 {}

Finally, we can visualize sample level statistics.

These include:

  • Barcode rank plot: The barcode rank plot shows the distribution of non-duplicate reads and which barcodes were inferred to be associated with cells. A steep drop-off (‘knee’) is indicative of good separation between the cell-associated barcodes and the barcodes associated with empty partitions.

  • Insertion size: ATAC-seq requires a proper pair of Tn5 transposase cutting events at the ends of DNA. In the nucleosome-free open chromatin regions, many molecules of Tn5 can kick in and chop the DNA into small pieces; around nucleosome-occupied regions, and Tn5 can only access the linker regions. Therefore, in a good ATAC-seq library, you should expect to see a sharp peak at the <100 bp region (open chromatin), and a peak at ~200bp region (mono-nucleosome), and other larger peaks (multi-nucleosomes). A clear nucleosome pattern indicates a good quality of the experiment.

  • Sample TSS enrichment: The TSS enrichment calculation is a signal to noise calculation. The reads around a reference set of TSSs are collected to form an aggregate distribution of reads centered on the TSSs and extending to 1000 bp in either direction (for a total of 2000bp). This distribution is then normalized by taking the average read depth in the 100 bps at each of the end flanks of the distribution (for a total of 200bp of averaged data) and calculating a fold change at each position over that average read depth. This means that the flanks should start at 1, and if there is high read signal at transcription start sites (highly open regions of the genome) there should be an increase in signal up to a peak in the middle.

[6]:
from pycisTopic.plotting.qc_plot import plot_sample_stats, plot_barcode_stats
import matplotlib.pyplot as plt
[7]:
for sample_id in fragments_dict:
    fig = plot_sample_stats(
        sample_id = sample_id,
        pycistopic_qc_output_dir = "outs/qc"
    )
../_images/notebooks_human_cerebellum_37_0.png

We can also visualize barcode level statistics.

These statistics can be used to filter cell barcodes to retain only high quality cells.

Note:

The pycistopic qc command will determine automatic thresholds for the minimum number of unique number of fragments and the minumum TSS enrichment. In case you want to change these thresholds or want to threhold based on FRIP, you can provide manually defined thresholds using the parameters: - unique_fragments_threshold - tss_enrichment_threshold - frip_threshold

In this case we will use the automatically defined thresholds, please manualy inspect the quality metrics to make sure these thresholds are valid!

The barcode level statistics include:

  • Total number of (unique) fragments

  • TSS enrichment: The score at position in the TSS enrichmen score for for each barcode (at position 0, the TSS). Noisy cells will have a low TSS enrichment.

  • FRIP: The fraction of reads in peaks for each barcode. Noisy cells have low FRIP values. However, this filter should be used with nuance, as it depends on the quality of the original peaks. For example, if there is a rare population in the sample, its specific peaks may be missed by peak calling algorithms, causing a decrease in their FRIP values.

[6]:
from pycisTopic.qc import get_barcodes_passing_qc_for_sample
sample_id_to_barcodes_passing_filters = {}
sample_id_to_thresholds = {}
for sample_id in fragments_dict:
    (
        sample_id_to_barcodes_passing_filters[sample_id],
        sample_id_to_thresholds[sample_id]
    ) = get_barcodes_passing_qc_for_sample(
            sample_id = sample_id,
            pycistopic_qc_output_dir = "outs/qc",
            unique_fragments_threshold = None, # use automatic thresholding
            tss_enrichment_threshold = None, # use automatic thresholding
            frip_threshold = 0,
            use_automatic_thresholds = True,
    )
10x_multiome_brain:
        Using automatic threshold for unique fragments: 1500.1880245466896
        Using automatic threshold for TSS enrichment: 7.534788253503151
[9]:
for sample_id in fragments_dict:
    fig = plot_barcode_stats(
        sample_id = sample_id,
        pycistopic_qc_output_dir = "outs/qc",
        bc_passing_filters = sample_id_to_barcodes_passing_filters[sample_id],
        detailed_title = False,
        **sample_id_to_thresholds[sample_id]
    )
../_images/notebooks_human_cerebellum_40_0.png

Creating a cisTopic object

In this step we will create a cisTopic object. This involves generating a count matrix containing fragment counts over consensus peaks (see above) for each cell barcode passing the QC metrices defined above.

Blacklist regions will be removed from this count matrix.

[9]:
path_to_regions = os.path.join(out_dir, "consensus_peak_calling/consensus_regions.bed")
path_to_blacklist = "pycisTopic/blacklist/hg38-blacklist.v2.bed"
pycistopic_qc_output_dir = "outs/qc"

from pycisTopic.cistopic_class import create_cistopic_object_from_fragments
import polars as pl

cistopic_obj_list = []
for sample_id in fragments_dict:
    sample_metrics = pl.read_parquet(
        os.path.join(pycistopic_qc_output_dir, f'{sample_id}.fragments_stats_per_cb.parquet')
    ).to_pandas().set_index("CB").loc[ sample_id_to_barcodes_passing_filters[sample_id] ]
    cistopic_obj = create_cistopic_object_from_fragments(
        path_to_fragments = fragments_dict[sample_id],
        path_to_regions = path_to_regions,
        path_to_blacklist = path_to_blacklist,
        metrics = sample_metrics,
        valid_bc = sample_id_to_barcodes_passing_filters[sample_id],
        n_cpu = 1,
        project = sample_id,
        split_pattern = '-'
    )
    cistopic_obj_list.append(cistopic_obj)
2024-03-06 10:15:48,250 cisTopic     INFO     Reading data for 10x_multiome_brain
2024-03-06 10:17:22,144 cisTopic     INFO     metrics provided!
2024-03-06 10:17:30,393 cisTopic     INFO     Counting fragments in regions
2024-03-06 10:18:32,026 cisTopic     INFO     Creating fragment matrix
2024-03-06 10:19:03,991 cisTopic     INFO     Converting fragment matrix to sparse matrix
2024-03-06 10:19:10,609 cisTopic     INFO     Removing blacklisted regions
2024-03-06 10:19:11,744 cisTopic     INFO     Creating CistopicObject
2024-03-06 10:19:13,305 cisTopic     INFO     Done!

In this case we only have one sample, so only one cisTopic object has been generated. If you would have multiple samples, you would need to run the merge() function on your cisTopic object list.

[10]:
cistopic_obj = cistopic_obj_list[0]
print(cistopic_obj)
CistopicObject from project 10x_multiome_brain with n_cells × n_regions = 2853 × 436234
[13]:
import pickle
pickle.dump(
    cistopic_obj,
    open(os.path.join(out_dir, "cistopic_obj.pkl"), "wb")
)

Adding metadata to a cisTopic object

We can add additional metadata (for regions or cells) to a cisTopic object. For example, let’s add the scRNA-seq data annotations. Missing values will be filled with Nan.

[14]:
import pandas as pd
cell_data = pd.read_table("data/cell_data.tsv", index_col = 0)
cell_data.head()
cistopic_obj.add_cell_data(cell_data, split_pattern='-')
pickle.dump(
    cistopic_obj,
    open(os.path.join(out_dir, "cistopic_obj.pkl"), "wb")
)
Warning: Some cells in this CistopicObject are not present in this cell_data. Values will be filled with Nan

[15]:
cistopic_obj.cell_data
[15]:
cisTopic_nr_frag cisTopic_log_nr_frag cisTopic_nr_acc cisTopic_log_nr_acc sample_id barcode_rank total_fragments_count log10_total_fragments_count unique_fragments_count log10_unique_fragments_count ... barcode VSN_cell_type VSN_leiden_res0.3 VSN_leiden_res0.6 VSN_leiden_res0.9 VSN_leiden_res1.2 VSN_sample_id Seurat_leiden_res0.6 Seurat_leiden_res1.2 Seurat_cell_type
CACCTCAGTTGTAAAC-1-10x_multiome_brain 18300 4.262451 15555 4.19187 10x_multiome_brain 1263 47941 4.680716 30052 4.477888 ... CACCTCAGTTGTAAAC-1 NaN NaN NaN NaN NaN NaN NaN NaN NaN
TGACTCCTCATCCACC-1-10x_multiome_brain 100055 5.000239 62282 4.794363 10x_multiome_brain 93 311527 5.493497 177932 5.250257 ... TGACTCCTCATCCACC-1 AST_CER AST_CER (2) AST_CER (2) AST_CER (2) AST_CER_1 (7) 10x_multiome_brain BG (2) BG (2) BG
TTTCTCACATAAACCT-1-10x_multiome_brain 32192 4.507748 27001 4.43138 10x_multiome_brain 594 110849 5.044736 67109 4.826787 ... TTTCTCACATAAACCT-1 GC GC (4) GC (5) GC (6) GC (8) 10x_multiome_brain GC (4) GC (5) GC
GTCCTCCCACACAATT-1-10x_multiome_brain 88443 4.946663 58296 4.765639 10x_multiome_brain 97 325923 5.513116 174752 5.242425 ... GTCCTCCCACACAATT-1 AST_CER AST_CER (2) AST_CER (2) AST_CER (2) AST_CER_1 (7) 10x_multiome_brain BG (2) BG (2) BG
CTCCGTCCAGTTTGTG-1-10x_multiome_brain 131110 5.117636 78509 4.894919 10x_multiome_brain 54 405882 5.608401 232359 5.366161 ... CTCCGTCCAGTTTGTG-1 NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
GCAGGTTGTCCAAATG-1-10x_multiome_brain 2770 3.44248 2613 3.417139 10x_multiome_brain 3258 7170 3.855580 4043 3.606811 ... GCAGGTTGTCCAAATG-1 NaN NaN NaN NaN NaN NaN NaN NaN NaN
AAGCTCCCAGCACCAT-1-10x_multiome_brain 2180 3.338456 2040 3.30963 10x_multiome_brain 3148 9905 3.995898 5832 3.765892 ... AAGCTCCCAGCACCAT-1 NaN NaN NaN NaN NaN NaN NaN NaN NaN
CAGAATCTCCTCATGC-1-10x_multiome_brain 1744 3.241546 1681 3.225568 10x_multiome_brain 3340 4638 3.666424 2655 3.424228 ... CAGAATCTCCTCATGC-1 MOL_B MOL_B (0) MOL_B_1 (0) MOL_B_1 (1) MOL_B_1 (1) 10x_multiome_brain MOL (0) MOL (1) MOL
TAGCCGGGTAACAGGG-1-10x_multiome_brain 2674 3.427161 2414 3.382737 10x_multiome_brain 3134 10310 4.013301 6033 3.780605 ... TAGCCGGGTAACAGGG-1 ENDO MOL_A (1) MOL_A_2 (1) MOL_A_1 (0) ENDO (18) 10x_multiome_brain AST+ENDO (6) ENDO (15) ENDO
GTGCGCAGTGCTTAGA-1-10x_multiome_brain 5859 3.767823 5515 3.741546 10x_multiome_brain 2452 25196 4.401349 14736 4.168409 ... GTGCGCAGTGCTTAGA-1 INH_SST INH_SST (5) INH_SST (7) INH_SST (7) INH_SST (9) 10x_multiome_brain INH_SST (5) INH_SST (8) INH_SST

2853 rows × 31 columns

Running scrublet

Optionally, you can run also scrublet on the fragment count matrix to infer doublets from the scATAC-seq.

[17]:
import scrublet as scr
scrub = scr.Scrublet(cistopic_obj.fragment_matrix.T, expected_doublet_rate=0.1)
doublet_scores, predicted_doublets = scrub.scrub_doublets()
scrub.plot_histogram();
scrub.call_doublets(threshold=0.22)
scrub.plot_histogram();
scrublet = pd.DataFrame([scrub.doublet_scores_obs_, scrub.predicted_doublets_], columns=cistopic_obj.cell_names, index=['Doublet_scores_fragments', 'Predicted_doublets_fragments']).T
Preprocessing...
/lustre1/project/stg_00002/mambaforge/vsc33053/envs/pycisTopic_polars_tutorial/lib/python3.11/site-packages/scrublet/helper_functions.py:252: RuntimeWarning: invalid value encountered in sqrt
  CV_input = np.sqrt(b);
Simulating doublets...
Embedding transcriptomes using PCA...
Calculating doublet scores...
Automatically set threshold at doublet score = 0.66
Detected doublet rate = 0.2%
Estimated detectable doublet fraction = 1.1%
Overall doublet rate:
        Expected   = 10.0%
        Estimated  = 16.1%
Elapsed time: 26.8 seconds
Detected doublet rate = 22.4%
Estimated detectable doublet fraction = 66.6%
Overall doublet rate:
        Expected   = 10.0%
        Estimated  = 33.6%
../_images/notebooks_human_cerebellum_50_3.png
../_images/notebooks_human_cerebellum_50_4.png
[18]:
cistopic_obj.add_cell_data(scrublet, split_pattern = '-')
sum(cistopic_obj.cell_data.Predicted_doublets_fragments == True)
[18]:
638
[19]:
pickle.dump(
    cistopic_obj,
    open(os.path.join(out_dir, "cistopic_obj.pkl"), "wb")
)
[20]:
# Remove doublets
singlets = cistopic_obj.cell_data[cistopic_obj.cell_data.Predicted_doublets_fragments == False].index.tolist()
# Subset cisTopic object
cistopic_obj_noDBL = cistopic_obj.subset(singlets, copy=True, split_pattern='-')
print(cistopic_obj_noDBL)
CistopicObject from project 10x_multiome_brain with n_cells × n_regions = 2215 × 436232
[21]:
pickle.dump(
    cistopic_obj,
    open(os.path.join(out_dir, "cistopic_obj.pkl"), "wb")
)

Run models

Next we will perform the actual topic modeling using LDA using a Collapsed Gibbs Sampler.

There are two functions that ca be used to perform the topic modeling, both producde similar results.

  • Serial LDA: The parallelization is done between models rather than within each model. Recommended for small-medium sized data sets in which several models with different number of topics are being tested. You can run these models with runCGSModels().

  • Parallel LDA with MALLET: The parallelization is done within each model. Recommended for large data sets where a few models with different number of topics are being tested. If working in a cluster, we recommed to submit a job per model so they can run simultaneously. You can run it with runCGSModelsMallet().

Here, we will use Mallet.

Note:

In order to be able to run Mallet you need the Mallet binary, these binaries can be downloaded from Github. You can also compile the binary from source, for more information please visit the Mallet Github repository.

[23]:
!wget https://github.com/mimno/Mallet/releases/download/v202108/Mallet-202108-bin.tar.gz
!tar -xf Mallet-202108-bin.tar.gz
--2024-03-06 10:35:09--  https://github.com/mimno/Mallet/releases/download/v202108/Mallet-202108-bin.tar.gz
Resolving github.com (github.com)... 140.82.121.4
Connecting to github.com (github.com)|140.82.121.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/18378040/6a3fdbe6-0d3f-4f99-add3-1f98129f43cb?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAVCODYLSA53PQK4ZA%2F20240306%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20240306T093319Z&X-Amz-Expires=300&X-Amz-Signature=441a3185ed2778aeb182bf8f470200f1c88411564e67493f8e82405ac057b350&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=18378040&response-content-disposition=attachment%3B%20filename%3DMallet-202108-bin.tar.gz&response-content-type=application%2Foctet-stream [following]
--2024-03-06 10:35:09--  https://objects.githubusercontent.com/github-production-release-asset-2e65be/18378040/6a3fdbe6-0d3f-4f99-add3-1f98129f43cb?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAVCODYLSA53PQK4ZA%2F20240306%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20240306T093319Z&X-Amz-Expires=300&X-Amz-Signature=441a3185ed2778aeb182bf8f470200f1c88411564e67493f8e82405ac057b350&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=18378040&response-content-disposition=attachment%3B%20filename%3DMallet-202108-bin.tar.gz&response-content-type=application%2Foctet-stream
Resolving objects.githubusercontent.com (objects.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.109.133, ...
Connecting to objects.githubusercontent.com (objects.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7187546 (6.9M) [application/octet-stream]
Saving to: ‘Mallet-202108-bin.tar.gz’

Mallet-202108-bin.t 100%[===================>]   6.85M  --.-KB/s    in 0.05s

2024-03-06 10:35:09 (138 MB/s) - ‘Mallet-202108-bin.tar.gz’ saved [7187546/7187546]

[ ]:
!mkdir -p /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/

Because we don’t know yet what number of topics will be optimal for our dataset we will run several topic models, each with a different number of topics.

[27]:
os.environ['MALLET_MEMORY'] = '200G'
from pycisTopic.lda_models import run_cgs_models_mallet
# Configure path Mallet
mallet_path="Mallet-202108/bin/mallet"
# Run models
models=run_cgs_models_mallet(
    cistopic_obj,
    n_topics=[2, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50],
    n_cpu=12,
    n_iter=500,
    random_state=555,
    alpha=50,
    alpha_by_topic=True,
    eta=0.1,
    eta_by_topic=False,
    tmp_path="/scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial",
    save_path="/scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial",
    mallet_path=mallet_path,
)
2024-03-06 10:52:57,071 cisTopic     INFO     Formatting input to corpus
2024-03-06 10:52:57,690 gensim.corpora.dictionary INFO     adding document #0 to Dictionary<0 unique tokens: []>
2024-03-06 10:53:23,393 gensim.corpora.dictionary INFO     built Dictionary<436234 unique tokens: ['0', '1', '2', '3', '4']...> from 2853 documents (total 46261622 corpus positions)
2024-03-06 10:53:23,395 cisTopic     INFO     Running model with 2 topics
2024-03-06 10:53:23,409 LDAMalletWrapper INFO     Serializing temporary corpus to /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/corpus.txt
2024-03-06 10:54:49,354 LDAMalletWrapper INFO     Converting temporary corpus to MALLET format with Mallet-202108/bin/mallet import-file --preserve-case --keep-sequence --remove-stopwords --token-regex "\S+" --input /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/corpus.txt --output /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/corpus.mallet
2024-03-06 10:55:16,373 LDAMalletWrapper INFO     Training MALLET LDA with Mallet-202108/bin/mallet train-topics --input /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/corpus.mallet --num-topics 2  --alpha 50 --beta 0.1 --optimize-interval 0 --num-threads 12 --output-state /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/607c9d_state.mallet.gz --output-doc-topics /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/607c9d_doctopics.txt --output-topic-keys /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/607c9d_topickeys.txt --num-iterations 500 --inferencer-filename /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/607c9d_inferencer.mallet --doc-topics-threshold 0.0  --random-seed 555
2024-03-06 11:11:23,264 LDAMalletWrapper INFO     loading assigned topics from /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/607c9d_state.mallet.gz
2024-03-06 11:12:40,862 cisTopic     INFO     Model with 2 topics done!
2024-03-06 11:12:40,863 cisTopic     INFO     Saving model with 2 topics at /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial
2024-03-06 11:12:40,954 cisTopic     INFO     Running model with 5 topics
2024-03-06 11:12:40,967 LDAMalletWrapper INFO     Serializing temporary corpus to /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/corpus.txt
2024-03-06 11:14:09,509 LDAMalletWrapper INFO     Converting temporary corpus to MALLET format with Mallet-202108/bin/mallet import-file --preserve-case --keep-sequence --remove-stopwords --token-regex "\S+" --input /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/corpus.txt --output /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/corpus.mallet
2024-03-06 11:14:37,009 LDAMalletWrapper INFO     Training MALLET LDA with Mallet-202108/bin/mallet train-topics --input /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/corpus.mallet --num-topics 5  --alpha 50 --beta 0.1 --optimize-interval 0 --num-threads 12 --output-state /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/3efb70_state.mallet.gz --output-doc-topics /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/3efb70_doctopics.txt --output-topic-keys /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/3efb70_topickeys.txt --num-iterations 500 --inferencer-filename /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/3efb70_inferencer.mallet --doc-topics-threshold 0.0  --random-seed 555
2024-03-06 11:31:43,881 LDAMalletWrapper INFO     loading assigned topics from /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/3efb70_state.mallet.gz
2024-03-06 11:33:02,089 cisTopic     INFO     Model with 5 topics done!
2024-03-06 11:33:02,090 cisTopic     INFO     Saving model with 5 topics at /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial
2024-03-06 11:33:02,210 cisTopic     INFO     Running model with 10 topics
2024-03-06 11:33:02,224 LDAMalletWrapper INFO     Serializing temporary corpus to /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/corpus.txt
2024-03-06 11:34:30,049 LDAMalletWrapper INFO     Converting temporary corpus to MALLET format with Mallet-202108/bin/mallet import-file --preserve-case --keep-sequence --remove-stopwords --token-regex "\S+" --input /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/corpus.txt --output /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/corpus.mallet
2024-03-06 11:34:56,968 LDAMalletWrapper INFO     Training MALLET LDA with Mallet-202108/bin/mallet train-topics --input /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/corpus.mallet --num-topics 10  --alpha 50 --beta 0.1 --optimize-interval 0 --num-threads 12 --output-state /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/56914f_state.mallet.gz --output-doc-topics /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/56914f_doctopics.txt --output-topic-keys /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/56914f_topickeys.txt --num-iterations 500 --inferencer-filename /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/56914f_inferencer.mallet --doc-topics-threshold 0.0  --random-seed 555
2024-03-06 11:57:16,974 LDAMalletWrapper INFO     loading assigned topics from /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/56914f_state.mallet.gz
2024-03-06 11:58:36,725 cisTopic     INFO     Model with 10 topics done!
2024-03-06 11:58:36,726 cisTopic     INFO     Saving model with 10 topics at /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial
2024-03-06 11:58:36,873 cisTopic     INFO     Running model with 15 topics
2024-03-06 11:58:36,887 LDAMalletWrapper INFO     Serializing temporary corpus to /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/corpus.txt
2024-03-06 12:00:04,591 LDAMalletWrapper INFO     Converting temporary corpus to MALLET format with Mallet-202108/bin/mallet import-file --preserve-case --keep-sequence --remove-stopwords --token-regex "\S+" --input /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/corpus.txt --output /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/corpus.mallet
2024-03-06 12:00:30,544 LDAMalletWrapper INFO     Training MALLET LDA with Mallet-202108/bin/mallet train-topics --input /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/corpus.mallet --num-topics 15  --alpha 50 --beta 0.1 --optimize-interval 0 --num-threads 12 --output-state /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/d5a078_state.mallet.gz --output-doc-topics /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/d5a078_doctopics.txt --output-topic-keys /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/d5a078_topickeys.txt --num-iterations 500 --inferencer-filename /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/d5a078_inferencer.mallet --doc-topics-threshold 0.0  --random-seed 555
2024-03-06 12:24:38,351 LDAMalletWrapper INFO     loading assigned topics from /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/d5a078_state.mallet.gz
2024-03-06 12:25:59,453 cisTopic     INFO     Model with 15 topics done!
2024-03-06 12:25:59,454 cisTopic     INFO     Saving model with 15 topics at /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial
2024-03-06 12:25:59,617 cisTopic     INFO     Running model with 20 topics
2024-03-06 12:25:59,635 LDAMalletWrapper INFO     Serializing temporary corpus to /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/corpus.txt
2024-03-06 12:27:25,547 LDAMalletWrapper INFO     Converting temporary corpus to MALLET format with Mallet-202108/bin/mallet import-file --preserve-case --keep-sequence --remove-stopwords --token-regex "\S+" --input /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/corpus.txt --output /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/corpus.mallet
2024-03-06 12:27:52,965 LDAMalletWrapper INFO     Training MALLET LDA with Mallet-202108/bin/mallet train-topics --input /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/corpus.mallet --num-topics 20  --alpha 50 --beta 0.1 --optimize-interval 0 --num-threads 12 --output-state /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/da4528_state.mallet.gz --output-doc-topics /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/da4528_doctopics.txt --output-topic-keys /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/da4528_topickeys.txt --num-iterations 500 --inferencer-filename /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/da4528_inferencer.mallet --doc-topics-threshold 0.0  --random-seed 555
2024-03-06 12:49:36,810 LDAMalletWrapper INFO     loading assigned topics from /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/da4528_state.mallet.gz
2024-03-06 12:50:58,204 cisTopic     INFO     Model with 20 topics done!
2024-03-06 12:50:58,205 cisTopic     INFO     Saving model with 20 topics at /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial
2024-03-06 12:50:58,372 cisTopic     INFO     Running model with 25 topics
2024-03-06 12:50:58,386 LDAMalletWrapper INFO     Serializing temporary corpus to /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/corpus.txt
2024-03-06 12:52:25,150 LDAMalletWrapper INFO     Converting temporary corpus to MALLET format with Mallet-202108/bin/mallet import-file --preserve-case --keep-sequence --remove-stopwords --token-regex "\S+" --input /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/corpus.txt --output /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/corpus.mallet
2024-03-06 12:52:52,157 LDAMalletWrapper INFO     Training MALLET LDA with Mallet-202108/bin/mallet train-topics --input /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/corpus.mallet --num-topics 25  --alpha 50 --beta 0.1 --optimize-interval 0 --num-threads 12 --output-state /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/c9d3ea_state.mallet.gz --output-doc-topics /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/c9d3ea_doctopics.txt --output-topic-keys /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/c9d3ea_topickeys.txt --num-iterations 500 --inferencer-filename /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/c9d3ea_inferencer.mallet --doc-topics-threshold 0.0  --random-seed 555
2024-03-06 13:23:17,202 LDAMalletWrapper INFO     loading assigned topics from /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/c9d3ea_state.mallet.gz
2024-03-06 13:24:41,201 cisTopic     INFO     Model with 25 topics done!
2024-03-06 13:24:41,201 cisTopic     INFO     Saving model with 25 topics at /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial
2024-03-06 13:24:41,396 cisTopic     INFO     Running model with 30 topics
2024-03-06 13:24:41,414 LDAMalletWrapper INFO     Serializing temporary corpus to /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/corpus.txt
2024-03-06 13:26:08,107 LDAMalletWrapper INFO     Converting temporary corpus to MALLET format with Mallet-202108/bin/mallet import-file --preserve-case --keep-sequence --remove-stopwords --token-regex "\S+" --input /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/corpus.txt --output /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/corpus.mallet
2024-03-06 13:26:35,899 LDAMalletWrapper INFO     Training MALLET LDA with Mallet-202108/bin/mallet train-topics --input /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/corpus.mallet --num-topics 30  --alpha 50 --beta 0.1 --optimize-interval 0 --num-threads 12 --output-state /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/9522b8_state.mallet.gz --output-doc-topics /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/9522b8_doctopics.txt --output-topic-keys /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/9522b8_topickeys.txt --num-iterations 500 --inferencer-filename /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/9522b8_inferencer.mallet --doc-topics-threshold 0.0  --random-seed 555
2024-03-06 13:53:23,219 LDAMalletWrapper INFO     loading assigned topics from /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/9522b8_state.mallet.gz
2024-03-06 13:54:48,411 cisTopic     INFO     Model with 30 topics done!
2024-03-06 13:54:48,413 cisTopic     INFO     Saving model with 30 topics at /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial
2024-03-06 13:54:48,639 cisTopic     INFO     Running model with 35 topics
2024-03-06 13:54:48,654 LDAMalletWrapper INFO     Serializing temporary corpus to /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/corpus.txt
2024-03-06 13:56:15,826 LDAMalletWrapper INFO     Converting temporary corpus to MALLET format with Mallet-202108/bin/mallet import-file --preserve-case --keep-sequence --remove-stopwords --token-regex "\S+" --input /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/corpus.txt --output /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/corpus.mallet
2024-03-06 13:56:44,206 LDAMalletWrapper INFO     Training MALLET LDA with Mallet-202108/bin/mallet train-topics --input /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/corpus.mallet --num-topics 35  --alpha 50 --beta 0.1 --optimize-interval 0 --num-threads 12 --output-state /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/71a4bc_state.mallet.gz --output-doc-topics /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/71a4bc_doctopics.txt --output-topic-keys /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/71a4bc_topickeys.txt --num-iterations 500 --inferencer-filename /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/71a4bc_inferencer.mallet --doc-topics-threshold 0.0  --random-seed 555
2024-03-06 14:24:54,558 LDAMalletWrapper INFO     loading assigned topics from /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/71a4bc_state.mallet.gz
2024-03-06 14:26:26,621 cisTopic     INFO     Model with 35 topics done!
2024-03-06 14:26:26,622 cisTopic     INFO     Saving model with 35 topics at /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial
2024-03-06 14:26:26,894 cisTopic     INFO     Running model with 40 topics
2024-03-06 14:26:26,914 LDAMalletWrapper INFO     Serializing temporary corpus to /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/corpus.txt
2024-03-06 14:27:55,771 LDAMalletWrapper INFO     Converting temporary corpus to MALLET format with Mallet-202108/bin/mallet import-file --preserve-case --keep-sequence --remove-stopwords --token-regex "\S+" --input /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/corpus.txt --output /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/corpus.mallet
2024-03-06 14:28:22,633 LDAMalletWrapper INFO     Training MALLET LDA with Mallet-202108/bin/mallet train-topics --input /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/corpus.mallet --num-topics 40  --alpha 50 --beta 0.1 --optimize-interval 0 --num-threads 12 --output-state /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/5db9d5_state.mallet.gz --output-doc-topics /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/5db9d5_doctopics.txt --output-topic-keys /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/5db9d5_topickeys.txt --num-iterations 500 --inferencer-filename /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/5db9d5_inferencer.mallet --doc-topics-threshold 0.0  --random-seed 555
2024-03-06 14:54:05,167 LDAMalletWrapper INFO     loading assigned topics from /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/5db9d5_state.mallet.gz
2024-03-06 14:55:32,458 cisTopic     INFO     Model with 40 topics done!
2024-03-06 14:55:32,459 cisTopic     INFO     Saving model with 40 topics at /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial
2024-03-06 14:55:32,717 cisTopic     INFO     Running model with 45 topics
2024-03-06 14:55:32,732 LDAMalletWrapper INFO     Serializing temporary corpus to /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/corpus.txt
2024-03-06 14:56:59,495 LDAMalletWrapper INFO     Converting temporary corpus to MALLET format with Mallet-202108/bin/mallet import-file --preserve-case --keep-sequence --remove-stopwords --token-regex "\S+" --input /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/corpus.txt --output /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/corpus.mallet
2024-03-06 14:57:26,628 LDAMalletWrapper INFO     Training MALLET LDA with Mallet-202108/bin/mallet train-topics --input /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/corpus.mallet --num-topics 45  --alpha 50 --beta 0.1 --optimize-interval 0 --num-threads 12 --output-state /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/cc641b_state.mallet.gz --output-doc-topics /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/cc641b_doctopics.txt --output-topic-keys /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/cc641b_topickeys.txt --num-iterations 500 --inferencer-filename /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/cc641b_inferencer.mallet --doc-topics-threshold 0.0  --random-seed 555
2024-03-06 15:26:12,558 LDAMalletWrapper INFO     loading assigned topics from /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/cc641b_state.mallet.gz
2024-03-06 15:27:42,660 cisTopic     INFO     Model with 45 topics done!
2024-03-06 15:27:42,660 cisTopic     INFO     Saving model with 45 topics at /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial
2024-03-06 15:27:42,926 cisTopic     INFO     Running model with 50 topics
2024-03-06 15:27:43,051 LDAMalletWrapper INFO     Serializing temporary corpus to /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/corpus.txt
2024-03-06 15:29:09,831 LDAMalletWrapper INFO     Converting temporary corpus to MALLET format with Mallet-202108/bin/mallet import-file --preserve-case --keep-sequence --remove-stopwords --token-regex "\S+" --input /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/corpus.txt --output /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/corpus.mallet
2024-03-06 15:29:36,971 LDAMalletWrapper INFO     Training MALLET LDA with Mallet-202108/bin/mallet train-topics --input /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/corpus.mallet --num-topics 50  --alpha 50 --beta 0.1 --optimize-interval 0 --num-threads 12 --output-state /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/e9e62f_state.mallet.gz --output-doc-topics /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/e9e62f_doctopics.txt --output-topic-keys /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/e9e62f_topickeys.txt --num-iterations 500 --inferencer-filename /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/e9e62f_inferencer.mallet --doc-topics-threshold 0.0  --random-seed 555
2024-03-06 16:02:19,422 LDAMalletWrapper INFO     loading assigned topics from /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial/e9e62f_state.mallet.gz
2024-03-06 16:03:47,824 cisTopic     INFO     Model with 50 topics done!
2024-03-06 16:03:47,824 cisTopic     INFO     Saving model with 50 topics at /scratch/leuven/330/vsc33053/ray_spill/mallet/tutorial
[29]:
pickle.dump(
    models,
    open(os.path.join(out_dir, "models.pkl"), "wb")
)

Model selection

Let’s select the model with the optimal number of topics.

There is no optimal way to do this selection, however getting the exact optimal number of topics is also not critical.

To make the selection easier we implemented several metrics:

  • Minmo_2011: Uses the average model coherence as calculated by Mimno et al (2011). In order to reduce the impact of the number of topics, we calculate the average coherence based on the top selected average values. The better the model, the higher coherence.

  • Log-likelihood: Uses the log-likelihood in the last iteration as calculated by Griffiths and Steyvers (2004). The better the model, the higher the log-likelihood.

  • Arun_2010: Uses a density-based metric as in Arun et al (2010) using the topic-region distribution, the cell-topic distribution and the cell coverage. The better the model, the lower the metric.

  • Cao_Juan_2009: Uses a divergence-based metric as in Cao Juan et al (2009) using the topic-region distribution. The better the model, the lower the metric.

Note:

Note that for both the Arun and Cao metric a lower score is related to a better model, for visualization purposes we inverted these scores. In the plots below a higher score for these metrics are thus related to a better model.

Also note that not all metrics agree (see for example the Arun metric).

For scATAC-seq data models, the most helpful methods are Minmo (topic coherence) and the log-likelihood in the last iteration.

[30]:
from pycisTopic.lda_models import evaluate_models
model = evaluate_models(
    models,
    select_model = 40,
    return_model = True
)
../_images/notebooks_human_cerebellum_62_0.png
[31]:
cistopic_obj.add_LDA_model(model)
[32]:
pickle.dump(
    cistopic_obj,
    open(os.path.join(out_dir, "cistopic_obj.pkl"), "wb")
)

Clustering and visualization

We can cluster the cells (or regions) using the leiden algorithm, and perform dimensionality reductiion with UMAP and TSNE. In these examples we will focus on the cells only. For these steps, the cell-topic contributions of the model will be used.

[82]:
from pycisTopic.clust_vis import (
    find_clusters,
    run_umap,
    run_tsne,
    plot_metadata,
    plot_topic,
    cell_topic_heatmap
)
[93]:
find_clusters(
    cistopic_obj,
    target  = 'cell',
    k = 10,
    res = [0.6, 1.2, 3],
    prefix = 'pycisTopic_',
    scale = True,
    split_pattern = '-'
)
2024-03-06 16:40:48,444 cisTopic     INFO     Finding neighbours
Columns ['pycisTopic_leiden_10_0.6'] will be overwritten
Columns ['pycisTopic_leiden_10_1.2'] will be overwritten
Columns ['pycisTopic_leiden_10_3'] will be overwritten
[42]:
run_umap(
    cistopic_obj,
    target  = 'cell', scale=True)
2024-03-06 16:21:36,432 cisTopic     INFO     Running UMAP
/lustre1/project/stg_00002/mambaforge/vsc33053/envs/pycisTopic_polars_tutorial/lib/python3.11/site-packages/umap/umap_.py:1943: UserWarning: n_jobs value -1 overridden to 1 by setting random_state. Use no seed for parallelism.
  warn(f"n_jobs value {self.n_jobs} overridden to 1 by setting random_state. Use no seed for parallelism.")
[43]:
run_tsne(
    cistopic_obj,
    target  = 'cell', scale=True)
2024-03-06 16:21:56,706 cisTopic     INFO     Running TSNE
[70]:
plot_metadata(
    cistopic_obj,
    reduction_name='UMAP',
    variables=['Seurat_cell_type', 'pycisTopic_leiden_10_0.6', 'pycisTopic_leiden_10_1.2', 'pycisTopic_leiden_10_3'],
    target='cell', num_columns=4,
    text_size=10,
    dot_size=5)
../_images/notebooks_human_cerebellum_70_0.png

Let’s annotate each cluster based on the overlap with scRNA-seq annotations.

[94]:
annot_dict = {}
for resolution in [0.6, 1.2, 3]:
    annot_dict[f"pycisTopic_leiden_10_{resolution}"] = {}
    for cluster in set(cistopic_obj.cell_data[f"pycisTopic_leiden_10_{resolution}"]):
        counts = cistopic_obj.cell_data.loc[
            cistopic_obj.cell_data.loc[cistopic_obj.cell_data[f"pycisTopic_leiden_10_{resolution}"] == cluster].index,
            "Seurat_cell_type"].value_counts()
        annot_dict[f"pycisTopic_leiden_10_{resolution}"][cluster] = f"{counts.index[counts.argmax()]}({cluster})"
[95]:
annot_dict
[95]:
{'pycisTopic_leiden_10_0.6': {'11': 'GP(11)',
  '6': 'INH_SST(6)',
  '8': 'INH_VIP(8)',
  '0': 'MOL(0)',
  '7': 'MGL(7)',
  '4': 'BG(4)',
  '12': 'MOL(12)',
  '9': 'GC(9)',
  '5': 'OPC(5)',
  '2': 'BG(2)',
  '10': 'AST(10)',
  '1': 'MOL(1)',
  '3': 'NFOL(3)'},
 'pycisTopic_leiden_10_1.2': {'15': 'MGL(15)',
  '8': 'MOL(8)',
  '12': 'MOL(12)',
  '5': 'OPC(5)',
  '3': 'MOL(3)',
  '6': 'BG(6)',
  '18': 'GP(18)',
  '14': 'GC(14)',
  '19': 'OPC(19)',
  '20': 'MOL(20)',
  '17': 'ENDO(17)',
  '16': 'AST(16)',
  '2': 'MOL(2)',
  '10': 'MOL(10)',
  '1': 'BG(1)',
  '11': 'MOL(11)',
  '0': 'NFOL(0)',
  '4': 'MOL(4)',
  '9': 'INH_SST(9)',
  '13': 'INH_VIP(13)',
  '7': 'MOL(7)'},
 'pycisTopic_leiden_10_3': {'21': 'OPC(21)',
  '31': 'MOL(31)',
  '15': 'MOL(15)',
  '8': 'BG(8)',
  '37': 'GP(37)',
  '12': 'MOL(12)',
  '41': 'INH_SST(41)',
  '5': 'BG(5)',
  '27': 'GC(27)',
  '43': 'PURK(43)',
  '6': 'MOL(6)',
  '3': 'BG(3)',
  '38': 'ENDO(38)',
  '18': 'MOL(18)',
  '14': 'NFOL(14)',
  '19': 'MOL(19)',
  '20': 'COP(20)',
  '30': 'MOL(30)',
  '17': 'BG(17)',
  '34': 'MOL(34)',
  '25': 'MOL(25)',
  '39': 'MGL(39)',
  '42': 'MOL(42)',
  '33': 'GC(33)',
  '24': 'MOL(24)',
  '28': 'INH_SST(28)',
  '29': 'NFOL(29)',
  '16': 'NFOL(16)',
  '2': 'OPC(2)',
  '10': 'INH_VIP(10)',
  '1': 'MOL(1)',
  '11': 'MOL(11)',
  '22': 'INH_SNCG(22)',
  '26': 'MOL(26)',
  '35': 'AST(35)',
  '23': 'MGL(23)',
  '36': 'MG(36)',
  '0': 'NFOL(0)',
  '40': 'MOL(40)',
  '4': 'MOL(4)',
  '9': 'MOL(9)',
  '13': 'BG(13)',
  '7': 'NFOL(7)',
  '32': 'OPC(32)'}}
[96]:
for resolution in [0.6, 1.2, 3]:
    cistopic_obj.cell_data[f'pycisTopic_leiden_10_{resolution}'] = [
        annot_dict[f'pycisTopic_leiden_10_{resolution}'][x] for x in cistopic_obj.cell_data[f'pycisTopic_leiden_10_{resolution}'].tolist()
    ]
[97]:
plot_metadata(
    cistopic_obj,
    reduction_name='UMAP',
    variables=['Seurat_cell_type', 'pycisTopic_leiden_10_0.6', 'pycisTopic_leiden_10_1.2', 'pycisTopic_leiden_10_3'],
    target='cell', num_columns=4,
    text_size=10,
    dot_size=5)
../_images/notebooks_human_cerebellum_75_0.png

We can also plot continuous values.

[78]:
plot_metadata(
    cistopic_obj,
    reduction_name='UMAP',
    variables=['log10_unique_fragments_count', 'tss_enrichment', 'Doublet_scores_fragments', 'fraction_of_fragments_in_peaks'],
    target='cell', num_columns=4,
    text_size=10,
    dot_size=5)
../_images/notebooks_human_cerebellum_77_0.png

Let’s visualize the cell-topic contributions.

[81]:
plot_topic(
    cistopic_obj,
    reduction_name = 'UMAP',
    target = 'cell',
    num_columns=5
)
../_images/notebooks_human_cerebellum_79_0.png

Or we can also draw a heatmap with the topic contributions (and annotations).

[90]:
cell_topic_heatmap(
    cistopic_obj,
    variables = ['Seurat_cell_type'],
    scale = False,
    legend_loc_x = 1.0,
    legend_loc_y = -1.2,
    legend_dist_y = -1,
    figsize = (10, 10)
)
../_images/notebooks_human_cerebellum_81_0.png

Topic binarization & QC

Next we can binarize topic-region and cell-topic distributions. The first is useful for exploring the topics with other tools that work with region sets (e.g. GREAT, cisTarget); while the latter is useful to automatically annotate topics.

We will first binarize the topic-region distributions. There are several methods that can be used for this: ‘otsu’ (Otsu, 1979), ‘yen’ (Yen et al., 1995), ‘li’ (Li & Lee, 1993), ‘aucell’ (Van de Sande et al., 2020) or ‘ntop’ (Taking the top n regions per topic). Otsu and Yen’s methods work well for topic-region distributions; however for some downstream analyses (e.g. deep learning) it may be convenient to use ntop to have balanced region sets.

[103]:
from pycisTopic.topic_binarization import binarize_topics
[116]:
region_bin_topics_top_3k = binarize_topics(
    cistopic_obj, method='ntop', ntop = 3_000,
    plot=True, num_columns=5
)
../_images/notebooks_human_cerebellum_84_0.png
[117]:
region_bin_topics_otsu = binarize_topics(
    cistopic_obj, method='otsu',
    plot=True, num_columns=5
)
../_images/notebooks_human_cerebellum_85_0.png

Similarly, we can now binarize the cell-topic distribions.

[101]:
binarized_cell_topic = binarize_topics(
    cistopic_obj,
    target='cell',
    method='li',
    plot=True,
    num_columns=5, nbins=100)
../_images/notebooks_human_cerebellum_87_0.png

Following, we can compute the topic quality control metrics. These include:

  • Number of assignments

  • Topic coherence (Mimno et al., 2011): Measures to which extent high scoring regions in the topic are actually co-accessible in the original data. If it is low it indicates that the topic is rather random. The higher, the better is a topic.

  • The marginal topic distribution: Indicates how much each topic contributes to the model. The higher, the better is a topic.

  • The gini index: Value between 0 and 1, that indicates the specificity of topics (0: General, 1:Specific)

  • If topics have been binarized, the number of regions/cells per topic will be added.

[121]:
from pycisTopic.topic_qc import compute_topic_metrics, plot_topic_qc, topic_annotation
import matplotlib.pyplot as plt
from pycisTopic.utils import fig2img
[118]:
topic_qc_metrics = compute_topic_metrics(cistopic_obj)
[119]:
fig_dict={}
fig_dict['CoherenceVSAssignments']=plot_topic_qc(topic_qc_metrics, var_x='Coherence', var_y='Log10_Assignments', var_color='Gini_index', plot=False, return_fig=True)
fig_dict['AssignmentsVSCells_in_bin']=plot_topic_qc(topic_qc_metrics, var_x='Log10_Assignments', var_y='Cells_in_binarized_topic', var_color='Gini_index', plot=False, return_fig=True)
fig_dict['CoherenceVSCells_in_bin']=plot_topic_qc(topic_qc_metrics, var_x='Coherence', var_y='Cells_in_binarized_topic', var_color='Gini_index', plot=False, return_fig=True)
fig_dict['CoherenceVSRegions_in_bin']=plot_topic_qc(topic_qc_metrics, var_x='Coherence', var_y='Regions_in_binarized_topic', var_color='Gini_index', plot=False, return_fig=True)
fig_dict['CoherenceVSMarginal_dist']=plot_topic_qc(topic_qc_metrics, var_x='Coherence', var_y='Marginal_topic_dist', var_color='Gini_index', plot=False, return_fig=True)
fig_dict['CoherenceVSGini_index']=plot_topic_qc(topic_qc_metrics, var_x='Coherence', var_y='Gini_index', var_color='Gini_index', plot=False, return_fig=True)
[120]:
# Plot topic stats in one figure
fig=plt.figure(figsize=(40, 43))
i = 1
for fig_ in fig_dict.keys():
    plt.subplot(2, 3, i)
    img = fig2img(fig_dict[fig_]) #To convert figures to png to plot together, see .utils.py. This converts the figure to png.
    plt.imshow(img)
    plt.axis('off')
    i += 1
plt.subplots_adjust(wspace=0, hspace=-0.70)
plt.show()
../_images/notebooks_human_cerebellum_92_0.png

Next, we can automatically annotate topics, in this case by cell type. Here we calculate the proportion of cells in each group that are assigned to the binarized topic in comparison to the ratio in the whole data set. We will consider a topic as general if the difference between the ration of cells in the whole data set in the binarized topic and the ratio of total cells in the assigned groups is above 0.2. This indicates that the topic is general, and the propotion test may fail if the topic is enriched in both foreground (the group) and background (the whole data set); resulting in a big difference between the ratios.

[122]:
topic_annot = topic_annotation(
    cistopic_obj,
    annot_var='Seurat_cell_type',
    binarized_cell_topic=binarized_cell_topic,
    general_topic_thr = 0.2
)
/lustre1/project/stg_00002/mambaforge/vsc33053/envs/pycisTopic_polars_tutorial/lib/python3.11/site-packages/statsmodels/stats/weightstats.py:792: RuntimeWarning: divide by zero encountered in scalar divide
  zstat = value / std
[123]:
topic_annot
[123]:
Seurat_cell_type Ratio_cells_in_topic Ratio_group_in_population is_general
Topic1 MOL, INH_SNCG 0.044164 0.274448 False
Topic2 GP, GC 0.036102 0.021381 False
Topic3 BG, MG 0.069401 0.121276 False
Topic4 ENDO 0.009814 0.005608 False
Topic5 BG, AST, MG 0.20715 0.131791 False
Topic6 INH_SST, GP, GC 0.052576 0.04171 False
Topic7 INH_SST, GP, INH_SNCG, INH_VIP, PURK, INH_PVAL... 0.121977 0.090782 False
Topic8 INH_SST, INH_SNCG, INH_PVALB 0.030494 0.038906 False
Topic9 MOL, GC 0.058184 0.276902 False
Topic10 OPC 0.071153 0.059586 False
Topic11 OPC 0.075009 0.059586 False
Topic12 BG, MG 0.179811 0.121276 False
Topic13 NFOL, COP 0.061689 0.120925 False
Topic14 INH_SST, BG, GP, MGL, INH_SNCG, INH_VIP, PURK,... 0.388013 0.31055 False
Topic15 MGL 0.037154 0.022783 False
Topic16 OPC 0.026639 0.059586 False
Topic17 INH_SST, INH_SNCG 0.042411 0.033649 False
Topic18 INH_SST, INH_VIP 0.031546 0.047319 False
Topic19 MOL, NFOL, COP 0.348405 0.382054 False
Topic20 INH_VIP, PURK 0.026288 0.030494 False
Topic21 INH_VIP, PURK 0.010165 0.030494 False
Topic22 MOL, NFOL, COP 0.60007 0.382054 True
Topic23 INH_SST, INH_SNCG, INH_VIP, PURK, INH_PVALB 0.057133 0.069401 False
Topic24 INH_SST, GP, INH_SNCG, INH_VIP, PURK, GC 0.078514 0.085524 False
Topic25 INH_SST, INH_SNCG, MG 0.023134 0.055731 False
Topic26 MOL, NFOL, COP 0.579741 0.382054 False
Topic27 INH_VIP, INH_PVALB, ENDO, MG 0.025237 0.059937 False
Topic28 INH_VIP 0.028041 0.026989 False
Topic29 BG, AST, MG 0.030845 0.131791 False
Topic30 MGL 0.011216 0.022783 False
Topic31 INH_SST, INH_SNCG, INH_PVALB 0.022783 0.038906 False
Topic32 BG, AST, MG 0.186821 0.131791 False
Topic33 GP, GC 0.012969 0.021381 False
Topic34 BG, GP, INH_SNCG, GC 0.043463 0.133894 False
Topic35 INH_SST, INH_SNCG, INH_VIP 0.040659 0.060638 False
Topic36 NFOL, COP 0.229933 0.120925 False
Topic37 AST 0.009113 0.010515 False
Topic38 MOL, NFOL 0.175605 0.35892 False
Topic39 NFOL, COP 0.04837 0.120925 False
Topic40 INH_SST, BG, GP, INH_SNCG, INH_VIP, PURK, INH_... 0.384508 0.287767 False

Differentially Accessible Regions (DARs)

Together with working with regulatory topics, we can also identify differentially accessible regions (DARs) between cell types. First, we will impute the region accessibility exploting the cell-topic and topic-region probabilities. To shrink very low probability values to 0, we use a scale factor (by default: 10^6).

[131]:
from pycisTopic.diff_features import (
    impute_accessibility,
    normalize_scores,
    find_highly_variable_features,
    find_diff_features
)
import numpy as np
[125]:
imputed_acc_obj = impute_accessibility(
    cistopic_obj,
    selected_cells=None,
    selected_regions=None,
    scale_factor=10**6
)
2024-03-06 16:55:01,494 cisTopic     INFO     Imputing region accessibility
2024-03-06 16:55:01,494 cisTopic     INFO     Impute region accessibility for regions 0-20000
2024-03-06 16:55:01,927 cisTopic     INFO     Impute region accessibility for regions 20000-40000
2024-03-06 16:55:02,201 cisTopic     INFO     Impute region accessibility for regions 40000-60000
2024-03-06 16:55:02,472 cisTopic     INFO     Impute region accessibility for regions 60000-80000
2024-03-06 16:55:02,746 cisTopic     INFO     Impute region accessibility for regions 80000-100000
2024-03-06 16:55:03,017 cisTopic     INFO     Impute region accessibility for regions 100000-120000
2024-03-06 16:55:03,291 cisTopic     INFO     Impute region accessibility for regions 120000-140000
2024-03-06 16:55:03,564 cisTopic     INFO     Impute region accessibility for regions 140000-160000
2024-03-06 16:55:03,836 cisTopic     INFO     Impute region accessibility for regions 160000-180000
2024-03-06 16:55:04,104 cisTopic     INFO     Impute region accessibility for regions 180000-200000
2024-03-06 16:55:04,373 cisTopic     INFO     Impute region accessibility for regions 200000-220000
2024-03-06 16:55:04,642 cisTopic     INFO     Impute region accessibility for regions 220000-240000
2024-03-06 16:55:04,911 cisTopic     INFO     Impute region accessibility for regions 240000-260000
2024-03-06 16:55:05,178 cisTopic     INFO     Impute region accessibility for regions 260000-280000
2024-03-06 16:55:05,458 cisTopic     INFO     Impute region accessibility for regions 280000-300000
2024-03-06 16:55:05,725 cisTopic     INFO     Impute region accessibility for regions 300000-320000
2024-03-06 16:55:05,995 cisTopic     INFO     Impute region accessibility for regions 320000-340000
2024-03-06 16:55:06,263 cisTopic     INFO     Impute region accessibility for regions 340000-360000
2024-03-06 16:55:06,535 cisTopic     INFO     Impute region accessibility for regions 360000-380000
2024-03-06 16:55:06,803 cisTopic     INFO     Impute region accessibility for regions 380000-400000
2024-03-06 16:55:07,071 cisTopic     INFO     Impute region accessibility for regions 400000-420000
2024-03-06 16:55:07,339 cisTopic     INFO     Impute region accessibility for regions 420000-440000
2024-03-06 16:55:07,578 cisTopic     INFO     Done!
[126]:
normalized_imputed_acc_obj = normalize_scores(imputed_acc_obj, scale_factor=10**4)
2024-03-06 16:55:12,062 cisTopic     INFO     Normalizing imputed data
2024-03-06 16:55:18,978 cisTopic     INFO     Done!

Optionally, we can identify highly variable regions. This is not mandatory, but will speed up the hypothesis testing step for identifying DARs.

[129]:
variable_regions = find_highly_variable_features(
    normalized_imputed_acc_obj,
    min_disp = 0.05,
    min_mean = 0.0125,
    max_mean = 3,
    max_disp = np.inf,
    n_bins=20,
    n_top_features=None,
    plot=True
)
2024-03-06 16:55:41,694 cisTopic     INFO     Calculating mean
2024-03-06 16:55:43,121 cisTopic     INFO     Calculating variance
../_images/notebooks_human_cerebellum_101_1.png
2024-03-06 16:56:02,058 cisTopic     INFO     Done!
[130]:
len(variable_regions)
[130]:
59709

We can now identify differentially accessible regions between groups. By default, this function will perform a Wilcoxon rank-sum test between each group using the specified variable and the rest. Alternatively, specified contrast can be provided as a list with foreground and background groups (e.g. for group 1 versus group 2 and 3, and group 2 versus group 1 and 3: [[[‘Group_1’], [‘Group_2, ‘Group_3’]], [[‘Group_2’], [‘Group_1, ‘Group_3’]]]).

[132]:
markers_dict= find_diff_features(
    cistopic_obj,
    imputed_acc_obj,
    variable='Seurat_cell_type',
    var_features=variable_regions,
    contrasts=None,
    adjpval_thr=0.05,
    log2fc_thr=np.log2(1.5),
    n_cpu=5,
    _temp_dir='/scratch/leuven/330/vsc33053/ray_spill',
    split_pattern = '-'
)
2024-03-06 16:57:06,033 INFO worker.py:1724 -- Started a local Ray instance.
2024-03-06 16:57:06,739 cisTopic     INFO     Subsetting data for AST (30 of 2853)
2024-03-06 16:57:10,317 cisTopic     INFO     Computing p-value for AST
2024-03-06 16:57:23,252 cisTopic     INFO     Computing log2FC for AST
2024-03-06 16:57:24,824 cisTopic     INFO     AST done!
2024-03-06 16:57:24,827 cisTopic     INFO     Subsetting data for BG (283 of 2853)
2024-03-06 16:57:24,875 cisTopic     INFO     Computing p-value for BG
2024-03-06 16:57:30,629 cisTopic     INFO     Computing log2FC for BG
2024-03-06 16:57:30,667 cisTopic     INFO     BG done!
2024-03-06 16:57:30,669 cisTopic     INFO     Subsetting data for COP (66 of 2853)
2024-03-06 16:57:30,700 cisTopic     INFO     Computing p-value for COP
2024-03-06 16:57:36,519 cisTopic     INFO     Computing log2FC for COP
2024-03-06 16:57:36,552 cisTopic     INFO     COP done!
2024-03-06 16:57:36,554 cisTopic     INFO     Subsetting data for ENDO (16 of 2853)
2024-03-06 16:57:36,582 cisTopic     INFO     Computing p-value for ENDO
2024-03-06 16:57:42,393 cisTopic     INFO     Computing log2FC for ENDO
2024-03-06 16:57:42,423 cisTopic     INFO     ENDO done!
2024-03-06 16:57:42,425 cisTopic     INFO     Subsetting data for GC (45 of 2853)
2024-03-06 16:57:42,459 cisTopic     INFO     Computing p-value for GC
2024-03-06 16:57:48,248 cisTopic     INFO     Computing log2FC for GC
2024-03-06 16:57:48,283 cisTopic     INFO     GC done!
2024-03-06 16:57:48,285 cisTopic     INFO     Subsetting data for GP (16 of 2853)
2024-03-06 16:57:48,311 cisTopic     INFO     Computing p-value for GP
2024-03-06 16:57:54,105 cisTopic     INFO     Computing log2FC for GP
2024-03-06 16:57:54,140 cisTopic     INFO     GP done!
2024-03-06 16:57:54,142 cisTopic     INFO     Subsetting data for INH_PVALB (15 of 2853)
2024-03-06 16:57:54,169 cisTopic     INFO     Computing p-value for INH_PVALB
2024-03-06 16:58:00,011 cisTopic     INFO     Computing log2FC for INH_PVALB
2024-03-06 16:58:00,045 cisTopic     INFO     INH_PVALB done!
2024-03-06 16:58:00,047 cisTopic     INFO     Subsetting data for INH_SNCG (38 of 2853)
2024-03-06 16:58:00,094 cisTopic     INFO     Computing p-value for INH_SNCG
2024-03-06 16:58:06,018 cisTopic     INFO     Computing log2FC for INH_SNCG
2024-03-06 16:58:06,054 cisTopic     INFO     INH_SNCG done!
2024-03-06 16:58:06,056 cisTopic     INFO     Subsetting data for INH_SST (58 of 2853)
2024-03-06 16:58:06,107 cisTopic     INFO     Computing p-value for INH_SST
2024-03-06 16:58:11,934 cisTopic     INFO     Computing log2FC for INH_SST
2024-03-06 16:58:11,968 cisTopic     INFO     INH_SST done!
2024-03-06 16:58:11,970 cisTopic     INFO     Subsetting data for INH_VIP (77 of 2853)
2024-03-06 16:58:12,011 cisTopic     INFO     Computing p-value for INH_VIP
2024-03-06 16:58:17,978 cisTopic     INFO     Computing log2FC for INH_VIP
2024-03-06 16:58:18,021 cisTopic     INFO     INH_VIP done!
2024-03-06 16:58:18,023 cisTopic     INFO     Subsetting data for MG (63 of 2853)
2024-03-06 16:58:18,052 cisTopic     INFO     Computing p-value for MG
2024-03-06 16:58:24,203 cisTopic     INFO     Computing log2FC for MG
2024-03-06 16:58:24,239 cisTopic     INFO     MG done!
2024-03-06 16:58:24,241 cisTopic     INFO     Subsetting data for MGL (65 of 2853)
2024-03-06 16:58:24,272 cisTopic     INFO     Computing p-value for MGL
2024-03-06 16:58:30,122 cisTopic     INFO     Computing log2FC for MGL
2024-03-06 16:58:30,154 cisTopic     INFO     MGL done!
2024-03-06 16:58:30,157 cisTopic     INFO     Subsetting data for MOL (745 of 2853)
2024-03-06 16:58:30,191 cisTopic     INFO     Computing p-value for MOL
2024-03-06 16:58:35,955 cisTopic     INFO     Computing log2FC for MOL
2024-03-06 16:58:35,996 cisTopic     INFO     MOL done!
2024-03-06 16:58:35,998 cisTopic     INFO     Subsetting data for NFOL (279 of 2853)
2024-03-06 16:58:36,062 cisTopic     INFO     Computing p-value for NFOL
2024-03-06 16:58:41,890 cisTopic     INFO     Computing log2FC for NFOL
2024-03-06 16:58:41,925 cisTopic     INFO     NFOL done!
2024-03-06 16:58:41,927 cisTopic     INFO     Subsetting data for OPC (170 of 2853)
2024-03-06 16:58:41,959 cisTopic     INFO     Computing p-value for OPC
2024-03-06 16:58:47,808 cisTopic     INFO     Computing log2FC for OPC
2024-03-06 16:58:47,842 cisTopic     INFO     OPC done!
2024-03-06 16:58:47,844 cisTopic     INFO     Subsetting data for PURK (10 of 2853)
2024-03-06 16:58:47,870 cisTopic     INFO     Computing p-value for PURK
2024-03-06 16:58:53,610 cisTopic     INFO     Computing log2FC for PURK
2024-03-06 16:58:53,639 cisTopic     INFO     PURK done!

We can also plot region accessibility onto the cell-topic UMAP. For example, let’s check how the best DARs for some cell types look like.

[133]:
from pycisTopic.clust_vis import plot_imputed_features
[134]:
plot_imputed_features(
    cistopic_obj,
    reduction_name='UMAP',
    imputed_data=imputed_acc_obj,
    features=[markers_dict[x].index.tolist()[0] for x in ['BG', 'GC', 'INH_SST', 'COP']],
    scale=False,
    num_columns=4
)
../_images/notebooks_human_cerebellum_107_0.png
[139]:
print("Number of DARs found:")
print("---------------------")
for x in markers_dict:
    print(f"  {x}: {len(markers_dict[x])}")
Number of DARs found:
---------------------
  AST: 22352
  BG: 26552
  COP: 11739
  ENDO: 9340
  GC: 12518
  GP: 12200
  INH_PVALB: 12405
  INH_SNCG: 12536
  INH_SST: 12082
  INH_VIP: 14240
  MG: 25522
  MGL: 9438
  MOL: 20705
  NFOL: 16428
  OPC: 17949
  PURK: 13188

Save region sets

[142]:
os.makedirs(os.path.join(out_dir, "region_sets"), exist_ok = True)
os.makedirs(os.path.join(out_dir, "region_sets", "Topics_otsu"), exist_ok = True)
os.makedirs(os.path.join(out_dir, "region_sets", "Topics_top_3k"), exist_ok = True)
os.makedirs(os.path.join(out_dir, "region_sets", "DARs_cell_type"), exist_ok = True)
[141]:
from pycisTopic.utils import region_names_to_coordinates
[143]:
for topic in region_bin_topics_otsu:
    region_names_to_coordinates(
        region_bin_topics_otsu[topic].index
    ).sort_values(
        ["Chromosome", "Start", "End"]
    ).to_csv(
        os.path.join(out_dir, "region_sets", "Topics_otsu", f"{topic}.bed"),
        sep = "\t",
        header = False, index = False
    )
[144]:
for topic in region_bin_topics_top_3k:
    region_names_to_coordinates(
        region_bin_topics_top_3k[topic].index
    ).sort_values(
        ["Chromosome", "Start", "End"]
    ).to_csv(
        os.path.join(out_dir, "region_sets", "Topics_top_3k", f"{topic}.bed"),
        sep = "\t",
        header = False, index = False
    )
[145]:
for cell_type in markers_dict:
    region_names_to_coordinates(
        markers_dict[cell_type].index
    ).sort_values(
        ["Chromosome", "Start", "End"]
    ).to_csv(
        os.path.join(out_dir, "region_sets", "DARs_cell_type", f"{cell_type}.bed"),
        sep = "\t",
        header = False, index = False
    )

Gene activity

Now we can infer gene activity.

In this function there are several options to evaluate: - Search space: The user can choose whether the search space should be include other genes or not (use_gene_boundaries), and the minimum and maximum distance it should have (upstream and downstream) - Distance weight: The parameters related to the distance weight measure the impact of distance when inferring region to gene weights as an exponential function. The user can control whether this weight should be used (distance_weight) and the effect of the distance (decay_rate). - Gene size weight: Large genes may have more peaks by chance. The user can optionally apply a weight based on the size of each gene (gene_size_weight), which by default is dividing the size of each gene by the median gene size in the genome. Alternatively, the user can also use average_scores which will calculate the gene activity as the mean weighted region accessibility of all regions linked to the gene. - Gini weight: This weight will give more importance to specific regions (gini_weight).

[157]:
import pyranges as pr
from pycisTopic.gene_activity import get_gene_activity
[150]:
chromsizes = pd.read_table(os.path.join(out_dir, "qc", "hg38.chrom_sizes_and_alias.tsv"))
chromsizes
[150]:
# ucsc length ensembl refseq_id genbank_id
0 chr1 248956422 1 NC_000001.11 CM000663.2
1 chr2 242193529 2 NC_000002.12 CM000664.2
2 chr3 198295559 3 NC_000003.12 CM000665.2
3 chr4 190214555 4 NC_000004.12 CM000666.2
4 chr5 181538259 5 NC_000005.10 CM000667.2
... ... ... ... ... ...
449 chrUn_KI270539v1 993 HSCHRUN_RANDOM_146 NT_187442.1 KI270539.1
450 chrUn_KI270385v1 990 HSCHRUN_RANDOM_195 NT_187487.1 KI270385.1
451 chrUn_KI270423v1 981 HSCHRUN_RANDOM_121 NT_187417.1 KI270423.1
452 chrUn_KI270392v1 971 HSCHRUN_RANDOM_193 NT_187485.1 KI270392.1
453 chrUn_KI270394v1 970 HSCHRUN_RANDOM_187 NT_187479.1 KI270394.1

454 rows × 5 columns

[153]:
chromsizes.rename({"# ucsc": "Chromosome", "length": "End"}, axis = 1, inplace = True)
chromsizes["Start"] = 0
chromsizes = pr.PyRanges(chromsizes[["Chromosome", "Start", "End"]])
[154]:
chromsizes
[154]:
Chromosome Start End
0 chr1 0 248956422
1 chr1_GL383518v1_alt 0 182439
2 chr1_GL383519v1_alt 0 110268
3 chr1_GL383520v2_alt 0 366580
4 chr1_KI270706v1_random 0 175055
... ... ... ...
449 chrX_KI270880v1_alt 0 284869
450 chrX_KI270881v1_alt 0 144206
451 chrX_KI270913v1_alt 0 274009
452 chrY 0 57227415
453 chrY_KI270740v1_random 0 37240

454 rows × 3 columns

[167]:
pr_annotation = pd.read_table(
        os.path.join(out_dir, "qc", "tss.bed")
    ).rename(
        {"Name": "Gene", "# Chromosome": "Chromosome"}, axis = 1)
pr_annotation["Transcription_Start_Site"] = pr_annotation["Start"]
pr_annotation = pr.PyRanges(pr_annotation)
pr_annotation
[167]:
Chromosome Start End Gene Score Strand Transcript_type Transcription_Start_Site
0 GL000009.2 58375 58376 NaN . - protein_coding 58375
1 GL000194.1 115017 115018 NaN . - protein_coding 115017
2 GL000194.1 115054 115055 MAFIP . - protein_coding 115054
3 GL000195.1 49163 49164 NaN . - protein_coding 49163
4 GL000213.1 139654 139655 NaN . - protein_coding 139654
... ... ... ... ... ... ... ... ...
87275 chrY 2500554 2500555 ZBED1 . - protein_coding 2500554
87276 chrY 2500894 2500895 ZBED1 . - protein_coding 2500894
87277 chrY 318795 318796 GTPBP6 . - protein_coding 318795
87278 chrY 1212648 1212649 CRLF2 . - protein_coding 1212648
87279 chrY 1212633 1212634 CRLF2 . - protein_coding 1212633

87280 rows × 8 columns

[168]:
gene_act, weigths = get_gene_activity(
    imputed_acc_obj,
    pr_annotation,
    chromsizes,
    use_gene_boundaries=True, # Whether to use the whole search space or stop when encountering another gene
    upstream=[1000, 100000], # Search space upstream. The minimum means that even if there is a gene right next to it
                             # these bp will be taken (1kbp here)
    downstream=[1000,100000], # Search space downstream
    distance_weight=True, # Whether to add a distance weight (an exponential function, the weight will decrease with distance)
    decay_rate=1, # Exponent for the distance exponential funciton (the higher the faster will be the decrease)
    extend_gene_body_upstream=10000, # Number of bp upstream immune to the distance weight (their value will be maximum for
                          #this weight)
    extend_gene_body_downstream=500, # Number of bp downstream immune to the distance weight
    gene_size_weight=False, # Whether to add a weights based on the length of the gene
    gene_size_scale_factor='median', # Dividend to calculate the gene size weigth. Default is the median value of all genes
                          #in the genome
    remove_promoters=False, # Whether to remove promoters when computing gene activity scores
    average_scores=True, # Whether to divide by the total number of region assigned to a gene when calculating the gene
                          #activity score
    scale_factor=1, # Value to multiply for the final gene activity matrix
    extend_tss=[10,10], # Space to consider a promoter
    gini_weight = True, # Whether to add a gini index weigth. The more unique the region is, the higher this weight will be
    return_weights= True, # Whether to return the final weights
    project='Gene_activity') # Project name for the gene activity object
Warning! Start and End columns now have different dtypes: int32 and int64
Warning! Start and End columns now have different dtypes: int32 and int64
2024-03-06 17:13:02,671 cisTopic     INFO     Calculating gene boundaries
Warning! Start and End columns now have different dtypes: int32 and int64
Warning! Start and End columns now have different dtypes: int32 and int64
Warning! Start and End columns now have different dtypes: int32 and int64
2024-03-06 17:13:17,782 cisTopic     INFO     Calculating distances
join: Strand data from other will be added as strand data to self.
If this is undesired use the flag apply_strand_suffix=False.
To turn off the warning set apply_strand_suffix to True or False.
2024-03-06 17:13:33,422 cisTopic     INFO     Calculating distance weigths
2024-03-06 17:13:34,066 cisTopic     INFO     Distance weights done
2024-03-06 17:13:34,066 cisTopic     INFO     Calculating gini weights
2024-03-06 17:13:47,195 cisTopic     INFO     Getting gene activity scores
2024-03-06 17:17:14,591 cisTopic     INFO     Creating imputed features object

As we did before for the imputed region accessibility, we can also infer the Differentially Accessible Genes (DAGs).

[169]:
DAG_markers_dict= find_diff_features(
    cistopic_obj,
    gene_act,
    variable='Seurat_cell_type',
    var_features=None,
    contrasts=None,
    adjpval_thr=0.05,
    log2fc_thr=np.log2(1.5),
    n_cpu=5,
    _temp_dir='/scratch/leuven/330/vsc33053/ray_spill',
    split_pattern = '-')
2024-03-06 17:17:33,892 INFO worker.py:1724 -- Started a local Ray instance.
2024-03-06 17:17:34,368 cisTopic     INFO     Subsetting data for AST (30 of 2853)
2024-03-06 17:17:36,523 cisTopic     INFO     Computing p-value for AST
2024-03-06 17:17:40,292 cisTopic     INFO     Computing log2FC for AST
2024-03-06 17:17:41,563 cisTopic     INFO     AST done!
2024-03-06 17:17:41,564 cisTopic     INFO     Subsetting data for BG (283 of 2853)
2024-03-06 17:17:41,588 cisTopic     INFO     Computing p-value for BG
2024-03-06 17:17:44,487 cisTopic     INFO     Computing log2FC for BG
2024-03-06 17:17:44,500 cisTopic     INFO     BG done!
2024-03-06 17:17:44,502 cisTopic     INFO     Subsetting data for COP (66 of 2853)
2024-03-06 17:17:44,527 cisTopic     INFO     Computing p-value for COP
2024-03-06 17:17:47,519 cisTopic     INFO     Computing log2FC for COP
2024-03-06 17:17:47,533 cisTopic     INFO     COP done!
2024-03-06 17:17:47,535 cisTopic     INFO     Subsetting data for ENDO (16 of 2853)
2024-03-06 17:17:47,566 cisTopic     INFO     Computing p-value for ENDO
2024-03-06 17:17:50,431 cisTopic     INFO     Computing log2FC for ENDO
2024-03-06 17:17:50,469 cisTopic     INFO     ENDO done!
2024-03-06 17:17:50,471 cisTopic     INFO     Subsetting data for GC (45 of 2853)
2024-03-06 17:17:50,486 cisTopic     INFO     Computing p-value for GC
2024-03-06 17:17:53,347 cisTopic     INFO     Computing log2FC for GC
2024-03-06 17:17:53,366 cisTopic     INFO     GC done!
2024-03-06 17:17:53,368 cisTopic     INFO     Subsetting data for GP (16 of 2853)
2024-03-06 17:17:53,381 cisTopic     INFO     Computing p-value for GP
2024-03-06 17:17:56,200 cisTopic     INFO     Computing log2FC for GP
2024-03-06 17:17:56,239 cisTopic     INFO     GP done!
2024-03-06 17:17:56,241 cisTopic     INFO     Subsetting data for INH_PVALB (15 of 2853)
2024-03-06 17:17:56,255 cisTopic     INFO     Computing p-value for INH_PVALB
2024-03-06 17:17:59,070 cisTopic     INFO     Computing log2FC for INH_PVALB
2024-03-06 17:17:59,107 cisTopic     INFO     INH_PVALB done!
2024-03-06 17:17:59,108 cisTopic     INFO     Subsetting data for INH_SNCG (38 of 2853)
2024-03-06 17:17:59,251 cisTopic     INFO     Computing p-value for INH_SNCG
2024-03-06 17:18:02,159 cisTopic     INFO     Computing log2FC for INH_SNCG
2024-03-06 17:18:02,209 cisTopic     INFO     INH_SNCG done!
2024-03-06 17:18:02,210 cisTopic     INFO     Subsetting data for INH_SST (58 of 2853)
2024-03-06 17:18:02,226 cisTopic     INFO     Computing p-value for INH_SST
2024-03-06 17:18:05,034 cisTopic     INFO     Computing log2FC for INH_SST
2024-03-06 17:18:05,048 cisTopic     INFO     INH_SST done!
2024-03-06 17:18:05,050 cisTopic     INFO     Subsetting data for INH_VIP (77 of 2853)
2024-03-06 17:18:05,069 cisTopic     INFO     Computing p-value for INH_VIP
2024-03-06 17:18:07,901 cisTopic     INFO     Computing log2FC for INH_VIP
2024-03-06 17:18:07,955 cisTopic     INFO     INH_VIP done!
2024-03-06 17:18:07,957 cisTopic     INFO     Subsetting data for MG (63 of 2853)
2024-03-06 17:18:07,972 cisTopic     INFO     Computing p-value for MG
2024-03-06 17:18:10,843 cisTopic     INFO     Computing log2FC for MG
2024-03-06 17:18:10,860 cisTopic     INFO     MG done!
2024-03-06 17:18:10,861 cisTopic     INFO     Subsetting data for MGL (65 of 2853)
2024-03-06 17:18:10,876 cisTopic     INFO     Computing p-value for MGL
2024-03-06 17:18:13,720 cisTopic     INFO     Computing log2FC for MGL
2024-03-06 17:18:13,735 cisTopic     INFO     MGL done!
2024-03-06 17:18:13,736 cisTopic     INFO     Subsetting data for MOL (745 of 2853)
2024-03-06 17:18:13,758 cisTopic     INFO     Computing p-value for MOL
2024-03-06 17:18:16,621 cisTopic     INFO     Computing log2FC for MOL
2024-03-06 17:18:16,634 cisTopic     INFO     MOL done!
2024-03-06 17:18:16,636 cisTopic     INFO     Subsetting data for NFOL (279 of 2853)
2024-03-06 17:18:16,654 cisTopic     INFO     Computing p-value for NFOL
2024-03-06 17:18:19,518 cisTopic     INFO     Computing log2FC for NFOL
2024-03-06 17:18:19,535 cisTopic     INFO     NFOL done!
2024-03-06 17:18:19,538 cisTopic     INFO     Subsetting data for OPC (170 of 2853)
2024-03-06 17:18:19,556 cisTopic     INFO     Computing p-value for OPC
2024-03-06 17:18:22,411 cisTopic     INFO     Computing log2FC for OPC
2024-03-06 17:18:22,439 cisTopic     INFO     OPC done!
2024-03-06 17:18:22,441 cisTopic     INFO     Subsetting data for PURK (10 of 2853)
2024-03-06 17:18:22,457 cisTopic     INFO     Computing p-value for PURK
2024-03-06 17:18:25,299 cisTopic     INFO     Computing log2FC for PURK
2024-03-06 17:18:25,317 cisTopic     INFO     PURK done!
[171]:
plot_imputed_features(
    cistopic_obj,
    reduction_name='UMAP',
    imputed_data=gene_act,
    features=['PDGFRA', 'OLIG2', 'MBP', 'SOX10', # Olig differentiation
              'CTNNA3', 'ENPP6', 'OLIG1', # Olig differentiation
              'GAD2', 'VIP', 'SST', 'CTXN3', # Int
              'NFIB', 'SOX9',  #Ast
              'LEF1', #Endo
              'SPI1'], #Glia
    scale=True,
    num_columns=4
)
../_images/notebooks_human_cerebellum_124_0.png
[172]:
print("Number of DAGs found:")
print("---------------------")
for x in markers_dict:
    print(f"  {x}: {len(DAG_markers_dict[x])}")
Number of DAGs found:
---------------------
  AST: 2321
  BG: 2668
  COP: 586
  ENDO: 2583
  GC: 2816
  GP: 2948
  INH_PVALB: 2683
  INH_SNCG: 2736
  INH_SST: 2786
  INH_VIP: 2714
  MG: 2256
  MGL: 4022
  MOL: 1735
  NFOL: 704
  OPC: 2487
  PURK: 2650

Label transfer

Exploting the gene activity scores, we can transfer labels from a reference data set (e.g. scRNA-seq). As an example, we will transfer labels from the scRNA-seq layer of this data set.

The methods available for label transferring are: - ingest (from scanpy) - harmony (Korsunsky et al, 2019) - bbknn (Polański et al, 2020) - scanorama (Hie et al, 2019) - cca.

Except for ingest, these methods return a common coembedding and labels are inferred using the distances between query and refenrence cells as weights.

[185]:
from pycisTopic.label_transfer import label_transfer
[199]:
import scanpy as sc
rna_anndata = sc.read_h5ad(
    "/staging/leuven/stg_00002/lcb/sdewin/PhD/python_modules/scenicplus_development_tutorial/scRNA_seq_pp/adata.h5ad"
).raw.to_adata()
atac_anndata = sc.AnnData(gene_act.mtx.T, obs = pd.DataFrame(index = gene_act.cell_names), var = pd.DataFrame(index = gene_act.feature_names))
[200]:
atac_anndata.obs["sample_id"] = "10x_multiome_brain"
rna_anndata.obs["sample_id"] = "10x_multiome_brain"
[204]:
label_dict = label_transfer(
    rna_anndata,
    atac_anndata,
    labels_to_transfer = ['Seurat_cell_type'],
    variable_genes = True,
    methods = ['ingest', 'harmony', 'bbknn', 'scanorama', 'cca'],
    return_label_weights = False,
    _temp_dir= '/scratch/leuven/330/vsc33053/ray_spill/'
)
2024-03-06 17:30:35,512 cisTopic     INFO     Normalizing RNA data
WARNING: adata.X seems to be already log-transformed.
/lustre1/project/stg_00002/mambaforge/vsc33053/envs/pycisTopic_polars_tutorial/lib/python3.11/site-packages/scanpy/preprocessing/_simple.py:843: UserWarning: Received a view of an AnnData. Making a copy.
  view_to_actual(adata)
2024-03-06 17:30:41,665 cisTopic     INFO     Processing 1 query sample(s) using 1 cpu(s)
2024-03-06 17:30:44,886 INFO worker.py:1724 -- Started a local Ray instance.
(label_transfer_ray pid=3796267) 2024-03-06 17:30:47,417 cisTopic     INFO     Normalizing ATAC data from sample 10x_multiome_brain
(label_transfer_ray pid=3796267) /lustre1/project/stg_00002/mambaforge/vsc33053/envs/pycisTopic_polars_tutorial/lib/python3.11/site-packages/scanpy/preprocessing/_simple.py:843: UserWarning: Received a view of an AnnData. Making a copy.
(label_transfer_ray pid=3796267)   view_to_actual(adata)
(label_transfer_ray pid=3796267) /lustre1/project/stg_00002/mambaforge/vsc33053/envs/pycisTopic_polars_tutorial/lib/python3.11/site-packages/pycisTopic/label_transfer.py:237: FutureWarning: Use anndata.concat instead of AnnData.concatenate, AnnData.concatenate is deprecated and will be removed in the future. See the tutorial for concat at: https://anndata.readthedocs.io/en/latest/concatenation.html
(label_transfer_ray pid=3796267)   adata_concat = ref_anndata.concatenate(
(label_transfer_ray pid=3796267) 2024-03-06 17:30:55,619 cisTopic     INFO     Running integration with ingest
(label_transfer_ray pid=3796267) /lustre1/project/stg_00002/mambaforge/vsc33053/envs/pycisTopic_polars_tutorial/lib/python3.11/site-packages/umap/umap_.py:1943: UserWarning: n_jobs value -1 overridden to 1 by setting random_state. Use no seed for parallelism.
(label_transfer_ray pid=3796267)   warn(f"n_jobs value {self.n_jobs} overridden to 1 by setting random_state. Use no seed for parallelism.")
(label_transfer_ray pid=3796267) <frozen _collections_abc>:949: ImplicitModificationWarning: Setting element `.obsm['rep']` of view, initializing view as actual.
(label_transfer_ray pid=3796267) <frozen _collections_abc>:949: ImplicitModificationWarning: Setting element `.obsm['X_umap']` of view, initializing view as actual.
(label_transfer_ray pid=3796267) <frozen _collections_abc>:949: ImplicitModificationWarning: Setting element `.obsm['X_pca']` of view, initializing view as actual.
(label_transfer_ray pid=3796267) /lustre1/project/stg_00002/mambaforge/vsc33053/envs/pycisTopic_polars_tutorial/lib/python3.11/site-packages/pycisTopic/label_transfer.py:247: FutureWarning: In a future version, `df.iloc[:, i] = newvals` will attempt to set the values inplace instead of always setting a new array. To retain the old behavior, use either `df[df.columns[i]] = newvals` or, if columns are non-unique, `df.isetitem(i, newvals)`
(label_transfer_ray pid=3796267)   query_anndata.obs.loc[:, var] = query_anndata.obs.loc[
(label_transfer_ray pid=3796267) 2024-03-06 17:31:28,883 cisTopic     INFO     Running integration with harmony
(label_transfer_ray pid=3796267) 2024-03-06 17:31:29,391 harmonypy    INFO     Computing initial centroids with sklearn.KMeans...
(label_transfer_ray pid=3796267) 2024-03-06 17:31:29,391 - harmonypy - INFO - Computing initial centroids with sklearn.KMeans...
(label_transfer_ray pid=3796267) 2024-03-06 17:31:30,685 harmonypy    INFO     sklearn.KMeans initialization complete.
(label_transfer_ray pid=3796267) 2024-03-06 17:31:30,697 harmonypy    INFO     Iteration 1 of 10
(label_transfer_ray pid=3796267) 2024-03-06 17:31:30,685 - harmonypy - INFO - sklearn.KMeans initialization complete.
(label_transfer_ray pid=3796267) 2024-03-06 17:31:30,697 - harmonypy - INFO - Iteration 1 of 10
(label_transfer_ray pid=3796267) 2024-03-06 17:31:31,458 harmonypy    INFO     Iteration 2 of 10
(label_transfer_ray pid=3796267) 2024-03-06 17:31:31,458 - harmonypy - INFO - Iteration 2 of 10
(label_transfer_ray pid=3796267) 2024-03-06 17:31:32,189 harmonypy    INFO     Iteration 3 of 10
(label_transfer_ray pid=3796267) 2024-03-06 17:31:32,189 - harmonypy - INFO - Iteration 3 of 10
(label_transfer_ray pid=3796267) 2024-03-06 17:31:32,916 harmonypy    INFO     Iteration 4 of 10
(label_transfer_ray pid=3796267) 2024-03-06 17:31:32,916 - harmonypy - INFO - Iteration 4 of 10
(label_transfer_ray pid=3796267) 2024-03-06 17:31:33,682 harmonypy    INFO     Converged after 4 iterations
(label_transfer_ray pid=3796267) 2024-03-06 17:31:33,722 cisTopic     INFO     Running integration with bbknn
(label_transfer_ray pid=3796267) 2024-03-06 17:31:33,682 - harmonypy - INFO - Converged after 4 iterations
(label_transfer_ray pid=3796267) WARNING: consider updating your call to make use of `computation`
(label_transfer_ray pid=3796267) 2024-03-06 17:31:45,983 cisTopic     INFO     Running integration with scanorama
(label_transfer_ray pid=3796267) Found 1430 genes among all datasets
(label_transfer_ray pid=3796267) [[0.         0.34805468]
(label_transfer_ray pid=3796267)  [0.         0.        ]]
(label_transfer_ray pid=3796267) Processing datasets (0, 1)
(label_transfer_ray pid=3796267) 2024-03-06 17:31:50,070 cisTopic     INFO     Running integration with cca

We can now add the annotations to our cisTopic object and visualize them in the cell-topic UMAP. Scanorama and harmony are the ones that work best.

[205]:
label_dict_x=[label_dict[key] for key in label_dict.keys()]
label_pd = pd.concat(label_dict_x, axis=1, sort=False)
label_pd.index = cistopic_obj.cell_names
label_pd.columns = ['pycisTopic_' + x for x in label_pd.columns]
cistopic_obj.add_cell_data(label_pd, split_pattern = '-')
[207]:
plot_metadata(
    cistopic_obj,
    reduction_name='UMAP',
    variables=['Seurat_cell_type'] + label_pd.columns.to_list(),
    remove_nan=True,
    seed=555,
    num_columns=3)
../_images/notebooks_human_cerebellum_133_0.png

Let’s plot a confusion matrix.

[236]:
import seaborn as sns
fig, axs = plt.subplots(ncols = 3, nrows = 2, figsize = (3 * 5, 2 * 5))
for method, ax in zip(label_pd.columns.to_list(), axs.ravel()):
    conf_mat = pd.crosstab(cistopic_obj.cell_data["Seurat_cell_type"], cistopic_obj.cell_data[method])
    conf_mat = conf_mat / conf_mat.sum()
    sns.heatmap(conf_mat.loc[conf_mat.columns], ax = ax)
fig.tight_layout()
fig.show()
../_images/notebooks_human_cerebellum_135_0.png

Exporting to loom

We can now create loom files to further explore the results. There are two types of loom files:

  • Region accessibility: These loom files include the imputed accessibility as matrix, topics as regulons and cell-topic distributions as AUC matrices. The imputed values, the cistopic object used for the imputation and the cell-topic and topic-region binarized distributions are required. Alternatively, we can also provide different clustering and the marker regions (DARs) per group per clustering.

  • Gene activity: These loom files contain the gene activity as matrix, scRNA-seq derived regulons and their AUC values based on gene activity. The gene activity values, the cistopic object, and scRNA-seq derived regulons are required. Alternatively, we can also provide different clustering and the marker genes (DAGs) per group per clustering.

[240]:
from pycisTopic.loom import export_region_accessibility_to_loom, export_gene_activity_to_loom
[241]:
cluster_markers = {'Seurat_cell_type': markers_dict}
[242]:
os.makedirs(os.path.join(out_dir, "loom"), exist_ok=True)
[244]:
export_region_accessibility_to_loom(
    accessibility_matrix = imputed_acc_obj,
    cistopic_obj = cistopic_obj,
    binarized_topic_region = region_bin_topics_otsu,
    binarized_cell_topic = binarized_cell_topic,
    selected_cells = cistopic_obj.projections['cell']['UMAP'].index.tolist(),
    out_fname = os.path.join(out_dir, "loom", "10x_multiome_brain_pycisTopic_region_accessibility.loom"),
    cluster_annotation = ['Seurat_cell_type'],
    cluster_markers = cluster_markers,
    tree_structure = ('10x_multiome_brain', 'pycisTopic', 'noDBL_all'),
    title = 'Tutorial - Region accessibility all',
    nomenclature = "hg38",
    split_pattern = '-'
)
/lustre1/project/stg_00002/mambaforge/vsc33053/envs/pycisTopic_polars_tutorial/lib/python3.11/site-packages/pycisTopic/loom.py:619: FutureWarning: In a future version, `df.iloc[:, i] = newvals` will attempt to set the values inplace instead of always setting a new array. To retain the old behavior, use either `df[df.columns[i]] = newvals` or, if columns are non-unique, `df.isetitem(i, newvals)`
  regulon_mat.loc[:, col_idx] = np.where(
2024-03-06 18:10:04,308 cisTopic     INFO     Creating minimal loom
/lustre1/project/stg_00002/mambaforge/vsc33053/envs/pycisTopic_polars_tutorial/lib/python3.11/site-packages/pycisTopic/loom.py:840: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead.
  for name, threshold in auc_thresholds.iteritems()
2024-03-06 18:11:27,242 cisTopic     INFO     Adding annotations
2024-03-06 18:11:49,942 cisTopic     INFO     Adding clusterings
2024-03-06 18:11:49,980 cisTopic     INFO     Adding markers
No markers for  nan
2024-03-06 18:12:01,188 cisTopic     INFO     Exporting
/lustre1/project/stg_00002/mambaforge/vsc33053/envs/pycisTopic_polars_tutorial/lib/python3.11/site-packages/loomxpy/loomxpy.py:459: FutureWarning: The default value of regex will change from True to False in a future version.
  regulons.columns = regulons.columns.str.replace("_?\\(", "_(")
/lustre1/project/stg_00002/mambaforge/vsc33053/envs/pycisTopic_polars_tutorial/lib/python3.11/site-packages/loomxpy/loomxpy.py:437: FutureWarning: The default value of regex will change from True to False in a future version.
  regulons.columns = regulons.columns.str.replace("_?\\(", "_(")
[246]:
export_gene_activity_to_loom(
    gene_activity_matrix = gene_act,
    cistopic_obj = cistopic_obj,
    out_fname = os.path.join(out_dir, "loom", "10x_multiome_brain_pycisTopic_gene_activity.loom"),
    cluster_annotation = ['Seurat_cell_type'],
    cluster_markers = cluster_markers,
    tree_structure = ('10x_multiome_brain', 'pycisTopic', 'ATAC'),
    title = 'Tutorial - Gene activity',
    nomenclature = "hg38",
    split_pattern = '-'
)
2024-03-06 18:13:30,346 cisTopic     INFO     Creating minimal loom
2024-03-06 18:13:39,158 cisTopic     INFO     Adding annotations
2024-03-06 18:13:41,059 cisTopic     INFO     Adding clusterings
2024-03-06 18:13:41,095 cisTopic     INFO     Adding markers
No markers for  nan
2024-03-06 18:13:41,482 cisTopic     INFO     Exporting