API

cisTopic object

class pycisTopic.cistopic_class.CistopicObject(fragment_matrix: csr_matrix, binary_matrix: csr_matrix, cell_names: List[str], region_names: List[str], cell_data: DataFrame, region_data: DataFrame, path_to_fragments: str | Dict[str, str], project: str | None = 'cisTopic')[source]

cisTopic data class.

CistopicObject contains the cell by fragment matrices (stored as counts fragment_matrix and as binary accessibility binary_matrix), cell metadata cell_data, region metadata region_data and path/s to the fragments file/s path_to_fragments.

LDA models from CisTopicLDAModel can be stored selected_model as well as cell/region projections projections as a dictionary.

Attributes:
fragment_matrix: sparse.csr_matrix

A matrix containing cell names as column names, regions as row names and fragment counts as values.

binary_matrix: sparse.csr_matrix

A matrix containing cell names as column names, regions as row names and whether regions as accessible (0: Not accessible; 1: Accessible) as values.

cell_names: list

A list containing cell names.

region_names: list

A list containing region names.

cell_data: pd.DataFrame

A data frame containing cell information, with cells as indexes and attributes as columns.

region_data: pd.DataFrame

A data frame containing region information, with region as indexes and attributes as columns.

path_to_fragments: str or dict

A list containing the paths to the fragments files used to generate the CistopicObject.

project: str

Name of the cisTopic project.

add_LDA_model(model: CistopicLDAModel)[source]

Add LDA model to a cisTopic object.

Parameters:
model: CistopicLDAModel

Selected cisTopic LDA model results (see LDAModels.evaluate_models)

add_cell_data(cell_data: DataFrame, split_pattern: str | None = '___')[source]

Add cell metadata to CistopicObject. If the column already exist on the cell metadata, it will be overwritten.

Parameters:
cell_data: pd.DataFrame

A data frame containing metadata information, with cell names as indexes. If cells are missing from the metadata, values will be filled with Nan.

split_pattern: str

Pattern to split cell barcode from sample id. Default: ___

add_region_data(region_data: DataFrame)[source]

Add region metadata to CistopicObject. If the column already exist on the region metadata, it will be overwritten.

Parameters:
region_data: pd.DataFrame

A data frame containing metadata information, with region names as indexes. If regions are missing from the metadata, values will be filled with Nan.

merge(cistopic_obj_list: List[CistopicObject], is_acc: int | None = 1, project: str | None = 'cisTopic_merge', copy: bool | None = False, split_pattern: str | None = '___')[source]

Merge a list of CistopicObject to the input CistopicObject. Reference coordinates must be the same between the objects. Existent cisTopicCGSModel and projections will be deleted. This is to ensure that models contained in a CistopicObject are derived from the cells it contains.

Parameters:
cistopic_obj_list: list

A list containing one or more CistopicObject to merge.

is_acc: int, optional

Minimal number of fragments for a region to be considered accessible. Default: 1.

project: str, optional

Name of the cisTopic project.

copy: bool, optional

Whether changes should be done on the input CistopicObject or a new object should be returned

split_pattern: str

Pattern to split cell barcode from sample id. Default: ___

Return
——
CistopicObject

A combined CistopicObject. Two new columns in cell_data indicate the CistopicObject of origin (cisTopic_id) and the fragment file from which the cell comes from (path_to_fragments).

subset(cells: List[str] | None = None, regions: List[str] | None = None, copy: bool | None = False, split_pattern: str | None = '___')[source]

Subset cells and/or regions from CistopicObject. Existent CisTopicLDAModel and projections will be deleted. This is to ensure that models contained in a CistopicObject are derived from the cells it contains.

Parameters:
cells: list, optional

A list containing the names of the cells to keep.

regions: list, optional

A list containing the names of the regions to keep.

copy: bool, optional

Whether changes should be done on the input CistopicObject or a new object should be returned

split_pattern: str

Pattern to split cell barcode from sample id. Default: ___

pycisTopic.cistopic_class.create_cistopic_object(fragment_matrix: DataFrame | csr_matrix, cell_names: List[str] | None = None, region_names: List[str] | None = None, path_to_blacklist: str | None = None, min_frag: int | None = 1, min_cell: int | None = 1, is_acc: int | None = 1, path_to_fragments: str | Dict[str, str] | None = {}, project: str | None = 'cisTopic', tag_cells: bool | None = True, split_pattern: str | None = '___')[source]

Creates a CistopicObject from a count matrix.

Parameters:
fragment_matrix: pd.DataFrame or sparse.csr_matrix

A data frame containing cell names as column names, regions as row names and fragment counts as values or sparse.csr_matrix containing cells as columns and regions as rows.

cell_names: list, optional

A list containing cell names. Only used if the fragment matrix is sparse.csr_matrix.

region_names: list, optional

A list containing region names. Only used if the fragment matrix is sparse.csr_matrix.

path_to_blacklist: str, optional

Path to bed file containing blacklist regions (Amemiya et al., 2019).

min_frag: int, optional

Minimal number of fragments in a cell for the cell to be kept. Default: 1

min_cell: int, optional

Minimal number of cell in which a region is detected to be kept. Default: 1

is_acc: int, optional

Minimal number of fragments for a region to be considered accessible. Default: 1

path_to_fragments: str, dict

A dict or str containing the paths to the fragments files used to generate the CistopicObject. Default: {}.

project: str, optional

Name of the cisTopic project. Default: ‘cisTopic’

tag_cells: bool, optional

Whether to add the project name as suffix to the cell names. Default: True

split_pattern: str

Pattern to split cell barcode from sample id. Default: ___

References

Amemiya, H. M., Kundaje, A., & Boyle, A. P. (2019). The ENCODE blacklist: identification of problematic regions of the genome. Scientific reports, 9(1), 1-5.

pycisTopic.cistopic_class.create_cistopic_object_from_fragments(path_to_fragments: str, path_to_regions: str, path_to_blacklist: str | None = None, metrics: str | DataFrame | None = None, valid_bc: List[str] | None = None, n_cpu: int | None = 1, min_frag: int | None = 1, min_cell: int | None = 1, is_acc: int | None = 1, check_for_duplicates: bool | None = True, project: str | None = 'cisTopic', partition: int | None = 5, fragments_df: DataFrame | PyRanges | None = None, split_pattern: str | None = '___', use_polars: bool | None = True)[source]

Creates a CistopicObject from a fragments file and defined genomic intervals (compatible with CellRangerATAC output)

Parameters:
path_to_fragments: str

The path to the fragments file containing chromosome, start, end and assigned barcode for each read (e.g. from CellRanger ATAC (/outs/fragments.tsv.gz)).

path_to_regions: str

Path to the bed file with the defined regions.

path_to_blacklist: str, optional

Path to bed file containing blacklist regions (Amemiya et al., 2019). Default: None

metrics: str, optional

Data frame of CellRanger ot similar, with barcodes and metrics (e.g. from CellRanger ATAC /outs/singlecell.csv). If it is an output from CellRanger, only cells for which is__cell_barcode is 1 will be considered, otherwise only barcodes included in the metrics will be taken. Default: None

valid_bc: list, optional

A list with valid cell barcodes can be provided, only used if path_to_metrics is not provided. Default: None

n_cpu: int, optional

Number of cores to use. Default: 1.

min_frag: int, optional

Minimal number of fragments in a cell for the cell to be kept. Default: 1

min_cell: int, optional

Minimal number of cell in which a region is detected to be kept. Default: 1

is_acc: int, optional

Minimal number of fragments for a region to be considered accessible. Default: 1

check_for_duplicates: bool, optional

If no duplicate counts are provided per row in the fragments file, whether to collapse duplicates. Default: True.

project: str, optional

Name of the cisTopic project. It will also be used as name for sample_id in the cell_data CistopicObject.cell_data. Default: ‘cisTopic’

partition: int, optional

When using Pandas > 0.21, counting may fail (https://github.com/pandas-dev/pandas/issues/26314). In that case, the fragments data frame is divided in this number of partitions, and after counting data is merged.

fragments_df: pd.DataFrame or pr.PyRanges, optional

A PyRanges or DataFrame containing chromosome, start, end and assigned barcode for each read, corresponding to the data in path_to_fragments.

split_pattern: str

Pattern to split cell barcode from sample id. Default: ___

use_polars: bool, optional

Whether to use polars to read fragments files. Default: True.

References

Amemiya, H. M., Kundaje, A., & Boyle, A. P. (2019). The ENCODE blacklist: identification of problematic regions of the genome. Scientific reports, 9(1), 1-5.

pycisTopic.cistopic_class.create_cistopic_object_from_matrix_file(fragment_matrix_file: str, path_to_blacklist: str | None = None, compression: str | None = None, min_frag: int | None = 1, min_cell: int | None = 1, is_acc: int | None = 1, path_to_fragments: Dict[str, str] | None = {}, sample_id: DataFrame | None = None, project: str | None = 'cisTopic', split_pattern: str | None = '___')[source]

Creates a CistopicObject from a count matrix file (tsv).

Parameters:
fragment_matrix: str

Path to a tsv file containing cell names as column names, regions as row names and fragment counts as values.

path_to_blacklist: str, optional

Path to bed file containing blacklist regions (Amemiya et al., 2019). Default: None

compression: str, None

Whether the file is compressed (e.g. bzip). Default: None

min_frag: int, optional

Minimal number of fragments in a cell for the cell to be kept. Default: 1

min_cell: int, optional

Minimal number of cell in which a region is detected to be kept. Default: 1

is_acc: int, optional

Minimal number of fragments for a region to be considered accessible. Default: 1

path_to_fragments: dict, optional

A list containing the paths to the fragments files used to generate the CistopicObject. Default: None.

sample_id: pd.DataFrame, optional

A data frame indicating from which sample each barcode is derived. Required if path_to_fragments is provided. Levels must agree with keys in path_to_fragments. Default: None.

project: str, optional

Name of the cisTopic project. Default: ‘cisTopic’

split_pattern: str

Pattern to split cell barcode from sample id. Default: ___

References

Amemiya, H. M., Kundaje, A., & Boyle, A. P. (2019). The ENCODE blacklist: identification of problematic regions of the genome. Scientific reports, 9(1), 1-5.

pycisTopic.cistopic_class.merge(cistopic_obj_list: List[CistopicObject], is_acc: int | None = 1, project: str | None = 'cisTopic_merge', split_pattern: str | None = '___')[source]

Merge a list of CistopicObject to the input CistopicObject. Reference coordinates must be the same between the objects. Existent cisTopicCGSModel and projections will be deleted. This is to ensure that models contained in a CistopicObject are derived from the cells it contains.

Parameters:
cistopic_obj_list: list

A list containing one or more CistopicObject to merge.

is_acc: int, optional

Minimal number of fragments for a region to be considered accessible. Default: 1.

project: str, optional

Name of the cisTopic project.

Pseudobulk formation and peak calling

class pycisTopic.pseudobulk_peak_calling.MACSCallPeak(macs_path: str, bed_path: str, name: str, outdir: str, genome_size: str, input_format: str | None = 'BEDPE', shift: int | None = 73, ext_size: int | None = 146, keep_dup: str | None = 'all', q_value: int | None = 0.05, nolambda: bool | None = True, skip_empty_peaks: bool = False)[source]

Parameters

macs_path: str

Path to MACS binary (e.g. /xxx/MACS/xxx/bin/macs2).

bed_path: str

Path to fragments file bed file.

name: str

Name of string of the group.

outdir: str

Path to the output directory.

genome_size: str

Effective genome size which is defined as the genome size which can be sequenced. Possible values: ‘hs’, ‘mm’, ‘ce’ and ‘dm’.

input_format: str, optional

Format of tag file can be ELAND, BED, ELANDMULTI, ELANDEXPORT, SAM, BAM, BOWTIE, BAMPE, or BEDPE. Default is AUTO which will allow MACS to decide the format automatically. Default: ‘BEDPE’.

shift: int, optional

To set an arbitrary shift in bp. For finding enriched cutting sites (such as in ATAC-seq) a shift of 73 bp is recommended. Default: 73.

ext_size: int, optional

To extend reads in 5’->3’ direction to fix-sized fragment. For ATAC-seq data, a extension of 146 bp is recommended. Default: 146.

keep_dup: str, optional

Whether to keep duplicate tags at te exact same location. Default: ‘all’.

q_value: float, optional

The q-value (minimum FDR) cutoff to call significant regions. Default: 0.05.

nolambda: bool, optional

Do not consider the local bias/lambda at peak candidate regions.

call_peak()[source]

Run MACS2 peak calling.

load_narrow_peak(skip_empty_peaks: bool)[source]

Load MACS2 narrow peak files as pr.PyRanges.

pycisTopic.pseudobulk_peak_calling.export_pseudobulk(input_data: CistopicObject | DataFrame, variable: str, chromsizes: DataFrame | PyRanges, bed_path: str, bigwig_path: str, path_to_fragments: Dict[str, str] | None = None, sample_id_col: str = 'sample_id', n_cpu: int = 1, normalize_bigwig: bool = True, split_pattern: str = '___', temp_dir: str = '/tmp') Tuple[Dict[str, str], Dict[str, str]][source]

Create pseudobulks as bed and bigwig from single cell fragments file given a barcode annotation.

Parameters

input_data: CistopicObject or pd.DataFrame

A CistopicObject containing the specified variable as a column in CistopicObject.cell_data or a cell metadata pd.DataFrame containing barcode as rows, containing the specified variable as a column (additional columns are possible) and a sample_id column. Index names must contain the BARCODE (e.g. ATGTCGTC-1), additional tags are possible separating with - (e.g. ATGCTGTGCG-1-Sample_1). The levels in the sample_id column must agree with the keys in the path_to_fragments dictionary. Alternatively, if the cell metadata contains a column named barcode it will be used instead of the index names.

variable: str

A character string indicating the column that will be used to create the different group pseudobulk. It must be included in the cell metadata provided as input_data.

chromsizes: pd.DataFrame or pr.PyRanges

A data frame or pr.PyRanges containing size of each chromosome, containing ‘Chromosome’, ‘Start’ and ‘End’ columns.

bed_path: str

Path to folder where the fragments bed files per group will be saved. If None, files will not be generated.

bigwig_path: str

Path to folder where the bigwig files per group will be saved. If None, files will not be generated.

path_to_fragments: str or dict, optional

A dictionary of character strings, with sample name as names indicating the path to the fragments file/s from which pseudobulk profiles have to be created. If a CistopicObject is provided as input it will be ignored, but if a cell metadata pd.DataFrame is provided it is necessary to provide it. The keys of the dictionary need to match with the sample_id tag added to the index names of the input data frame.

sample_id_col: str, optional

Name of the column containing the sample name per barcode in the input CistopicObject.cell_data or class:pd.DataFrame. Default: ‘sample_id’.

n_cpu: int, optional

Number of cores to use. Default: 1.

normalize_bigwig: bool, optional

Whether bigwig files should be CPM normalized. Default: True.

split_pattern: str, optional

Pattern to split cell barcode from sample id. Default: ‘___’. Note, if split_pattern is not None, then export_pseudobulk will attempt to infer sample_id from the index of input_data and ignore sample_id_col.

temp_dir: str

Path to temporary directory. Default: ‘/tmp’.

pycisTopic.pseudobulk_peak_calling.macs_call_peak(macs_path: str, bed_path: str, name: str, outdir: str, genome_size: str, input_format: str | None = 'BEDPE', shift: int | None = 73, ext_size: int | None = 146, keep_dup: str | None = 'all', q_value: int | None = 0.05, nolambda: bool | None = True, skip_empty_peaks: bool = False)[source]

Performs pseudobulk peak calling with MACS2 in a group. It requires to have MACS2 installed (https://github.com/macs3-project/MACS).

Parameters

macs_path: str

Path to MACS binary (e.g. /xxx/MACS/xxx/bin/macs2).

bed_path: str

Path to fragments file bed file.

name: str

Name of string of the group.

outdir: str

Path to the output directory.

genome_size: str

Effective genome size which is defined as the genome size which can be sequenced. Possible values: ‘hs’, ‘mm’, ‘ce’ and ‘dm’.

input_format: str, optional

Format of tag file can be ELAND, BED, ELANDMULTI, ELANDEXPORT, SAM, BAM, BOWTIE, BAMPE, or BEDPE. Default is AUTO which will allow MACS to decide the format automatically. Default: ‘BEDPE’.

shift: int, optional

To set an arbitrary shift in bp. For finding enriched cutting sites (such as in ATAC-seq) a shift of 73 bp is recommended. Default: 73.

ext_size: int, optional

To extend reads in 5’->3’ direction to fix-sized fragment. For ATAC-seq data, a extension of 146 bp is recommended. Default: 146.

keep_dup: str, optional

Whether to keep duplicate tags at te exact same location. Default: ‘all’.

q_value: float, optional

The q-value (minimum FDR) cutoff to call significant regions. Default: 0.05.

nolambda: bool, optional

Do not consider the local bias/lambda at peak candidate regions.

pycisTopic.pseudobulk_peak_calling.peak_calling(macs_path: str, bed_paths: Dict, outdir: str, genome_size: str, n_cpu: int | None = 1, input_format: str | None = 'BEDPE', shift: int | None = 73, ext_size: int | None = 146, keep_dup: str | None = 'all', q_value: float | None = 0.05, nolambda: bool | None = True, skip_empty_peaks: bool = False, **kwargs)[source]

Performs pseudobulk peak calling with MACS2. It requires to have MACS2 installed (https://github.com/macs3-project/MACS).

Parameters

macs_path: str

Path to MACS binary (e.g. /xxx/MACS/xxx/bin/macs2).

bed_paths: dict

A dictionary containing group label as name and the path to their corresponding fragments bed file as value.

outdir: str

Path to the output directory.

genome_size: str

Effective genome size which is defined as the genome size which can be sequenced. Possible values: ‘hs’, ‘mm’, ‘ce’ and ‘dm’.

n_cpu: int, optional

Number of cores to use. Default: 1.

input_format: str, optional

Format of tag file can be ELAND, BED, ELANDMULTI, ELANDEXPORT, SAM, BAM, BOWTIE, BAMPE, or BEDPE. Default is AUTO which will allow MACS to decide the format automatically. Default: ‘BEDPE’.

shift: int, optional

To set an arbitrary shift in bp. For finding enriched cutting sites (such as in ATAC-seq) a shift of 73 bp is recommended. Default: 73.

ext_size: int, optional

To extend reads in 5’->3’ direction to fix-sized fragment. For ATAC-seq data, a extension of 146 bp is recommended. Default: 146.

keep_dup: str, optional

Whether to keep duplicate tags at te exact same location. Default: ‘all’.

q_value: float, optional

The q-value (minimum FDR) cutoff to call significant regions. Default: 0.05.

**kwargs

Additional parameters to pass to ray.init().

Iterative peak filtering

pycisTopic.iterative_peak_calling.calculate_peaks_and_extend(narrow_peaks: PyRanges, peak_half_width: int, chromsizes: DataFrame | PyRanges | None = None, path_to_blacklist: str | None = None)[source]

Extend peaks a number of base pairs in eca direction from the summit

Parameters

narrow_peaks: pr.PyRanges

A pr.PyRanges with the narrowPeak results from MACS2.

peak_half_width: int

Number of base pairs that each summit will be extended in each direction.

chromsizes: pd.PyRanges or pd.DataFrame

A data frame or pr.PyRanges containing size of each column, containing ‘Chromosome’, ‘Start’ and ‘End’ columns.

path_to_blacklist: str, optional

Path to bed file containing blacklist regions (Amemiya et al., 2019). Default: None

pycisTopic.iterative_peak_calling.cpm(x: PyRanges, column: str)[source]

cpm normalization

Parameters

x: pr.PyRanges

A pyRanges object

column: str

Name of the column that has to be normalized

pycisTopic.iterative_peak_calling.get_consensus_peaks(narrow_peaks_dict: Dict[str, PyRanges], peak_half_width: int, chromsizes: DataFrame | PyRanges | None = None, path_to_blacklist: str | None = None)[source]

Returns consensus peaks from a set of MACS narrow peak results. First, each summit is extended a peak_half_width in each direction and then we iteratively filter out less significant peaks that overlap with a more significant one. During this procedure peaks will be merged and depending on the number of peaks included into them, different processes will happen: * 1 peak: The original peak region will be kept * 2 peaks: The original peak region with the highest score will be kept * 3 or more peaks: The orignal peak region with the most significant score will be taken, and all the original peak regions in this merged peak region that overlap with the significant peak region will be removed. The process is repeated with the next most significant peak (if it was not removed already) until all peaks are processed.

This proccess will happen twice, first in each pseudobulk peaks; and after peak score normalization, to process all peaks together.

This approach is described in Corces et al. 2018.

Parameters

narrow_peaks_dict: dict

A dictionary containing group labels as keys and pr.PyRanges with the narrowPeak results from MACS2 as values (as returned by .pseudobulkPeakCalling.peakCalling()).

peak_half_width: int

Number of base pairs that each summit will be extended in each direction.

chromsizes: pd.PyRanges or pd.DataFrame

A data frame or pr.PyRanges containing size of each column, containing ‘Chromosome’, ‘Start’ and ‘End’ columns.

path_to_blacklist: str, optional

Path to bed file containing blacklist regions (Amemiya et al., 2019). Default: None

pycisTopic.iterative_peak_calling.iterative_peak_filtering(center_extended_peaks: PyRanges)[source]

Returns consensus peaks from a set of MACS narrow peak results. First, each summit is extended a peak_half_width in each direction and then we iteratively filter out less significant peaks that overlap with a more significant one. During this procedure, described in this functions, peaks will be merged and depending on the number of peaks included into them, different processes will happen: * 1 peak: The original peak region will be kept * 2 peaks: The original peak region with the highest score will be kept * 3 or more peaks: The orignal peak region with the most significant score will be taken, and all the original peak regions in this merged peak region that overlap with the significant peak region will be removed. The process is repeated with the next most significant peak (if it was not removed already) until all peaks are processed.

This proccess will happen twice, first in each pseudobulk peaks; and after peak score normalization, to process all peaks together.

This approach is described in Corces et al. 2018.

Parameters

center_extended_peaks: pr.PyRanges

A pr.PyRanges with all the peaks to be combined (and their MACS score), after centering and extending the peaks.

Fragments

pycisTopic.fragments.create_pyranges_from_polars_df(bed_df_pl: DataFrame) PyRanges[source]

Create PyRanges DataFrame from Polars DataFrame.

Parameters:
bed_df_pl

Polars DataFrame containing BED entries. e.g.: This can also be a filtered Polars DataFrame with fragments or

TSS annotation.

Returns:
PyRanges DataFrame.

Examples

Read BED file to Polars DataFrame with pyarrow engine.

>>> bed_df_pl = read_bed_to_polars_df("test.bed", engine="pyarrow")

Create PyRanges object directly from Polars DataFrame.

>>> bed_df_pr = create_pyranges_from_polars_df(bed_df_pl=bed_df_pl)
pycisTopic.fragments.filter_fragments_by_cb(fragments_df_pl: DataFrame, cbs: Series | Sequence) DataFrame[source]

Filter fragments by cell barcodes.

Parameters:
fragments_df_pl

Polars DataFrame with fragments.

cbs

List/Polars Series with Cell barcodes. See pycisTopic.fragments.get_cbs_passing_filter() for a way to get a filtered list of cell barcodes (selected_cbs variable).

Returns:
Polars DataFrame with fragments for the requested cell barcodes.

Examples

Read gzipped fragments BED file to a Polars DataFrame.

>>> fragments_df_pl = read_fragments_to_polars_df(
...    fragments_bed_filename="fragments.tsv.gz",
... )

List of cell barcodes for which to retain fragments.

>>> cbs = ["GGACATAAGGGCCACT-1", "ACCTTCATCTTTGAGA-1"]

Polars DataFrame with fragments for the requested cell barcodes.

>>> fragments_cb_filtered_df_pl = filter_fragments_by_cb(
...     fragments_df_pl=fragments_df_pl,
...     cbs=cbs,
... )

List of cell barcodes for which to retain fragments.

>>> cbs = ["GGACATAAGGGCCACT-1", "ACCTTCATCTTTGAGA-1"]

Polars DataFrame with fragments for the requested cell barcodes.

>>> fragments_cb_filtered_df_pl = filter_fragments_by_cb(
...     fragments_df_pl=fragments_df_pl,
...     cbs=cbs,
... )

List of cell barcodes as a Polars categorical Series for which to retain fragments.

>>> cbs = pl.Series(
...     "CB",
...     ["GGACATAAGGGCCACT-1", "ACCTTCATCTTTGAGA-1"],
...     dtype=pl.Categorical,
... )

Read list of cell barcodes from a file.

>>> cbs = read_barcodes_file_to_polars_series("barcodes.tsv")

Polars DataFrame with fragments for the requested cell barcodes.

>>> fragments_cb_filtered_df_pl = filter_fragments_by_cb(
...     fragments_df_pl=fragments_df_pl,
...     cbs=cbs,
... )
pycisTopic.fragments.get_cbs_passing_filter(fragments_stats_per_cb_df_pl: pl.DataFrame, cbs: pl.Series | Sequence | None = None, min_fragments_per_cb: int | None = None, keep_top_x_cbs: int | None = None, collapse_duplicates: bool | None = True)[source]

Get cell barcodes passing the filter.

Parameters:
fragments_stats_per_cb_df_pl

Polars DataFrame with number of fragments and duplication ratio per cell barcode. See pycisTopic.fragments.get_fragments_per_cb().

cbs

Cell barcodes to keep. If specified, min_fragments_per_cb and min_cbs are ignored.

min_fragments_per_cb

Minimum number of fragments needed per cell barcode to keep the cell barcode. Only used if cbs is None, min_cbs will be ignored.

keep_top_x_cbs

Keep the x most abundant cell barcodes based on the number of fragments. Only used if cbs is None and min_fragments_per_cb is None.

collapse_duplicates

Collapse duplicate fragments (same chromosomal positions and linked to the same cell barcode).

Returns:
(Cell barcodes passing the filter,

fragments_stats_per_cb_df_pl filtered by the cell barcodes passing the filter)

Examples

Read gzipped fragments BED file to a Polars DataFrame.

>>> fragments_df_pl = read_fragments_to_polars_df(
...     fragments_bed_filename="fragments.tsv.gz",
... )

Get number of fragments and duplication ratio per cell barcode (which have 10 fragments or more after collapsing duplicates).

>>> fragments_stats_per_cb_df_pl = get_fragments_per_cb(
...     fragments_df_pl=fragments_df_pl,
...     min_fragments_per_cb=10,
...     collapse_duplicates=True,
... )

Keep only cell barcodes which have 1000 or more fragments.

>>> cbs_selected, fragments_stats_per_cb_filtered_df_pl = get_cbs_passing_filter(
...     fragments_stats_per_cb_df_pl=fragments_stats_per_cb_df_pl,
...     min_fragments_per_cb=1000,
...     collapse_duplicates=True,
... )

Keep only the 4000 most abundant cell barcodes based on the number of fragments after collapsing duplicates.

>>> cbs_selected, fragments_stats_per_cb_filtered_df_pl = get_cbs_passing_filter(
...     fragments_stats_per_cb_df_pl=fragments_stats_per_cb_df_pl,
...     keep_top_x_cbs=4000,
...     collapse_duplicates=True,
... )
pycisTopic.fragments.get_fragments_in_peaks(fragments_df_pl: DataFrame, regions_df_pl: DataFrame) DataFrame[source]

Get number of total and unique fragments in peaks.

Parameters:
fragments_df_pl

Polars DataFrame with fragments.

regions_df_pl

Polars DataFrame with peak regions (consensus peaks or SCREEN regions). See pycisTopic.fragments.read_bed_to_polars_df() for a way to read a BED file with peak regsions.

Returns:
Polars DataFrame with total fragment counts and unique fragment counts per region.

Examples

As input get a Polars DataFrame with fragments for the cell barcodes of interest. See pycisTopic.fragments.filter_fragments_by_cb

>>> fragments_cb_filtered_df_pl = filter_fragments_by_cb(
...     fragments_df_pl=fragments_df_pl,
...     cbs=cbs,
... )

Read BED file with consensus peaks or SCREEN regions (get first 3 columns only).

>>> regions_df_pl = read_bed_to_polars_df(
...     bed_filename=screen_regions_bed_filename,
...     min_column_count=3,
... )

Polars DataFrame with number of total and unique fragments in peaks.

>>> fragments_in_peaks_df_pl = get_fragments_in_peaks(
...     fragments_df_pl=fragments_cb_filtered_df_pl,
...     regions_df_pl=regions_df_pl,
... )
pycisTopic.fragments.get_fragments_per_cb(fragments_df_pl: DataFrame, min_fragments_per_cb: int = 10, collapse_duplicates: bool | None = True) DataFrame[source]

Get number of fragments and duplication ratio per cell barcode.

Parameters:
fragments_df_pl:

Polars DataFrame with fragments. See pycisTopic.fragments.read_fragments_to_polars_df().

min_fragments_per_cb:

Minimum number of fragments needed per cell barcode to keep the fragments for those cell barcodes.

collapse_duplicates:

Collapse duplicate fragments (same chromosomal positions and linked to the same cell barcode).

Returns:
Polars DataFrame with number of fragments and duplication ratio per cell barcode.

Examples

Read gzipped fragments BED file to a Polars DataFrame.

>>> fragments_df_pl = read_fragments_to_polars_df(
...    fragments_bed_filename="fragments.tsv.gz",
... )

Get number of fragments and duplication ratio per cell barcode (which have 10 fragments or more after collapsing duplicates).

>>> fragments_stats_per_cb_df_pl = get_fragments_per_cb(
...     fragments_df_pl=fragments_df_pl,
...     min_fragments_per_cb=10,
...     collapse_duplicates=True,
... )
pycisTopic.fragments.get_insert_size_distribution(fragments_df_pl: DataFrame) DataFrame[source]

Get insert size distribution of fragments.

Parameters:
fragments_df_pl

Polars DataFrame with fragments.

cbs

List/Polars Series with Cell barcodes. See pycisTopic.fragments.get_cbs_passing_filter() for a way to get a filtered list of cell barcodes (selected_cbs variable).

Returns:
Polars DataFrame with fragment counts and fragment ratios for each found insert
size.

Examples

As input get a Polars DataFrame with fragments for the cell barcodes of interest. See pycisTopic.fragments.filter_fragments_by_cb

>>> fragments_cb_filtered_df_pl = filter_fragments_by_cb(
...     fragments_df_pl=fragments_df_pl,
...     cbs=cbs,
... )

Polars DataFrame with insert size distribution of fragments.

>>> insert_size_dist_df_pl = get_insert_size_distribution(
...     fragments_df_pl=fragments_cb_filtered_df_pl,
... )
pycisTopic.fragments.read_barcodes_file_to_polars_series(barcodes_tsv_filename: str) Series[source]

Read barcode TSV file to a Polars Series.

Parameters:
barcodes_tsv_filename

TSV file with CBs.

Returns:
Polars Series with CBs.

Examples

Read gzipped barcodes TSV file to a Polars Series.

>>> cbs = read_barcodes_file_to_polars_series(
...     barcodes_tsv_filename="barcodes.tsv.gz",
... )

Read uncompressed barcodes TSV file to a Polars Series.

>>> cbs = read_barcodes_file_to_polars_series(
...     barcodes_tsv_filename="barcodes.tsv",
... )
pycisTopic.fragments.read_bed_to_polars_df(bed_filename: str, engine: str | Literal['polars'] | Literal['pyarrow'] = 'pyarrow', min_column_count: int = 3) DataFrame[source]

Read BED file to a Polars DataFrame.

Parameters:
bed_filename

BED filename.

engine

Use Polars or pyarrow to read the BED file (default: pyarrow).

min_column_count

Minimum number of required columns needed in BED file.

Returns:
Polars DataFrame with BED entries.

Examples

Read BED file to Polars DataFrame with pyarrow engine.

>>> bed_df_pl = read_bed_to_polars_df("test.bed", engine="pyarrow")

Read BED file to Polars DataFrame with pyarrow engine and require that the BED file has at least 4 columns.

>>> bed_with_at_least_4_columns_df_pl = read_bed_to_polars_df(
...     "test.bed",
...     engine="pyarrow",
...     min_column_count=4,
... )
pycisTopic.fragments.read_fragments_to_polars_df(fragments_bed_filename: str, engine: str | Literal['polars'] | Literal['pyarrow'] = 'pyarrow') DataFrame[source]

Read fragments BED file to a Polars DataFrame.

If fragments don’t have a Score column, a Score columns is created by counting the number of fragments with the same chromosome, start, end and CB.

Parameters:
fragments_bed_filename

Fragments BED filename.

engine

Use Polars or pyarrow to read the fragments BED file (default: pyarrow).

Returns:
Polars DataFrame with fragments.

Examples

Read gzipped fragments BED file to a Polars DataFrame.

>>> fragments_df_pl = read_fragments_to_polars_df(
...     fragments_bed_filename="fragments.tsv.gz",
... )

Read uncompressed fragments BED file to a Polars DataFrame.

>>> fragments_df_pl = read_fragments_to_polars_df(
...     fragments_bed_filename="fragments.tsv",
... )
pycisTopic.fragments.read_fragments_to_pyranges(fragments_bed_filename: str, engine: str | Literal['polars'] | Literal['pyarrow'] | Literal['pandas'] = 'pyarrow') PyRanges[source]

Read fragments BED file to PyRanges object.

Parameters:
fragments_bed_filename

Fragments BED filename.

engine

Use Polars, pyarrow or pandas to read the fragments BED file (default: pyarrow).

Returns:
PyRanges object with fragments.

Examples

Read BED file to PyRanges object with pyarrow engine.

>>> bed_pr = read_fragments_to_pyranges("test.bed", engine="pyarrow")

Gene annotation

pycisTopic.gene_annotation.change_chromosome_source_in_bed(chrom_sizes_and_alias_df_pl: DataFrame, bed_df_pl: DataFrame, from_chrom_source_name: str, to_chrom_source_name: str) DataFrame[source]

Change chromosome names from Polars DataFrame with BED entries from one chromosome source to another one.

Parameters:
chrom_sizes_and_alias_df_pl

Polars DataFrame with chromosome sizes and alias mapping. See pycisTopic.gene_annotation.get_chrom_sizes_and_alias_mapping_from_file(), pycisTopic.gene_annotation.get_chrom_sizes_and_alias_mapping_from_ncbi() and pycisTopic.gene_annotation.get_chrom_sizes_and_alias_mapping_from_ucsc().

bed_df_pl

Polars DataFrame with BED entries for which chromosome names need to be remapped from from_chrom_source_name to to_chrom_source_name. See pycisTopic.fragments.read_bed_to_polars_df() and pycisTopic.gene_annotation.read_tss_annotation_from_bed()

from_chrom_source_name

Current chromosome source name for the input BED file: ucsc, ensembl, genbank or refseq. Can be guessed with pycisTopic.gene_annotation.find_most_likely_chromosome_source_in_bed().

to_chrom_source_name

Chromosome source name to which the output Polars DataFrame with BED entries should be mapped: ucsc, ensembl, genbank or refseq.

Returns:
Polars Dataframe with BED entries with changed chromosome names.

Examples

Get chromosome sizes and alias mapping for hg38.

>>> chrom_sizes_and_alias_hg38_df_pl = get_chrom_sizes_and_alias_mapping_from_ucsc(ucsc_assembly="hg38")

Get gene annotation for hg38 from Ensembl BioMart.

>>> hg38_tss_annotation_bed_df_pl = get_tss_annotation_from_ensembl(
...     biomart_name="hsapiens_gene_ensembl",
... )
>>> hg38_tss_annotation_bed_df_pl

Replace Ensembl chromosome names with UCSC chromosome names in gene annotation for hg38.

>>> hg38_tss_annotation_ucsc_chroms_bed_df_pl = change_chromosome_source_in_bed(
...     chrom_sizes_and_alias_df_pl=chrom_sizes_and_alias_hg38_df_pl,
...     bed_df_pl=hg38_tss_annotation_bed_df_pl,
...     from_chrom_source_name="ensembl",
...     to_chrom_source_name="ucsc",
... )
>>> hg38_tss_annotation_ucsc_chroms_bed_df_pl
pycisTopic.gene_annotation.find_most_likely_chromosome_source_in_bed(chrom_sizes_and_alias_df_pl: pl.DataFrame, bed_df_pl: pl.DataFrame)[source]

Find which chromosome source is the most likely in the provided BED file entries.

Find which chromosome source (UCSC, Ensembl, GenBank and RefSeq) given as a chrom_sizes_and_alias_df_pl Polars DataFrame is the most likely in the provided Polars DataFrame with BED entries.

Parameters:
chrom_sizes_and_alias_df_pl

Polars DataFrame with chromosome sizes and alias mapping. See pycisTopic.gene_annotation.get_chrom_sizes_and_alias_mapping_from_file(), pycisTopic.gene_annotation.get_chrom_sizes_and_alias_mapping_from_ncbi() and pycisTopic.gene_annotation.get_chrom_sizes_and_alias_mapping_from_ucsc().

bed_df_pl

Polars DataFrame with BED entries. See pycisTopic.fragments.read_bed_to_polars_df().

Returns:
Tuple of most likely chromosome source and a Polars DataFrame with the ranking of
all possible chromosome sources.

Examples

>>> chrom_sizes_and_alias_hg38_df_pl = get_chrom_sizes_and_alias_mapping_from_ucsc(ucsc_assembly="hg38")
>>> bed_df_pl = read_bed_to_polars_df("test.bed", engine="pyarrow")
>>> best_chrom_source_name, chrom_source_stats_df_pl = find_most_likely_chromosome_source_in_bed(
...     chrom_sizes_and_alias_df_pl=chrom_sizes_and_alias_hg38_df_pl,
...     bed_df_pl=bed_df_pl,
... )
>>> print(best_chrom_source_name, chrom_source_stats_df_pl)
pycisTopic.gene_annotation.get_all_gene_annotation_ensembl_biomart_dataset_names(biomart_host: str = 'http://www.ensembl.org', use_cache: bool = True) pd.DataFrame[source]

Get all avaliable gene annotation Ensembl BioMart dataset names.

Parameters:
biomart_host
BioMart host URL to use.
use_cache

Whether to cache requests to Ensembl BioMart server.

Returns:
Pandas dataframe with all available gene annotation Ensembl BioMart datasets.

Examples

>>> biomart_latest_datasets = get_all_biomart_ensembl_dataset_names(
...    biomart_host="http://www.ensembl.org",
... )
>>> biomart_jul2022_datasets = get_all_biomart_ensembl_dataset_names(
...     biomart_host="http://jul2022.archive.ensembl.org/",
... )
pycisTopic.gene_annotation.get_biomart_dataset_name_for_species(biomart_datasets: pd.DataFrame, species: str) pd.DataFrame[source]

Get gene annotation Ensembl BioMart dataset names for species of interest.

Parameters:
biomart_datasets

All gene annotation Ensembl BioMart datasets See pycisTopic.gene_annotation.get_all_gene_annotation_ensembl_biomart_dataset_names().

species

Species name to search for.

Returns:
Filtered list of gene annotation Ensembl BioMart dataset names.
pycisTopic.gene_annotation.get_chrom_sizes_and_alias_mapping_from_file(chrom_sizes_and_alias_tsv_filename: str | Path) DataFrame[source]

Get chromosome sizes and alias mapping from a chromosome alias TSV file.

Get chromosome sizes and alias mapping from a chromosome alias TSV file to map chromosome names between UCSC, Ensembl, GenBank and RefSeq chromosome names.

Parameters:
chrom_sizes_and_alias_tsv_filename:
Chromosome alias TSV files created with:
  • get_chrom_sizes_and_alias_mapping_from_ncbi

  • get_chrom_sizes_and_alias_mapping_from_ucsc

Returns:
Polars Dataframe with chromosome sizes and alias mapping between UCSC, Ensembl,
GenBank and RefSeq chromosome names.

Examples

Get chromosome sizes and alias mapping for hg38 from a previous written TSV file:

>>> chrom_sizes_and_alias_hg38_from_file_df_pl = get_chrom_sizes_and_alias_mapping_from_file(
...    chrom_sizes_and_alias_tsv_filename="hg38.chrom_sizes_and_alias.tsv",
... )
pycisTopic.gene_annotation.get_chrom_sizes_and_alias_mapping_from_ncbi(accession_id: str, chrom_sizes_and_alias_tsv_filename: str | Path | None) DataFrame[source]

Get chromosome sizes and alias mapping from NCBI sequence reports.

Get chromosome sizes and alias mapping from NCBI sequence reports to be able to map chromosome names between UCSC, Ensembl, GenBank and RefSeq chromosome names or read mapping from local file (chrom_sizes_and_alias_tsv_filename) instead.

Parameters:
accession_id

NCBI assembly accession ID.

chrom_sizes_and_alias_tsv_filename

If specified, write the chromosome sizes and alias mapping to the specified file.

Returns:
Polars Dataframe with chromosome alias mapping between UCSC, Ensembl, GenBank and
RefSeq chromosome names.

Examples

Get chromosome sizes and alias mapping for different assemblies from NCBI.

Assemby accession IDs for a species can be queries with pycisTopic.gene_annotation.get_ncbi_assembly_accessions_for_species

>>> chrom_sizes_and_alias_hg38_df_pl = get_chrom_sizes_and_alias_mapping_from_ncbi(
...    accession_id="GCF_000001405.40"
... )
>>> chrom_sizes_and_alias_mm10_df_pl = get_chrom_sizes_and_alias_mapping_from_ncbi(
...     accession_id="GCF_000001215.4"
... )
>>> chrom_sizes_and_alias_dm6_df_pl = get_chrom_sizes_and_alias_mapping_from_ncbi(
...     accession_id="GCF_000001215.4"
... )

Get chromosome sizes and alias mapping for Homo sapiens and also write it to a TSV file:

>>> chrom_sizes_and_alias_hg38_df_pl = get_chrom_sizes_and_alias_mapping_from_ncbi(
...     accession_id="GCF_000001405.40",
...     chrom_sizes_and_alias_tsv_filename="GCF_000001405.40.chrom_sizes_and_alias.tsv",
... )
pycisTopic.gene_annotation.get_chrom_sizes_and_alias_mapping_from_ucsc(ucsc_assembly: str, chrom_sizes_and_alias_tsv_filename: str | Path | None = None) DataFrame[source]

Get chromosome sizes and alias mapping from UCSC genome browser.

Get chromosome sizes and alias mapping from UCSC genome browser for UCSC assembly to be able to map chromosome names between UCSC, Ensembl, GenBank and RefSeq chromosome names or read mapping from local file (chrom_sizes_and_alias_tsv_filename) instead.

Parameters:
ucsc_assembly:

UCSC assembly names (hg38, mm10, dm6, …).

chrom_sizes_and_alias_tsv_filename:

If specified, write the chromosome sizes and alias mapping to the specified file.

Returns:
Polars Dataframe with chromosome sizes and alias mapping between UCSC, Ensembl,
GenBank and RefSeq chromosome names.

Examples

Get chromosome sizes and aliases for different assemblies from UCSC:

>>> chrom_sizes_and_alias_hg38_df_pl = get_chrom_sizes_and_alias_mapping_from_ucsc(
...     ucsc_assembly="hg38"
... )
>>> chrom_sizes_and_alias_mm10_df_pl = get_chrom_sizes_and_alias_mapping_from_ucsc(
...     ucsc_assembly="mm10"
... )
>>> chrom_sizes_and_alias_dm6_df_pl = get_chrom_sizes_and_alias_mapping_from_ucsc(
...     ucsc_assembly="dm6"
... )

Get chromosome sizes and aliases for hg38 and also write it to a TSV file:

>>> chrom_sizes_and_alias_hg38_df_pl = get_chrom_sizes_and_alias_mapping_from_ucsc(
...     ucsc_assembly="hg38",
...     chrom_sizes_and_alias_tsv_filename="hg38.chrom_sizes_and_alias.tsv",
... )
pycisTopic.gene_annotation.get_ncbi_assembly_accessions_for_species(species: str) str[source]

Get NCBI assembly accession numbers and assembly names for a certain species.

Parameters:
species

Species name (latin name) for which to look for NCBI assembly accession numbers.

Returns:
String with NCBI assembly accession number and assembly name.

Examples

>>> print(get_ncbi_assembly_accessions_for_species("homo sapiens"))
accession   assembly_name
GCF_000001405.40    GRCh38.p14
GCF_000001405.25    GRCh37.p13
GCF_000001405.26    GRCh38
GCF_000001405.27    GRCh38.p1
GCF_000001405.28    GRCh38.p2
GCF_000001405.29    GRCh38.p3
GCF_000001405.30    GRCh38.p4
GCF_000001405.31    GRCh38.p5
GCF_000001405.32    GRCh38.p6
GCF_000001405.33    GRCh38.p7
GCF_000001405.34    GRCh38.p8
GCF_000001405.35    GRCh38.p9
GCF_000001405.36    GRCh38.p10
GCF_000001405.37    GRCh38.p11
GCF_000001405.38    GRCh38.p12
GCF_000001405.39    GRCh38.p13
GCF_000002125.1     HuRef
GCF_000306695.2     CHM1_1.1
GCF_009914755.1     T2T-CHM13v2.0
>>> print(get_ncbi_assembly_accessions_for_species("drosophila melanogaster"))
accession   assembly_name
GCF_000001215.4     Release 6 plus ISO1 MT
pycisTopic.gene_annotation.get_tss_annotation_from_ensembl(biomart_name: str, biomart_host: str = 'http://www.ensembl.org', transcript_type: Sequence[str] | None = ['protein_coding'], use_cache: bool = True) DataFrame[source]

Get TSS annotation for requested transcript types from Ensembl BioMart.

Parameters:
biomart_name

Ensembl BioMart ID of the dataset. See pycisTopic.gene_annotation.get_biomart_dataset_name_for_species() to get the biomart_name for species of interest: e.g.: hsapiens_gene_ensembl, mmusculus_gene_ensembl, dmelanogaster_gene_ensembl, …

biomart_host
BioMart host URL to use.
transcript_type

Only keep list of specified transcript types (e.g.: ["protein_coding"]) or all (None).

use_cache

Whether to cache requests to Ensembl BioMart server.

Returns:
Polars DataFrame with TSS positions in BED format.

Examples

>>> tss_annotation_bed_df_pl = get_tss_annotation_from_ensembl(
...     biomart_name="hsapiens_gene_ensembl"
... )
>>> tss_annotation_jul2022_bed_df_pl = get_tss_annotation_from_ensembl(
...     biomart_name="hsapiens_gene_ensembl",
...     biomart_host="http://jul2022.archive.ensembl.org/",
... )
pycisTopic.gene_annotation.read_tss_annotation_from_bed(tss_annotation_bed_filename: str) DataFrame[source]

Read TSS annotation BED file to Polars DataFrame.

Read TSS annotation BED file created by pycisTopic.gene_annotation.get_tss_annotation_from_ensembl() and pycisTopic.gene_annotation.write_tss_annotation_to_bed() to Polars DataFrame with TSS positions in BED format.

Parameters:
tss_annotation_bed_filename

TSS annotation BED file to read. TSS annotation BED files can be written with pycisTopic.gene_annotation.write_tss_annotation_to_bed() and will have the following header line:

# Chromosome Start End Gene Score Strand Transcript_type

Minimum required columns for pycisTopic.tss_profile.get_tss_profile():

Chromosome, Start (0-based BED), Strand

Returns:
Polars DataFrame with TSS positions in BED format.

Examples

Get TSS annotation from Ensembl.

>>> tss_annotation_bed_df_pl = get_tss_annotation_from_ensembl(
...     biomart_name="hsapiens_gene_ensembl"
... )

If your fragments files use a different chromosome convention than the one used by Ensembl, take a look at pycisTopic.gene_annotation.change_chromosome_source_in_bed() to convert the Ensembl chromosome names to UCSC, Ensembl, GenBank or RefSeq chromosome names.

Write TSS annotation to a file.

>>> write_tss_annotation_to_bed(
...     tss_annotation_bed_df_pl=tss_annotation_bed_df_pl,
...     tss_annotation_bed_filename="hg38.tss.bed",
... )

Read TSS annotation from a file.

>>> tss_annotation_bed_df_pl = read_tss_annotation_from_bed(
...     tss_annotation_bed_filename="hg38.tss.bed"
... )
pycisTopic.gene_annotation.write_tss_annotation_to_bed(tss_annotation_bed_df_pl, tss_annotation_bed_filename: str) None[source]

Write TSS annotation Polars DataFrame to a BED file.

Write TSS annotation Polars DataFrame with TSS positions in BED format. to a BED file.

Parameters:
tss_annotation_bed_df_pl

TSS annotation Polars DataFrame with TSS positions in BED format created with pycisTopic.gene_annotation.get_tss_annotation_from_ensembl().

tss_annotation_bed_filename

TSS annotation BED file to write to. TSS annotation BED files from pycisTopic.gene_annotation.get_tss_annotation_from_ensembl() will have the following header line:

# Chromosome Start End Gene Score Strand Transcript_type

Minimum required columns for pycisTopic.tss_profile.get_tss_profile():

Chromosome, Start (0-based BED), Strand

Returns:
Polars DataFrame with TSS positions in BED format.

Examples

Get TSS annotation from Ensembl.

>>> tss_annotation_bed_df_pl = get_tss_annotation_from_ensembl(
...     biomart_name="hsapiens_gene_ensembl"
... )

If your fragments files use a different chromosome convention than the one used by Ensembl, take a look at pycisTopic.gene_annotation.change_chromosome_source_in_bed() to convert the Ensembl chromosome names to UCSC, Ensembl, GenBank or RefSeq chromosome names.

Write TSS annotation to a file.

>>> write_tss_annotation_to_bed(
...     tss_annotation_bed_df_pl=tss_annotation_bed_df_pl,
...     tss_annotation_bed_filename="hg38.tss.bed",
... )

Read TSS annotation from a file.

>>> tss_annotation_bed_df_pl = read_tss_annotation_from_bed(
...     tss_annotation_bed_filename="hg38.tss.bed"
... )

Genomic ranges

pycisTopic.genomic_ranges.intersection(regions1_df_pl: DataFrame, regions2_df_pl: DataFrame, how: Literal['all', 'containment', 'first', 'last'] | str | None = None, regions1_info: bool = True, regions2_info: bool = False, regions1_coord: bool = False, regions2_coord: bool = False, regions1_suffix: str = '@1', regions2_suffix: str = '@2') DataFrame[source]

Get overlapping subintervals between first set and second set of regions.

Parameters:
regions1_df_pl

Polars DataFrame containing BED entries for first set of regions.

regions2_df_pl

Polars DataFrame containing BED entries for second set of regions.

how
What intervals to report:
  • "all" (None): all overlaps with second set or regions.

  • "containment": only overlaps where region of first set is contained within region of second set.

  • "first": first overlap with second set of regions.

  • "last": last overlap with second set of regions.

  • "outer": all regions for first and all regions of second (outer join). If no overlap was found for a region, the other region set will contain None for that entry.

  • "left": all first set of regions and overlap with second set of regions (left join). If no overlap was found for a region in the first set, the second region set will contain None for that entry.

  • "right": all second set of regions and overlap with first set of regions (right join). If no overlap was found for a region in the second set, the first region set will contain None for that entry.

regions1_info

Add non-coordinate columns from first set of regions to output of intersection.

regions2_info

Add non-coordinate columns from first set of regions to output of intersection.

regions1_coord

Add coordinates from first set of regions to output of intersection.

regions2_coord

Add coordinates from second set of regions to output of intersection.

regions1_suffix

Suffix added to coordinate columns of first set of regions.

regions2_suffix

Suffix added to coordinate and info columns of second set of regions.

strandedness

Note: Not implemented yet. {None, "same", "opposite", False}, default None, i.e. auto Whether to compare PyRanges on the same strand, the opposite or ignore strand information. The default, None, means use "same" if both PyRanges are stranded, otherwise ignore the strand information.

Returns:
intersection_df_pl

Polars Dataframe containing BED entries with the intersection.

Examples

>>> regions1_df_pl = pl.from_dict(
...     {
...         "Chromosome": ["chr1"] * 3,
...         "Start": [1, 4, 10],
...         "End": [3, 9, 11],
...         "ID": ["a", "b", "c"],
...     }
... )
>>> regions1_df_pl
shape: (3, 4)
┌────────────┬───────┬─────┬─────┐
│ Chromosome ┆ Start ┆ End ┆ ID  │
│ ---        ┆ ---   ┆ --- ┆ --- │
│ str        ┆ i64   ┆ i64 ┆ str │
╞════════════╪═══════╪═════╪═════╡
│ chr1       ┆ 1     ┆ 3   ┆ a   │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ chr1       ┆ 4     ┆ 9   ┆ b   │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ chr1       ┆ 10    ┆ 11  ┆ c   │
└────────────┴───────┴─────┴─────┘
>>> regions2_df_pl = pl.from_dict(
...     {
...         "Chromosome": ["chr1"] * 3,
...         "Start": [2, 2, 9],
...         "End": [3, 9, 10],
...         "Name": ["reg1", "reg2", "reg3"]
...     }
... )
>>> regions2_df_pl
shape: (3, 4)
┌────────────┬───────┬─────┬──────┐
│ Chromosome ┆ Start ┆ End ┆ Name │
│ ---        ┆ ---   ┆ --- ┆ ---  │
│ str        ┆ i64   ┆ i64 ┆ str  │
╞════════════╪═══════╪═════╪══════╡
│ chr1       ┆ 2     ┆ 3   ┆ reg1 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┤
│ chr1       ┆ 2     ┆ 9   ┆ reg2 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┤
│ chr1       ┆ 9     ┆ 10  ┆ reg3 │
└────────────┴───────┴─────┴──────┘
>>> intersection(regions1_df_pl, regions2_df_pl)
shape: (3, 3)
┌────────────┬───────┬─────┬─────┐
│ Chromosome ┆ Start ┆ End ┆ ID  │
│ ---        ┆ ---   ┆ --- ┆ --- │
│ str        ┆ i64   ┆ i64 ┆ str │
╞════════════╪═══════╪═════╪═════╡
│ chr1       ┆ 2     ┆ 3   ┆ a   │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ chr1       ┆ 2     ┆ 3   ┆ a   │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ chr1       ┆ 4     ┆ 9   ┆ b   │
└────────────┴───────┴─────┴─────┘
>>> intersection(regions1_df_pl, regions2_df_pl, how="first")
shape: (2, 4)
┌────────────┬───────┬─────┬─────┐
│ Chromosome ┆ Start ┆ End ┆ ID  │
│ ---        ┆ ---   ┆ --- ┆ --- │
│ str        ┆ i64   ┆ i64 ┆ str │
╞════════════╪═══════╪═════╪═════╡
│ chr1       ┆ 2     ┆ 3   ┆ a   │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ chr1       ┆ 4     ┆ 9   ┆ b   │
└────────────┴───────┴─────┴─────┘
>>> intersection(
...     regions1_df_pl,
...     regions2_df_pl,
...     how="containment",
...     regions1_info=False,
...     regions2_info=True,
... )
shape: (1, 4)
┌────────────┬───────┬─────┬──────┐
│ Chromosome ┆ Start ┆ End ┆ Name │
│ ---        ┆ ---   ┆ --- ┆ ---  │
│ str        ┆ i64   ┆ i64 ┆ str  │
╞════════════╪═══════╪═════╪══════╡
│ chr1       ┆ 4     ┆ 9   ┆ reg2 │
└────────────┴───────┴─────┴──────┘
>>> intersection(
...     regions1_df_pl,
...     regions2_df_pl,
...     regions1_coord=True,
...     regions2_coord=True,
... )
shape: (3, 10)
┌────────────┬───────┬─────┬──────────────┬─────────┬───────┬──────────────┬─────────┬───────┬─────┐
│ Chromosome ┆ Start ┆ End ┆ Chromosome@1 ┆ Start@1 ┆ End@1 ┆ Chromosome@2 ┆ Start@2 ┆ End@2 ┆ ID  │
│ ---        ┆ ---   ┆ --- ┆ ---          ┆ ---     ┆ ---   ┆ ---          ┆ ---     ┆ ---   ┆ --- │
│ str        ┆ i64   ┆ i64 ┆ str          ┆ i64     ┆ i64   ┆ str          ┆ i64     ┆ i64   ┆ str │
╞════════════╪═══════╪═════╪══════════════╪═════════╪═══════╪══════════════╪═════════╪═══════╪═════╡
│ chr1       ┆ 2     ┆ 3   ┆ chr1         ┆ 1       ┆ 3     ┆ chr1         ┆ 2       ┆ 9     ┆ a   │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ chr1       ┆ 2     ┆ 3   ┆ chr1         ┆ 1       ┆ 3     ┆ chr1         ┆ 2       ┆ 3     ┆ a   │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ chr1       ┆ 4     ┆ 9   ┆ chr1         ┆ 4       ┆ 9     ┆ chr1         ┆ 2       ┆ 9     ┆ b   │
└────────────┴───────┴─────┴──────────────┴─────────┴───────┴──────────────┴─────────┴───────┴─────┘
>>> intersection(
...     regions1_df_pl,
...     regions2_df_pl,
...     regions1_info=False,
...     regions_info=True,
...     regions2_coord=True,
... )
shape: (3, 7)
┌────────────┬───────┬─────┬──────────────┬─────────┬───────┬──────┐
│ Chromosome ┆ Start ┆ End ┆ Chromosome@2 ┆ Start@2 ┆ End@2 ┆ Name │
│ ---        ┆ ---   ┆ --- ┆ ---          ┆ ---     ┆ ---   ┆ ---  │
│ str        ┆ i64   ┆ i64 ┆ str          ┆ i64     ┆ i64   ┆ str  │
╞════════════╪═══════╪═════╪══════════════╪═════════╪═══════╪══════╡
│ chr1       ┆ 2     ┆ 3   ┆ chr1         ┆ 2       ┆ 9     ┆ reg2 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ chr1       ┆ 2     ┆ 3   ┆ chr1         ┆ 2       ┆ 3     ┆ reg1 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ chr1       ┆ 4     ┆ 9   ┆ chr1         ┆ 2       ┆ 9     ┆ reg2 │
└────────────┴───────┴─────┴──────────────┴─────────┴───────┴──────┘
pycisTopic.genomic_ranges.overlap(regions1_df_pl: DataFrame, regions2_df_pl: DataFrame, how: Literal['all', 'containment', 'first'] | str | None = 'first', invert: bool = False) DataFrame[source]

Get overlap between two region sets.

Get overlap between first set and second set of regions and return interval of first set of regions.

Parameters:
regions1_df_pl

Polars DataFrame containing BED entries for first set of regions.

regions2_df_pl

Polars DataFrame containing BED entries for second set of regions.

how
What overlaps to report:
  • "all" (None): all overlaps with second set or regions.

  • "containment": only overlaps where region of first set is contained within region of second set.

  • "first": first overlap with second set of regions.

invert

Whether to return the intervals without overlaps.

strandedness

Note: Not implemented yet. {None, "same", "opposite", False}, default None, i.e. auto Whether to compare PyRanges on the same strand, the opposite or ignore strand information. The default, None, means use "same" if both PyRanges are stranded, otherwise ignore the strand information.

Returns:
overlap_df_pl

Polars Dataframe containing BED entries with the overlap.

Examples

>>> regions1_df_pl = pl.from_dict(
...     {
...         "Chromosome": ["chr1"] * 3,
...         "Start": [1, 4, 10],
...         "End": [3, 9, 11],
...         "ID": ["a", "b", "c"],
...     }
... )
>>> regions1_df_pl
shape: (3, 4)
┌────────────┬───────┬─────┬─────┐
│ Chromosome ┆ Start ┆ End ┆ ID  │
│ ---        ┆ ---   ┆ --- ┆ --- │
│ str        ┆ i64   ┆ i64 ┆ str │
╞════════════╪═══════╪═════╪═════╡
│ chr1       ┆ 1     ┆ 3   ┆ a   │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ chr1       ┆ 4     ┆ 9   ┆ b   │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ chr1       ┆ 10    ┆ 11  ┆ c   │
└────────────┴───────┴─────┴─────┘
>>> regions2_df_pl = pl.from_dict(
...     {
...         "Chromosome": ["chr1"] * 3,
...         "Start": [2, 2, 9],
...         "End": [3, 9, 10],
...         "Name": ["reg1", "reg2", "reg3"]
...     }
... )
>>> regions2_df_pl
shape: (3, 4)
┌────────────┬───────┬─────┬──────┐
│ Chromosome ┆ Start ┆ End ┆ Name │
│ ---        ┆ ---   ┆ --- ┆ ---  │
│ str        ┆ i64   ┆ i64 ┆ str  │
╞════════════╪═══════╪═════╪══════╡
│ chr1       ┆ 2     ┆ 3   ┆ reg1 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┤
│ chr1       ┆ 2     ┆ 9   ┆ reg2 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┤
│ chr1       ┆ 9     ┆ 10  ┆ reg3 │
└────────────┴───────┴─────┴──────┘
>>> overlap(regions1_df_pl, regions2_df_pl, how="first")
shape: (2, 4)
┌────────────┬───────┬─────┬─────┐
│ Chromosome ┆ Start ┆ End ┆ ID  │
│ ---        ┆ ---   ┆ --- ┆ --- │
│ str        ┆ i64   ┆ i64 ┆ str │
╞════════════╪═══════╪═════╪═════╡
│ chr1       ┆ 1     ┆ 3   ┆ a   │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ chr1       ┆ 4     ┆ 9   ┆ b   │
└────────────┴───────┴─────┴─────┘
>>> overlap(regions1_df_pl, regions2_df_pl, how="all")
shape: (3, 4)
┌────────────┬───────┬─────┬─────┐
│ Chromosome ┆ Start ┆ End ┆ ID  │
│ ---        ┆ ---   ┆ --- ┆ --- │
│ str        ┆ i64   ┆ i64 ┆ str │
╞════════════╪═══════╪═════╪═════╡
│ chr1       ┆ 1     ┆ 3   ┆ a   │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ chr1       ┆ 1     ┆ 3   ┆ a   │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ chr1       ┆ 4     ┆ 9   ┆ b   │
└────────────┴───────┴─────┴─────┘
>>> overlap(regions1_df_pl, regions2_df_pl, how="containment")
shape: (1, 4)
┌────────────┬───────┬─────┬─────┐
│ Chromosome ┆ Start ┆ End ┆ ID  │
│ ---        ┆ ---   ┆ --- ┆ --- │
│ str        ┆ i64   ┆ i64 ┆ str │
╞════════════╪═══════╪═════╪═════╡
│ chr1       ┆ 4     ┆ 9   ┆ b   │
└────────────┴───────┴─────┴─────┘
>>> overlap(regions1_df_pl, regions2_df_pl, how="containment", invert=True)
shape: (2, 4)
┌────────────┬───────┬─────┬─────┐
│ Chromosome ┆ Start ┆ End ┆ ID  │
│ ---        ┆ ---   ┆ --- ┆ --- │
│ str        ┆ i64   ┆ i64 ┆ str │
╞════════════╪═══════╪═════╪═════╡
│ chr1       ┆ 1     ┆ 3   ┆ a   │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ chr1       ┆ 10    ┆ 11  ┆ c   │
└────────────┴───────┴─────┴─────┘

TSS profile

pycisTopic.tss_profile.get_tss_profile(fragments_df_pl: DataFrame, tss_annotation: DataFrame, flank_window: int = 2000, smoothing_rolling_window: int = 10, minimum_signal_window: int = 100, tss_window: int = 50, min_norm: float = 0.2, use_genomic_ranges: bool = True)[source]

Get TSS profile for Polars DataFrame with fragments filtered by cell barcodes.

Parameters:
fragments_df_pl

Polars DataFrame with fragments (filtered by cell barcodes of interest). See pycisTopic.fragments.filter_fragments_by_cb().

tss_annotation

TSS annotation Polars DataFrame with at least the following columns: ["Chromosome", "Start", "Strand"]. The “Start” column is 0-based like a BED file. See pycisTopic.gene_annotation.get_tss_annotation_from_ensembl() and pycisTopic.gene_annotation.change_chromosome_source_in_bed() for ways to get TSS annotation from Ensembl BioMart.

flank_window

Flanking window around the TSS. Used for intersecting fragments with TSS positions and keeping cut sites. Default: 2000 (+/- 2000 bp).

smoothing_rolling_window

Rolling window used to smooth the cut sites signal. Default: 10.

minimum_signal_window
Average signal in the tails of the flanking window around the TSS:
  • [-flank_window, -flank_window + minimum_signal_window + 1]

  • [flank_window - minimum_signal_window + 1, flank_window]

is used to normalize the TSS enrichment. Default: 100 (average signal in [-2000, -1901], [1901, 2000] around TSS if flank_window=2000).

tss_window

Window around the TSS used to count fragments in the TSS when calculating the TSS enrichment per cell barcode. Default: 50 (+/- 50 bp).

min_norm

Minimum normalization score. If the average minimum signal value is below this value, this number is used to normalize the TSS signal. This approach penalizes cells with fewer reads. Default: 0.2

use_genomic_ranges

Use genomic ranges implementation for calculating intersections, instead of using pyranges.

Returns:
tss_enrichment_per_cb, tss_norm_matrix_sample, tss_norm_matrix_per_cb

Examples

Get TSS annotation for requested transcript types from Ensembl BioMart.

>>> ensembl_tss_annotation_bed_df_pl = get_tss_annotation_from_ensembl(
...     biomart_name="hsapiens_gene_ensembl"
)

Get TSS profile for Polars DataFrame with fragments filtered by cell barcodes.

>>> get_tss_profile(
...     fragments_df_pl=fragments_cb_filtered_df_pl,
...     tss_annotation=ensembl_tss_annotation_bed_df_pl,
...     flank_window=2000,
...     smoothing_rolling_window=10,
...     minimum_signal_window=100,
...     tss_window=50,
...     min_norm=0.2,
... )

QC

pycisTopic.qc.compute_kde(training_data: ndarray, test_data: ndarray, no_threads: int = 8)[source]

Compute kernel-density estimate (KDE) using Gaussian kernels.

This function calculates the KDE in parallel and gives the same result as:

>>> from scipy.stats import gaussian_kde
>>> gaussian_kde(training_data)(test_data)
Parameters:
training_data

2D numpy array with training data to train the KDE.

test_data

2D numpy array with test data for which to evaluate the estimated probability density function (PDF).

no_threads

Number of threads to use in parallelization of KDE function.

Returns:
1D numpy array with probability density function (PDF) values for points in
test_data.
pycisTopic.qc.compute_qc_stats(fragments_df_pl: DataFrame, regions_df_pl: DataFrame, tss_annotation: DataFrame, tss_flank_window: int = 2000, tss_smoothing_rolling_window: int = 10, tss_minimum_signal_window: int = 100, tss_window: int = 50, tss_min_norm: float = 0.2, use_genomic_ranges: bool = True, min_fragments_per_cb: int = 10, collapse_duplicates: bool = True, no_threads: int = 8) tuple[DataFrame, DataFrame, DataFrame, DataFrame][source]

Compute quality check statistics from Polars DataFrame with fragments.

Parameters:
fragments_df_pl

Polars DataFrame with fragments. fragments_df_pl Polars DataFrame with fragments (filtered by cell barcodes of interest). See pycisTopic.fragments.filter_fragments_by_cb().

regions_df_pl

Polars DataFrame with peak regions (consensus peaks or SCREEN regions). See pycisTopic.fragments.read_bed_to_polars_df() for a way to read a BED file with peak regions.

tss_annotation

TSS annotation Polars DataFrame with at least the following columns: ["Chromosome", "Start", "Strand"]. The “Start” column is 0-based like a BED file. See pycisTopic.gene_annotation.read_tss_annotation_from_bed(), pycisTopic.gene_annotation.get_tss_annotation_from_ensembl() and pycisTopic.gene_annotation.change_chromosome_source_in_bed() for ways to get TSS annotation from Ensembl BioMart.

tss_flank_window

Flanking window around the TSS. Used for intersecting fragments with TSS positions and keeping cut sites. Default: 2000 (+/- 2000 bp). See pycisTopic.tss_profile.get_tss_profile().

tss_smoothing_rolling_window

Rolling window used to smooth the cut sites signal. Default: 10. See pycisTopic.tss_profile.get_tss_profile().

tss_minimum_signal_window
Average signal in the tails of the flanking window around the TSS:
  • [-flank_window, -flank_window + minimum_signal_window + 1]

  • [flank_window - minimum_signal_window + 1, flank_window]

is used to normalize the TSS enrichment. Default: 100 (average signal in [-2000, -1901], [1901, 2000] around TSS if flank_window=2000). See pycisTopic.tss_profile.get_tss_profile().

tss_window

Window around the TSS used to count fragments in the TSS when calculating the TSS enrichment per cell barcode. Default: 50 (+/- 50 bp). See pycisTopic.tss_profile.get_tss_profile().

tss_min_norm

Minimum normalization score. If the average minimum signal value is below this value, this number is used to normalize the TSS signal. This approach penalizes cells with fewer reads. Default: 0.2 See pycisTopic.tss_profile.get_tss_profile().

use_genomic_ranges

Use genomic ranges implementation for calculating intersections, instead of using pyranges.

min_fragments_per_cb

Minimum number of fragments needed per cell barcode to keep the fragments for those cell barcodes.

collapse_duplicates

Collapse duplicate fragments (same chromosomal positions and linked to the same cell barcode).

no_threads

Number of threads to use when calculating kernel-density estimate (KDE) to get probability density function (PDF) values for log10 unique fragments in peaks vs TSS enrichment, fractions of fragments in peaks and duplication ratio. Default: 8

Returns:
Tuple with:
  • Polars DataFrame with fragments statistics per cell barcode.

  • Polars DataFrame with insert size distribution of fragments.

  • Polars DataFrame with TSS normalization matrix for the whole sample.

  • Polars DataFrame with TSS normalization matrix per cell barcode.

Examples

>>> from pycisTopic.fragments import read_bed_to_polars_df
>>> from pycisTopic.fragments import read_fragments_to_polars_df
>>> from pycisTopic.gene_annotation import read_tss_annotation_from_bed
  1. Read gzipped fragments BED file to a Polars DataFrame.

>>> fragments_df_pl = read_fragments_to_polars_df(
...     fragments_bed_filename="fragments.tsv.gz",
... )
  1. Read BED file with consensus peaks or SCREEN regions (get first 3 columns only) which will be used for counting number of fragments in peaks.

>>> regions_df_pl = read_bed_to_polars_df(
...     bed_filename=screen_regions_bed_filename,
...     min_column_count=3,
... )
  1. Read TSS annotation from a file. See pycisTopic.gene_annotation.read_tss_annotation_from_bed() for more info.

>>> tss_annotation_bed_df_pl = read_tss_annotation_from_bed(
...     tss_annotation_bed_filename="hg38.tss.bed",
... )
  1. Compute QC statistics.

>>> (
...     fragments_stats_per_cb_df_pl,
...     insert_size_dist_df_pl,
...     tss_norm_matrix_sample,
...     tss_norm_matrix_per_cb,
... ) = compute_qc_stats(
...     fragments_df_pl=fragments_cb_filtered_df_pl,
...     regions_df_pl=regions_df_pl,
...     tss_annotation=tss_annotation_bed_df_pl,
...     tss_flank_window=2000,
...     tss_smoothing_rolling_window=10,
...     tss_minimum_signal_window=100,
...     tss_window=50,
...     tss_min_norm=0.2,
...     use_genomic_ranges=True,
...     min_fragments_per_cb=10,
...     collapse_duplicates=True,
...     no_threads=8,
... )
pycisTopic.qc.get_barcodes_passing_qc_for_sample(sample_id: str, pycistopic_qc_output_dir: str | Path, unique_fragments_threshold: int | None = None, tss_enrichment_threshold: float | None = None, frip_threshold: float | None = None, use_automatic_thresholds: bool = True) tuple[np.ndarray, dict[str, float]][source]

Get barcodes passing quality control (QC) for a sample.

Parameters:
sample_id

Sample ID.

pycistopic_qc_output_dir

Directory with output from pycistopic qc.

unique_fragments_threshold

Threshold for number of unique fragments in peaks. If not defined, and use_automatic_thresholds is False, the threshold will be set to 0.

tss_enrichment_threshold

Threshold for TSS enrichment score. If not defined, and use_automatic_thresholds is False, the threshold will be set to 0.

frip_threshold

Threshold for fraction of reads in peaks (FRiP). If not defined the threshold will be set to 0.

use_automatic_thresholds

Use automatic thresholds for unique fragments in peaks and TSS enrichment score as calculated by Otsu’s method. If False, the thresholds will be set to 0 if not defined.

Returns:
Tuple with:
  • Numpy array with cell barcodes passing QC.

  • Dictionary with thresholds used for QC.

Raises:
FileNotFoundError

If the file with fragments statistics per cell barcode does not exist.

pycisTopic.qc.get_otsu_threshold(fragments_stats_per_cb_df_pl: DataFrame, min_otsu_fragments: int = 100, min_otsu_tss: float = 1.0)[source]

Get Otsu thresholds for number of unique fragments in peaks and TSS enrichment score.

Parameters:
fragments_stats_per_cb_df_pl

Polars DataFrame with fragments statistics per cell barcode as generated by pycisTopic.qc.compute_qc_stats().

min_otsu_fragments

When calculating Otsu threshold for number of unique fragments in peaks per CB, only consider those CBs which have at least this number of fragments.

min_otsu_tss

When calculating Otsu threshold for TSS enrichment score per CB, only consider those CBs which have at least this TSS value.

Returns:
Tuple with:
  • Otsu threshold for number of unique fragments in peaks.

  • Otsu threshold for TSS enrichment.

  • Polars DataFrame with fragments statistics per cell barcode for cell barcodes that passed both Otsu thresholds.

Examples

Only keep fragments stats for CBs that pass both Otsu thresholds. >>> ( … unique_fragments_in_peaks_count_otsu_threshold, … tss_enrichment_otsu_threshold, … fragments_stats_per_cb_for_otsu_threshold_df_pl, … ) = get_otsu_threshold( … fragments_stats_per_cb_df_pl=fragments_stats_per_cb_df_pl, … min_otsu_fragments=100, … min_otsu_tss=1.0, … )

Topic modelling

class pycisTopic.lda_models.CistopicLDAModel(metrics: DataFrame, coherence: DataFrame, marg_topic: DataFrame, topic_ass: DataFrame, cell_topic: DataFrame, topic_region: DataFrame, parameters: DataFrame)[source]

cisTopic LDA model class

cistopicLdaModel contains model quality metrics (model coherence (adaptation from Mimno et al., 2011), log-likelihood (Griffiths and Steyvers, 2004), density-based (Cao Juan et al., 2009) and divergence-based (Arun et al., 2010)), topic quality metrics (coherence, marginal distribution and total number of assignments), cell-topic and topic-region distribution, model parameters and model dimensions.

Parameters:
metrics: pd.DataFrame

pd.DataFrame containing model quality metrics, including model coherence (adaptation from Mimno et al., 2011), log-likelihood and density and divergence-based methods (Cao Juan et al., 2009; Arun et al., 2010).

coherence: pd.DataFrame

pd.DataFrame containing the coherence of each topic (Mimno et al., 2011).

marginal_distribution: pd.DataFrame

pd.DataFrame containing the marginal distribution for each topic. It can be interpreted as the importance of each topic for the whole corpus.

topic_ass: pd.DataFrame

pd.DataFrame containing the total number of assignments per topic.

cell_topic: pd.DataFrame

pd.DataFrame containing the topic cell distributions, with cells as columns, topics as rows and the probability of each topic in each cell as values.

topic_region: pd.DataFrame

pd.DataFrame containing the topic cell distributions, with topics as columns, regions as rows and the probability of each region in each topic as values.

parameters: pd.DataFrame

pd.DataFrame containing parameters used for the model.

n_cells: int

Number of cells in the model.

n_regions: int

Number of regions in the model.

n_topic: int

Number of topics in the model.

References

Mimno, D., Wallach, H., Talley, E., Leenders, M., & McCallum, A. (2011). Optimizing semantic coherence in topic models. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (pp. 262-272).

Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National academy of Sciences, 101(suppl 1), 5228-5235.

Cao, J., Xia, T., Li, J., Zhang, Y., & Tang, S. (2009). A density-based method for adaptive LDA model selection. Neurocomputing, 72(7-9), 1775-1781.

Arun, R., Suresh, V., Madhavan, C. V., & Murthy, M. N. (2010). On finding the natural number of topics with latent dirichlet allocation: Some observations. In Pacific-Asia conference on knowledge discovery and data mining (pp. 391-402). Springer, Berlin, Heidelberg.

class pycisTopic.lda_models.LDAMallet(num_topics: int, corpus: Iterable | None = None, alpha: float | None = 50, eta: float | None = 0.1, id2word: FakeDict | None = None, n_cpu: int | None = 1, tmp_dir: str | None = None, optimize_interval: int | None = 0, iterations: int | None = 150, topic_threshold: float | None = 0.0, random_seed: int | None = 555, reuse_corpus: bool | None = False, mallet_path: str = 'mallet')[source]

Wrapper class to run LDA models with Mallet. This class has been adapted from gensim (https://github.com/RaRe-Technologies/gensim/blob/27bbb7015dc6bbe02e00bb1853e7952ac13e7fe0/gensim/models/wrappers/ldamallet.py).

Parameters:
num_topics: int

The number of topics to use in the model.

corpus: iterable of iterable of (int, int), optional

Collection of texts in BoW format. Default: None.

alpha: float, optional

Scalar value indicating the symmetric Dirichlet hyperparameter for topic proportions. Default: 50.

id2wordgensim.utils.FakeDict, optional

Mapping between tokens ids and words from corpus, if not specified - will be inferred from corpus. Default: None.

n_cpuint, optional

Number of threads that will be used for training. Default: 1.

tmp_dirstr, optional

tmp_dir for produced temporary files. Default: None.

optimize_intervalint, optional

Optimize hyperparameters every optimize_interval iterations (sometimes leads to Java exception 0 to switch off hyperparameter optimization). Default: 0.

iterationsint, optional

Number of training iterations. Default: 150.

topic_thresholdfloat, optional

Threshold of the probability above which we consider a topic. Default: 0.0.

random_seed: int, optional

Random seed to ensure consistent results, if 0 - use system clock. Default: 555.

mallet_path: str

Path to the mallet binary (e.g. /xxx/Mallet/bin/mallet). Default: “mallet”.

convert_input(corpus)[source]

Convert corpus to Mallet format and save it to a temporary text file.

Parameters:
corpus

iterable of iterable of (int, int) Collection of texts in BoW format.

Returns:
None.
corpus_to_mallet(corpus, file_like)[source]

Convert corpus to Mallet format and write it to file_like descriptor.

Parameters:
corpus

iterable of iterable of (int, int) Collection of texts in BoW format.

file_like

Writable file-like object in text mode.

Returns:
None.
fcorpusmallet()[source]

Get path to corpus.mallet file.

Returns:
str

Path to corpus.mallet file.

fcorpustxt()[source]

Get path to corpus text file.

Returns:
str

Path to corpus text file.

fdoctopics()[source]

Get path to document topic text file.

Returns:
str

Path to document topic text file.

finferencer()[source]

Get path to inferencer.mallet file.

Returns:
str

Path to inferencer.mallet file.

fstate()[source]

Get path to temporary file.

Returns:
str

Path to file.

ftopickeys()[source]

Get path to topic keys text file.

Returns:
str

Path to topic keys text file.

get_topics()[source]

Get topics X words matrix.

Returns:
np.ndarray

Topics X words matrix, shape num_topics x vocabulary_size.

load_word_topics()[source]

Load words X topics matrix from gensim.models.wrappers.LDAMallet.LDAMallet.fstate() file.

Returns:
np.ndarray

Matrix words X topics.

train(corpus, reuse_corpus)[source]

Train Mallet LDA.

Parameters:
corpusiterable of iterable of (int, int)

Corpus in BoW format

reuse_corpus: bool, optional

Whether to reuse the mallet corpus in the tmp directory. Default: False

pycisTopic.lda_models.evaluate_models(models: List[CistopicLDAModel], select_model: int | None = None, return_model: bool | None = True, metrics: str | None = ['Minmo_2011', 'loglikelihood', 'Cao_Juan_2009', 'Arun_2010'], min_topics_coh: int | None = 5, plot: bool | None = True, figsize: Tuple[float, float] | None = (6.4, 4.8), plot_metrics: bool | None = False, save: str | None = None)[source]

Model selection based on model quality metrics (model coherence (adaptation from Mimno et al., 2011), log-likelihood (Griffiths and Steyvers, 2004), density-based (Cao Juan et al., 2009) and divergence-based (Arun et al., 2010)).

Parameters:
models: list of :class:`CistopicLDAModel`

A list containing cisTopic LDA models, as returned from run_cgs_models or run_cgs_modelsMallet.

selected_model: int, optional

Integer indicating the number of topics of the selected model. If not provided, the best model will be selected automatically based on the model quality metrics. Default: None.

return_model: bool, optional

Whether to return the selected model as CistopicLDAModel

metrics: list of str
Metrics to use for plotting and model selection:

Minmo_2011: Uses the average model coherence as calculated by Mimno et al (2011). In order to reduce the impact of the number of topics, we calculate the average coherence based on the top selected average values. The better the model, the higher coherence. log-likelihood: Uses the log-likelihood in the last iteration as calculated by Griffiths and Steyvers (2004). The better the model, the higher the log-likelihood. Arun_2010: Uses a divergence-based metric as in Arun et al (2010) using the topic-region distribution, the cell-topic distribution and the cell coverage. The better the model, the lower the metric. Cao_Juan_2009: Uses a density-based metric as in Cao Juan et al (2009) using the topic-region distribution. The better the model, the lower the metric.

Default: all metrics.

min_topics_coh: int, optional

Minimum number of topics on a topic to use its coherence for model selection. Default: 5.

plot: bool, optional

Whether to return plot to the console. Default: True.

figsize: tuple, optional

Size of the figure. Default: (6.4, 4.8)

plot_metrics: bool, optional

Whether to plot metrics independently. Default: False.

save: str, optional

Output file to save plot. Default: None.

References

Mimno, D., Wallach, H., Talley, E., Leenders, M., & McCallum, A. (2011). Optimizing semantic coherence in topic models. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (pp. 262-272).

Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National academy of Sciences, 101(suppl 1), 5228-5235

Cao, J., Xia, T., Li, J., Zhang, Y., & Tang, S. (2009). A density-based method for adaptive LDA model selection. Neurocomputing, 72(7-9), 1775-1781.

Arun, R., Suresh, V., Madhavan, C. V., & Murthy, M. N. (2010). On finding the natural number of topics with latent dirichlet allocation: Some observations. In Pacific-Asia conference on knowledge discovery and data mining (pp. 391-402). Springer, Berlin, Heidelberg.

pycisTopic.lda_models.run_cgs_model_mallet(binary_matrix: csr_matrix, corpus: Iterable, id2word: FakeDict, n_topics: List[int], cell_names: List[str], region_names: List[str], n_cpu: int | None = 1, n_iter: int | None = 500, random_state: int | None = 555, alpha: float | None = 50, alpha_by_topic: bool | None = True, eta: float | None = 0.1, eta_by_topic: bool | None = False, top_topics_coh: int | None = 5, tmp_path: str | None = None, save_path: str | None = None, reuse_corpus: bool | None = False, mallet_path: str = 'mallet')[source]

Run Latent Dirichlet Allocation in a model as implemented in Mallet (McCallum, 2002).

Parameters:
binary_matrix: sparse.csr_matrix

Binary sparse matrix containing cells as columns, regions as rows, and 1 if a regions is considered accessible on a cell (otherwise, 0).

n_topics: list of int

A list containing the number of topics to use in each model.

cell_names: list of str

List containing cell names as ordered in the binary matrix columns.

region_names: list of str

List containing region names as ordered in the binary matrix rows.

n_cpu: int, optional

Number of cpus to use for modelling. In this function parallelization is done per model, that is, one model will run entirely in a unique cpu. We recommend to set the number of cpus as the number of models that will be inferred, so all models start at the same time.

n_iter: int, optional

Number of iterations for which the Gibbs sampler will be run. Default: 150.

random_state: int, optional

Random seed to initialize the models. Default: 555.

alpha: float, optional

Scalar value indicating the symmetric Dirichlet hyperparameter for topic proportions. Default: 50.

alpha_by_topic: bool, optional

Boolean indicating whether the scalar given in alpha has to be divided by the number of topics. Default: True

eta: float, optional

Scalar value indicating the symmetric Dirichlet hyperparameter for topic multinomials. Default: 0.1.

eta_by_topic: bool, optional

Boolean indicating whether the scalar given in beta has to be divided by the number of topics. Default: False

top_topics_coh: int, optional

Number of topics to use to calculate the model coherence. For each model, the coherence will be calculated as the average of the top coherence values. Default: 5.

tmp_path: str, optional

Path to a temporary folder for Mallet. Default: None.

save_path: str, optional

Path to save models as independent files as they are completed. This is recommended for large data sets. Default: None.

reuse_corpus: bool, optional

Whether to reuse the mallet corpus in the tmp directory. Default: False

mallet_path: str

Path to Mallet binary (e.g. “/xxx/Mallet/bin/mallet”). Default: “mallet”.

References

McCallum, A. K. (2002). Mallet: A machine learning for language toolkit. http://mallet.cs.umass.edu.

pycisTopic.lda_models.run_cgs_models(cistopic_obj: CistopicObject, n_topics: List[int], n_cpu: int | None = 1, n_iter: int | None = 150, random_state: int | None = 555, alpha: float | None = 50, alpha_by_topic: bool | None = True, eta: float | None = 0.1, eta_by_topic: bool | None = False, top_topics_coh: int | None = 5, save_path: str | None = None, **kwargs)[source]

Run Latent Dirichlet Allocation using Gibbs Sampling as described in Griffiths and Steyvers, 2004.

Parameters:
cistopic_obj: CistopicObject

A CistopicObject. Note that cells/regions have to be filtered before running any LDA model.

n_topics: list of int

A list containing the number of topics to use in each model.

n_cpu: int, optional

Number of cpus to use for modelling. In this function parallelization is done per model, that is, one model will run entirely in a unique cpu. We recommend to set the number of cpus as the number of models that will be inferred, so all models start at the same time.

n_iter: int, optional

Number of iterations for which the Gibbs sampler will be run. Default: 150.

random_state: int, optional

Random seed to initialize the models. Default: 555.

alpha: float, optional

Scalar value indicating the symmetric Dirichlet hyperparameter for topic proportions. Default: 50.

alpha_by_topic: bool, optional

Boolean indicating whether the scalar given in alpha has to be divided by the number of topics. Default: True

eta: float, optional

Scalar value indicating the symmetric Dirichlet hyperparameter for topic multinomials. Default: 0.1.

eta_by_topic: bool, optional

Boolean indicating whether the scalar given in beta has to be divided by the number of topics. Default: False

top_topics_coh: int, optional

Number of topics to use to calculate the model coherence. For each model, the coherence will be calculated as the average of the top coherence values. Default: 5.

save_path: str, optional

Path to save models as independent files as they are completed. This is recommended for large data sets. Default: None.

References

Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National academy of Sciences, 101(suppl 1), 5228-5235.

pycisTopic.lda_models.run_cgs_models_mallet(cistopic_obj: CistopicObject, n_topics: List[int], n_cpu: int | None = 1, n_iter: int | None = 150, random_state: int | None = 555, alpha: float | None = 50, alpha_by_topic: bool | None = True, eta: float | None = 0.1, eta_by_topic: bool | None = False, top_topics_coh: int | None = 5, tmp_path: str | None = None, save_path: str | None = None, reuse_corpus: bool | None = False, mallet_path: str = 'mallet')[source]

Run Latent Dirichlet Allocation per model as implemented in Mallet (McCallum, 2002).

Parameters:
cistopic_obj: CistopicObject

A CistopicObject. Note that cells/regions have to be filtered before running any LDA model.

n_topics: list of int

A list containing the number of topics to use in each model.

n_cpu: int, optional

Number of cpus to use for modelling. In this function parallelization is done per model, that is, one model will run entirely in a unique cpu. We recommend to set the number of cpus as the number of models that will be inferred, so all models start at the same time.

n_iter: int, optional

Number of iterations for which the Gibbs sampler will be run. Default: 150.

random_state: int, optional

Random seed to initialize the models. Default: 555.

alpha: float, optional

Scalar value indicating the symmetric Dirichlet hyperparameter for topic proportions. Default: 50.

alpha_by_topic: bool, optional

Boolean indicating whether the scalar given in alpha has to be divided by the number of topics. Default: True

eta: float, optional

Scalar value indicating the symmetric Dirichlet hyperparameter for topic multinomials. Default: 0.1.

eta_by_topic: bool, optional

Boolean indicating whether the scalar given in beta has to be divided by the number of topics. Default: False

top_topics_coh: int, optional

Number of topics to use to calculate the model coherence. For each model, the coherence will be calculated as the average of the top coherence values. Default: 5.

tmp_path: str, optional

Path to a temporary folder for Mallet. Default: None.

save_path: str, optional

Path to save models as independent files as they are completed. This is recommended for large data sets. Default: None.

reuse_corpus: bool, optional

Whether to reuse the mallet corpus in the tmp directory. Default: False

mallet_path: str

Path to Mallet binary (e.g. “/xxx/Mallet/bin/mallet”). Default: “mallet”.

References

McCallum, A. K. (2002). Mallet: A machine learning for language toolkit. http://mallet.cs.umass.edu.

Clustering & visualization

pycisTopic.clust_vis.cell_topic_heatmap(cistopic_obj: CistopicObject, variables: List[str] | None = None, remove_nan: bool | None = True, scale: bool | None = False, cluster_topics: bool | None = False, color_dict: Dict[str, Dict[str, str]] | None = {}, seed: int | None = 555, legend_loc_x: float | None = 1.2, legend_loc_y: float | None = -0.5, legend_dist_y: float | None = -1, figsize: Tuple[float, float] | None = (6.4, 4.8), selected_topics: List[int] | None = None, selected_cells: List[str] | None = None, harmony: bool | None = False, save: str | None = None)[source]

Plot heatmap with cell-topic distributions. Parameters ——— cistopic_obj: class::CistopicObject

A cisTopic object with a model in class::CistopicObject.selected_model.

variables: list

List of variables to plot. They should be included in class::CistopicObject.cell_data and class::CistopicObject.region_data, depending on which target is specified.

remove_nan: bool, optional

Whether to remove data points for which the variable value is ‘nan’. Default: True

reduction_name: str

Name of the dimensionality reduction to use

scale: bool, optional

Whether to scale the cell-topic or topic-regions contributions prior to plotting. Default: False

cluster_topics: bool, optional

Whether to cluster rows in the heatmap. Otherwise, they will be ordered based on the maximum values over the ordered cells. Default: False

color_dictionary: dict, optional

A dictionary containing an entry per variable, whose values are dictionaries with variable levels as keys and corresponding colors as values. Default: None

seed: int, optional

Random seed used to select random colors. Default: 555

legend_loc_x: float, optional

X location for legend. Default: 1.2

legend_loc_y: float, optional

Y location for legend. Default: -0.5

legend_dist_y: float, optional

Y distance between legends. Default: -1

figsize: tuple, optional

Size of the figure. Default: (6.4, 4.8)

selected_topics: list, optional

A list with selected topics to be used for plotting. Default: None (use all topics)

selected_cellss: list, optional

A list with selected cells to plot. Default: None (use all cells)

harmony: bool, optional

If target is ‘cell’, whether to use harmony processed topic contributions. Default: False

save: str, optional

Path to save plot. Default: None.

pycisTopic.clust_vis.find_clusters(cistopic_obj: CistopicObject, target: str | None = 'cell', k: int | None = 10, res: List[float] | None = [0.6], seed: int | None = 555, scale: bool | None = False, prefix: str | None = '', selected_topics: List[int] | None = None, selected_features: List[str] | None = None, harmony: bool | None = False, rna_components: DataFrame | None = None, use_umap_integration: bool | None = False, rna_weight: float | None = 0.5, split_pattern: str | None = '___', **kwargs)[source]

Performing leiden cell or region clustering and add results to cisTopic object’s metadata.

Parameters

cistopic_obj: class::CistopicObject

A cisTopic object with a model in class::CistopicObject.selected_model.

target: str, optional

Whether cells (‘cell’) or regions (‘region’) should be clustered. Default: ‘cell’

k: int, optional

Number of neighbours in the k-neighbours graph. Default: 10

res: float, optional

Resolution parameter for the leiden algorithm step. Default: 0.6

seed: int, optional

Seed parameter for the leiden algorithm step. Default: 555

scale: bool, optional

Whether to scale the cell-topic or topic-regions contributions prior to the clustering. Default: False

prefix: str, optional

Prefix to add to the clustering name when adding it to the correspondent metadata attribute. Default: ‘’

selected_topics: list, optional

A list with selected topics to be used for clustering. Default: None (use all topics)

selected_features: list, optional

A list with selected features (cells or regions) to cluster. This is recommended when working with regions (e.g. selecting regions in binarized topics), as working with all regions can be time consuming. Default: None (use all features)

harmony: bool, optional

If target is ‘cell’, whether to use harmony processed topic contributions. Default: False.

rna_components: pd.DataFrame, optional

A pandas dataframe containing RNA dimensionality reduction (e.g. PCA) components. If provided, both layers (atac and rna) will be considered for clustering.

use_umap_integration: bool, optional

Whether to use a weighted UMAP representation for the clustering or directly integrating the two graphs. Default: True

rna_weight: float, optional

Weight of the RNA layer on the clustering (only applicable when clustering via UMAP). Default: 0.5 (same weight)

pycisTopic.clust_vis.harmony(cistopic_obj: CistopicObject, vars_use: List[str], scale: bool | None = True, random_state: int | None = 555, **kwargs)[source]

Apply harmony batch effect correction (Korsunsky et al, 2019) over cell-topic distribution

Parameters

cistopic_obj: class::CistopicObject

A cisTopic object with a model in class::CistopicObject.selected_model.

vars_use: list

List of variables to correct batch effect with.

scale: bool, optional

Whether to scale probability matrix prior to correction. Default: True

random_state: int, optional

Random seed used to use with harmony. Default: 555

References

Korsunsky, I., Millard, N., Fan, J., Slowikowski, K., Zhang, F., Wei, K., … & Raychaudhuri, S. (2019). Fast, sensitive and accurate integration of single-cell data with Harmony. Nature methods, 16(12), 1289-1296.

pycisTopic.clust_vis.input_check(atac_topics: DataFrame, rna_pca: DataFrame)[source]

A function to select cells present in both the RNA and the ATAC layers

pycisTopic.clust_vis.plot_imputed_features(cistopic_obj: CistopicObject, reduction_name: str, imputed_data: cisTopicImputedFeatures, features: ~typing.List[str], scale: bool | None = False, cmap: str | matplotlib.cm | None = <matplotlib.colors.ListedColormap object>, dot_size: int | None = 10, alpha: float | int | None = 1, selected_cells: ~typing.List[str] | None = None, figsize: ~typing.Tuple[float, float] | None = (6.4, 4.8), num_columns: int | None = 1, save: str | None = None)[source]

Plot imputed features into dimensionality reduction.

Parameters

cistopic_obj: class::CistopicObject

A cisTopic object with dimensionality reductions in class::CistopicObject.dr.

reduction_name: str

Name of the dimensionality reduction to use

imputed_data: class::cisTopicImputedFeatures

A class::cisTopicImputedFeatures object derived from the input cisTopic object.

features: list

Names of the features to plot.

scale: bool, optional

Whether to scale the imputed features prior to plotting. Default: False

cmap: str or ‘matplotlib.cm’, optional

For continuous variables, color map to use for the legend color bar. Default: cm.viridis

dot_size: int, optional

Dot size in the plot. Default: 10

alpha: float, optional

Transparency value for the dots in the plot. Default: 1

selected_cells: list, optional

A list with selected cells to plot. Default: None (use all cells)

figsize: tuple, optional

Size of the figure. If num_columns is 1, this is the size for each figure; if num_columns is above 1, this is the overall size of the figure (if keeping default, it will be the size of each subplot in the figure). Default: (6.4, 4.8)

num_columns: int, optional

For multiplot figures, indicates the number of columns (the number of rows will be automatically determined based on the number of plots). Default: 1

save: str, optional

Path to save plot. Default: None.

pycisTopic.clust_vis.plot_metadata(cistopic_obj: ~pycisTopic.cistopic_class.CistopicObject, reduction_name: str, variables: ~typing.List[str], target: str | None = 'cell', remove_nan: bool | None = True, show_label: bool | None = True, show_legend: bool | None = False, cmap: str | <module 'matplotlib.cm' from '/home/docs/checkouts/readthedocs.org/user_builds/pycistopic/envs/polars/lib/python3.11/site-packages/matplotlib/cm.py'> | None = <matplotlib.colors.ListedColormap object>, dot_size: int | None = 10, text_size: int | None = 10, alpha: float | int | None = 1, seed: int | None = 555, color_dictionary: ~typing.Dict[str, str] | None = {}, figsize: ~typing.Tuple[float, float] | None = (6.4, 4.8), num_columns: int | None = 1, selected_features: ~typing.List[str] | None = None, save: str | None = None)[source]

Plot categorical and continuous metadata into dimensionality reduction.

Parameters

cistopic_obj: class::CistopicObject

A cisTopic object with dimensionality reductions in class::CistopicObject.projections.

reduction_name: str

Name of the dimensionality reduction to use

variables: list

List of variables to plot. They should be included in class::CistopicObject.cell_data and class::CistopicObject.region_data, depending on which target is specified.

target: str, optional

Whether cells (‘cell’) or regions (‘region’) should be used. Default: ‘cell’

remove_nan: bool, optional

Whether to remove data points for which the variable value is ‘nan’. Default: True

show_label: bool, optional

For categorical variables, whether to show the label in the plot. Default: True

show_legend: bool, optional

For categorical variables, whether to show the legend next to the plot. Default: False

cmap: str or ‘matplotlib.cm’, optional

For continuous variables, color map to use for the legend color bar. Default: cm.viridis

dot_size: int, optional

Dot size in the plot. Default: 10

text_size: int, optional

For categorical variables and if show_label is True, size of the labels in the plot. Default: 10

alpha: float, optional

Transparency value for the dots in the plot. Default: 1

seed: int, optional

Random seed used to select random colors. Default: 555

color_dictionary: dict, optional

A dictionary containing an entry per variable, whose values are dictionaries with variable levels as keys and corresponding colors as values. Default: None

figsize: tuple, optional

Size of the figure. If num_columns is 1, this is the size for each figure; if num_columns is above 1, this is the overall size of the figure (if keeping default, it will be the size of each subplot in the figure). Default: (6.4, 4.8)

num_columns: int, optional

For multiplot figures, indicates the number of columns (the number of rows will be automatically determined based on the number of plots). Default: 1

selected_features: list, optional

A list with selected features (cells or regions) to plot. This is recommended when working with regions (e.g. selecting regions in binarized topics), as working with all regions can be time consuming. Default: None (use all features)

save: str, optional

Path to save plot. Default: None.

pycisTopic.clust_vis.plot_topic(cistopic_obj: ~pycisTopic.cistopic_class.CistopicObject, reduction_name: str, target: str | None = 'cell', cmap: str | <module 'matplotlib.cm' from '/home/docs/checkouts/readthedocs.org/user_builds/pycistopic/envs/polars/lib/python3.11/site-packages/matplotlib/cm.py'> | None = <matplotlib.colors.ListedColormap object>, dot_size: int | None = 10, alpha: float | int | None = 1, scale: bool | None = False, selected_topics: ~typing.List[int] | None = None, selected_features: ~typing.List[str] | None = None, harmony: bool | None = False, figsize: ~typing.Tuple[float, float] | None = (6.4, 4.8), num_columns: int | None = 1, save: str | None = None)[source]

Plot topic distributions into dimensionality reduction.

Parameters

cistopic_obj: class::CistopicObject

A cisTopic object with dimensionality reductions in class::CistopicObject.projections.

reduction_name: str

Name of the dimensionality reduction to use

target: str, optional

Whether cells (‘cell’) or regions (‘region’) should be used. Default: ‘cell’

cmap: str or ‘matplotlib.cm’, optional

For continuous variables, color map to use for the legend color bar. Default: cm.viridis

dot_size: int, optional

Dot size in the plot. Default: 10

alpha: float, optional

Transparency value for the dots in the plot. Default: 1

scale: bool, optional

Whether to scale the cell-topic or topic-regions contributions prior to plotting. Default: False

selected_topics: list, optional

A list with selected topics to be used for plotting. Default: None (use all topics)

selected_features: list, optional

A list with selected features (cells or regions) to plot. This is recommended when working with regions (e.g. selecting regions in binarized topics), as working with all regions can be time consuming. Default: None (use all features)

harmony: bool, optional

If target is ‘cell’, whether to use harmony processed topic contributions. Default: False

figsize: tuple, optional

Size of the figure. If num_columns is 1, this is the size for each figure; if num_columns is above 1, this is the overall size of the figure (if keeping default, it will be the size of each subplot in the figure). Default: (6.4, 4.8)

num_columns: int, optional

For multiplot figures, indicates the number of columns (the number of rows will be automatically determined based on the number of plots). Default: 1

save: str, optional

Path to save plot. Default: None.

pycisTopic.clust_vis.run_tsne(cistopic_obj: CistopicObject, target: str | None = 'cell', scale: bool | None = False, reduction_name: str | None = 'tSNE', random_state: int | None = 555, perplexity: int | None = 30, selected_topics: List[int] | None = None, selected_features: List[str] | None = None, harmony: bool | None = False, rna_components: DataFrame | None = None, rna_weight: float | None = 0.5, **kwargs)[source]

Run tSNE and add it to the dimensionality reduction dictionary. If FItSNE is installed it will be used, otherwise sklearn TSNE implementation will be used.

Parameters

cistopic_obj: class::CistopicObject

A cisTopic object with a model in class::CistopicObject.selected_model.

target: str, optional

Whether cells (‘cell’) or regions (‘region’) should be used. Default: ‘cell’

scale: bool, optional

Whether to scale the cell-topic or topic-regions contributions prior to the dimensionality reduction. Default: False

reduction_name: str, optional

Reduction name to use as key in the dimensionality reduction dictionary. Default: ‘tSNE’

random_state: int, optional

Seed parameter for running tSNE. Default: 555

perplexity: int, optional

Perplexity parameter for FitSNE. Default: 30

selected_topics: list, optional

A list with selected topics to be used for clustering. Default: None (use all topics)

selected_features: list, optional

A list with selected features (cells or regions) to cluster. This is recommended when working with regions (e.g. selecting regions in binarized topics), as working with all regions can be time consuming. Default: None (use all features)

harmony: bool, optional

If target is ‘cell’, whether to use harmony processed topic contributions. Default: False

rna_components: pd.DataFrame, optional

A pandas dataframe containing RNA dimensionality reduction (e.g. PCA) components. If provided, both layers (atac and rna) will be considered for clustering.

rna_weight: float, optional

Weight of the RNA layer on the clustering (only applicable when clustering via UMAP). Default: 0.5 (same weight)

**kwargs

Parameters to pass to fitsne.FItSNE or sklearn.manifold.TSNE.

pycisTopic.clust_vis.run_umap(cistopic_obj: CistopicObject, target: str | None = 'cell', scale: bool | None = False, reduction_name: str | None = 'UMAP', random_state: int | None = 555, selected_topics: List[int] | None = None, selected_features: List[str] | None = None, harmony: bool | None = False, rna_components: DataFrame | None = None, rna_weight: float | None = 0.5, **kwargs)[source]

Run UMAP and add it to the dimensionality reduction dictionary.

Parameters

cistopic_obj: class::CistopicObject

A cisTopic object with a model in class::CistopicObject.selected_model.

target: str, optional

Whether cells (‘cell’) or regions (‘region’) should be used. Default: ‘cell’

scale: bool, optional

Whether to scale the cell-topic or topic-regions contributions prior to the dimensionality reduction. Default: False

reduction_name: str, optional

Reduction name to use as key in the dimensionality reduction dictionary. Default: ‘UMAP’

random_state: int, optional

Seed parameter for running UMAP. Default: 555

selected_topics: list, optional

A list with selected topics to be used for clustering. Default: None (use all topics)

selected_features: list, optional

A list with selected features (cells or regions) to cluster. This is recommended when working with regions (e.g. selecting regions in binarized topics), as working with all regions can be time consuming. Default: None (use all features)

harmony: bool, optional

If target is ‘cell’, whether to use harmony processed topic contributions. Default: False.

rna_components: pd.DataFrame, optional

A pandas dataframe containing RNA dimensionality reduction (e.g. PCA) components. If provided, both layers (atac and rna) will be considered for clustering.

rna_weight: float, optional

Weight of the RNA layer on the clustering (only applicable when clustering via UMAP). Default: 0.5 (same weight)

**kwargs

Parameters to pass to umap.UMAP.

pycisTopic.clust_vis.weighted_integration(atac_topics: DataFrame, rna_pca: DataFrame, common_cells: List[str], weight=0.5, **kwargs)[source]

A function for weighted integration via UMAP

Drop-out imputation & Differential features

class pycisTopic.diff_features.CistopicImputedFeatures(imputed_acc: csr_matrix, feature_names: List[str], cell_names: List[str], project: str)[source]

cisTopic imputation data class.

CistopicImputedFeatures contains the cell by features matrices (stored at mtx, with features being eithere regions or genes ), cell names cell_names and feature names feature_names.

Attributes

mtx: sparse.csr_matrix

A matrix containing imputed values.

cell_names: list

A list containing cell names.

feature_names: list

A list containing feature names.

project: str

Name of the cisTopic imputation project.

make_rankings(seed=123)[source]

A function to generate rankings per cell based on the imputed accessibility scores per region.

Parameters

seed: int, optional

Random seed to ensure reproducibility of the rankings when there are ties

Return

CistopicImputedFeatures

A CistopicImputedFeatures containing with ranking values rather than scores.

merge(cistopic_imputed_features_list: List[CistopicImputedFeatures], project: str | None = 'cisTopic_impute_merge', copy: bool | None = False)[source]

Merge a list of CistopicImputedFeatures to the input CistopicImputedFeatures. Reference coordinates (for regions) must be the same between the objects.

Parameters

cistopic_imputed_features_list: list

A list containing one or more CistopicImputedFeatures to merge.

project: str, optional

Name of the cisTopic imputation project.

copy: bool, optional

Whether changes should be done on the input CistopicObject or a new object should be returned

Return

CistopicImputedFeatures

A combined CistopicImputedFeatures.

subset(cells: List[str] | None = None, features: List[str] | None = None, copy: bool | None = False, split_pattern: str | None = '___')[source]

Subset cells and/or regions from CistopicImputedFeatures.

Parameters

cells: list, optional

A list containing the names of the cells to keep.

features: list, optional

A list containing the names of the features to keep.

copy: bool, optional

Whether changes should be done on the input CistopicObject or a new object should be returned

split_pattern: str

Pattern to split cell barcode from sample id. Default: ___

pycisTopic.diff_features.find_diff_features(cistopic_obj: CistopicObject, imputed_features_obj: CistopicImputedFeatures, variable: str, var_features: List[str] | None = None, contrasts: List[List[str]] | None = None, adjpval_thr: float | None = 0.05, log2fc_thr: float | None = 0.5849625007211562, split_pattern: str | None = '___', n_cpu: int | None = 1, **kwargs)[source]

Find differential imputed features.

Parameters

cistopic_obj: class::CistopicObject

A cisTopic object including the cells in imputed_features_obj.

imputed_features_obj: CistopicImputedFeatures

A cisTopic imputation data object.

variable: str

Name of the group variable to do comparison. It must be included in class::CistopicObject.cell_data

var_features: list, optional

A list of features to use (e.g. variable features from find_highly_variable_features())

contrasts: List, optional

A list including contrasts to make in the form of lists with foreground and background, e.g. [[[‘Group_1’], [‘Group_2, ‘Group_3’]], [][‘Group_2’], [‘Group_1, ‘Group_3’]], [][‘Group_1’], [‘Group_2, ‘Group_3’]]]. Default: None.

adjpval_thr: float, optional

Adjusted p-values threshold. Default: 0.05

log2fc_thr: float, optional

Log2FC threshold. Default: np.log2(1.5)

split_pattern: str

Pattern to split cell barcode from sample id. Default: ___

n_cpu: int, optional

Number of cores to use. Default: 1

**kwargs

Parameters to pass to ray.init()

pycisTopic.diff_features.find_highly_variable_features(input_mat: DataFrame | CistopicImputedFeatures, min_disp: float | None = 0.05, min_mean: float | None = 0.0125, max_disp: float | None = inf, max_mean: float | None = 3, n_bins: int | None = 20, n_top_features: int | None = None, plot: bool | None = True, save: str | None = None)[source]

Find highly variable features.

Parameters

input_mat: pd.DataFrame or CistopicImputedFeatures

A dataframe with values to be normalize or cisTopic imputation data.

min_disp: float, optional

Minimum dispersion value for a feature to be selected. Default: 0.05

min_mean: float, optional

Minimum mean value for a feature to be selected. Default: 0.0125

max_disp: float, optional

Maximum dispersion value for a feature to be selected. Default: np.inf

max_mean: float, optional

Maximum mean value for a feature to be selected. Default: 3

n_bins: int, optional

Number of bins for binning the mean gene expression. Normalization is done with respect to each bin. Default: 20

n_top_features: int, optional

Number of highly-variable features to keep. If specifed, dispersion and mean thresholds will be ignored. Default: None

plot: bool, optional

Whether to plot dispersion versus mean values. Default: True.

save: str, optional

Path to save feature selection plot. Default: None

pycisTopic.diff_features.get_log2_fc(fg_mat, bg_mat)[source]

Calculate log2 fold change between foreground and background matrix.

Parameters:
fg_mat

2D-numpy foreground matrix.

bg_mat

2D-numpy background matrix.

pycisTopic.diff_features.get_wilcox_test_pvalues(fg_mat, bg_mat)[source]

Calculate wilcox test p-values between foreground and background matrix.

Parameters:
fg_mat

2D-numpy foreground matrix.

bg_mat

2D-numpy background matrix.

pycisTopic.diff_features.impute_accessibility(cistopic_obj: CistopicObject, selected_cells: List[str] | None = None, selected_regions: List[str] | None = None, scale_factor: int | None = 1000000, chunk_size: int = 20000, project: str | None = 'cisTopic_Impute')[source]

Impute region accessibility.

Parameters:
cistopic_obj: `class::CistopicObject`

A cisTopic object with a model in class::CistopicObject.selected_model.

selected_cells: list, optional

A list with selected cells to impute accessibility for. Default: None

selected_regions: list, optional

A list with selected regions to impute accessibility for. Default: None

scale_factor: int, optional

A number to multiply the imputed values for. This is useful to convert low probabilities to 0, making the matrix more sparse. Default: 10**6.

chunk_size:

Chunk size used (number of regions for which imputed accessibility is calculated at the same time).

project: str, optional

Name of the cisTopic imputation project. Default: cisTopic_impute.

pycisTopic.diff_features.markers(input_mat: DataFrame | CistopicImputedFeatures, barcode_group: List[List[str]], contrast_name: str, adjpval_thr: float | None = 0.05, log2fc_thr: float | None = 1, n_cpu: int | None = 1)[source]

Find differential imputed features.

Parameters:
input_mat: :class:`pd.DataFrame` or :class:`CistopicImputedFeatures`

A data frame or a cisTopic imputation data object.

barcode_group: List

List of length 2, including foreground cells on the first slot and background on the second.

contrast_name: str

Name of the contrast

adjpval_thr: float, optional

Adjusted p-values threshold. Default: 0.05

log2fc_thr: float, optional

Log2FC threshold. Default: np.log2(1.5)

n_cpu: int, optional

Number of cores to use. Default: 1

pycisTopic.diff_features.mean_axis1(arr)[source]

Calculate column wise mean of 2D-numpy matrix with numba, mimicking np.mean(x, axis=1).

Parameters:
arr

2D-numpy array to calculate the mean per column for.

pycisTopic.diff_features.normalize_scores(imputed_acc: DataFrame | CistopicImputedFeatures, scale_factor: int = 10000)[source]

Log-normalize imputation data. Feature counts for each cell are divided by the total counts for that cell and multiplied by the scale_factor.

Parameters:
imputed_acc: pd.DataFrame or :class:`CistopicImputedFeatures`

A dataframe with values to be normalized or cisTopic imputation data.

scale_factor: int

Scale factor for cell-level normalization. Default: 10**4

pycisTopic.diff_features.p_adjust_bh(p: float)[source]

Benjamini-Hochberg p-value correction for multiple hypothesis testing.

pycisTopic.diff_features.subset_array_second_axis(arr, col_indices)[source]

Subset array by second axis based on provided col_indices.

Returns the same as arr[:, col_indices], but is much faster when arr and col_indices are big.

Parameters:
arr

2D-numpy array to subset by provided column indices.

col_indices

1D-numpy array (preferably with np.int64 as dtype) with column indices.

Topic binarization

pycisTopic.topic_binarization.binarize_topics(cistopic_obj: CistopicObject, target: str | None = 'region', method: str | None = 'otsu', smooth_topics: bool = True, ntop: int = 2000, predefined_thr: dict[str, float] | None = None, nbins: int = 100, plot: bool = False, figsize: tuple[float, float] | None = (6.4, 4.8), num_columns: int = 1, save: str | None = None)[source]

Binarize topic distributions.

Parameters:
cistopic_obj

A cisTopic object with a model in CistopicObject.

target

Whether cell-topic (“cell”) or region-topic (“region”) distributions should be binarized. Default: “region”.

method
Method to use for topic binarization. Possible options are:
  • otsu [Otsu, 1979]

  • yen [Yen et al., 1995]

  • li [Li & Lee, 1993]

  • aucell [Van de Sande et al., 2020]

  • ntop [Taking the top n regions per topic]

Default: otsu.

smooth_topics

Whether to smooth topics distributions to penalize regions enriched across many topics. The following formula is applied:

\[\beta_{w, k} (\log\beta_{w,k} - 1 / K \sum_{k'} \log \beta_{w,k'})\]
ntop

Number of top regions to select when using method="ntop". Default: 2000.

predefined_thr

A dictionary containing topics as keys and threshold as values. If a topic is not present, thresholds will be computed with the specified method. This can be used for manually adjusting thresholds when necessary. Default: None.

nbins

Number of bins to use in the histogram used for otsu, yen and li thresholding. Default: 100.

plot

Whether to plot region-topic distributions and their threshold. Default: False.

figsize

Size of the figure. If num_columns is 1, this is the size for each figure. If num_columns is above 1, this is the overall size of the figure. If keeping the default, it will be the size of each subplot in the figure. Default: (6.4, 4.8).

num_columns

For multiplot figures, indicates the number of columns (the number of rows will be automatically determined based on the number of plots). Default: 1.

save

Path to save plot. Default: None.

Returns:
A dictionary containing a pd.DataFrame with the selected regions with region names
as indexes and a topic score column.

References

  • Otsu, N., 1979. A threshold selection method from gray-level histograms. IEEE transactions on systems, man, and cybernetics, 9(1), pp.62-66.

  • Yen, J.C., Chang, F.J. and Chang, S., 1995. A new criterion for automatic multilevel thresholding. IEEE Transactions on Image Processing, 4(3), pp.370-378.

  • Li, C.H. and Lee, C.K., 1993. Minimum cross entropy thresholding. Pattern recognition, 26(4), pp.617-625.

  • Van de Sande, B., Flerin, C., Davie, K., De Waegeneer, M., Hulselmans, G., Aibar, S., Seurinck, R., Saelens, W., Cannoodt, R., Rouchon, Q. and Verbeiren, T., 2020. A scalable SCENIC workflow for single-cell gene regulatory network analysis. Nature Protocols, 15(7), pp.2247-2276.

pycisTopic.topic_binarization.cross_entropy(array: ndarray, threshold: float, nbins: int = 100) float[source]

Calculate entropies for Li thresholding on topic-region distributions [Li & Lee, 1993].

Parameters:
array

Array containing the region values for the topic to be binarized.

threshold

Distribution threshold to calculate entropy from.

nbins

Number of bins to use in the binarization histogram.

Returns:
Entropy for the given threshold.
pycisTopic.topic_binarization.histogram_and_bin_centers(array: ndarray, nbins: int = 100) tuple[ndarray, ndarray][source]

Draw histogram from distribution and identify centers.

Parameters:
array

Scores distribution.

nbins

Number of bins to use in the histogram.

Returns:
Histogram values and bin centers.
pycisTopic.topic_binarization.smooth_topics_distributions(topic_region_distributions: DataFrame) DataFrame[source]

Smooth topic-region distributions.

Smooth topics distributions to penalize regions enriched across many topics. The formula applied is:

\[\beta_{w, k} (\log\beta_{w,k} - 1 / K \sum_{k'} \log \beta_{w,k'})\]
Parameters:
topic_region_distributions

A pandas dataframe with topic-region distributions (with topics as columns and regions as rows).

Returns:
Smoothed topic-region dataframe.
pycisTopic.topic_binarization.threshold_otsu(array: ndarray, nbins: int = 100) float[source]

Apply Otsu threshold on topic-region distributions [Otsu, 1979].

Parameters:
array

Array containing the region values for the topic to be binarized.

nbins

Number of bins to use in the binarization histogram.

Returns:
Binarization threshold.
pycisTopic.topic_binarization.threshold_yen(array: ndarray, nbins: int = 100) float[source]

Apply Yen threshold on topic-region distributions [Yen et al., 1995].

Parameters:
array

Array containing the region values for the topic to be binarized.

nbins

Number of bins to use in the binarization histogram.

Returns:
Binarization threshold.

Topic QC

pycisTopic.topic_qc.compute_topic_metrics(cistopic_obj: CistopicObject, return_metrics: bool | None = True)[source]

Compute topic quality control metrics.

Parameters

cistopic_obj: class::CistopicObject

A cisTopic object with a model in class::CistopicObject.selected_model.

return_metrics: bool, optional

Whether to return metrics as class::pd.DataFrame. The metrics will be also appended to class::CistopicObject.selected_model.topic_qc_metrics despite the value of this parameter. Default: True.

References

Mimno, D., Wallach, H., Talley, E., Leenders, M., & McCallum, A. (2011). Optimizing semantic coherence in topic models. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (pp. 262-272).

pycisTopic.topic_qc.gini_coefficient(x)[source]

Compute Gini coefficient of array of values

pycisTopic.topic_qc.plot_topic_qc(topic_qc_metrics: DataFrame | CistopicObject, var_x: str, var_y: str, min_x: int | None = None, max_x: int | None = None, min_y: int | None = None, max_y: int | None = None, var_color: str | None = None, cmap: str | None = 'viridis', dot_size: int | None = 10, text_size: int | None = 10, plot: bool | None = False, save: str | None = None, return_topics: bool | None = False, return_fig: bool | None = False)[source]

Plotting topic qc metrics and filtering.

Parameters

topic_qc_metrics: class::pd.DataFrame or class::CistopicObject

A topic metrics dataframe or a cisTopic object with class::CistopicObject.selected_model.topic_qc_metrics filled.

var_x: str

Metric to plot.

var_y: str, optional

A second metric to plot in combination with var_x.

min_x: float, optional

Minimum value on var_x to keep the barcode/cell. Default: None.

max_x: float, optional

Maximum value on var_x to keep the barcode/cell. Default: None.

min_y: float, optional

Minimum value on var_y to keep the barcode/cell. Default: None.

max_y: float, optional

Maximum value on var_y to keep the barcode/cell. Default: None.

var_color: str, optional

Metric to color plot by. Default: None

cmap: str, optional

Color map to color 2D dot plots by density. Default: None.

dot_size: int, optional

Dot size in the plot. Default: 10

text_size: int, optional

Size of the labels in the plot. Default: 10

plot: bool, optional

Whether the plots should be returned to the console. Default: True.

save: bool, optional

Path to save plots as a file. Default: None.

return_topics: bool, optional

Whether to return selected topics based on user-given thresholds. Default: True.

return_fig: bool, optional

Whether to return the plot figure; if several samples it will return a dictionary with the figures per sample. Default: False.

Return — list

A list with the selected topics.

pycisTopic.topic_qc.topic_annotation(cistopic_obj: CistopicObject, annot_var: str, binarized_cell_topic: Dict[str, DataFrame] | None = None, general_topic_thr: float | None = 0.2, **kwargs)[source]

Automatic annotation of topics.

Parameters

cistopic_obj: class::CistopicObject

A cisTopic object with a model in class::CistopicObject.selected_model.

annot_var: str

Name of the variable (contained in ‘class::CistopicObject.cell_data’) to use for annotation

binarized_cell_topic: Dict, optional

A dictionary containing binarized cell topic distributions (from binarize_topics()). If not provided, binarized_topics() will be run. Default: None.

general_topic_thr: float, optional

Threshold for considering a topic as general. After assigning topics to annotations, the ratio of cells in the binarized topic in the whole population is compared with the ratio of the total number of cells in the assigned groups versus the whole population. If the difference is above this threshold, the topic is considered general. Default: 0.2.

**kwargs

Arguments to pass to binarize_topics()

Export to loom

pycisTopic.loom.add_annotation(loom, annots: DataFrame)[source]

A helper function to add annotations

pycisTopic.loom.add_clusterings(loom: SCopeLoom, cluster_data: DataFrame)[source]

A helper function to add clusters

pycisTopic.loom.add_markers(loom: SCopeLoom, markers_dict: Dict[str, Dict[str, DataFrame]])[source]

A helper function to add markers to clusterings

pycisTopic.loom.add_metrics(loom, metrics: DataFrame)[source]

A helper function to add metrics

pycisTopic.loom.df_to_named_matrix(df: DataFrame)[source]

A helper function to create metadata structure.

pycisTopic.loom.export_gene_activity_to_loom(gene_activity_matrix: CistopicImputedFeatures | DataFrame, cistopic_obj: CistopicObject, out_fname: str, regulons: List[Regulon] = None, selected_genes: List[str] | None = None, selected_cells: List[str] | None = None, auc_mtx: DataFrame | None = None, auc_thresholds: DataFrame | None = None, cluster_annotation: List[str] = None, cluster_markers: Dict[str, Dict[str, DataFrame]] = None, tree_structure: Sequence[str] = (), title: str = None, nomenclature: str = 'Unknown', split_pattern='___', num_workers: int = 1, **kwargs)[source]

Create SCope [Davie et al, 2018] compatible loom files for gene activity exploration

Parameters

gene_activity_matrix: class::CistopicImputedFeatures or class::pd.DataFrame

A cisTopic imputed features object containing imputed gene activity as values. Alternatively, a pandas data frame with genes as columns, cells as rows and gene activity per gene as values.

cistopic_obj: class::CisTopicObject

The cisTopic object from which gene activity values have been derived. It must include cell meta data (including specified cluster annotation columns).

regulons: list

A list of regulons as derived from pySCENIC (Van de Sande et al., 2020).

out_fname: str

Path to output file.

selected_genes: list, optional

A list specifying which genes should be included in the loom file. Default: None

selected_cells: list, optional

A list specifying which cells should be included in the loom file. Default: None

auc_mtx: pd.DataFrame, optional

A regulon AUC matrix for the regulons as derived from pySCENIC (Van de Sande et al., 2020). If not provided it will be inferred.

auc_thresholds: pd.DataFrame, optional

A AUC thresholds for the regulons as derived from pySCENIC (Van de Sande et al., 2020). If not provided it will be inferred.

cluster_annotation: list, optional

A list indicating which information in cistopic_obj.cell_data should be used as clusters. The specified names must be included as columns in cistopic_obj.cell_data. Default: None.

cluster_markers: dict, optional

A dictionary including an entry per cluster annotation (which should match with the names in cluster_annotation) including a dictionary per cluster with a pandas data frame with marker regions as rows and logFC and adjusted p-values as columns (the output of find_diff_features). Default: None.

tree_structure: sequence, optional

A sequence of strings that defines the category tree structure. Needs to be a sequence of strings with three elements. Default: ()

title: str, optional

The title for this loom file. If None than the basename of the filename is used as the title. Default: None

nomenclature: str, optional

The name of the genome. Default: ‘Unknown’

**kwargs

Additional parameters for pyscenic.export.export2loom

References

Davie, K., Janssens, J., Koldere, D., De Waegeneer, M., Pech, U., Kreft, Ł., … & Aerts, S. (2018). A single-cell transcriptome atlas of the aging Drosophila brain. Cell, 174(4), 982-998.

Van de Sande, B., Flerin, C., Davie, K., De Waegeneer, M., Hulselmans, G., Aibar, S., … & Aerts, S. (2020). A scalable SCENIC workflow for single-cell gene regulatory network analysis. Nature Protocols, 15(7), 2247-2276.

pycisTopic.loom.export_minimal_loom_gene(ex_mtx: DataFrame, embeddings: Mapping[str, DataFrame], out_fname: str, regulons: List[Regulon] = None, cell_annotations: Mapping[str, str] | None = None, tree_structure: Sequence[str] = (), title: str | None = None, nomenclature: str = 'Unknown', num_workers: int = 2, auc_mtx=None, auc_thresholds=None, compress: bool = False)[source]

Create a loom file for a single cell experiment to be used in SCope. :param ex_mtx: The expression matrix (n_cells x n_genes). :param regulons: A list of Regulons. :param cell_annotations: A dictionary that maps a cell ID to its corresponding cell type annotation. :param out_fname: The name of the file to create. :param tree_structure: A sequence of strings that defines the category tree structure. Needs to be a sequence of strings with three elements. :param title: The title for this loom file. If None than the basename of the filename is used as the title. :param nomenclature: The name of the genome. :param num_workers: The number of cores to use for AUCell regulon enrichment. :param embeddings: A dictionary that maps the name of an embedding to its representation as a pandas DataFrame with two columns: the first column is the first component of the projection for each cell followed by the second. The first mapping is the default embedding (use collections.OrderedDict to enforce this). :param compress: compress metadata (only when using SCope).

pycisTopic.loom.export_region_accessibility_to_loom(accessibility_matrix: CistopicImputedFeatures | DataFrame, cistopic_obj: CistopicObject, binarized_topic_region: Dict[str, DataFrame], binarized_cell_topic: Dict[str, DataFrame], out_fname: str, selected_regions: List[str] = None, selected_cells: List[str] = None, cluster_annotation: List[str] = None, cluster_markers: Dict[str, Dict[str, DataFrame]] = None, tree_structure: Sequence[str] = (), title: str = None, nomenclature: str = 'Unknown', split_pattern: str = '___', **kwargs)[source]

Create SCope [Davie et al, 2018] compatible loom files for accessibility data exploration

Parameters

accessibility_matrix: class::CistopicImputedFeatures or class::pd.DataFrame

A cisTopic imputed features object containing imputed accessibility as values. Alternatively, a pandas data frame with regions as columns, cells as rows and accessibility per regions as values.

cistopic_obj: class::CisTopicObject

The cisTopic object from which accessibility values have been derived. It must include cell meta data (including specified cluster annotation columns) and the topic model from which accessibility has been imputed.

binarized_topic_region: dictionary

A dictionary containing topics as keys and class::pd.DataFrame with regions in topics as index and their topic contribution as values. This is the output of binarize_topics() using target=’region’.

binarized_cell_topic: dictionary

A dictionary containing topics as keys and class::pd.DataFrame with cells in topics as index and their topic contribution as values. This is the output of binarize_topics() using target=’cell’.

out_fname: str

Path to output file.

selected_regions: list, optional

A list specifying which regions should be included in the loom file. This is useful when working with very large data sets (e.g. one can select only regions in topics as DARs to reduce the file size). Default: None

selected_cells: list, optional

A list specifying which cells should be included in the loom file. Default: None

cluster_annotation: list, optional

A list indicating which information in cistopic_obj.cell_data should be used as clusters. The specified names must be included as columns in cistopic_obj.cell_data. Default: None.

cluster_markers: dict, optional

A dictionary including an entry per cluster annotation (which should match with the names in cluster_annotation) including a dictionary per cluster with a pandas data frame with marker regions as rows and logFC and adjusted p-values as columns (the output of find_diff_features). Default: None.

tree_structure: sequence, optional

A sequence of strings that defines the category tree structure. Needs to be a sequence of strings with three elements. Default: ()

title: str, optional

The title for this loom file. If None than the basename of the filename is used as the title. Default: None

nomenclature: str, optional

The name of the genome. Default: ‘Unknown’

**kwargs

Additional parameters for pyscenic.export.export2loom

References

Davie, K., Janssens, J., Koldere, D., De Waegeneer, M., Pech, U., Kreft, Ł., … & Aerts, S. (2018). A single-cell transcriptome atlas of the aging Drosophila brain. Cell, 174(4), 982-998.

pycisTopic.loom.get_metadata(loom)[source]

A helper function to get metadata

pycisTopic.loom.get_regulons(loom)[source]

A helper function to get regulons

Signature enrichment

pycisTopic.signature_enrichment.gene_set_to_signature(gene_set: List, name: str)[source]

A helper function to generat gene signatures

Parameters

gene_set: pr.PyRanges

List of genes

name: str

Name for the signature

pycisTopic.signature_enrichment.region_set_to_signature(query_region_set: PyRanges, target_region_set: PyRanges, name: str)[source]

A helper function to intersect query regions with the input data set regions

Parameters

query_region_set: pr.PyRanges

Pyranges with regions to query

target_region_set: pr.PyRanges

Pyranges with target regions

name: str

Name for the signature

pycisTopic.signature_enrichment.signature_enrichment(rankings: CistopicImputedFeatures, signatures: Dict[str, PyRanges] | Dict[str, List], enrichment_type: str = 'region', auc_threshold: float = 0.05, normalize: bool = False, n_cpu: int = 1)[source]

Get enrichment of a region signature in cells or topics using AUCell (Van de Sande et al., 2020)

Parameters

rankings: CistopicImputedFeatures

A CistopicImputedFeatures object with ranking values

signatures: Dictionary of pr.PyRanges (for regions) or list (for genes)

A dictionary containing region signatures as pr.PyRanges or gene names as list

enrichment_type: str

Whether features are genes or regions

auc_threshold: float

The fraction of the ranked genome to take into account for the calculation of the Area Under the recovery Curve. Default: 0.05

normalize: bool

Normalize the AUC values to a maximum of 1.0 per regulon. Default: False

num_workers: int

The number of cores to use. Default: 1

pyGREAT

pycisTopic.pyGREAT.get_region_signature(pyGREAT_results: Dict[str, DataFrame], region_set_key: str, ontology: str, term: str)[source]

Retriving GO region signature from GREAT results

Parameters:
pyGREAT_results: Dict

A dictionary with pyGREAT results.

region_set_key: str

Key of the region set to query

ontology: str

Ontology to query

term: str

Term to retrive regions from

pycisTopic.pyGREAT.pyGREAT(region_sets: Dict[str, PyRanges], species: str, rule: str = 'basalPlusExt', span: float = 1000.0, upstream: float = 5.0, downstream: float = 1.0, two_distance: float = 1000.0, one_distance: float = 1000.0, include_curated_reg_doms: int = 1, bg_choice: str = 'wholeGenome', tmp_dir: str = None, n_cpu: int = 1, **kwargs)[source]

Running GREAT (McLean et al., 2010) on a dictionary of pyranges. For more details in GREAT parameters, please visit http://great.stanford.edu/public/html/

Parameters:
region_sets: Dict

A dictionary containing region sets to query as pyRanges objects.

species: str

Genome assembly from where the coordinates come from. Possible values are: ‘mm9’, ‘mm10’, ‘hg19’, ‘hg38’

rule: str

How to associate genomic regions to genes. Possible options are ‘basalPlusExt’, ‘twoClosest’, ‘oneClosest’. Default: ‘basalPlusExt’

span: float

Unit: kb, only used when rule is ‘basalPlusExt’. Default: 1000.0

upstream: float

Unit: kb, only used when rule is ‘basalPlusExt’. Default: 5.0

downstream: float

Unit: kb, only used when rule is ‘basalPlusExt’. Default: 1.0

two_distance: float

Unit: kb, only used when rule is ‘twoClosest’. Default: 1000.0

one_distance: float

Unit: kb, only used when rule is ‘oneClosest’. Default: 1000.0

include_curated_reg_doms: int

Whether to include curated regulatory domains. Default: 1

bg_choice: str

A path to the background file or a string. Default: ‘wholeGenome’

tmp_dir: str

Temporary directory to save region sets as bed files for GREAT. Default: None

n_cpu: int

Number of cores to use. Default: 1

***kwargs

Other parameters to pass to ray.init

References

McLean, C. Y., Bristor, D., Hiller, M., Clarke, S. L., Schaar, B. T., Lowe, C. B., … & Bejerano, G. (2010). GREAT improves functional interpretation of cis-regulatory regions. Nature biotechnology, 28(5), 495-501.

pycisTopic.pyGREAT.pyGREAT_oneset(region_set: PyRanges, species: str, rule: str = 'basalPlusExt', span: float = 1000.0, upstream: float = 5.0, downstream: float = 1.0, two_distance: float = 1000.0, one_distance: float = 1000.0, include_curated_reg_doms: int = 1, bg_choice: str = 'wholeGenome', tmp_dir: str = None)[source]

Running GREAT (McLean et al., 2010) on a pyranges object. For more details in GREAT parameters, please visit http://great.stanford.edu/public/html/

Parameters:
region_sets: Dict

A dictionary containing region sets to query as pyRanges objects.

species: str

Genome assembly from where the coordinates come from. Possible values are: ‘mm9’, ‘mm10’, ‘hg19’, ‘hg38’

rule: str

How to associate genomic regions to genes. Possible options are ‘basalPlusExt’, ‘twoClosest’, ‘oneClosest’. Default: ‘basalPlusExt’

span: float

Unit: kb, only used when rule is ‘basalPlusExt’. Default: 1000.0

upstream: float

Unit: kb, only used when rule is ‘basalPlusExt’. Default: 5.0

downstream: float

Unit: kb, only used when rule is ‘basalPlusExt’. Default: 1.0

two_distance: float

Unit: kb, only used when rule is ‘twoClosest’. Default: 1000.0

one_distance: float

Unit: kb, only used when rule is ‘oneClosest’. Default: 1000.0

include_curated_reg_doms: int

Whether to include curated regulatory domains. Default: 1

bg_choice: str

A path to the background file or a string. Default: ‘wholeGenome’

tmp_dir: str

Temporary directory to save region sets as bed files for GREAT. Default: None

n_cpu: int

Number of cores to use. Default: 1

***kwargs

Other parameters to pass to ray.init

References

McLean, C. Y., Bristor, D., Hiller, M., Clarke, S. L., Schaar, B. T., Lowe, C. B., … & Bejerano, G. (2010). GREAT improves functional interpretation of cis-regulatory regions. Nature biotechnology, 28(5), 495-501.

Gene activity

pycisTopic.gene_activity.calculate_distance_join(pr_obj: PyRanges)[source]

A helper function to calculate distances between regions and genes.

pycisTopic.gene_activity.calculate_distance_with_limits_join(pr_obj: PyRanges)[source]

A helper function to calculate distances between regions and genes, returning information on what is the relative distance to the TSS and end of the gene.

pycisTopic.gene_activity.extend_pyranges(pr_obj: PyRanges, upstream: int, downstream: int)[source]

A helper function to extend coordinates downstream/upstream in a pyRanges given upstream and downstream distances.

pycisTopic.gene_activity.extend_pyranges_with_limits(pr_obj: PyRanges)[source]

A helper function to extend coordinates downstream/upstream in a pyRanges with Distance_upstream and Distance_downstream columns.

pycisTopic.gene_activity.get_gene_activity(imputed_acc_object: CistopicImputedFeatures, pr_annot: PyRanges, chromsizes: PyRanges, predefined_boundaries: PyRanges | None = None, use_gene_boundaries: bool | None = True, upstream: List[int] | None = [1000, 100000], downstream: List[int] | None = [1000, 100000], distance_weight: bool | None = True, decay_rate: float | None = 1, extend_gene_body_upstream: int | None = 5000, extend_gene_body_downstream: int | None = 0, gene_size_weight: bool | None = False, gene_size_scale_factor: int | str | None = 'median', remove_promoters: bool | None = False, scale_factor: float | None = 1, average_scores: bool | None = True, extend_tss: List[int] | None = [10, 10], return_weights: bool | None = True, gini_weight: bool | None = True, project: str | None = 'Gene_activity')[source]

Infer gene activity.

Parameters

imputed_features_obj: CistopicImputedFeatures

A cisTopic imputation data object.

pr_annot: pr.PyRanges

A pr.PyRanges containing gene annotation, including Chromosome, Start, End, Strand (as ‘+’ and ‘-‘), Gene name and Transcription Start Site.

chromsizes: pr.PyRanges

A pr.PyRanges containing size of each chromosome, containing ‘Chromosome’, ‘Start’ and ‘End’ columns.

predefined_boundaries: pr.PyRanges

A pr.PyRanges containing predefined genomic domain boundaries (e.g. TAD boundaries) to use as boundaries. If given, use_gene_boundaries will be ignored.

use_gene_boundaries: bool, optional

Whether to use the whole search space or stop when encountering another gene. Default: True

upstream: List, optional

Search space upstream. The minimum (first position) means that even if there is a gene right next to it these bp will be taken. The second position indicates the maximum distance. Default: [1000,100000]

downstream: List, optional

Search space downstream. The minimum (first position) means that even if there is a gene right next to it these bp will be taken. The second position indicates the maximum distance. Default: [1000,100000]

distance_weight: bool, optional

Whether to add a distance weight (an exponential function, the weight will decrease with distance). Default: True

decay_rate: float, optional

Exponent for the distance exponential funciton (the higher the faster will be the decrease). Default: 1

extend_gene_body_upstream: int, optional

Number of bp upstream immune to the distance weight (their value will be maximum for this weight). Default: 5000

extend_gene_body_downstream: int, optional

Number of bp downstream immune to the distance weight (their value will be maximum for this weight). Default: 0

gene_size_weight: bool, optional

Whether to add a weights based on th length of the gene. Default: False

gene_size_scale_factor: str or int, optional

Dividend to calculate the gene size weigth. Default is the median value of all genes in the genome.

remove_promoters: bool, optional

Whether to ignore promoters when computing gene activity. Default: False

average_scores: bool, optional

Whether to divide by the total number of region assigned to a gene when calculating the gene activity score. Default: True

scale_factor: int, optional

Value to multiply for the final gene activity matrix. Default: 1

extend_tss: list, optional

Space around the TSS consider as promoter. Default: [10,10]

return_weights: bool, optional

Whether to return the final weight values. Default: True

gini_weight: bool, optional

Whether to add a gini index weigth. The more unique the region is, the higher this weight will be. Default: True

project: str, optional;

Project name for the CistopicImputedFeatures with the gene activity

pycisTopic.gene_activity.reduce_pyranges_b(pr_obj: PyRanges, upstream: int, downstream: int)[source]

A helper function to reduce coordinates downstream/upstream in a pyRanges given upstream and downstream distances.

pycisTopic.gene_activity.reduce_pyranges_with_limits_b(pr_obj: PyRanges)[source]

A helper function to reduce coordinates downstream/upstream in a pyRanges with Distance_upstream and Distance_downstream columns.

pycisTopic.gene_activity.region_weights(imputed_acc_object, pr_annot, chromsizes, predefined_boundaries=None, use_gene_boundaries=True, upstream=[1000, 100000], downstream=[1000, 100000], distance_weight=True, decay_rate=1, extend_gene_body_upstream=5000, extend_gene_body_downstream=0, gene_size_weight=True, gene_size_scale_factor='median', remove_promoters=True, extend_tss=[10, 10], gini_weight=True)[source]

Calculate region weights.

Parameters

imputed_features_obj: CistopicImputedFeatures

A cisTopic imputation data object.

pr_annot: pr.PyRanges

A pr.PyRanges containing gene annotation, including Chromosome, Start, End, Strand (as ‘+’ and ‘-‘), Gene name and Transcription Start Site.

chromsizes: pr.PyRanges

A pr.PyRanges containing size of each chromosome, containing ‘Chromosome’, ‘Start’ and ‘End’ columns.

predefined_boundaries: pr.PyRanges

A pr.PyRanges containing predefined genomic domain boundaries (e.g. TAD boundaries) to use as boundaries. If given, use_gene_boundaries will be ignored.

use_gene_boundaries: bool, optional

Whether to use the whole search space or stop when encountering another gene. Default: True

upstream: List, optional

Search space upstream. The minimum (first position) means that even if there is a gene right next to it these bp will be taken. The second position indicates the maximum distance. Default: [1000,100000]

downstream: List, optional

Search space downstream. The minimum (first position) means that even if there is a gene right next to it these bp will be taken. The second position indicates the maximum distance. Default: [1000,100000]

distance_weight: bool, optional

Whether to add a distance weight (an exponential function, the weight will decrease with distance). Default: True

decay_rate: float, optional

Exponent for the distance exponential funciton (the higher the faster will be the decrease). Default: 1

extend_gene_body_upstream: int, optional

Number of bp upstream immune to the distance weight (their value will be maximum for this weight). Default: 5000

extend_gene_body_downstream: int, optional

Number of bp downstream immune to the distance weight (their value will be maximum for this weight). Default: 0

gene_size_weight: bool, optional

Whether to add a weights based on th length of the gene. Default: False

gene_size_scale_factor: str or int, optional

Dividend to calculate the gene size weigth. Default is the median value of all genes in the genome.

remove_promoters: bool, optional

Whether to ignore promoters when computing gene activity. Default: False

extend_tss: list, optional

Space around the TSS consider as promoter. Default: [10,10]

gini_weight: bool, optional

Whether to add a gini index weigth. The more unique the region is, the higher this weight will be. Default: True

pycisTopic.gene_activity.weighted_aggregation(imputed_acc_obj_mtx: csr_matrix, region_weights_df_per_gene: DataFrame, average_scores: bool)[source]

Weighted aggregation of region probabilities into gene activity

Parameters

imputed_acc_obj_mtx: sparse.csr_matrix

A sparse matrix with regions as rows and cells as columns.

region_weights_df_per_gene: pd.DataFrame

A data frame with region index (from the sparse matrix) for the gene

average_score: bool

Whether final values should be divided by the total number of regions aggregated

Label transfer

pycisTopic.label_transfer.label_transfer(ref_anndata: AnnData, query_anndata: AnnData, labels_to_transfer: List[str], sample_id_col: str | None = 'sample_id', n_cpu: int | None = 1, variable_genes: bool | None = True, methods: List[str] | None = ['ingest', 'harmony', 'bbknn', 'scanorama', 'cca'], pca_ncomps: List[int] | None = [50, 50], n_neighbours: List[int] | None = [10, 10], bbknn_components: int | None = 30, cca_components: int | None = 30, return_label_weights: bool | None = False, **kwargs)[source]

Wrapper function of Ray processes to compute label transfer from single reference to multiple query samples.

Parameters

ref_anndata: AnnData

An AnnData object containing the reference data set (typically, scRNA-seq data)

query_anndata: AnnData

An AnnData object containing the query data set, with features matching with the reference data set (typically, gene activities derived from scATAC-seq)

labels_to_transfer: List

Labels to transfer. They must be included in ref_anndata.obs.

sample_id_col: str

Name of the column containing the sample ids in the query data set. It must be included in query_anndata.obs. Default: sample_id

n_cpu: int, optional

Number of cores to use. Default: 1.

variable_genes: bool, optional

Whether variable genes matching between the two data set should be used (True) or otherwise, all matching genes (False). Default: True

methods: List, optional

Methods to be used for label transferring. These include: ‘ingest’ [from scanpy], ‘harmony’ [Korsunsky et al, 2019], ‘bbknn’ [Polański et al, 2020], ‘scanorama’ [Hie et al, 2019] and ‘cca’. Except for ingest, these methods return a common coembedding and labels are inferred using the distances between query and refenrence cells as weights.

pca_ncomps: List, optional

Number of principal components to use for reference and query, respectively. Default: [50,50]

n_neighbours: List, optional

Number of neighbours to use for reference and query, respectively. Default: [10,10]

bbknn_components: int, optional

Number of components to use for the umap for bbknn integration. Default: 30

cca_components: int, optional

Number of components to use for cca. Default: 30

return_label_weights: bool, optional

Whether to return the label scores per variable (as a dictionary, except for ingest). Default: False

**kwargs

Additional parameters for ray.init.

References

Korsunsky, I., Millard, N., Fan, J., Slowikowski, K., Zhang, F., Wei, K., … & Raychaudhuri, S. (2019). Fast, sensitive and accurate integration of single-cell data with Harmony. Nature methods, 16(12), 1289-1296.

Polański, K., Young, M. D., Miao, Z., Meyer, K. B., Teichmann, S. A., & Park, J. E. (2020). BBKNN: fast batch alignment of single cell transcriptomes. Bioinformatics, 36(3), 964-965.

Hie, B., Bryson, B., & Berger, B. (2019). Efficient integration of heterogeneous single-cell transcriptomes using Scanorama. Nature biotechnology, 37(6), 685-691.

pycisTopic.label_transfer.label_transfer_coembedded(dist, labels)[source]

A helper function to propagate labels in a common space

Utils

pycisTopic.utils.collapse_duplicates(df)[source]

Collapse duplicates from fragments df

pycisTopic.utils.coord_to_region_names(coord)[source]

PyRanges to region names

pycisTopic.utils.fig2img(fig)[source]

Convert a Matplotlib figure to a PIL Image and return it

pycisTopic.utils.get_tss_matrix(fragments, flank_window, tss_space_annotation)[source]

Get TSS matrix

pycisTopic.utils.gini(array)[source]

Calculate the Gini coefficient of a numpy array.

pycisTopic.utils.normalise_filepath(path: str | Path, check_not_directory: bool = True) str[source]

Create a string path, expanding the home directory if present.

pycisTopic.utils.read_fragments_from_file(fragments_bed_filename, use_polars: bool = True) PyRanges[source]

Read fragments BED file to PyRanges object.

Parameters:
fragments_bed_filename: Fragments BED filename.
use_polars: Use polars instead of pandas for reading the fragments BED file.
Returns:
PyRanges object of fragments.
pycisTopic.utils.region_names_to_coordinates(region_names: Sequence[str]) DataFrame[source]

Create Pandas DataFrame with region IDs to coordinates mapping.

Parameters:
region_names: List of region names in “chrom:start-end” format.
Returns:
Pandas DataFrame with region IDs to coordinates mapping.