API

cisTopic object

class pycisTopic.cistopic_class.CistopicObject(fragment_matrix: csr_matrix, binary_matrix: csr_matrix, cell_names: List[str], region_names: List[str], cell_data: DataFrame, region_data: DataFrame, path_to_fragments: str | Dict[str, str], project: str | None = 'cisTopic')[source]

cisTopic data class.

CistopicObject contains the cell by fragment matrices (stored as counts fragment_matrix and as binary accessibility binary_matrix), cell metadata cell_data, region metadata region_data and path/s to the fragments file/s path_to_fragments.

LDA models from CisTopicLDAModel can be stored selected_model as well as cell/region projections projections as a dictionary.

Attributes:

fragment_matrix: sparse.csr_matrix: A matrix containing cell names as column names, regions as row names and fragment counts as values.
binary_matrix: sparse.csr_matrix: A matrix containing cell names as column names, regions as row names and whether regions as accessible (0: Not accessible; 1: Accessible) as values.
cell_names: list: A list containing cell names.
region_names: list: A list containing region names.
cell_data: pd.DataFrame: A data frame containing cell information, with cells as indexes and attributes as columns.
region_data: pd.DataFrame: A data frame containing region information, with region as indexes and attributes as columns.
path_to_fragments: str or dict: A list containing the paths to the fragments files used to generate the CistopicObject.
project: str: Name of the cisTopic project.

add_LDA_model(model: CistopicLDAModel)[source]

Add LDA model to a cisTopic object.

Parameters:

model: CistopicLDAModel: Selected cisTopic LDA model results (see LDAModels.evaluate_models)

add_cell_data(cell_data: DataFrame, split_pattern: str | None = '___')[source]

Add cell metadata to CistopicObject. If the column already exist on the cell metadata, it will be overwritten.

Parameters:

cell_data: pd.DataFrame: A data frame containing metadata information, with cell names as indexes. If cells are missing from the metadata, values will be filled with Nan.
split_pattern: str: Pattern to split cell barcode from sample id. Default: ___

add_region_data(region_data: DataFrame)[source]

Add region metadata to CistopicObject. If the column already exist on the region metadata, it will be overwritten.

Parameters:

region_data: pd.DataFrame: A data frame containing metadata information, with region names as indexes. If regions are missing from the metadata, values will be filled with Nan.

merge(cistopic_obj_list: List[CistopicObject], is_acc: int | None = 1, project: str | None = 'cisTopic_merge', copy: bool | None = False, split_pattern: str | None = '___')[source]

Merge a list of CistopicObject to the input CistopicObject. Reference coordinates must be the same between the objects. Existent cisTopicCGSModel and projections will be deleted. This is to ensure that models contained in a CistopicObject are derived from the cells it contains.

Parameters:

cistopic_obj_list: list: A list containing one or more CistopicObject to merge.
is_acc: int, optional: Minimal number of fragments for a region to be considered accessible. Default: 1.
project: str, optional: Name of the cisTopic project.
copy: bool, optional: Whether changes should be done on the input CistopicObject or a new object should be returned
split_pattern: str: Pattern to split cell barcode from sample id. Default: ___
Return
——
CistopicObject: A combined CistopicObject. Two new columns in cell_data indicate the CistopicObject of origin (cisTopic_id) and the fragment file from which the cell comes from (path_to_fragments).

subset(cells: List[str] | None = None, regions: List[str] | None = None, copy: bool | None = False, split_pattern: str | None = '___')[source]

Subset cells and/or regions from CistopicObject. Existent CisTopicLDAModel and projections will be deleted. This is to ensure that models contained in a CistopicObject are derived from the cells it contains.

Parameters:

cells: list, optional: A list containing the names of the cells to keep.
regions: list, optional: A list containing the names of the regions to keep.
copy: bool, optional: Whether changes should be done on the input CistopicObject or a new object should be returned
split_pattern: str: Pattern to split cell barcode from sample id. Default: ___

pycisTopic.cistopic_class.create_cistopic_object(fragment_matrix: DataFrame | csr_matrix, cell_names: List[str] | None = None, region_names: List[str] | None = None, path_to_blacklist: str | None = None, min_frag: int | None = 1, min_cell: int | None = 1, is_acc: int | None = 1, path_to_fragments: str | Dict[str, str] | None = {}, project: str | None = 'cisTopic', tag_cells: bool | None = True, split_pattern: str | None = '___')[source]

Creates a CistopicObject from a count matrix.

Parameters:

fragment_matrix: pd.DataFrame or sparse.csr_matrix: A data frame containing cell names as column names, regions as row names and fragment counts as values or sparse.csr_matrix containing cells as columns and regions as rows.
cell_names: list, optional: A list containing cell names. Only used if the fragment matrix is sparse.csr_matrix.
region_names: list, optional: A list containing region names. Only used if the fragment matrix is sparse.csr_matrix.
path_to_blacklist: str, optional: Path to bed file containing blacklist regions (Amemiya et al., 2019).
min_frag: int, optional: Minimal number of fragments in a cell for the cell to be kept. Default: 1
min_cell: int, optional: Minimal number of cell in which a region is detected to be kept. Default: 1
is_acc: int, optional: Minimal number of fragments for a region to be considered accessible. Default: 1
path_to_fragments: str, dict: A dict or str containing the paths to the fragments files used to generate the CistopicObject. Default: {}.
project: str, optional: Name of the cisTopic project. Default: ‘cisTopic’
tag_cells: bool, optional: Whether to add the project name as suffix to the cell names. Default: True
split_pattern: str: Pattern to split cell barcode from sample id. Default: ___

References

Amemiya, H. M., Kundaje, A., & Boyle, A. P. (2019). The ENCODE blacklist: identification of problematic regions of the genome. Scientific reports, 9(1), 1-5.

Creates a CistopicObject from a fragments file and defined genomic intervals (compatible with CellRangerATAC output)

Parameters:

path_to_fragments: str: The path to the fragments file containing chromosome, start, end and assigned barcode for each read (e.g. from CellRanger ATAC (/outs/fragments.tsv.gz)).
path_to_regions: str: Path to the bed file with the defined regions.
path_to_blacklist: str, optional: Path to bed file containing blacklist regions (Amemiya et al., 2019). Default: None
metrics: str, optional: Data frame of CellRanger ot similar, with barcodes and metrics (e.g. from CellRanger ATAC /outs/singlecell.csv). If it is an output from CellRanger, only cells for which is__cell_barcode is 1 will be considered, otherwise only barcodes included in the metrics will be taken. Default: None
valid_bc: list, optional: A list with valid cell barcodes can be provided, only used if path_to_metrics is not provided. Default: None
n_cpu: int, optional: Number of cores to use. Default: 1.
min_frag: int, optional: Minimal number of fragments in a cell for the cell to be kept. Default: 1
min_cell: int, optional: Minimal number of cell in which a region is detected to be kept. Default: 1
is_acc: int, optional: Minimal number of fragments for a region to be considered accessible. Default: 1
check_for_duplicates: bool, optional: If no duplicate counts are provided per row in the fragments file, whether to collapse duplicates. Default: True.
project: str, optional: Name of the cisTopic project. It will also be used as name for sample_id in the cell_data CistopicObject.cell_data. Default: ‘cisTopic’
partition: int, optional: When using Pandas > 0.21, counting may fail (https://github.com/pandas-dev/pandas/issues/26314). In that case, the fragments data frame is divided in this number of partitions, and after counting data is merged.
fragments_df: pd.DataFrame or pr.PyRanges, optional: A PyRanges or DataFrame containing chromosome, start, end and assigned barcode for each read, corresponding to the data in path_to_fragments.
split_pattern: str: Pattern to split cell barcode from sample id. Default: ___
use_polars: bool, optional: Whether to use polars to read fragments files. Default: True.

References

Amemiya, H. M., Kundaje, A., & Boyle, A. P. (2019). The ENCODE blacklist: identification of problematic regions of the genome. Scientific reports, 9(1), 1-5.

pycisTopic.cistopic_class.create_cistopic_object_from_matrix_file(fragment_matrix_file: str, path_to_blacklist: str | None = None, compression: str | None = None, min_frag: int | None = 1, min_cell: int | None = 1, is_acc: int | None = 1, path_to_fragments: Dict[str, str] | None = {}, sample_id: DataFrame | None = None, project: str | None = 'cisTopic', split_pattern: str | None = '___')[source]

Creates a CistopicObject from a count matrix file (tsv).

Parameters:

fragment_matrix: str: Path to a tsv file containing cell names as column names, regions as row names and fragment counts as values.
path_to_blacklist: str, optional: Path to bed file containing blacklist regions (Amemiya et al., 2019). Default: None
compression: str, None: Whether the file is compressed (e.g. bzip). Default: None
min_frag: int, optional: Minimal number of fragments in a cell for the cell to be kept. Default: 1
min_cell: int, optional: Minimal number of cell in which a region is detected to be kept. Default: 1
is_acc: int, optional: Minimal number of fragments for a region to be considered accessible. Default: 1
path_to_fragments: dict, optional: A list containing the paths to the fragments files used to generate the CistopicObject. Default: None.
sample_id: pd.DataFrame, optional: A data frame indicating from which sample each barcode is derived. Required if path_to_fragments is provided. Levels must agree with keys in path_to_fragments. Default: None.
project: str, optional: Name of the cisTopic project. Default: ‘cisTopic’
split_pattern: str: Pattern to split cell barcode from sample id. Default: ___

References

Amemiya, H. M., Kundaje, A., & Boyle, A. P. (2019). The ENCODE blacklist: identification of problematic regions of the genome. Scientific reports, 9(1), 1-5.

pycisTopic.cistopic_class.merge(cistopic_obj_list: List[CistopicObject], is_acc: int | None = 1, project: str | None = 'cisTopic_merge', split_pattern: str | None = '___')[source]

Merge a list of CistopicObject to the input CistopicObject. Reference coordinates must be the same between the objects. Existent cisTopicCGSModel and projections will be deleted. This is to ensure that models contained in a CistopicObject are derived from the cells it contains.

Parameters:

cistopic_obj_list: list: A list containing one or more CistopicObject to merge.
is_acc: int, optional: Minimal number of fragments for a region to be considered accessible. Default: 1.
project: str, optional: Name of the cisTopic project.

Pseudobulk formation and peak calling

class pycisTopic.pseudobulk_peak_calling.MACSCallPeak(macs_path: str, bed_path: str, name: str, outdir: str, genome_size: str, input_format: str | None = 'BEDPE', shift: int | None = 73, ext_size: int | None = 146, keep_dup: str | None = 'all', q_value: int | None = 0.05, nolambda: bool | None = True, skip_empty_peaks: bool = False)[source]

Parameters

macs_path: str: Path to MACS binary (e.g. /xxx/MACS/xxx/bin/macs2).
bed_path: str: Path to fragments file bed file.
name: str: Name of string of the group.
outdir: str: Path to the output directory.
genome_size: str: Effective genome size which is defined as the genome size which can be sequenced. Possible values: ‘hs’, ‘mm’, ‘ce’ and ‘dm’.
input_format: str, optional: Format of tag file can be ELAND, BED, ELANDMULTI, ELANDEXPORT, SAM, BAM, BOWTIE, BAMPE, or BEDPE. Default is AUTO which will allow MACS to decide the format automatically. Default: ‘BEDPE’.
shift: int, optional: To set an arbitrary shift in bp. For finding enriched cutting sites (such as in ATAC-seq) a shift of 73 bp is recommended. Default: 73.
ext_size: int, optional: To extend reads in 5’->3’ direction to fix-sized fragment. For ATAC-seq data, a extension of 146 bp is recommended. Default: 146.
keep_dup: str, optional: Whether to keep duplicate tags at te exact same location. Default: ‘all’.
q_value: float, optional: The q-value (minimum FDR) cutoff to call significant regions. Default: 0.05.
nolambda: bool, optional: Do not consider the local bias/lambda at peak candidate regions.

call_peak()[source]: Run MACS2 peak calling.

load_narrow_peak(skip_empty_peaks: bool)[source]: Load MACS2 narrow peak files as pr.PyRanges.

pycisTopic.pseudobulk_peak_calling.export_pseudobulk(input_data: CistopicObject | DataFrame, variable: str, chromsizes: DataFrame | PyRanges, bed_path: str, bigwig_path: str, path_to_fragments: Dict[str, str] | None = None, sample_id_col: str = 'sample_id', n_cpu: int = 1, normalize_bigwig: bool = True, split_pattern: str = '___', temp_dir: str = '/tmp') → Tuple[Dict[str, str], Dict[str, str]][source]

Create pseudobulks as bed and bigwig from single cell fragments file given a barcode annotation.

Parameters

input_data: CistopicObject or pd.DataFrame: A CistopicObject containing the specified variable as a column in CistopicObject.cell_data or a cell metadata pd.DataFrame containing barcode as rows, containing the specified variable as a column (additional columns are possible) and a sample_id column. Index names must contain the BARCODE (e.g. ATGTCGTC-1), additional tags are possible separating with - (e.g. ATGCTGTGCG-1-Sample_1). The levels in the sample_id column must agree with the keys in the path_to_fragments dictionary. Alternatively, if the cell metadata contains a column named barcode it will be used instead of the index names.
variable: str: A character string indicating the column that will be used to create the different group pseudobulk. It must be included in the cell metadata provided as input_data.
chromsizes: pd.DataFrame or pr.PyRanges: A data frame or pr.PyRanges containing size of each chromosome, containing ‘Chromosome’, ‘Start’ and ‘End’ columns.
bed_path: str: Path to folder where the fragments bed files per group will be saved. If None, files will not be generated.
bigwig_path: str: Path to folder where the bigwig files per group will be saved. If None, files will not be generated.
path_to_fragments: str or dict, optional: A dictionary of character strings, with sample name as names indicating the path to the fragments file/s from which pseudobulk profiles have to be created. If a CistopicObject is provided as input it will be ignored, but if a cell metadata pd.DataFrame is provided it is necessary to provide it. The keys of the dictionary need to match with the sample_id tag added to the index names of the input data frame.
sample_id_col: str, optional: Name of the column containing the sample name per barcode in the input CistopicObject.cell_data or class:pd.DataFrame. Default: ‘sample_id’.
n_cpu: int, optional: Number of cores to use. Default: 1.
normalize_bigwig: bool, optional: Whether bigwig files should be CPM normalized. Default: True.
split_pattern: str, optional: Pattern to split cell barcode from sample id. Default: ‘___’. Note, if split_pattern is not None, then export_pseudobulk will attempt to infer sample_id from the index of input_data and ignore sample_id_col.
temp_dir: str: Path to temporary directory. Default: ‘/tmp’.

pycisTopic.pseudobulk_peak_calling.macs_call_peak(macs_path: str, bed_path: str, name: str, outdir: str, genome_size: str, input_format: str | None = 'BEDPE', shift: int | None = 73, ext_size: int | None = 146, keep_dup: str | None = 'all', q_value: int | None = 0.05, nolambda: bool | None = True, skip_empty_peaks: bool = False)[source]

Performs pseudobulk peak calling with MACS2 in a group. It requires to have MACS2 installed (https://github.com/macs3-project/MACS).

Parameters

macs_path: str: Path to MACS binary (e.g. /xxx/MACS/xxx/bin/macs2).
bed_path: str: Path to fragments file bed file.
name: str: Name of string of the group.
outdir: str: Path to the output directory.
genome_size: str: Effective genome size which is defined as the genome size which can be sequenced. Possible values: ‘hs’, ‘mm’, ‘ce’ and ‘dm’.
input_format: str, optional: Format of tag file can be ELAND, BED, ELANDMULTI, ELANDEXPORT, SAM, BAM, BOWTIE, BAMPE, or BEDPE. Default is AUTO which will allow MACS to decide the format automatically. Default: ‘BEDPE’.
shift: int, optional: To set an arbitrary shift in bp. For finding enriched cutting sites (such as in ATAC-seq) a shift of 73 bp is recommended. Default: 73.
ext_size: int, optional: To extend reads in 5’->3’ direction to fix-sized fragment. For ATAC-seq data, a extension of 146 bp is recommended. Default: 146.
keep_dup: str, optional: Whether to keep duplicate tags at te exact same location. Default: ‘all’.
q_value: float, optional: The q-value (minimum FDR) cutoff to call significant regions. Default: 0.05.
nolambda: bool, optional: Do not consider the local bias/lambda at peak candidate regions.

pycisTopic.pseudobulk_peak_calling.peak_calling(macs_path: str, bed_paths: Dict, outdir: str, genome_size: str, n_cpu: int | None = 1, input_format: str | None = 'BEDPE', shift: int | None = 73, ext_size: int | None = 146, keep_dup: str | None = 'all', q_value: float | None = 0.05, nolambda: bool | None = True, skip_empty_peaks: bool = False, **kwargs)[source]

Performs pseudobulk peak calling with MACS2. It requires to have MACS2 installed (https://github.com/macs3-project/MACS).

Parameters

macs_path: str: Path to MACS binary (e.g. /xxx/MACS/xxx/bin/macs2).
bed_paths: dict: A dictionary containing group label as name and the path to their corresponding fragments bed file as value.
outdir: str: Path to the output directory.
genome_size: str: Effective genome size which is defined as the genome size which can be sequenced. Possible values: ‘hs’, ‘mm’, ‘ce’ and ‘dm’.
n_cpu: int, optional: Number of cores to use. Default: 1.
input_format: str, optional: Format of tag file can be ELAND, BED, ELANDMULTI, ELANDEXPORT, SAM, BAM, BOWTIE, BAMPE, or BEDPE. Default is AUTO which will allow MACS to decide the format automatically. Default: ‘BEDPE’.
shift: int, optional: To set an arbitrary shift in bp. For finding enriched cutting sites (such as in ATAC-seq) a shift of 73 bp is recommended. Default: 73.
ext_size: int, optional: To extend reads in 5’->3’ direction to fix-sized fragment. For ATAC-seq data, a extension of 146 bp is recommended. Default: 146.
keep_dup: str, optional: Whether to keep duplicate tags at te exact same location. Default: ‘all’.
q_value: float, optional: The q-value (minimum FDR) cutoff to call significant regions. Default: 0.05.
**kwargs: Additional parameters to pass to ray.init().

Iterative peak filtering

pycisTopic.iterative_peak_calling.calculate_peaks_and_extend(narrow_peaks: PyRanges, peak_half_width: int, chromsizes: DataFrame | PyRanges | None = None, path_to_blacklist: str | None = None)[source]

Extend peaks a number of base pairs in eca direction from the summit

Parameters

narrow_peaks: pr.PyRanges: A pr.PyRanges with the narrowPeak results from MACS2.
peak_half_width: int: Number of base pairs that each summit will be extended in each direction.
chromsizes: pd.PyRanges or pd.DataFrame: A data frame or pr.PyRanges containing size of each column, containing ‘Chromosome’, ‘Start’ and ‘End’ columns.
path_to_blacklist: str, optional: Path to bed file containing blacklist regions (Amemiya et al., 2019). Default: None

pycisTopic.iterative_peak_calling.cpm(x: PyRanges, column: str)[source]

cpm normalization

Parameters

x: pr.PyRanges: A pyRanges object
column: str: Name of the column that has to be normalized

pycisTopic.iterative_peak_calling.get_consensus_peaks(narrow_peaks_dict: Dict[str, PyRanges], peak_half_width: int, chromsizes: DataFrame | PyRanges | None = None, path_to_blacklist: str | None = None)[source]

Returns consensus peaks from a set of MACS narrow peak results. First, each summit is extended a peak_half_width in each direction and then we iteratively filter out less significant peaks that overlap with a more significant one. During this procedure peaks will be merged and depending on the number of peaks included into them, different processes will happen: * 1 peak: The original peak region will be kept * 2 peaks: The original peak region with the highest score will be kept * 3 or more peaks: The orignal peak region with the most significant score will be taken, and all the original peak regions in this merged peak region that overlap with the significant peak region will be removed. The process is repeated with the next most significant peak (if it was not removed already) until all peaks are processed.

This proccess will happen twice, first in each pseudobulk peaks; and after peak score normalization, to process all peaks together.

This approach is described in Corces et al. 2018.

Parameters

narrow_peaks_dict: dict: A dictionary containing group labels as keys and pr.PyRanges with the narrowPeak results from MACS2 as values (as returned by .pseudobulkPeakCalling.peakCalling()).
peak_half_width: int: Number of base pairs that each summit will be extended in each direction.
chromsizes: pd.PyRanges or pd.DataFrame: A data frame or pr.PyRanges containing size of each column, containing ‘Chromosome’, ‘Start’ and ‘End’ columns.
path_to_blacklist: str, optional: Path to bed file containing blacklist regions (Amemiya et al., 2019). Default: None

pycisTopic.iterative_peak_calling.iterative_peak_filtering(center_extended_peaks: PyRanges)[source]

Returns consensus peaks from a set of MACS narrow peak results. First, each summit is extended a peak_half_width in each direction and then we iteratively filter out less significant peaks that overlap with a more significant one. During this procedure, described in this functions, peaks will be merged and depending on the number of peaks included into them, different processes will happen: * 1 peak: The original peak region will be kept * 2 peaks: The original peak region with the highest score will be kept * 3 or more peaks: The orignal peak region with the most significant score will be taken, and all the original peak regions in this merged peak region that overlap with the significant peak region will be removed. The process is repeated with the next most significant peak (if it was not removed already) until all peaks are processed.

This proccess will happen twice, first in each pseudobulk peaks; and after peak score normalization, to process all peaks together.

This approach is described in Corces et al. 2018.

Parameters

center_extended_peaks: pr.PyRanges: A pr.PyRanges with all the peaks to be combined (and their MACS score), after centering and extending the peaks.

Fragments

pycisTopic.fragments.create_pyranges_from_polars_df(bed_df_pl: DataFrame) → PyRanges[source]

Create PyRanges DataFrame from Polars DataFrame.

Parameters:

bed_df_pl: Polars DataFrame containing BED entries. e.g.: This can also be a filtered Polars DataFrame with fragments or

TSS annotation.

Returns:

PyRanges DataFrame.

See also

pycisTopic.fragments.filter_fragments_by_cb
pycisTopic.fragments.read_bed_to_polars_df
pycisTopic.fragments.read_fragments_to_polars_df
pycisTopic.gene_annotation.change_chromosome_source_in_bed

Examples

Read BED file to Polars DataFrame with pyarrow engine.

>>> bed_df_pl = read_bed_to_polars_df("test.bed", engine="pyarrow")

Create PyRanges object directly from Polars DataFrame.

>>> bed_df_pr = create_pyranges_from_polars_df(bed_df_pl=bed_df_pl)

pycisTopic.fragments.filter_fragments_by_cb(fragments_df_pl: DataFrame, cbs: Series | Sequence) → DataFrame[source]

Filter fragments by cell barcodes.

Parameters:

fragments_df_pl: Polars DataFrame with fragments.
cbs: List/Polars Series with Cell barcodes. See pycisTopic.fragments.get_cbs_passing_filter() for a way to get a filtered list of cell barcodes (selected_cbs variable).

Returns:

Polars DataFrame with fragments for the requested cell barcodes.

See also

pycisTopic.fragments.get_cbs_passing_filter
pycisTopic.fragments.read_barcodes_file_to_polars_series

Examples

Read gzipped fragments BED file to a Polars DataFrame.

>>> fragments_df_pl = read_fragments_to_polars_df(
...    fragments_bed_filename="fragments.tsv.gz",
... )

List of cell barcodes for which to retain fragments.

>>> cbs = ["GGACATAAGGGCCACT-1", "ACCTTCATCTTTGAGA-1"]

Polars DataFrame with fragments for the requested cell barcodes.

>>> fragments_cb_filtered_df_pl = filter_fragments_by_cb(
...     fragments_df_pl=fragments_df_pl,
...     cbs=cbs,
... )

List of cell barcodes for which to retain fragments.

>>> cbs = ["GGACATAAGGGCCACT-1", "ACCTTCATCTTTGAGA-1"]

Polars DataFrame with fragments for the requested cell barcodes.

>>> fragments_cb_filtered_df_pl = filter_fragments_by_cb(
...     fragments_df_pl=fragments_df_pl,
...     cbs=cbs,
... )

List of cell barcodes as a Polars categorical Series for which to retain fragments.

>>> cbs = pl.Series(
...     "CB",
...     ["GGACATAAGGGCCACT-1", "ACCTTCATCTTTGAGA-1"],
...     dtype=pl.Categorical,
... )

Read list of cell barcodes from a file.

>>> cbs = read_barcodes_file_to_polars_series("barcodes.tsv")

Polars DataFrame with fragments for the requested cell barcodes.

>>> fragments_cb_filtered_df_pl = filter_fragments_by_cb(
...     fragments_df_pl=fragments_df_pl,
...     cbs=cbs,
... )

pycisTopic.fragments.get_cbs_passing_filter(fragments_stats_per_cb_df_pl: pl.DataFrame, cbs: pl.Series | Sequence | None = None, min_fragments_per_cb: int | None = None, keep_top_x_cbs: int | None = None, collapse_duplicates: bool | None = True)[source]

Get cell barcodes passing the filter.

Parameters:

fragments_stats_per_cb_df_pl: Polars DataFrame with number of fragments and duplication ratio per cell barcode. See pycisTopic.fragments.get_fragments_per_cb().
cbs: Cell barcodes to keep. If specified, min_fragments_per_cb and min_cbs are ignored.
min_fragments_per_cb: Minimum number of fragments needed per cell barcode to keep the cell barcode. Only used if cbs is None, min_cbs will be ignored.
keep_top_x_cbs: Keep the x most abundant cell barcodes based on the number of fragments. Only used if cbs is None and min_fragments_per_cb is None.
collapse_duplicates: Collapse duplicate fragments (same chromosomal positions and linked to the same cell barcode).

Returns:

(Cell barcodes passing the filter,: fragments_stats_per_cb_df_pl filtered by the cell barcodes passing the filter)

See also

pycisTopic.fragments.filter_fragments_by_cb
pycisTopic.fragments.get_fragments_per_cb

Examples

Read gzipped fragments BED file to a Polars DataFrame.

>>> fragments_df_pl = read_fragments_to_polars_df(
...     fragments_bed_filename="fragments.tsv.gz",
... )

Get number of fragments and duplication ratio per cell barcode (which have 10 fragments or more after collapsing duplicates).

>>> fragments_stats_per_cb_df_pl = get_fragments_per_cb(
...     fragments_df_pl=fragments_df_pl,
...     min_fragments_per_cb=10,
...     collapse_duplicates=True,
... )

Keep only cell barcodes which have 1000 or more fragments.

>>> cbs_selected, fragments_stats_per_cb_filtered_df_pl = get_cbs_passing_filter(
...     fragments_stats_per_cb_df_pl=fragments_stats_per_cb_df_pl,
...     min_fragments_per_cb=1000,
...     collapse_duplicates=True,
... )

Keep only the 4000 most abundant cell barcodes based on the number of fragments after collapsing duplicates.

>>> cbs_selected, fragments_stats_per_cb_filtered_df_pl = get_cbs_passing_filter(
...     fragments_stats_per_cb_df_pl=fragments_stats_per_cb_df_pl,
...     keep_top_x_cbs=4000,
...     collapse_duplicates=True,
... )

pycisTopic.fragments.get_fragments_in_peaks(fragments_df_pl: DataFrame, regions_df_pl: DataFrame) → DataFrame[source]

Get number of total and unique fragments in peaks.

Parameters:

fragments_df_pl: Polars DataFrame with fragments.
regions_df_pl: Polars DataFrame with peak regions (consensus peaks or SCREEN regions). See pycisTopic.fragments.read_bed_to_polars_df() for a way to read a BED file with peak regsions.

Returns:

Polars DataFrame with total fragment counts and unique fragment counts per region.

See also

pycisTopic.fragments.filter_fragments_by_cb

Examples

As input get a Polars DataFrame with fragments for the cell barcodes of interest. See pycisTopic.fragments.filter_fragments_by_cb

>>> fragments_cb_filtered_df_pl = filter_fragments_by_cb(
...     fragments_df_pl=fragments_df_pl,
...     cbs=cbs,
... )

Read BED file with consensus peaks or SCREEN regions (get first 3 columns only).

>>> regions_df_pl = read_bed_to_polars_df(
...     bed_filename=screen_regions_bed_filename,
...     min_column_count=3,
... )

Polars DataFrame with number of total and unique fragments in peaks.

>>> fragments_in_peaks_df_pl = get_fragments_in_peaks(
...     fragments_df_pl=fragments_cb_filtered_df_pl,
...     regions_df_pl=regions_df_pl,
... )

pycisTopic.fragments.get_fragments_per_cb(fragments_df_pl: DataFrame, min_fragments_per_cb: int = 10, collapse_duplicates: bool | None = True) → DataFrame[source]

Get number of fragments and duplication ratio per cell barcode.

Parameters:

fragments_df_pl:: Polars DataFrame with fragments. See pycisTopic.fragments.read_fragments_to_polars_df().
min_fragments_per_cb:: Minimum number of fragments needed per cell barcode to keep the fragments for those cell barcodes.
collapse_duplicates:: Collapse duplicate fragments (same chromosomal positions and linked to the same cell barcode).

Returns:

Polars DataFrame with number of fragments and duplication ratio per cell barcode.

See also

pycisTopic.fragments.read_fragments_to_polars_df

Examples

Read gzipped fragments BED file to a Polars DataFrame.

>>> fragments_df_pl = read_fragments_to_polars_df(
...    fragments_bed_filename="fragments.tsv.gz",
... )

Get number of fragments and duplication ratio per cell barcode (which have 10 fragments or more after collapsing duplicates).

>>> fragments_stats_per_cb_df_pl = get_fragments_per_cb(
...     fragments_df_pl=fragments_df_pl,
...     min_fragments_per_cb=10,
...     collapse_duplicates=True,
... )

pycisTopic.fragments.get_insert_size_distribution(fragments_df_pl: DataFrame) → DataFrame[source]

Get insert size distribution of fragments.

Parameters:

fragments_df_pl: Polars DataFrame with fragments.
cbs: List/Polars Series with Cell barcodes. See pycisTopic.fragments.get_cbs_passing_filter() for a way to get a filtered list of cell barcodes (selected_cbs variable).

Returns:

Polars DataFrame with fragment counts and fragment ratios for each found insert
size.

See also

pycisTopic.fragments.filter_fragments_by_cb

Examples

As input get a Polars DataFrame with fragments for the cell barcodes of interest. See pycisTopic.fragments.filter_fragments_by_cb

>>> fragments_cb_filtered_df_pl = filter_fragments_by_cb(
...     fragments_df_pl=fragments_df_pl,
...     cbs=cbs,
... )

Polars DataFrame with insert size distribution of fragments.

>>> insert_size_dist_df_pl = get_insert_size_distribution(
...     fragments_df_pl=fragments_cb_filtered_df_pl,
... )

pycisTopic.fragments.read_barcodes_file_to_polars_series(barcodes_tsv_filename: str) → Series[source]

Read barcode TSV file to a Polars Series.

Parameters:

barcodes_tsv_filename: TSV file with CBs.

Returns:

Polars Series with CBs.

See also

pycisTopic.fragments.filter_fragments_by_cb

Examples

Read gzipped barcodes TSV file to a Polars Series.

>>> cbs = read_barcodes_file_to_polars_series(
...     barcodes_tsv_filename="barcodes.tsv.gz",
... )

Read uncompressed barcodes TSV file to a Polars Series.

>>> cbs = read_barcodes_file_to_polars_series(
...     barcodes_tsv_filename="barcodes.tsv",
... )

pycisTopic.fragments.read_bed_to_polars_df(bed_filename: str, engine: str | Literal['polars'] | Literal['pyarrow'] = 'pyarrow', min_column_count: int = 3) → DataFrame[source]

Read BED file to a Polars DataFrame.

Parameters:

bed_filename: BED filename.
engine: Use Polars or pyarrow to read the BED file (default: pyarrow).
min_column_count: Minimum number of required columns needed in BED file.

Returns:

Polars DataFrame with BED entries.

See also

pycisTopic.fragments.read_fragments_to_polars_df

Examples

Read BED file to Polars DataFrame with pyarrow engine.

>>> bed_df_pl = read_bed_to_polars_df("test.bed", engine="pyarrow")

Read BED file to Polars DataFrame with pyarrow engine and require that the BED file has at least 4 columns.

>>> bed_with_at_least_4_columns_df_pl = read_bed_to_polars_df(
...     "test.bed",
...     engine="pyarrow",
...     min_column_count=4,
... )

pycisTopic.fragments.read_fragments_to_polars_df(fragments_bed_filename: str, engine: str | Literal['polars'] | Literal['pyarrow'] = 'pyarrow') → DataFrame[source]

Read fragments BED file to a Polars DataFrame.

If fragments don’t have a Score column, a Score columns is created by counting the number of fragments with the same chromosome, start, end and CB.

Parameters:

fragments_bed_filename: Fragments BED filename.
engine: Use Polars or pyarrow to read the fragments BED file (default: pyarrow).

Returns:

Polars DataFrame with fragments.

See also

pycisTopic.fragments.read_bed_to_polars_df

Examples

Read gzipped fragments BED file to a Polars DataFrame.

>>> fragments_df_pl = read_fragments_to_polars_df(
...     fragments_bed_filename="fragments.tsv.gz",
... )

Read uncompressed fragments BED file to a Polars DataFrame.

>>> fragments_df_pl = read_fragments_to_polars_df(
...     fragments_bed_filename="fragments.tsv",
... )

pycisTopic.fragments.read_fragments_to_pyranges(fragments_bed_filename: str, engine: str | Literal['polars'] | Literal['pyarrow'] | Literal['pandas'] = 'pyarrow') → PyRanges[source]

Read fragments BED file to PyRanges object.

Parameters:

fragments_bed_filename: Fragments BED filename.
engine: Use Polars, pyarrow or pandas to read the fragments BED file (default: pyarrow).

Returns:

PyRanges object with fragments.

Examples

Read BED file to PyRanges object with pyarrow engine.

>>> bed_pr = read_fragments_to_pyranges("test.bed", engine="pyarrow")

Gene annotation

pycisTopic.gene_annotation.change_chromosome_source_in_bed(chrom_sizes_and_alias_df_pl: DataFrame, bed_df_pl: DataFrame, from_chrom_source_name: str, to_chrom_source_name: str) → DataFrame[source]

Change chromosome names from Polars DataFrame with BED entries from one chromosome source to another one.

Parameters:

chrom_sizes_and_alias_df_pl: Polars DataFrame with chromosome sizes and alias mapping. See pycisTopic.gene_annotation.get_chrom_sizes_and_alias_mapping_from_file(), pycisTopic.gene_annotation.get_chrom_sizes_and_alias_mapping_from_ncbi() and pycisTopic.gene_annotation.get_chrom_sizes_and_alias_mapping_from_ucsc().
bed_df_pl: Polars DataFrame with BED entries for which chromosome names need to be remapped from from_chrom_source_name to to_chrom_source_name. See pycisTopic.fragments.read_bed_to_polars_df() and pycisTopic.gene_annotation.read_tss_annotation_from_bed()
from_chrom_source_name: Current chromosome source name for the input BED file: ucsc, ensembl, genbank or refseq. Can be guessed with pycisTopic.gene_annotation.find_most_likely_chromosome_source_in_bed().
to_chrom_source_name: Chromosome source name to which the output Polars DataFrame with BED entries should be mapped: ucsc, ensembl, genbank or refseq.

Returns:

Polars Dataframe with BED entries with changed chromosome names.

See also

pycisTopic.fragments.read_bed_to_polars_df
pycisTopic.gene_annotation.find_most_likely_chromosome_source_in_bed
pycisTopic.gene_annotation.get_chrom_sizes_and_alias_mapping_from_file
pycisTopic.gene_annotation.get_chrom_sizes_and_alias_mapping_from_ncbi
pycisTopic.gene_annotation.get_chrom_sizes_and_alias_mapping_from_ucsc
pycisTopic.gene_annotation.read_tss_annotation_from_bed
pycisTopic.gene_annotation.write_tss_annotation_to_bed

Examples

Get chromosome sizes and alias mapping for hg38.

>>> chrom_sizes_and_alias_hg38_df_pl = get_chrom_sizes_and_alias_mapping_from_ucsc(ucsc_assembly="hg38")

Get gene annotation for hg38 from Ensembl BioMart.

>>> hg38_tss_annotation_bed_df_pl = get_tss_annotation_from_ensembl(
...     biomart_name="hsapiens_gene_ensembl",
... )
>>> hg38_tss_annotation_bed_df_pl

Replace Ensembl chromosome names with UCSC chromosome names in gene annotation for hg38.

>>> hg38_tss_annotation_ucsc_chroms_bed_df_pl = change_chromosome_source_in_bed(
...     chrom_sizes_and_alias_df_pl=chrom_sizes_and_alias_hg38_df_pl,
...     bed_df_pl=hg38_tss_annotation_bed_df_pl,
...     from_chrom_source_name="ensembl",
...     to_chrom_source_name="ucsc",
... )
>>> hg38_tss_annotation_ucsc_chroms_bed_df_pl

pycisTopic.gene_annotation.find_most_likely_chromosome_source_in_bed(chrom_sizes_and_alias_df_pl: pl.DataFrame, bed_df_pl: pl.DataFrame)[source]

Find which chromosome source is the most likely in the provided BED file entries.

Find which chromosome source (UCSC, Ensembl, GenBank and RefSeq) given as a chrom_sizes_and_alias_df_pl Polars DataFrame is the most likely in the provided Polars DataFrame with BED entries.

Parameters:

chrom_sizes_and_alias_df_pl: Polars DataFrame with chromosome sizes and alias mapping. See pycisTopic.gene_annotation.get_chrom_sizes_and_alias_mapping_from_file(), pycisTopic.gene_annotation.get_chrom_sizes_and_alias_mapping_from_ncbi() and pycisTopic.gene_annotation.get_chrom_sizes_and_alias_mapping_from_ucsc().
bed_df_pl: Polars DataFrame with BED entries. See pycisTopic.fragments.read_bed_to_polars_df().

Returns:

Tuple of most likely chromosome source and a Polars DataFrame with the ranking of
all possible chromosome sources.

See also

pycisTopic.fragments.read_bed_to_polars_df
pycisTopic.gene_annotation.change_chromosome_source_in_bed
pycisTopic.gene_annotation.get_chrom_sizes_and_alias_mapping_from_file
pycisTopic.gene_annotation.get_chrom_sizes_and_alias_mapping_from_ncbi
pycisTopic.gene_annotation.get_chrom_sizes_and_alias_mapping_from_ucsc

Examples

>>> chrom_sizes_and_alias_hg38_df_pl = get_chrom_sizes_and_alias_mapping_from_ucsc(ucsc_assembly="hg38")
>>> bed_df_pl = read_bed_to_polars_df("test.bed", engine="pyarrow")
>>> best_chrom_source_name, chrom_source_stats_df_pl = find_most_likely_chromosome_source_in_bed(
...     chrom_sizes_and_alias_df_pl=chrom_sizes_and_alias_hg38_df_pl,
...     bed_df_pl=bed_df_pl,
... )
>>> print(best_chrom_source_name, chrom_source_stats_df_pl)

pycisTopic.gene_annotation.get_all_gene_annotation_ensembl_biomart_dataset_names(biomart_host: str = 'http://www.ensembl.org', use_cache: bool = True) → pd.DataFrame[source]

Get all avaliable gene annotation Ensembl BioMart dataset names.

Parameters:

biomart_host

BioMart host URL to use.

Default: http://www.ensembl.org
Archived Ensembl BioMart URLs: https://www.ensembl.org/info/website/archives/index.html (List of currently available archives)

use_cache

Whether to cache requests to Ensembl BioMart server.

Returns:

Pandas dataframe with all available gene annotation Ensembl BioMart datasets.

See also

pycisTopic.gene_annotation.get_all_gene_annotation_ensembl_biomart_dataset_names

Examples

>>> biomart_latest_datasets = get_all_biomart_ensembl_dataset_names(
...    biomart_host="http://www.ensembl.org",
... )
>>> biomart_jul2022_datasets = get_all_biomart_ensembl_dataset_names(
...     biomart_host="http://jul2022.archive.ensembl.org/",
... )

pycisTopic.gene_annotation.get_biomart_dataset_name_for_species(biomart_datasets: pd.DataFrame, species: str) → pd.DataFrame[source]

Get gene annotation Ensembl BioMart dataset names for species of interest.

Parameters:

biomart_datasets: All gene annotation Ensembl BioMart datasets See pycisTopic.gene_annotation.get_all_gene_annotation_ensembl_biomart_dataset_names().
species: Species name to search for.

Returns:

Filtered list of gene annotation Ensembl BioMart dataset names.

See also

pycisTopic.gene_annotation.get_all_gene_annotation_ensembl_biomart_dataset_names

pycisTopic.gene_annotation.get_chrom_sizes_and_alias_mapping_from_file(chrom_sizes_and_alias_tsv_filename: str | Path) → DataFrame[source]

Get chromosome sizes and alias mapping from a chromosome alias TSV file.

Get chromosome sizes and alias mapping from a chromosome alias TSV file to map chromosome names between UCSC, Ensembl, GenBank and RefSeq chromosome names.

Parameters:

chrom_sizes_and_alias_tsv_filename:

Chromosome alias TSV files created with:

get_chrom_sizes_and_alias_mapping_from_ncbi
get_chrom_sizes_and_alias_mapping_from_ucsc

Returns:

Polars Dataframe with chromosome sizes and alias mapping between UCSC, Ensembl,
GenBank and RefSeq chromosome names.

See also

pycisTopic.gene_annotation.get_chrom_sizes_and_alias_mapping_from_ncbi
pycisTopic.gene_annotation.get_chrom_sizes_and_alias_mapping_from_ucsc

Examples

Get chromosome sizes and alias mapping for hg38 from a previous written TSV file:

>>> chrom_sizes_and_alias_hg38_from_file_df_pl = get_chrom_sizes_and_alias_mapping_from_file(
...    chrom_sizes_and_alias_tsv_filename="hg38.chrom_sizes_and_alias.tsv",
... )

pycisTopic.gene_annotation.get_chrom_sizes_and_alias_mapping_from_ncbi(accession_id: str, chrom_sizes_and_alias_tsv_filename: str | Path | None) → DataFrame[source]

Get chromosome sizes and alias mapping from NCBI sequence reports.

Get chromosome sizes and alias mapping from NCBI sequence reports to be able to map chromosome names between UCSC, Ensembl, GenBank and RefSeq chromosome names or read mapping from local file (chrom_sizes_and_alias_tsv_filename) instead.

Parameters:

accession_id: NCBI assembly accession ID.
chrom_sizes_and_alias_tsv_filename: If specified, write the chromosome sizes and alias mapping to the specified file.

Returns:

Polars Dataframe with chromosome alias mapping between UCSC, Ensembl, GenBank and
RefSeq chromosome names.

See also

pycisTopic.gene_annotation.get_chrom_sizes_and_alias_mapping_from_file
pycisTopic.gene_annotation.get_chrom_sizes_and_alias_mapping_from_ucsc
pycisTopic.gene_annotation.get_ncbi_assembly_accessions_for_species

Examples

Get chromosome sizes and alias mapping for different assemblies from NCBI.

Assemby accession IDs for a species can be queries with pycisTopic.gene_annotation.get_ncbi_assembly_accessions_for_species

>>> chrom_sizes_and_alias_hg38_df_pl = get_chrom_sizes_and_alias_mapping_from_ncbi(
...    accession_id="GCF_000001405.40"
... )
>>> chrom_sizes_and_alias_mm10_df_pl = get_chrom_sizes_and_alias_mapping_from_ncbi(
...     accession_id="GCF_000001215.4"
... )
>>> chrom_sizes_and_alias_dm6_df_pl = get_chrom_sizes_and_alias_mapping_from_ncbi(
...     accession_id="GCF_000001215.4"
... )

Get chromosome sizes and alias mapping for Homo sapiens and also write it to a TSV file:

>>> chrom_sizes_and_alias_hg38_df_pl = get_chrom_sizes_and_alias_mapping_from_ncbi(
...     accession_id="GCF_000001405.40",
...     chrom_sizes_and_alias_tsv_filename="GCF_000001405.40.chrom_sizes_and_alias.tsv",
... )

pycisTopic.gene_annotation.get_chrom_sizes_and_alias_mapping_from_ucsc(ucsc_assembly: str, chrom_sizes_and_alias_tsv_filename: str | Path | None = None) → DataFrame[source]

Get chromosome sizes and alias mapping from UCSC genome browser.

Get chromosome sizes and alias mapping from UCSC genome browser for UCSC assembly to be able to map chromosome names between UCSC, Ensembl, GenBank and RefSeq chromosome names or read mapping from local file (chrom_sizes_and_alias_tsv_filename) instead.

Parameters:

ucsc_assembly:: UCSC assembly names (hg38, mm10, dm6, …).
chrom_sizes_and_alias_tsv_filename:: If specified, write the chromosome sizes and alias mapping to the specified file.

Returns:

Polars Dataframe with chromosome sizes and alias mapping between UCSC, Ensembl,
GenBank and RefSeq chromosome names.

See also

pycisTopic.gene_annotation.get_chrom_sizes_and_alias_mapping_from_file
pycisTopic.gene_annotation.get_chrom_sizes_and_alias_mapping_from_ncbi

Examples

Get chromosome sizes and aliases for different assemblies from UCSC:

>>> chrom_sizes_and_alias_hg38_df_pl = get_chrom_sizes_and_alias_mapping_from_ucsc(
...     ucsc_assembly="hg38"
... )
>>> chrom_sizes_and_alias_mm10_df_pl = get_chrom_sizes_and_alias_mapping_from_ucsc(
...     ucsc_assembly="mm10"
... )
>>> chrom_sizes_and_alias_dm6_df_pl = get_chrom_sizes_and_alias_mapping_from_ucsc(
...     ucsc_assembly="dm6"
... )

Get chromosome sizes and aliases for hg38 and also write it to a TSV file:

>>> chrom_sizes_and_alias_hg38_df_pl = get_chrom_sizes_and_alias_mapping_from_ucsc(
...     ucsc_assembly="hg38",
...     chrom_sizes_and_alias_tsv_filename="hg38.chrom_sizes_and_alias.tsv",
... )

pycisTopic.gene_annotation.get_ncbi_assembly_accessions_for_species(species: str) → str[source]

Get NCBI assembly accession numbers and assembly names for a certain species.

Parameters:

species: Species name (latin name) for which to look for NCBI assembly accession numbers.

Returns:

String with NCBI assembly accession number and assembly name.

See also

pycisTopic.gene_annotation.get_chrom_sizes_and_alias_mapping_from_ncbi

Examples

>>> print(get_ncbi_assembly_accessions_for_species("homo sapiens"))
accession   assembly_name
GCF_000001405.40    GRCh38.p14
GCF_000001405.25    GRCh37.p13
GCF_000001405.26    GRCh38
GCF_000001405.27    GRCh38.p1
GCF_000001405.28    GRCh38.p2
GCF_000001405.29    GRCh38.p3
GCF_000001405.30    GRCh38.p4
GCF_000001405.31    GRCh38.p5
GCF_000001405.32    GRCh38.p6
GCF_000001405.33    GRCh38.p7
GCF_000001405.34    GRCh38.p8
GCF_000001405.35    GRCh38.p9
GCF_000001405.36    GRCh38.p10
GCF_000001405.37    GRCh38.p11
GCF_000001405.38    GRCh38.p12
GCF_000001405.39    GRCh38.p13
GCF_000002125.1     HuRef
GCF_000306695.2     CHM1_1.1
GCF_009914755.1     T2T-CHM13v2.0
>>> print(get_ncbi_assembly_accessions_for_species("drosophila melanogaster"))
accession   assembly_name
GCF_000001215.4     Release 6 plus ISO1 MT

pycisTopic.gene_annotation.get_tss_annotation_from_ensembl(biomart_name: str, biomart_host: str = 'http://www.ensembl.org', transcript_type: Sequence[str] | None = ['protein_coding'], use_cache: bool = True) → DataFrame[source]

Get TSS annotation for requested transcript types from Ensembl BioMart.

Parameters:

biomart_name

Ensembl BioMart ID of the dataset. See pycisTopic.gene_annotation.get_biomart_dataset_name_for_species() to get the biomart_name for species of interest: e.g.: hsapiens_gene_ensembl, mmusculus_gene_ensembl, dmelanogaster_gene_ensembl, …

biomart_host

BioMart host URL to use.

Default: http://www.ensembl.org
Archived Ensembl BioMart URLs: https://www.ensembl.org/info/website/archives/index.html (List of currently available archives)

transcript_type

Only keep list of specified transcript types (e.g.: ["protein_coding"]) or all (None).

use_cache

Whether to cache requests to Ensembl BioMart server.

Returns:

Polars DataFrame with TSS positions in BED format.

See also

pycisTopic.gene_annotation.get_biomart_dataset_name_for_species
pycisTopic.gene_annotation.read_tss_annotation_from_bed
pycisTopic.gene_annotation.write_tss_annotation_to_bed

Examples

>>> tss_annotation_bed_df_pl = get_tss_annotation_from_ensembl(
...     biomart_name="hsapiens_gene_ensembl"
... )
>>> tss_annotation_jul2022_bed_df_pl = get_tss_annotation_from_ensembl(
...     biomart_name="hsapiens_gene_ensembl",
...     biomart_host="http://jul2022.archive.ensembl.org/",
... )

pycisTopic.gene_annotation.read_tss_annotation_from_bed(tss_annotation_bed_filename: str) → DataFrame[source]

Read TSS annotation BED file to Polars DataFrame.

Read TSS annotation BED file created by pycisTopic.gene_annotation.get_tss_annotation_from_ensembl() and pycisTopic.gene_annotation.write_tss_annotation_to_bed() to Polars DataFrame with TSS positions in BED format.

Parameters:

tss_annotation_bed_filename

TSS annotation BED file to read. TSS annotation BED files can be written with pycisTopic.gene_annotation.write_tss_annotation_to_bed() and will have the following header line:

# Chromosome Start End Gene Score Strand Transcript_type

Minimum required columns for pycisTopic.tss_profile.get_tss_profile():: Chromosome, Start (0-based BED), Strand

Returns:

Polars DataFrame with TSS positions in BED format.

See also

pycisTopic.gene_annotation.change_chromosome_source_in_bed
pycisTopic.gene_annotation.get_tss_annotation_from_ensembl
pycisTopic.gene_annotation.write_tss_annotation_to_bed

Examples

Get TSS annotation from Ensembl.

>>> tss_annotation_bed_df_pl = get_tss_annotation_from_ensembl(
...     biomart_name="hsapiens_gene_ensembl"
... )

If your fragments files use a different chromosome convention than the one used by Ensembl, take a look at pycisTopic.gene_annotation.change_chromosome_source_in_bed() to convert the Ensembl chromosome names to UCSC, Ensembl, GenBank or RefSeq chromosome names.

Write TSS annotation to a file.

>>> write_tss_annotation_to_bed(
...     tss_annotation_bed_df_pl=tss_annotation_bed_df_pl,
...     tss_annotation_bed_filename="hg38.tss.bed",
... )

Read TSS annotation from a file.

>>> tss_annotation_bed_df_pl = read_tss_annotation_from_bed(
...     tss_annotation_bed_filename="hg38.tss.bed"
... )

pycisTopic.gene_annotation.write_tss_annotation_to_bed(tss_annotation_bed_df_pl, tss_annotation_bed_filename: str) → None[source]

Write TSS annotation Polars DataFrame to a BED file.

Write TSS annotation Polars DataFrame with TSS positions in BED format. to a BED file.

Parameters:

tss_annotation_bed_df_pl

TSS annotation Polars DataFrame with TSS positions in BED format created with pycisTopic.gene_annotation.get_tss_annotation_from_ensembl().

tss_annotation_bed_filename

TSS annotation BED file to write to. TSS annotation BED files from pycisTopic.gene_annotation.get_tss_annotation_from_ensembl() will have the following header line:

# Chromosome Start End Gene Score Strand Transcript_type

Minimum required columns for pycisTopic.tss_profile.get_tss_profile():: Chromosome, Start (0-based BED), Strand

Returns:

Polars DataFrame with TSS positions in BED format.

See also

pycisTopic.gene_annotation.change_chromosome_source_in_bed
pycisTopic.gene_annotation.get_tss_annotation_from_ensembl
pycisTopic.gene_annotation.read_tss_annotation_from_bed

Examples

Get TSS annotation from Ensembl.

>>> tss_annotation_bed_df_pl = get_tss_annotation_from_ensembl(
...     biomart_name="hsapiens_gene_ensembl"
... )

If your fragments files use a different chromosome convention than the one used by Ensembl, take a look at pycisTopic.gene_annotation.change_chromosome_source_in_bed() to convert the Ensembl chromosome names to UCSC, Ensembl, GenBank or RefSeq chromosome names.

Write TSS annotation to a file.

>>> write_tss_annotation_to_bed(
...     tss_annotation_bed_df_pl=tss_annotation_bed_df_pl,
...     tss_annotation_bed_filename="hg38.tss.bed",
... )

Read TSS annotation from a file.

>>> tss_annotation_bed_df_pl = read_tss_annotation_from_bed(
...     tss_annotation_bed_filename="hg38.tss.bed"
... )

Genomic ranges

pycisTopic.genomic_ranges.intersection(regions1_df_pl: DataFrame, regions2_df_pl: DataFrame, how: Literal['all', 'containment', 'first', 'last'] | str | None = None, regions1_info: bool = True, regions2_info: bool = False, regions1_coord: bool = False, regions2_coord: bool = False, regions1_suffix: str = '@1', regions2_suffix: str = '@2') → DataFrame[source]

Get overlapping subintervals between first set and second set of regions.

Parameters:

regions1_df_pl

Polars DataFrame containing BED entries for first set of regions.

regions2_df_pl

Polars DataFrame containing BED entries for second set of regions.

how

What intervals to report:

"all" (None): all overlaps with second set or regions.
"containment": only overlaps where region of first set is contained within region of second set.
"first": first overlap with second set of regions.
"last": last overlap with second set of regions.
"outer": all regions for first and all regions of second (outer join). If no overlap was found for a region, the other region set will contain None for that entry.
"left": all first set of regions and overlap with second set of regions (left join). If no overlap was found for a region in the first set, the second region set will contain None for that entry.
"right": all second set of regions and overlap with first set of regions (right join). If no overlap was found for a region in the second set, the first region set will contain None for that entry.

regions1_info

Add non-coordinate columns from first set of regions to output of intersection.

regions2_info

Add non-coordinate columns from first set of regions to output of intersection.

regions1_coord

Add coordinates from first set of regions to output of intersection.

regions2_coord

Add coordinates from second set of regions to output of intersection.

regions1_suffix

Suffix added to coordinate columns of first set of regions.

regions2_suffix

Suffix added to coordinate and info columns of second set of regions.

strandedness

Note: Not implemented yet. {None, "same", "opposite", False}, default None, i.e. auto Whether to compare PyRanges on the same strand, the opposite or ignore strand information. The default, None, means use "same" if both PyRanges are stranded, otherwise ignore the strand information.

Returns:

intersection_df_pl: Polars Dataframe containing BED entries with the intersection.

Examples

>>> regions1_df_pl = pl.from_dict(
...     {
...         "Chromosome": ["chr1"] * 3,
...         "Start": [1, 4, 10],
...         "End": [3, 9, 11],
...         "ID": ["a", "b", "c"],
...     }
... )
>>> regions1_df_pl
shape: (3, 4)
┌────────────┬───────┬─────┬─────┐
│ Chromosome ┆ Start ┆ End ┆ ID  │
│ ---        ┆ ---   ┆ --- ┆ --- │
│ str        ┆ i64   ┆ i64 ┆ str │
╞════════════╪═══════╪═════╪═════╡
│ chr1       ┆ 1     ┆ 3   ┆ a   │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ chr1       ┆ 4     ┆ 9   ┆ b   │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ chr1       ┆ 10    ┆ 11  ┆ c   │
└────────────┴───────┴─────┴─────┘

>>> regions2_df_pl = pl.from_dict(
...     {
...         "Chromosome": ["chr1"] * 3,
...         "Start": [2, 2, 9],
...         "End": [3, 9, 10],
...         "Name": ["reg1", "reg2", "reg3"]
...     }
... )
>>> regions2_df_pl
shape: (3, 4)
┌────────────┬───────┬─────┬──────┐
│ Chromosome ┆ Start ┆ End ┆ Name │
│ ---        ┆ ---   ┆ --- ┆ ---  │
│ str        ┆ i64   ┆ i64 ┆ str  │
╞════════════╪═══════╪═════╪══════╡
│ chr1       ┆ 2     ┆ 3   ┆ reg1 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┤
│ chr1       ┆ 2     ┆ 9   ┆ reg2 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┤
│ chr1       ┆ 9     ┆ 10  ┆ reg3 │
└────────────┴───────┴─────┴──────┘

>>> intersection(regions1_df_pl, regions2_df_pl)
shape: (3, 3)
┌────────────┬───────┬─────┬─────┐
│ Chromosome ┆ Start ┆ End ┆ ID  │
│ ---        ┆ ---   ┆ --- ┆ --- │
│ str        ┆ i64   ┆ i64 ┆ str │
╞════════════╪═══════╪═════╪═════╡
│ chr1       ┆ 2     ┆ 3   ┆ a   │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ chr1       ┆ 2     ┆ 3   ┆ a   │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ chr1       ┆ 4     ┆ 9   ┆ b   │
└────────────┴───────┴─────┴─────┘

>>> intersection(regions1_df_pl, regions2_df_pl, how="first")
shape: (2, 4)
┌────────────┬───────┬─────┬─────┐
│ Chromosome ┆ Start ┆ End ┆ ID  │
│ ---        ┆ ---   ┆ --- ┆ --- │
│ str        ┆ i64   ┆ i64 ┆ str │
╞════════════╪═══════╪═════╪═════╡
│ chr1       ┆ 2     ┆ 3   ┆ a   │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ chr1       ┆ 4     ┆ 9   ┆ b   │
└────────────┴───────┴─────┴─────┘

>>> intersection(
...     regions1_df_pl,
...     regions2_df_pl,
...     how="containment",
...     regions1_info=False,
...     regions2_info=True,
... )
shape: (1, 4)
┌────────────┬───────┬─────┬──────┐
│ Chromosome ┆ Start ┆ End ┆ Name │
│ ---        ┆ ---   ┆ --- ┆ ---  │
│ str        ┆ i64   ┆ i64 ┆ str  │
╞════════════╪═══════╪═════╪══════╡
│ chr1       ┆ 4     ┆ 9   ┆ reg2 │
└────────────┴───────┴─────┴──────┘

>>> intersection(
...     regions1_df_pl,
...     regions2_df_pl,
...     regions1_coord=True,
...     regions2_coord=True,
... )
shape: (3, 10)
┌────────────┬───────┬─────┬──────────────┬─────────┬───────┬──────────────┬─────────┬───────┬─────┐
│ Chromosome ┆ Start ┆ End ┆ Chromosome@1 ┆ Start@1 ┆ End@1 ┆ Chromosome@2 ┆ Start@2 ┆ End@2 ┆ ID  │
│ ---        ┆ ---   ┆ --- ┆ ---          ┆ ---     ┆ ---   ┆ ---          ┆ ---     ┆ ---   ┆ --- │
│ str        ┆ i64   ┆ i64 ┆ str          ┆ i64     ┆ i64   ┆ str          ┆ i64     ┆ i64   ┆ str │
╞════════════╪═══════╪═════╪══════════════╪═════════╪═══════╪══════════════╪═════════╪═══════╪═════╡
│ chr1       ┆ 2     ┆ 3   ┆ chr1         ┆ 1       ┆ 3     ┆ chr1         ┆ 2       ┆ 9     ┆ a   │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ chr1       ┆ 2     ┆ 3   ┆ chr1         ┆ 1       ┆ 3     ┆ chr1         ┆ 2       ┆ 3     ┆ a   │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ chr1       ┆ 4     ┆ 9   ┆ chr1         ┆ 4       ┆ 9     ┆ chr1         ┆ 2       ┆ 9     ┆ b   │
└────────────┴───────┴─────┴──────────────┴─────────┴───────┴──────────────┴─────────┴───────┴─────┘

>>> intersection(
...     regions1_df_pl,
...     regions2_df_pl,
...     regions1_info=False,
...     regions_info=True,
...     regions2_coord=True,
... )
shape: (3, 7)
┌────────────┬───────┬─────┬──────────────┬─────────┬───────┬──────┐
│ Chromosome ┆ Start ┆ End ┆ Chromosome@2 ┆ Start@2 ┆ End@2 ┆ Name │
│ ---        ┆ ---   ┆ --- ┆ ---          ┆ ---     ┆ ---   ┆ ---  │
│ str        ┆ i64   ┆ i64 ┆ str          ┆ i64     ┆ i64   ┆ str  │
╞════════════╪═══════╪═════╪══════════════╪═════════╪═══════╪══════╡
│ chr1       ┆ 2     ┆ 3   ┆ chr1         ┆ 2       ┆ 9     ┆ reg2 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ chr1       ┆ 2     ┆ 3   ┆ chr1         ┆ 2       ┆ 3     ┆ reg1 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ chr1       ┆ 4     ┆ 9   ┆ chr1         ┆ 2       ┆ 9     ┆ reg2 │
└────────────┴───────┴─────┴──────────────┴─────────┴───────┴──────┘

pycisTopic.genomic_ranges.overlap(regions1_df_pl: DataFrame, regions2_df_pl: DataFrame, how: Literal['all', 'containment', 'first'] | str | None = 'first', invert: bool = False) → DataFrame[source]

Get overlap between two region sets.

Get overlap between first set and second set of regions and return interval of first set of regions.

Parameters:

regions1_df_pl

Polars DataFrame containing BED entries for first set of regions.

regions2_df_pl

Polars DataFrame containing BED entries for second set of regions.

how

What overlaps to report:

"all" (None): all overlaps with second set or regions.
"containment": only overlaps where region of first set is contained within region of second set.
"first": first overlap with second set of regions.

invert

Whether to return the intervals without overlaps.

strandedness

Note: Not implemented yet. {None, "same", "opposite", False}, default None, i.e. auto Whether to compare PyRanges on the same strand, the opposite or ignore strand information. The default, None, means use "same" if both PyRanges are stranded, otherwise ignore the strand information.

Returns:

overlap_df_pl: Polars Dataframe containing BED entries with the overlap.

Examples

>>> regions1_df_pl = pl.from_dict(
...     {
...         "Chromosome": ["chr1"] * 3,
...         "Start": [1, 4, 10],
...         "End": [3, 9, 11],
...         "ID": ["a", "b", "c"],
...     }
... )
>>> regions1_df_pl
shape: (3, 4)
┌────────────┬───────┬─────┬─────┐
│ Chromosome ┆ Start ┆ End ┆ ID  │
│ ---        ┆ ---   ┆ --- ┆ --- │
│ str        ┆ i64   ┆ i64 ┆ str │
╞════════════╪═══════╪═════╪═════╡
│ chr1       ┆ 1     ┆ 3   ┆ a   │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ chr1       ┆ 4     ┆ 9   ┆ b   │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ chr1       ┆ 10    ┆ 11  ┆ c   │
└────────────┴───────┴─────┴─────┘

>>> regions2_df_pl = pl.from_dict(
...     {
...         "Chromosome": ["chr1"] * 3,
...         "Start": [2, 2, 9],
...         "End": [3, 9, 10],
...         "Name": ["reg1", "reg2", "reg3"]
...     }
... )
>>> regions2_df_pl
shape: (3, 4)
┌────────────┬───────┬─────┬──────┐
│ Chromosome ┆ Start ┆ End ┆ Name │
│ ---        ┆ ---   ┆ --- ┆ ---  │
│ str        ┆ i64   ┆ i64 ┆ str  │
╞════════════╪═══════╪═════╪══════╡
│ chr1       ┆ 2     ┆ 3   ┆ reg1 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┤
│ chr1       ┆ 2     ┆ 9   ┆ reg2 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┤
│ chr1       ┆ 9     ┆ 10  ┆ reg3 │
└────────────┴───────┴─────┴──────┘

>>> overlap(regions1_df_pl, regions2_df_pl, how="first")
shape: (2, 4)
┌────────────┬───────┬─────┬─────┐
│ Chromosome ┆ Start ┆ End ┆ ID  │
│ ---        ┆ ---   ┆ --- ┆ --- │
│ str        ┆ i64   ┆ i64 ┆ str │
╞════════════╪═══════╪═════╪═════╡
│ chr1       ┆ 1     ┆ 3   ┆ a   │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ chr1       ┆ 4     ┆ 9   ┆ b   │
└────────────┴───────┴─────┴─────┘

>>> overlap(regions1_df_pl, regions2_df_pl, how="all")
shape: (3, 4)
┌────────────┬───────┬─────┬─────┐
│ Chromosome ┆ Start ┆ End ┆ ID  │
│ ---        ┆ ---   ┆ --- ┆ --- │
│ str        ┆ i64   ┆ i64 ┆ str │
╞════════════╪═══════╪═════╪═════╡
│ chr1       ┆ 1     ┆ 3   ┆ a   │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ chr1       ┆ 1     ┆ 3   ┆ a   │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ chr1       ┆ 4     ┆ 9   ┆ b   │
└────────────┴───────┴─────┴─────┘

>>> overlap(regions1_df_pl, regions2_df_pl, how="containment")
shape: (1, 4)
┌────────────┬───────┬─────┬─────┐
│ Chromosome ┆ Start ┆ End ┆ ID  │
│ ---        ┆ ---   ┆ --- ┆ --- │
│ str        ┆ i64   ┆ i64 ┆ str │
╞════════════╪═══════╪═════╪═════╡
│ chr1       ┆ 4     ┆ 9   ┆ b   │
└────────────┴───────┴─────┴─────┘

>>> overlap(regions1_df_pl, regions2_df_pl, how="containment", invert=True)
shape: (2, 4)
┌────────────┬───────┬─────┬─────┐
│ Chromosome ┆ Start ┆ End ┆ ID  │
│ ---        ┆ ---   ┆ --- ┆ --- │
│ str        ┆ i64   ┆ i64 ┆ str │
╞════════════╪═══════╪═════╪═════╡
│ chr1       ┆ 1     ┆ 3   ┆ a   │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ chr1       ┆ 10    ┆ 11  ┆ c   │
└────────────┴───────┴─────┴─────┘

TSS profile

pycisTopic.tss_profile.get_tss_profile(fragments_df_pl: DataFrame, tss_annotation: DataFrame, flank_window: int = 2000, smoothing_rolling_window: int = 10, minimum_signal_window: int = 100, tss_window: int = 50, min_norm: float = 0.2, use_genomic_ranges: bool = True)[source]

Get TSS profile for Polars DataFrame with fragments filtered by cell barcodes.

Parameters:

fragments_df_pl

Polars DataFrame with fragments (filtered by cell barcodes of interest). See pycisTopic.fragments.filter_fragments_by_cb().

tss_annotation

TSS annotation Polars DataFrame with at least the following columns: ["Chromosome", "Start", "Strand"]. The “Start” column is 0-based like a BED file. See pycisTopic.gene_annotation.get_tss_annotation_from_ensembl() and pycisTopic.gene_annotation.change_chromosome_source_in_bed() for ways to get TSS annotation from Ensembl BioMart.

flank_window

Flanking window around the TSS. Used for intersecting fragments with TSS positions and keeping cut sites. Default: 2000 (+/- 2000 bp).

smoothing_rolling_window

Rolling window used to smooth the cut sites signal. Default: 10.

minimum_signal_window

Average signal in the tails of the flanking window around the TSS:

[-flank_window, -flank_window + minimum_signal_window + 1]
[flank_window - minimum_signal_window + 1, flank_window]

is used to normalize the TSS enrichment. Default: 100 (average signal in [-2000, -1901], [1901, 2000] around TSS if flank_window=2000).

tss_window

Window around the TSS used to count fragments in the TSS when calculating the TSS enrichment per cell barcode. Default: 50 (+/- 50 bp).

min_norm

Minimum normalization score. If the average minimum signal value is below this value, this number is used to normalize the TSS signal. This approach penalizes cells with fewer reads. Default: 0.2

use_genomic_ranges

Use genomic ranges implementation for calculating intersections, instead of using pyranges.

Returns:

tss_enrichment_per_cb, tss_norm_matrix_sample, tss_norm_matrix_per_cb

See also

pycisTopic.fragments.filter_fragments_by_cb
pycisTopic.gene_annotation.change_chromosome_source_in_bed
pycisTopic.gene_annotation.get_tss_annotation_from_ensembl

Examples

Get TSS annotation for requested transcript types from Ensembl BioMart.

>>> ensembl_tss_annotation_bed_df_pl = get_tss_annotation_from_ensembl(
...     biomart_name="hsapiens_gene_ensembl"
)

Get TSS profile for Polars DataFrame with fragments filtered by cell barcodes.

>>> get_tss_profile(
...     fragments_df_pl=fragments_cb_filtered_df_pl,
...     tss_annotation=ensembl_tss_annotation_bed_df_pl,
...     flank_window=2000,
...     smoothing_rolling_window=10,
...     minimum_signal_window=100,
...     tss_window=50,
...     min_norm=0.2,
... )

QC

pycisTopic.qc.compute_kde(training_data: ndarray, test_data: ndarray, no_threads: int = 8)[source]

Compute kernel-density estimate (KDE) using Gaussian kernels.

This function calculates the KDE in parallel and gives the same result as:

>>> from scipy.stats import gaussian_kde
>>> gaussian_kde(training_data)(test_data)

Parameters:

training_data: 2D numpy array with training data to train the KDE.
test_data: 2D numpy array with test data for which to evaluate the estimated probability density function (PDF).
no_threads: Number of threads to use in parallelization of KDE function.

Returns:

1D numpy array with probability density function (PDF) values for points in
test_data.

pycisTopic.qc.compute_qc_stats(fragments_df_pl: DataFrame, regions_df_pl: DataFrame, tss_annotation: DataFrame, tss_flank_window: int = 2000, tss_smoothing_rolling_window: int = 10, tss_minimum_signal_window: int = 100, tss_window: int = 50, tss_min_norm: float = 0.2, use_genomic_ranges: bool = True, min_fragments_per_cb: int = 10, collapse_duplicates: bool = True, no_threads: int = 8) → tuple[DataFrame, DataFrame, DataFrame, DataFrame][source]

Compute quality check statistics from Polars DataFrame with fragments.

Parameters:

fragments_df_pl

Polars DataFrame with fragments. fragments_df_pl Polars DataFrame with fragments (filtered by cell barcodes of interest). See pycisTopic.fragments.filter_fragments_by_cb().

regions_df_pl

Polars DataFrame with peak regions (consensus peaks or SCREEN regions). See pycisTopic.fragments.read_bed_to_polars_df() for a way to read a BED file with peak regions.

tss_annotation

TSS annotation Polars DataFrame with at least the following columns: ["Chromosome", "Start", "Strand"]. The “Start” column is 0-based like a BED file. See pycisTopic.gene_annotation.read_tss_annotation_from_bed(), pycisTopic.gene_annotation.get_tss_annotation_from_ensembl() and pycisTopic.gene_annotation.change_chromosome_source_in_bed() for ways to get TSS annotation from Ensembl BioMart.

tss_flank_window

Flanking window around the TSS. Used for intersecting fragments with TSS positions and keeping cut sites. Default: 2000 (+/- 2000 bp). See pycisTopic.tss_profile.get_tss_profile().

tss_smoothing_rolling_window

Rolling window used to smooth the cut sites signal. Default: 10. See pycisTopic.tss_profile.get_tss_profile().

tss_minimum_signal_window

Average signal in the tails of the flanking window around the TSS:

[-flank_window, -flank_window + minimum_signal_window + 1]
[flank_window - minimum_signal_window + 1, flank_window]

is used to normalize the TSS enrichment. Default: 100 (average signal in [-2000, -1901], [1901, 2000] around TSS if flank_window=2000). See pycisTopic.tss_profile.get_tss_profile().

tss_window

Window around the TSS used to count fragments in the TSS when calculating the TSS enrichment per cell barcode. Default: 50 (+/- 50 bp). See pycisTopic.tss_profile.get_tss_profile().

tss_min_norm

Minimum normalization score. If the average minimum signal value is below this value, this number is used to normalize the TSS signal. This approach penalizes cells with fewer reads. Default: 0.2 See pycisTopic.tss_profile.get_tss_profile().

use_genomic_ranges

Use genomic ranges implementation for calculating intersections, instead of using pyranges.

min_fragments_per_cb

Minimum number of fragments needed per cell barcode to keep the fragments for those cell barcodes.

collapse_duplicates

Collapse duplicate fragments (same chromosomal positions and linked to the same cell barcode).

no_threads

Number of threads to use when calculating kernel-density estimate (KDE) to get probability density function (PDF) values for log10 unique fragments in peaks vs TSS enrichment, fractions of fragments in peaks and duplication ratio. Default: 8

Returns:

Tuple with:

Polars DataFrame with fragments statistics per cell barcode.
Polars DataFrame with insert size distribution of fragments.
Polars DataFrame with TSS normalization matrix for the whole sample.
Polars DataFrame with TSS normalization matrix per cell barcode.

See also

pycisTopic.fragments.filter_fragments_by_cb
pycisTopic.fragments.get_insert_size_distribution
pycisTopic.fragments.get_fragments_in_peaks
pycisTopic.fragments.read_bed_to_polars_df
pycisTopic.fragments.read_fragments_to_polars_df
pycisTopic.gene_annotation.read_tss_annotation_from_bed
pycisTopic.tss_profile.get_tss_profile

Examples

>>> from pycisTopic.fragments import read_bed_to_polars_df
>>> from pycisTopic.fragments import read_fragments_to_polars_df
>>> from pycisTopic.gene_annotation import read_tss_annotation_from_bed

Read gzipped fragments BED file to a Polars DataFrame.

>>> fragments_df_pl = read_fragments_to_polars_df(
...     fragments_bed_filename="fragments.tsv.gz",
... )

Read BED file with consensus peaks or SCREEN regions (get first 3 columns only) which will be used for counting number of fragments in peaks.

>>> regions_df_pl = read_bed_to_polars_df(
...     bed_filename=screen_regions_bed_filename,
...     min_column_count=3,
... )

Read TSS annotation from a file. See pycisTopic.gene_annotation.read_tss_annotation_from_bed() for more info.

>>> tss_annotation_bed_df_pl = read_tss_annotation_from_bed(
...     tss_annotation_bed_filename="hg38.tss.bed",
... )

Compute QC statistics.

>>> (
...     fragments_stats_per_cb_df_pl,
...     insert_size_dist_df_pl,
...     tss_norm_matrix_sample,
...     tss_norm_matrix_per_cb,
... ) = compute_qc_stats(
...     fragments_df_pl=fragments_cb_filtered_df_pl,
...     regions_df_pl=regions_df_pl,
...     tss_annotation=tss_annotation_bed_df_pl,
...     tss_flank_window=2000,
...     tss_smoothing_rolling_window=10,
...     tss_minimum_signal_window=100,
...     tss_window=50,
...     tss_min_norm=0.2,
...     use_genomic_ranges=True,
...     min_fragments_per_cb=10,
...     collapse_duplicates=True,
...     no_threads=8,
... )

pycisTopic.qc.get_barcodes_passing_qc_for_sample(sample_id: str, pycistopic_qc_output_dir: str | Path, unique_fragments_threshold: int | None = None, tss_enrichment_threshold: float | None = None, frip_threshold: float | None = None, use_automatic_thresholds: bool = True) → tuple[np.ndarray, dict[str, float]][source]

Get barcodes passing quality control (QC) for a sample.

Parameters:

sample_id: Sample ID.
pycistopic_qc_output_dir: Directory with output from pycistopic qc.
unique_fragments_threshold: Threshold for number of unique fragments in peaks. If not defined, and use_automatic_thresholds is False, the threshold will be set to 0.
tss_enrichment_threshold: Threshold for TSS enrichment score. If not defined, and use_automatic_thresholds is False, the threshold will be set to 0.
frip_threshold: Threshold for fraction of reads in peaks (FRiP). If not defined the threshold will be set to 0.
use_automatic_thresholds: Use automatic thresholds for unique fragments in peaks and TSS enrichment score as calculated by Otsu’s method. If False, the thresholds will be set to 0 if not defined.

Returns:

Tuple with:

Numpy array with cell barcodes passing QC.
Dictionary with thresholds used for QC.

Raises:

FileNotFoundError: If the file with fragments statistics per cell barcode does not exist.

pycisTopic.qc.get_otsu_threshold(fragments_stats_per_cb_df_pl: DataFrame, min_otsu_fragments: int = 100, min_otsu_tss: float = 1.0)[source]

Get Otsu thresholds for number of unique fragments in peaks and TSS enrichment score.

Parameters:

fragments_stats_per_cb_df_pl: Polars DataFrame with fragments statistics per cell barcode as generated by pycisTopic.qc.compute_qc_stats().
min_otsu_fragments: When calculating Otsu threshold for number of unique fragments in peaks per CB, only consider those CBs which have at least this number of fragments.
min_otsu_tss: When calculating Otsu threshold for TSS enrichment score per CB, only consider those CBs which have at least this TSS value.

Returns:

Tuple with:

Otsu threshold for number of unique fragments in peaks.
Otsu threshold for TSS enrichment.
Polars DataFrame with fragments statistics per cell barcode for cell barcodes that passed both Otsu thresholds.

Examples

Only keep fragments stats for CBs that pass both Otsu thresholds. >>> ( … unique_fragments_in_peaks_count_otsu_threshold, … tss_enrichment_otsu_threshold, … fragments_stats_per_cb_for_otsu_threshold_df_pl, … ) = get_otsu_threshold( … fragments_stats_per_cb_df_pl=fragments_stats_per_cb_df_pl, … min_otsu_fragments=100, … min_otsu_tss=1.0, … )

Topic modelling

class pycisTopic.lda_models.CistopicLDAModel(metrics: DataFrame, coherence: DataFrame, marg_topic: DataFrame, topic_ass: DataFrame, cell_topic: DataFrame, topic_region: DataFrame, parameters: DataFrame)[source]

cisTopic LDA model class

cistopicLdaModel contains model quality metrics (model coherence (adaptation from Mimno et al., 2011), log-likelihood (Griffiths and Steyvers, 2004), density-based (Cao Juan et al., 2009) and divergence-based (Arun et al., 2010)), topic quality metrics (coherence, marginal distribution and total number of assignments), cell-topic and topic-region distribution, model parameters and model dimensions.

Parameters:

metrics: pd.DataFrame: pd.DataFrame containing model quality metrics, including model coherence (adaptation from Mimno et al., 2011), log-likelihood and density and divergence-based methods (Cao Juan et al., 2009; Arun et al., 2010).
coherence: pd.DataFrame: pd.DataFrame containing the coherence of each topic (Mimno et al., 2011).
marginal_distribution: pd.DataFrame: pd.DataFrame containing the marginal distribution for each topic. It can be interpreted as the importance of each topic for the whole corpus.
topic_ass: pd.DataFrame: pd.DataFrame containing the total number of assignments per topic.
cell_topic: pd.DataFrame: pd.DataFrame containing the topic cell distributions, with cells as columns, topics as rows and the probability of each topic in each cell as values.
topic_region: pd.DataFrame: pd.DataFrame containing the topic cell distributions, with topics as columns, regions as rows and the probability of each region in each topic as values.
parameters: pd.DataFrame: pd.DataFrame containing parameters used for the model.
n_cells: int: Number of cells in the model.
n_regions: int: Number of regions in the model.
n_topic: int: Number of topics in the model.

References

Mimno, D., Wallach, H., Talley, E., Leenders, M., & McCallum, A. (2011). Optimizing semantic coherence in topic models. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (pp. 262-272).

Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National academy of Sciences, 101(suppl 1), 5228-5235.

Cao, J., Xia, T., Li, J., Zhang, Y., & Tang, S. (2009). A density-based method for adaptive LDA model selection. Neurocomputing, 72(7-9), 1775-1781.

Arun, R., Suresh, V., Madhavan, C. V., & Murthy, M. N. (2010). On finding the natural number of topics with latent dirichlet allocation: Some observations. In Pacific-Asia conference on knowledge discovery and data mining (pp. 391-402). Springer, Berlin, Heidelberg.

Wrapper class to run LDA models with Mallet. This class has been adapted from gensim (https://github.com/RaRe-Technologies/gensim/blob/27bbb7015dc6bbe02e00bb1853e7952ac13e7fe0/gensim/models/wrappers/ldamallet.py).

Parameters:

num_topics: int: The number of topics to use in the model.
corpus: iterable of iterable of (int, int), optional: Collection of texts in BoW format. Default: None.
alpha: float, optional: Scalar value indicating the symmetric Dirichlet hyperparameter for topic proportions. Default: 50.
id2wordgensim.utils.FakeDict, optional: Mapping between tokens ids and words from corpus, if not specified - will be inferred from corpus. Default: None.
n_cpuint, optional: Number of threads that will be used for training. Default: 1.
tmp_dirstr, optional: tmp_dir for produced temporary files. Default: None.
optimize_intervalint, optional: Optimize hyperparameters every optimize_interval iterations (sometimes leads to Java exception 0 to switch off hyperparameter optimization). Default: 0.
iterationsint, optional: Number of training iterations. Default: 150.
topic_thresholdfloat, optional: Threshold of the probability above which we consider a topic. Default: 0.0.
random_seed: int, optional: Random seed to ensure consistent results, if 0 - use system clock. Default: 555.
mallet_path: str: Path to the mallet binary (e.g. /xxx/Mallet/bin/mallet). Default: “mallet”.

convert_input(corpus)[source]

Convert corpus to Mallet format and save it to a temporary text file.

Parameters:

corpus: iterable of iterable of (int, int) Collection of texts in BoW format.

Returns:

None.

corpus_to_mallet(corpus, file_like)[source]

Convert corpus to Mallet format and write it to file_like descriptor.

Parameters:

corpus: iterable of iterable of (int, int) Collection of texts in BoW format.
file_like: Writable file-like object in text mode.

Returns:

None.

fcorpusmallet()[source]

Get path to corpus.mallet file.

Returns:

str: Path to corpus.mallet file.

fcorpustxt()[source]

Get path to corpus text file.

Returns:

str: Path to corpus text file.

fdoctopics()[source]

Get path to document topic text file.

Returns:

str: Path to document topic text file.

finferencer()[source]

Get path to inferencer.mallet file.

Returns:

str: Path to inferencer.mallet file.

fstate()[source]

Get path to temporary file.

Returns:

str: Path to file.

ftopickeys()[source]

Get path to topic keys text file.

Returns:

str: Path to topic keys text file.

get_topics()[source]

Get topics X words matrix.

Returns:

np.ndarray: Topics X words matrix, shape num_topics x vocabulary_size.

load_word_topics()[source]

Load words X topics matrix from gensim.models.wrappers.LDAMallet.LDAMallet.fstate() file.

Returns:

np.ndarray: Matrix words X topics.

train(corpus, reuse_corpus)[source]

Train Mallet LDA.

Parameters:

corpusiterable of iterable of (int, int): Corpus in BoW format
reuse_corpus: bool, optional: Whether to reuse the mallet corpus in the tmp directory. Default: False

pycisTopic.lda_models.evaluate_models(models: List[CistopicLDAModel], select_model: int | None = None, return_model: bool | None = True, metrics: str | None = ['Minmo_2011', 'loglikelihood', 'Cao_Juan_2009', 'Arun_2010'], min_topics_coh: int | None = 5, plot: bool | None = True, figsize: Tuple[float, float] | None = (6.4, 4.8), plot_metrics: bool | None = False, save: str | None = None)[source]

Model selection based on model quality metrics (model coherence (adaptation from Mimno et al., 2011), log-likelihood (Griffiths and Steyvers, 2004), density-based (Cao Juan et al., 2009) and divergence-based (Arun et al., 2010)).

Parameters:

models: list of :class:`CistopicLDAModel`

A list containing cisTopic LDA models, as returned from run_cgs_models or run_cgs_modelsMallet.

selected_model: int, optional

Integer indicating the number of topics of the selected model. If not provided, the best model will be selected automatically based on the model quality metrics. Default: None.

return_model: bool, optional

Whether to return the selected model as CistopicLDAModel

metrics: list of str

Metrics to use for plotting and model selection:: Minmo_2011: Uses the average model coherence as calculated by Mimno et al (2011). In order to reduce the impact of the number of topics, we calculate the average coherence based on the top selected average values. The better the model, the higher coherence. log-likelihood: Uses the log-likelihood in the last iteration as calculated by Griffiths and Steyvers (2004). The better the model, the higher the log-likelihood. Arun_2010: Uses a divergence-based metric as in Arun et al (2010) using the topic-region distribution, the cell-topic distribution and the cell coverage. The better the model, the lower the metric. Cao_Juan_2009: Uses a density-based metric as in Cao Juan et al (2009) using the topic-region distribution. The better the model, the lower the metric.

Default: all metrics.

min_topics_coh: int, optional

Minimum number of topics on a topic to use its coherence for model selection. Default: 5.

plot: bool, optional

Whether to return plot to the console. Default: True.

figsize: tuple, optional

Size of the figure. Default: (6.4, 4.8)

plot_metrics: bool, optional

Whether to plot metrics independently. Default: False.

save: str, optional

Output file to save plot. Default: None.

References

Mimno, D., Wallach, H., Talley, E., Leenders, M., & McCallum, A. (2011). Optimizing semantic coherence in topic models. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (pp. 262-272).

Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National academy of Sciences, 101(suppl 1), 5228-5235

Cao, J., Xia, T., Li, J., Zhang, Y., & Tang, S. (2009). A density-based method for adaptive LDA model selection. Neurocomputing, 72(7-9), 1775-1781.

Arun, R., Suresh, V., Madhavan, C. V., & Murthy, M. N. (2010). On finding the natural number of topics with latent dirichlet allocation: Some observations. In Pacific-Asia conference on knowledge discovery and data mining (pp. 391-402). Springer, Berlin, Heidelberg.

pycisTopic.lda_models.run_cgs_model_mallet(binary_matrix: csr_matrix, corpus: Iterable, id2word: FakeDict, n_topics: List[int], cell_names: List[str], region_names: List[str], n_cpu: int | None = 1, n_iter: int | None = 500, random_state: int | None = 555, alpha: float | None = 50, alpha_by_topic: bool | None = True, eta: float | None = 0.1, eta_by_topic: bool | None = False, top_topics_coh: int | None = 5, tmp_path: str | None = None, save_path: str | None = None, reuse_corpus: bool | None = False, mallet_path: str = 'mallet')[source]

Run Latent Dirichlet Allocation in a model as implemented in Mallet (McCallum, 2002).

Parameters:

binary_matrix: sparse.csr_matrix: Binary sparse matrix containing cells as columns, regions as rows, and 1 if a regions is considered accessible on a cell (otherwise, 0).
n_topics: list of int: A list containing the number of topics to use in each model.
cell_names: list of str: List containing cell names as ordered in the binary matrix columns.
region_names: list of str: List containing region names as ordered in the binary matrix rows.
n_cpu: int, optional: Number of cpus to use for modelling. In this function parallelization is done per model, that is, one model will run entirely in a unique cpu. We recommend to set the number of cpus as the number of models that will be inferred, so all models start at the same time.
n_iter: int, optional: Number of iterations for which the Gibbs sampler will be run. Default: 150.
random_state: int, optional: Random seed to initialize the models. Default: 555.
alpha: float, optional: Scalar value indicating the symmetric Dirichlet hyperparameter for topic proportions. Default: 50.
alpha_by_topic: bool, optional: Boolean indicating whether the scalar given in alpha has to be divided by the number of topics. Default: True
eta: float, optional: Scalar value indicating the symmetric Dirichlet hyperparameter for topic multinomials. Default: 0.1.
eta_by_topic: bool, optional: Boolean indicating whether the scalar given in beta has to be divided by the number of topics. Default: False
top_topics_coh: int, optional: Number of topics to use to calculate the model coherence. For each model, the coherence will be calculated as the average of the top coherence values. Default: 5.
tmp_path: str, optional: Path to a temporary folder for Mallet. Default: None.
save_path: str, optional: Path to save models as independent files as they are completed. This is recommended for large data sets. Default: None.
reuse_corpus: bool, optional: Whether to reuse the mallet corpus in the tmp directory. Default: False
mallet_path: str: Path to Mallet binary (e.g. “/xxx/Mallet/bin/mallet”). Default: “mallet”.

References

McCallum, A. K. (2002). Mallet: A machine learning for language toolkit. http://mallet.cs.umass.edu.

Run Latent Dirichlet Allocation using Gibbs Sampling as described in Griffiths and Steyvers, 2004.

Parameters:

cistopic_obj: CistopicObject: A CistopicObject. Note that cells/regions have to be filtered before running any LDA model.
n_topics: list of int: A list containing the number of topics to use in each model.
n_cpu: int, optional: Number of cpus to use for modelling. In this function parallelization is done per model, that is, one model will run entirely in a unique cpu. We recommend to set the number of cpus as the number of models that will be inferred, so all models start at the same time.
n_iter: int, optional: Number of iterations for which the Gibbs sampler will be run. Default: 150.
random_state: int, optional: Random seed to initialize the models. Default: 555.
alpha: float, optional: Scalar value indicating the symmetric Dirichlet hyperparameter for topic proportions. Default: 50.
alpha_by_topic: bool, optional: Boolean indicating whether the scalar given in alpha has to be divided by the number of topics. Default: True
eta: float, optional: Scalar value indicating the symmetric Dirichlet hyperparameter for topic multinomials. Default: 0.1.
eta_by_topic: bool, optional: Boolean indicating whether the scalar given in beta has to be divided by the number of topics. Default: False
top_topics_coh: int, optional: Number of topics to use to calculate the model coherence. For each model, the coherence will be calculated as the average of the top coherence values. Default: 5.
save_path: str, optional: Path to save models as independent files as they are completed. This is recommended for large data sets. Default: None.

References

Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National academy of Sciences, 101(suppl 1), 5228-5235.

Run Latent Dirichlet Allocation per model as implemented in Mallet (McCallum, 2002).

Parameters:

cistopic_obj: CistopicObject: A CistopicObject. Note that cells/regions have to be filtered before running any LDA model.
n_topics: list of int: A list containing the number of topics to use in each model.
n_cpu: int, optional: Number of cpus to use for modelling. In this function parallelization is done per model, that is, one model will run entirely in a unique cpu. We recommend to set the number of cpus as the number of models that will be inferred, so all models start at the same time.
n_iter: int, optional: Number of iterations for which the Gibbs sampler will be run. Default: 150.
random_state: int, optional: Random seed to initialize the models. Default: 555.
alpha: float, optional: Scalar value indicating the symmetric Dirichlet hyperparameter for topic proportions. Default: 50.
alpha_by_topic: bool, optional: Boolean indicating whether the scalar given in alpha has to be divided by the number of topics. Default: True
eta: float, optional: Scalar value indicating the symmetric Dirichlet hyperparameter for topic multinomials. Default: 0.1.
eta_by_topic: bool, optional: Boolean indicating whether the scalar given in beta has to be divided by the number of topics. Default: False
top_topics_coh: int, optional: Number of topics to use to calculate the model coherence. For each model, the coherence will be calculated as the average of the top coherence values. Default: 5.
tmp_path: str, optional: Path to a temporary folder for Mallet. Default: None.
save_path: str, optional: Path to save models as independent files as they are completed. This is recommended for large data sets. Default: None.
reuse_corpus: bool, optional: Whether to reuse the mallet corpus in the tmp directory. Default: False
mallet_path: str: Path to Mallet binary (e.g. “/xxx/Mallet/bin/mallet”). Default: “mallet”.

References

McCallum, A. K. (2002). Mallet: A machine learning for language toolkit. http://mallet.cs.umass.edu.

Clustering & visualization

pycisTopic.clust_vis.cell_topic_heatmap(cistopic_obj: CistopicObject, variables: List[str] | None = None, remove_nan: bool | None = True, scale: bool | None = False, cluster_topics: bool | None = False, color_dict: Dict[str, Dict[str, str]] | None = {}, seed: int | None = 555, legend_loc_x: float | None = 1.2, legend_loc_y: float | None = -0.5, legend_dist_y: float | None = -1, figsize: Tuple[float, float] | None = (6.4, 4.8), selected_topics: List[int] | None = None, selected_cells: List[str] | None = None, harmony: bool | None = False, save: str | None = None)[source]

Plot heatmap with cell-topic distributions. Parameters ——— cistopic_obj: class::CistopicObject

A cisTopic object with a model in class::CistopicObject.selected_model.

variables: list: List of variables to plot. They should be included in class::CistopicObject.cell_data and class::CistopicObject.region_data, depending on which target is specified.
remove_nan: bool, optional: Whether to remove data points for which the variable value is ‘nan’. Default: True
reduction_name: str: Name of the dimensionality reduction to use
scale: bool, optional: Whether to scale the cell-topic or topic-regions contributions prior to plotting. Default: False
cluster_topics: bool, optional: Whether to cluster rows in the heatmap. Otherwise, they will be ordered based on the maximum values over the ordered cells. Default: False
color_dictionary: dict, optional: A dictionary containing an entry per variable, whose values are dictionaries with variable levels as keys and corresponding colors as values. Default: None
seed: int, optional: Random seed used to select random colors. Default: 555
legend_loc_x: float, optional: X location for legend. Default: 1.2
legend_loc_y: float, optional: Y location for legend. Default: -0.5
legend_dist_y: float, optional: Y distance between legends. Default: -1
figsize: tuple, optional: Size of the figure. Default: (6.4, 4.8)
selected_topics: list, optional: A list with selected topics to be used for plotting. Default: None (use all topics)
selected_cellss: list, optional: A list with selected cells to plot. Default: None (use all cells)
harmony: bool, optional: If target is ‘cell’, whether to use harmony processed topic contributions. Default: False
save: str, optional: Path to save plot. Default: None.

Performing leiden cell or region clustering and add results to cisTopic object’s metadata.

Parameters

cistopic_obj: class::CistopicObject: A cisTopic object with a model in class::CistopicObject.selected_model.
target: str, optional: Whether cells (‘cell’) or regions (‘region’) should be clustered. Default: ‘cell’
k: int, optional: Number of neighbours in the k-neighbours graph. Default: 10
res: float, optional: Resolution parameter for the leiden algorithm step. Default: 0.6
seed: int, optional: Seed parameter for the leiden algorithm step. Default: 555
scale: bool, optional: Whether to scale the cell-topic or topic-regions contributions prior to the clustering. Default: False
prefix: str, optional: Prefix to add to the clustering name when adding it to the correspondent metadata attribute. Default: ‘’
selected_topics: list, optional: A list with selected topics to be used for clustering. Default: None (use all topics)
selected_features: list, optional: A list with selected features (cells or regions) to cluster. This is recommended when working with regions (e.g. selecting regions in binarized topics), as working with all regions can be time consuming. Default: None (use all features)
harmony: bool, optional: If target is ‘cell’, whether to use harmony processed topic contributions. Default: False.
rna_components: pd.DataFrame, optional: A pandas dataframe containing RNA dimensionality reduction (e.g. PCA) components. If provided, both layers (atac and rna) will be considered for clustering.
use_umap_integration: bool, optional: Whether to use a weighted UMAP representation for the clustering or directly integrating the two graphs. Default: True
rna_weight: float, optional: Weight of the RNA layer on the clustering (only applicable when clustering via UMAP). Default: 0.5 (same weight)

pycisTopic.clust_vis.harmony(cistopic_obj: CistopicObject, vars_use: List[str], scale: bool | None = True, random_state: int | None = 555, **kwargs)[source]

Apply harmony batch effect correction (Korsunsky et al, 2019) over cell-topic distribution

Parameters

cistopic_obj: class::CistopicObject: A cisTopic object with a model in class::CistopicObject.selected_model.
vars_use: list: List of variables to correct batch effect with.
scale: bool, optional: Whether to scale probability matrix prior to correction. Default: True
random_state: int, optional: Random seed used to use with harmony. Default: 555

References

Korsunsky, I., Millard, N., Fan, J., Slowikowski, K., Zhang, F., Wei, K., … & Raychaudhuri, S. (2019). Fast, sensitive and accurate integration of single-cell data with Harmony. Nature methods, 16(12), 1289-1296.

pycisTopic.clust_vis.input_check(atac_topics: DataFrame, rna_pca: DataFrame)[source]: A function to select cells present in both the RNA and the ATAC layers

pycisTopic.clust_vis.plot_imputed_features(cistopic_obj: CistopicObject, reduction_name: str, imputed_data: cisTopicImputedFeatures, features: ~typing.List[str], scale: bool | None = False, cmap: str | matplotlib.cm | None = <matplotlib.colors.ListedColormap object>, dot_size: int | None = 10, alpha: float | int | None = 1, selected_cells: ~typing.List[str] | None = None, figsize: ~typing.Tuple[float, float] | None = (6.4, 4.8), num_columns: int | None = 1, save: str | None = None)[source]

Plot imputed features into dimensionality reduction.

Parameters

cistopic_obj: class::CistopicObject: A cisTopic object with dimensionality reductions in class::CistopicObject.dr.
reduction_name: str: Name of the dimensionality reduction to use
imputed_data: class::cisTopicImputedFeatures: A class::cisTopicImputedFeatures object derived from the input cisTopic object.
features: list: Names of the features to plot.
scale: bool, optional: Whether to scale the imputed features prior to plotting. Default: False
cmap: str or ‘matplotlib.cm’, optional: For continuous variables, color map to use for the legend color bar. Default: cm.viridis
dot_size: int, optional: Dot size in the plot. Default: 10
alpha: float, optional: Transparency value for the dots in the plot. Default: 1
selected_cells: list, optional: A list with selected cells to plot. Default: None (use all cells)
figsize: tuple, optional: Size of the figure. If num_columns is 1, this is the size for each figure; if num_columns is above 1, this is the overall size of the figure (if keeping default, it will be the size of each subplot in the figure). Default: (6.4, 4.8)
num_columns: int, optional: For multiplot figures, indicates the number of columns (the number of rows will be automatically determined based on the number of plots). Default: 1
save: str, optional: Path to save plot. Default: None.

pycisTopic.clust_vis.plot_metadata(cistopic_obj: ~pycisTopic.cistopic_class.CistopicObject, reduction_name: str, variables: ~typing.List[str], target: str | None = 'cell', remove_nan: bool | None = True, show_label: bool | None = True, show_legend: bool | None = False, cmap: str | <module 'matplotlib.cm' from '/home/docs/checkouts/readthedocs.org/user_builds/pycistopic/envs/polars/lib/python3.11/site-packages/matplotlib/cm.py'> | None = <matplotlib.colors.ListedColormap object>, dot_size: int | None = 10, text_size: int | None = 10, alpha: float | int | None = 1, seed: int | None = 555, color_dictionary: ~typing.Dict[str, str] | None = {}, figsize: ~typing.Tuple[float, float] | None = (6.4, 4.8), num_columns: int | None = 1, selected_features: ~typing.List[str] | None = None, save: str | None = None)[source]

Plot categorical and continuous metadata into dimensionality reduction.

Parameters

cistopic_obj: class::CistopicObject: A cisTopic object with dimensionality reductions in class::CistopicObject.projections.
reduction_name: str: Name of the dimensionality reduction to use
variables: list: List of variables to plot. They should be included in class::CistopicObject.cell_data and class::CistopicObject.region_data, depending on which target is specified.
target: str, optional: Whether cells (‘cell’) or regions (‘region’) should be used. Default: ‘cell’
remove_nan: bool, optional: Whether to remove data points for which the variable value is ‘nan’. Default: True
show_label: bool, optional: For categorical variables, whether to show the label in the plot. Default: True
show_legend: bool, optional: For categorical variables, whether to show the legend next to the plot. Default: False
cmap: str or ‘matplotlib.cm’, optional: For continuous variables, color map to use for the legend color bar. Default: cm.viridis
dot_size: int, optional: Dot size in the plot. Default: 10
text_size: int, optional: For categorical variables and if show_label is True, size of the labels in the plot. Default: 10
alpha: float, optional: Transparency value for the dots in the plot. Default: 1
seed: int, optional: Random seed used to select random colors. Default: 555
color_dictionary: dict, optional: A dictionary containing an entry per variable, whose values are dictionaries with variable levels as keys and corresponding colors as values. Default: None
figsize: tuple, optional: Size of the figure. If num_columns is 1, this is the size for each figure; if num_columns is above 1, this is the overall size of the figure (if keeping default, it will be the size of each subplot in the figure). Default: (6.4, 4.8)
num_columns: int, optional: For multiplot figures, indicates the number of columns (the number of rows will be automatically determined based on the number of plots). Default: 1
selected_features: list, optional: A list with selected features (cells or regions) to plot. This is recommended when working with regions (e.g. selecting regions in binarized topics), as working with all regions can be time consuming. Default: None (use all features)
save: str, optional: Path to save plot. Default: None.

pycisTopic.clust_vis.plot_topic(cistopic_obj: ~pycisTopic.cistopic_class.CistopicObject, reduction_name: str, target: str | None = 'cell', cmap: str | <module 'matplotlib.cm' from '/home/docs/checkouts/readthedocs.org/user_builds/pycistopic/envs/polars/lib/python3.11/site-packages/matplotlib/cm.py'> | None = <matplotlib.colors.ListedColormap object>, dot_size: int | None = 10, alpha: float | int | None = 1, scale: bool | None = False, selected_topics: ~typing.List[int] | None = None, selected_features: ~typing.List[str] | None = None, harmony: bool | None = False, figsize: ~typing.Tuple[float, float] | None = (6.4, 4.8), num_columns: int | None = 1, save: str | None = None)[source]

Plot topic distributions into dimensionality reduction.

Parameters

cistopic_obj: class::CistopicObject: A cisTopic object with dimensionality reductions in class::CistopicObject.projections.
reduction_name: str: Name of the dimensionality reduction to use
target: str, optional: Whether cells (‘cell’) or regions (‘region’) should be used. Default: ‘cell’
cmap: str or ‘matplotlib.cm’, optional: For continuous variables, color map to use for the legend color bar. Default: cm.viridis
dot_size: int, optional: Dot size in the plot. Default: 10
alpha: float, optional: Transparency value for the dots in the plot. Default: 1
scale: bool, optional: Whether to scale the cell-topic or topic-regions contributions prior to plotting. Default: False
selected_topics: list, optional: A list with selected topics to be used for plotting. Default: None (use all topics)
selected_features: list, optional: A list with selected features (cells or regions) to plot. This is recommended when working with regions (e.g. selecting regions in binarized topics), as working with all regions can be time consuming. Default: None (use all features)
harmony: bool, optional: If target is ‘cell’, whether to use harmony processed topic contributions. Default: False
figsize: tuple, optional: Size of the figure. If num_columns is 1, this is the size for each figure; if num_columns is above 1, this is the overall size of the figure (if keeping default, it will be the size of each subplot in the figure). Default: (6.4, 4.8)
num_columns: int, optional: For multiplot figures, indicates the number of columns (the number of rows will be automatically determined based on the number of plots). Default: 1
save: str, optional: Path to save plot. Default: None.

Run tSNE and add it to the dimensionality reduction dictionary. If FItSNE is installed it will be used, otherwise sklearn TSNE implementation will be used.

Parameters

cistopic_obj: class::CistopicObject: A cisTopic object with a model in class::CistopicObject.selected_model.
target: str, optional: Whether cells (‘cell’) or regions (‘region’) should be used. Default: ‘cell’
scale: bool, optional: Whether to scale the cell-topic or topic-regions contributions prior to the dimensionality reduction. Default: False
reduction_name: str, optional: Reduction name to use as key in the dimensionality reduction dictionary. Default: ‘tSNE’
random_state: int, optional: Seed parameter for running tSNE. Default: 555
perplexity: int, optional: Perplexity parameter for FitSNE. Default: 30
selected_topics: list, optional: A list with selected topics to be used for clustering. Default: None (use all topics)
selected_features: list, optional: A list with selected features (cells or regions) to cluster. This is recommended when working with regions (e.g. selecting regions in binarized topics), as working with all regions can be time consuming. Default: None (use all features)
harmony: bool, optional: If target is ‘cell’, whether to use harmony processed topic contributions. Default: False
rna_components: pd.DataFrame, optional: A pandas dataframe containing RNA dimensionality reduction (e.g. PCA) components. If provided, both layers (atac and rna) will be considered for clustering.
rna_weight: float, optional: Weight of the RNA layer on the clustering (only applicable when clustering via UMAP). Default: 0.5 (same weight)
**kwargs: Parameters to pass to fitsne.FItSNE or sklearn.manifold.TSNE.

Run UMAP and add it to the dimensionality reduction dictionary.

Parameters

cistopic_obj: class::CistopicObject: A cisTopic object with a model in class::CistopicObject.selected_model.
target: str, optional: Whether cells (‘cell’) or regions (‘region’) should be used. Default: ‘cell’
scale: bool, optional: Whether to scale the cell-topic or topic-regions contributions prior to the dimensionality reduction. Default: False
reduction_name: str, optional: Reduction name to use as key in the dimensionality reduction dictionary. Default: ‘UMAP’
random_state: int, optional: Seed parameter for running UMAP. Default: 555
selected_topics: list, optional: A list with selected topics to be used for clustering. Default: None (use all topics)
selected_features: list, optional: A list with selected features (cells or regions) to cluster. This is recommended when working with regions (e.g. selecting regions in binarized topics), as working with all regions can be time consuming. Default: None (use all features)
harmony: bool, optional: If target is ‘cell’, whether to use harmony processed topic contributions. Default: False.
rna_components: pd.DataFrame, optional: A pandas dataframe containing RNA dimensionality reduction (e.g. PCA) components. If provided, both layers (atac and rna) will be considered for clustering.
rna_weight: float, optional: Weight of the RNA layer on the clustering (only applicable when clustering via UMAP). Default: 0.5 (same weight)
**kwargs: Parameters to pass to umap.UMAP.

pycisTopic.clust_vis.weighted_integration(atac_topics: DataFrame, rna_pca: DataFrame, common_cells: List[str], weight=0.5, **kwargs)[source]: A function for weighted integration via UMAP

Drop-out imputation & Differential features

class pycisTopic.diff_features.CistopicImputedFeatures(imputed_acc: csr_matrix, feature_names: List[str], cell_names: List[str], project: str)[source]

cisTopic imputation data class.

CistopicImputedFeatures contains the cell by features matrices (stored at mtx, with features being eithere regions or genes ), cell names cell_names and feature names feature_names.

Attributes

mtx: sparse.csr_matrix: A matrix containing imputed values.
cell_names: list: A list containing cell names.
feature_names: list: A list containing feature names.
project: str: Name of the cisTopic imputation project.

make_rankings(seed=123)[source]

A function to generate rankings per cell based on the imputed accessibility scores per region.

Parameters

seed: int, optional: Random seed to ensure reproducibility of the rankings when there are ties

Return

CistopicImputedFeatures
A CistopicImputedFeatures containing with ranking values rather than scores.

merge(cistopic_imputed_features_list: List[CistopicImputedFeatures], project: str | None = 'cisTopic_impute_merge', copy: bool | None = False)[source]

Merge a list of CistopicImputedFeatures to the input CistopicImputedFeatures. Reference coordinates (for regions) must be the same between the objects.

Parameters

cistopic_imputed_features_list: list: A list containing one or more CistopicImputedFeatures to merge.
project: str, optional: Name of the cisTopic imputation project.
copy: bool, optional: Whether changes should be done on the input CistopicObject or a new object should be returned

Return

CistopicImputedFeatures: A combined CistopicImputedFeatures.

subset(cells: List[str] | None = None, features: List[str] | None = None, copy: bool | None = False, split_pattern: str | None = '___')[source]

Subset cells and/or regions from CistopicImputedFeatures.

Parameters

cells: list, optional: A list containing the names of the cells to keep.
features: list, optional: A list containing the names of the features to keep.
copy: bool, optional: Whether changes should be done on the input CistopicObject or a new object should be returned
split_pattern: str: Pattern to split cell barcode from sample id. Default: ___

pycisTopic.diff_features.find_diff_features(cistopic_obj: CistopicObject, imputed_features_obj: CistopicImputedFeatures, variable: str, var_features: List[str] | None = None, contrasts: List[List[str]] | None = None, adjpval_thr: float | None = 0.05, log2fc_thr: float | None = 0.5849625007211562, split_pattern: str | None = '___', n_cpu: int | None = 1, **kwargs)[source]

Find differential imputed features.

Parameters

cistopic_obj: class::CistopicObject: A cisTopic object including the cells in imputed_features_obj.
imputed_features_obj: CistopicImputedFeatures: A cisTopic imputation data object.
variable: str: Name of the group variable to do comparison. It must be included in class::CistopicObject.cell_data
var_features: list, optional: A list of features to use (e.g. variable features from find_highly_variable_features())
contrasts: List, optional: A list including contrasts to make in the form of lists with foreground and background, e.g. [[[‘Group_1’], [‘Group_2, ‘Group_3’]], [][‘Group_2’], [‘Group_1, ‘Group_3’]], [][‘Group_1’], [‘Group_2, ‘Group_3’]]]. Default: None.
adjpval_thr: float, optional: Adjusted p-values threshold. Default: 0.05
log2fc_thr: float, optional: Log2FC threshold. Default: np.log2(1.5)
split_pattern: str: Pattern to split cell barcode from sample id. Default: ___
n_cpu: int, optional: Number of cores to use. Default: 1
**kwargs: Parameters to pass to ray.init()

Find highly variable features.

Parameters

input_mat: pd.DataFrame or CistopicImputedFeatures: A dataframe with values to be normalize or cisTopic imputation data.
min_disp: float, optional: Minimum dispersion value for a feature to be selected. Default: 0.05
min_mean: float, optional: Minimum mean value for a feature to be selected. Default: 0.0125
max_disp: float, optional: Maximum dispersion value for a feature to be selected. Default: np.inf
max_mean: float, optional: Maximum mean value for a feature to be selected. Default: 3
n_bins: int, optional: Number of bins for binning the mean gene expression. Normalization is done with respect to each bin. Default: 20
n_top_features: int, optional: Number of highly-variable features to keep. If specifed, dispersion and mean thresholds will be ignored. Default: None
plot: bool, optional: Whether to plot dispersion versus mean values. Default: True.
save: str, optional: Path to save feature selection plot. Default: None

pycisTopic.diff_features.get_log2_fc(fg_mat, bg_mat)[source]

Calculate log2 fold change between foreground and background matrix.

Parameters:

fg_mat: 2D-numpy foreground matrix.
bg_mat: 2D-numpy background matrix.

pycisTopic.diff_features.get_wilcox_test_pvalues(fg_mat, bg_mat)[source]

Calculate wilcox test p-values between foreground and background matrix.

Parameters:

fg_mat: 2D-numpy foreground matrix.
bg_mat: 2D-numpy background matrix.

pycisTopic.diff_features.impute_accessibility(cistopic_obj: CistopicObject, selected_cells: List[str] | None = None, selected_regions: List[str] | None = None, scale_factor: int | None = 1000000, chunk_size: int = 20000, project: str | None = 'cisTopic_Impute')[source]

Impute region accessibility.

Parameters:

cistopic_obj: `class::CistopicObject`: A cisTopic object with a model in class::CistopicObject.selected_model.
selected_cells: list, optional: A list with selected cells to impute accessibility for. Default: None
selected_regions: list, optional: A list with selected regions to impute accessibility for. Default: None
scale_factor: int, optional: A number to multiply the imputed values for. This is useful to convert low probabilities to 0, making the matrix more sparse. Default: 10**6.
chunk_size:: Chunk size used (number of regions for which imputed accessibility is calculated at the same time).
project: str, optional: Name of the cisTopic imputation project. Default: cisTopic_impute.

pycisTopic.diff_features.markers(input_mat: DataFrame | CistopicImputedFeatures, barcode_group: List[List[str]], contrast_name: str, adjpval_thr: float | None = 0.05, log2fc_thr: float | None = 1, n_cpu: int | None = 1)[source]

Find differential imputed features.

Parameters:

input_mat: :class:`pd.DataFrame` or :class:`CistopicImputedFeatures`: A data frame or a cisTopic imputation data object.
barcode_group: List: List of length 2, including foreground cells on the first slot and background on the second.
contrast_name: str: Name of the contrast
adjpval_thr: float, optional: Adjusted p-values threshold. Default: 0.05
log2fc_thr: float, optional: Log2FC threshold. Default: np.log2(1.5)
n_cpu: int, optional: Number of cores to use. Default: 1

pycisTopic.diff_features.mean_axis1(arr)[source]

Calculate column wise mean of 2D-numpy matrix with numba, mimicking np.mean(x, axis=1).

Parameters:

arr: 2D-numpy array to calculate the mean per column for.

pycisTopic.diff_features.normalize_scores(imputed_acc: DataFrame | CistopicImputedFeatures, scale_factor: int = 10000)[source]

Log-normalize imputation data. Feature counts for each cell are divided by the total counts for that cell and multiplied by the scale_factor.

Parameters:

imputed_acc: pd.DataFrame or :class:`CistopicImputedFeatures`: A dataframe with values to be normalized or cisTopic imputation data.
scale_factor: int: Scale factor for cell-level normalization. Default: 10**4

pycisTopic.diff_features.p_adjust_bh(p: float)[source]: Benjamini-Hochberg p-value correction for multiple hypothesis testing.

pycisTopic.diff_features.subset_array_second_axis(arr, col_indices)[source]

Subset array by second axis based on provided col_indices.

Returns the same as arr[:, col_indices], but is much faster when arr and col_indices are big.

Parameters:

arr: 2D-numpy array to subset by provided column indices.
col_indices: 1D-numpy array (preferably with np.int64 as dtype) with column indices.

Topic binarization

pycisTopic.topic_binarization.binarize_topics(cistopic_obj: CistopicObject, target: str | None = 'region', method: str | None = 'otsu', smooth_topics: bool = True, ntop: int = 2000, predefined_thr: dict[str, float] | None = None, nbins: int = 100, plot: bool = False, figsize: tuple[float, float] | None = (6.4, 4.8), num_columns: int = 1, save: str | None = None)[source]

Binarize topic distributions.

Parameters:

cistopic_obj

A cisTopic object with a model in CistopicObject.

target

Whether cell-topic (“cell”) or region-topic (“region”) distributions should be binarized. Default: “region”.

method

Method to use for topic binarization. Possible options are:

otsu [Otsu, 1979]
yen [Yen et al., 1995]
li [Li & Lee, 1993]
aucell [Van de Sande et al., 2020]
ntop [Taking the top n regions per topic]

Default: otsu.

smooth_topics

Whether to smooth topics distributions to penalize regions enriched across many topics. The following formula is applied:

\[\beta_{w, k} (\log\beta_{w,k} - 1 / K \sum_{k'} \log \beta_{w,k'})\]

ntop

Number of top regions to select when using method="ntop". Default: 2000.

predefined_thr

A dictionary containing topics as keys and threshold as values. If a topic is not present, thresholds will be computed with the specified method. This can be used for manually adjusting thresholds when necessary. Default: None.

nbins

Number of bins to use in the histogram used for otsu, yen and li thresholding. Default: 100.

plot

Whether to plot region-topic distributions and their threshold. Default: False.

figsize

Size of the figure. If num_columns is 1, this is the size for each figure. If num_columns is above 1, this is the overall size of the figure. If keeping the default, it will be the size of each subplot in the figure. Default: (6.4, 4.8).

num_columns

For multiplot figures, indicates the number of columns (the number of rows will be automatically determined based on the number of plots). Default: 1.

save

Path to save plot. Default: None.

Returns:

A dictionary containing a pd.DataFrame with the selected regions with region names
as indexes and a topic score column.

References

Otsu, N., 1979. A threshold selection method from gray-level histograms. IEEE transactions on systems, man, and cybernetics, 9(1), pp.62-66.
Yen, J.C., Chang, F.J. and Chang, S., 1995. A new criterion for automatic multilevel thresholding. IEEE Transactions on Image Processing, 4(3), pp.370-378.
Li, C.H. and Lee, C.K., 1993. Minimum cross entropy thresholding. Pattern recognition, 26(4), pp.617-625.
Van de Sande, B., Flerin, C., Davie, K., De Waegeneer, M., Hulselmans, G., Aibar, S., Seurinck, R., Saelens, W., Cannoodt, R., Rouchon, Q. and Verbeiren, T., 2020. A scalable SCENIC workflow for single-cell gene regulatory network analysis. Nature Protocols, 15(7), pp.2247-2276.

pycisTopic.topic_binarization.cross_entropy(array: ndarray, threshold: float, nbins: int = 100) → float[source]

Calculate entropies for Li thresholding on topic-region distributions [Li & Lee, 1993].

Parameters:

array: Array containing the region values for the topic to be binarized.
threshold: Distribution threshold to calculate entropy from.
nbins: Number of bins to use in the binarization histogram.

Returns:

Entropy for the given threshold.

pycisTopic.topic_binarization.histogram_and_bin_centers(array: ndarray, nbins: int = 100) → tuple[ndarray, ndarray][source]

Draw histogram from distribution and identify centers.

Parameters:

array: Scores distribution.
nbins: Number of bins to use in the histogram.

Returns:

Histogram values and bin centers.

pycisTopic.topic_binarization.smooth_topics_distributions(topic_region_distributions: DataFrame) → DataFrame[source]

Smooth topic-region distributions.

Smooth topics distributions to penalize regions enriched across many topics. The formula applied is:

\[\beta_{w, k} (\log\beta_{w,k} - 1 / K \sum_{k'} \log \beta_{w,k'})\]

Parameters:

topic_region_distributions: A pandas dataframe with topic-region distributions (with topics as columns and regions as rows).

Returns:

Smoothed topic-region dataframe.

pycisTopic.topic_binarization.threshold_otsu(array: ndarray, nbins: int = 100) → float[source]

Apply Otsu threshold on topic-region distributions [Otsu, 1979].

Parameters:

array: Array containing the region values for the topic to be binarized.
nbins: Number of bins to use in the binarization histogram.

Returns:

Binarization threshold.

pycisTopic.topic_binarization.threshold_yen(array: ndarray, nbins: int = 100) → float[source]

Apply Yen threshold on topic-region distributions [Yen et al., 1995].

Parameters:

array: Array containing the region values for the topic to be binarized.
nbins: Number of bins to use in the binarization histogram.

Returns:

Binarization threshold.

Topic QC

pycisTopic.topic_qc.compute_topic_metrics(cistopic_obj: CistopicObject, return_metrics: bool | None = True)[source]

Compute topic quality control metrics.

Parameters

cistopic_obj: class::CistopicObject: A cisTopic object with a model in class::CistopicObject.selected_model.
return_metrics: bool, optional: Whether to return metrics as class::pd.DataFrame. The metrics will be also appended to class::CistopicObject.selected_model.topic_qc_metrics despite the value of this parameter. Default: True.

References

Mimno, D., Wallach, H., Talley, E., Leenders, M., & McCallum, A. (2011). Optimizing semantic coherence in topic models. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (pp. 262-272).

pycisTopic.topic_qc.gini_coefficient(x)[source]: Compute Gini coefficient of array of values

Plotting topic qc metrics and filtering.

Parameters

topic_qc_metrics: class::pd.DataFrame or class::CistopicObject: A topic metrics dataframe or a cisTopic object with class::CistopicObject.selected_model.topic_qc_metrics filled.
var_x: str: Metric to plot.
var_y: str, optional: A second metric to plot in combination with var_x.
min_x: float, optional: Minimum value on var_x to keep the barcode/cell. Default: None.
max_x: float, optional: Maximum value on var_x to keep the barcode/cell. Default: None.
min_y: float, optional: Minimum value on var_y to keep the barcode/cell. Default: None.
max_y: float, optional: Maximum value on var_y to keep the barcode/cell. Default: None.
var_color: str, optional: Metric to color plot by. Default: None
cmap: str, optional: Color map to color 2D dot plots by density. Default: None.
dot_size: int, optional: Dot size in the plot. Default: 10
text_size: int, optional: Size of the labels in the plot. Default: 10
plot: bool, optional: Whether the plots should be returned to the console. Default: True.
save: bool, optional: Path to save plots as a file. Default: None.
return_topics: bool, optional: Whether to return selected topics based on user-given thresholds. Default: True.
return_fig: bool, optional: Whether to return the plot figure; if several samples it will return a dictionary with the figures per sample. Default: False.

Return — list

A list with the selected topics.

pycisTopic.topic_qc.topic_annotation(cistopic_obj: CistopicObject, annot_var: str, binarized_cell_topic: Dict[str, DataFrame] | None = None, general_topic_thr: float | None = 0.2, **kwargs)[source]

Automatic annotation of topics.

Parameters

cistopic_obj: class::CistopicObject: A cisTopic object with a model in class::CistopicObject.selected_model.
annot_var: str: Name of the variable (contained in ‘class::CistopicObject.cell_data’) to use for annotation
binarized_cell_topic: Dict, optional: A dictionary containing binarized cell topic distributions (from binarize_topics()). If not provided, binarized_topics() will be run. Default: None.
general_topic_thr: float, optional: Threshold for considering a topic as general. After assigning topics to annotations, the ratio of cells in the binarized topic in the whole population is compared with the ratio of the total number of cells in the assigned groups versus the whole population. If the difference is above this threshold, the topic is considered general. Default: 0.2.
**kwargs: Arguments to pass to binarize_topics()

Export to loom

pycisTopic.loom.add_annotation(loom, annots: DataFrame)[source]: A helper function to add annotations

pycisTopic.loom.add_clusterings(loom: SCopeLoom, cluster_data: DataFrame)[source]: A helper function to add clusters

pycisTopic.loom.add_markers(loom: SCopeLoom, markers_dict: Dict[str, Dict[str, DataFrame]])[source]: A helper function to add markers to clusterings

pycisTopic.loom.add_metrics(loom, metrics: DataFrame)[source]: A helper function to add metrics

pycisTopic.loom.df_to_named_matrix(df: DataFrame)[source]: A helper function to create metadata structure.

pycisTopic.loom.export_gene_activity_to_loom(gene_activity_matrix: CistopicImputedFeatures | DataFrame, cistopic_obj: CistopicObject, out_fname: str, regulons: List[Regulon] = None, selected_genes: List[str] | None = None, selected_cells: List[str] | None = None, auc_mtx: DataFrame | None = None, auc_thresholds: DataFrame | None = None, cluster_annotation: List[str] = None, cluster_markers: Dict[str, Dict[str, DataFrame]] = None, tree_structure: Sequence[str] = (), title: str = None, nomenclature: str = 'Unknown', split_pattern='___', num_workers: int = 1, **kwargs)[source]

Create SCope [Davie et al, 2018] compatible loom files for gene activity exploration

Parameters

gene_activity_matrix: class::CistopicImputedFeatures or class::pd.DataFrame: A cisTopic imputed features object containing imputed gene activity as values. Alternatively, a pandas data frame with genes as columns, cells as rows and gene activity per gene as values.
cistopic_obj: class::CisTopicObject: The cisTopic object from which gene activity values have been derived. It must include cell meta data (including specified cluster annotation columns).
regulons: list: A list of regulons as derived from pySCENIC (Van de Sande et al., 2020).
out_fname: str: Path to output file.
selected_genes: list, optional: A list specifying which genes should be included in the loom file. Default: None
selected_cells: list, optional: A list specifying which cells should be included in the loom file. Default: None
auc_mtx: pd.DataFrame, optional: A regulon AUC matrix for the regulons as derived from pySCENIC (Van de Sande et al., 2020). If not provided it will be inferred.
auc_thresholds: pd.DataFrame, optional: A AUC thresholds for the regulons as derived from pySCENIC (Van de Sande et al., 2020). If not provided it will be inferred.
cluster_annotation: list, optional: A list indicating which information in cistopic_obj.cell_data should be used as clusters. The specified names must be included as columns in cistopic_obj.cell_data. Default: None.
cluster_markers: dict, optional: A dictionary including an entry per cluster annotation (which should match with the names in cluster_annotation) including a dictionary per cluster with a pandas data frame with marker regions as rows and logFC and adjusted p-values as columns (the output of find_diff_features). Default: None.
tree_structure: sequence, optional: A sequence of strings that defines the category tree structure. Needs to be a sequence of strings with three elements. Default: ()
title: str, optional: The title for this loom file. If None than the basename of the filename is used as the title. Default: None
nomenclature: str, optional: The name of the genome. Default: ‘Unknown’
**kwargs: Additional parameters for pyscenic.export.export2loom

References

Davie, K., Janssens, J., Koldere, D., De Waegeneer, M., Pech, U., Kreft, Ł., … & Aerts, S. (2018). A single-cell transcriptome atlas of the aging Drosophila brain. Cell, 174(4), 982-998.

Van de Sande, B., Flerin, C., Davie, K., De Waegeneer, M., Hulselmans, G., Aibar, S., … & Aerts, S. (2020). A scalable SCENIC workflow for single-cell gene regulatory network analysis. Nature Protocols, 15(7), 2247-2276.

pycisTopic.loom.export_minimal_loom_gene(ex_mtx: DataFrame, embeddings: Mapping[str, DataFrame], out_fname: str, regulons: List[Regulon] = None, cell_annotations: Mapping[str, str] | None = None, tree_structure: Sequence[str] = (), title: str | None = None, nomenclature: str = 'Unknown', num_workers: int = 2, auc_mtx=None, auc_thresholds=None, compress: bool = False)[source]: Create a loom file for a single cell experiment to be used in SCope. :param ex_mtx: The expression matrix (n_cells x n_genes). :param regulons: A list of Regulons. :param cell_annotations: A dictionary that maps a cell ID to its corresponding cell type annotation. :param out_fname: The name of the file to create. :param tree_structure: A sequence of strings that defines the category tree structure. Needs to be a sequence of strings with three elements. :param title: The title for this loom file. If None than the basename of the filename is used as the title. :param nomenclature: The name of the genome. :param num_workers: The number of cores to use for AUCell regulon enrichment. :param embeddings: A dictionary that maps the name of an embedding to its representation as a pandas DataFrame with two columns: the first column is the first component of the projection for each cell followed by the second. The first mapping is the default embedding (use collections.OrderedDict to enforce this). :param compress: compress metadata (only when using SCope).

pycisTopic.loom.export_region_accessibility_to_loom(accessibility_matrix: CistopicImputedFeatures | DataFrame, cistopic_obj: CistopicObject, binarized_topic_region: Dict[str, DataFrame], binarized_cell_topic: Dict[str, DataFrame], out_fname: str, selected_regions: List[str] = None, selected_cells: List[str] = None, cluster_annotation: List[str] = None, cluster_markers: Dict[str, Dict[str, DataFrame]] = None, tree_structure: Sequence[str] = (), title: str = None, nomenclature: str = 'Unknown', split_pattern: str = '___', **kwargs)[source]

Create SCope [Davie et al, 2018] compatible loom files for accessibility data exploration

Parameters

accessibility_matrix: class::CistopicImputedFeatures or class::pd.DataFrame: A cisTopic imputed features object containing imputed accessibility as values. Alternatively, a pandas data frame with regions as columns, cells as rows and accessibility per regions as values.
cistopic_obj: class::CisTopicObject: The cisTopic object from which accessibility values have been derived. It must include cell meta data (including specified cluster annotation columns) and the topic model from which accessibility has been imputed.
binarized_topic_region: dictionary: A dictionary containing topics as keys and class::pd.DataFrame with regions in topics as index and their topic contribution as values. This is the output of binarize_topics() using target=’region’.
binarized_cell_topic: dictionary: A dictionary containing topics as keys and class::pd.DataFrame with cells in topics as index and their topic contribution as values. This is the output of binarize_topics() using target=’cell’.
out_fname: str: Path to output file.
selected_regions: list, optional: A list specifying which regions should be included in the loom file. This is useful when working with very large data sets (e.g. one can select only regions in topics as DARs to reduce the file size). Default: None
selected_cells: list, optional: A list specifying which cells should be included in the loom file. Default: None
cluster_annotation: list, optional: A list indicating which information in cistopic_obj.cell_data should be used as clusters. The specified names must be included as columns in cistopic_obj.cell_data. Default: None.
cluster_markers: dict, optional: A dictionary including an entry per cluster annotation (which should match with the names in cluster_annotation) including a dictionary per cluster with a pandas data frame with marker regions as rows and logFC and adjusted p-values as columns (the output of find_diff_features). Default: None.
tree_structure: sequence, optional: A sequence of strings that defines the category tree structure. Needs to be a sequence of strings with three elements. Default: ()
title: str, optional: The title for this loom file. If None than the basename of the filename is used as the title. Default: None
nomenclature: str, optional: The name of the genome. Default: ‘Unknown’
**kwargs: Additional parameters for pyscenic.export.export2loom

References

Davie, K., Janssens, J., Koldere, D., De Waegeneer, M., Pech, U., Kreft, Ł., … & Aerts, S. (2018). A single-cell transcriptome atlas of the aging Drosophila brain. Cell, 174(4), 982-998.

pycisTopic.loom.get_metadata(loom)[source]: A helper function to get metadata

pycisTopic.loom.get_regulons(loom)[source]: A helper function to get regulons

Signature enrichment

pycisTopic.signature_enrichment.gene_set_to_signature(gene_set: List, name: str)[source]

A helper function to generat gene signatures

Parameters

gene_set: pr.PyRanges: List of genes
name: str: Name for the signature

pycisTopic.signature_enrichment.region_set_to_signature(query_region_set: PyRanges, target_region_set: PyRanges, name: str)[source]

A helper function to intersect query regions with the input data set regions

Parameters

query_region_set: pr.PyRanges: Pyranges with regions to query
target_region_set: pr.PyRanges: Pyranges with target regions
name: str: Name for the signature

pycisTopic.signature_enrichment.signature_enrichment(rankings: CistopicImputedFeatures, signatures: Dict[str, PyRanges] | Dict[str, List], enrichment_type: str = 'region', auc_threshold: float = 0.05, normalize: bool = False, n_cpu: int = 1)[source]

Get enrichment of a region signature in cells or topics using AUCell (Van de Sande et al., 2020)

Parameters

rankings: CistopicImputedFeatures: A CistopicImputedFeatures object with ranking values
signatures: Dictionary of pr.PyRanges (for regions) or list (for genes): A dictionary containing region signatures as pr.PyRanges or gene names as list
enrichment_type: str: Whether features are genes or regions
auc_threshold: float: The fraction of the ranked genome to take into account for the calculation of the Area Under the recovery Curve. Default: 0.05
normalize: bool: Normalize the AUC values to a maximum of 1.0 per regulon. Default: False
num_workers: int: The number of cores to use. Default: 1

pyGREAT

pycisTopic.pyGREAT.get_region_signature(pyGREAT_results: Dict[str, DataFrame], region_set_key: str, ontology: str, term: str)[source]

Retriving GO region signature from GREAT results

Parameters:

pyGREAT_results: Dict: A dictionary with pyGREAT results.
region_set_key: str: Key of the region set to query
ontology: str: Ontology to query
term: str: Term to retrive regions from

pycisTopic.pyGREAT.pyGREAT(region_sets: Dict[str, PyRanges], species: str, rule: str = 'basalPlusExt', span: float = 1000.0, upstream: float = 5.0, downstream: float = 1.0, two_distance: float = 1000.0, one_distance: float = 1000.0, include_curated_reg_doms: int = 1, bg_choice: str = 'wholeGenome', tmp_dir: str = None, n_cpu: int = 1, **kwargs)[source]

Running GREAT (McLean et al., 2010) on a dictionary of pyranges. For more details in GREAT parameters, please visit http://great.stanford.edu/public/html/

Parameters:

region_sets: Dict: A dictionary containing region sets to query as pyRanges objects.
species: str: Genome assembly from where the coordinates come from. Possible values are: ‘mm9’, ‘mm10’, ‘hg19’, ‘hg38’
rule: str: How to associate genomic regions to genes. Possible options are ‘basalPlusExt’, ‘twoClosest’, ‘oneClosest’. Default: ‘basalPlusExt’
span: float: Unit: kb, only used when rule is ‘basalPlusExt’. Default: 1000.0
upstream: float: Unit: kb, only used when rule is ‘basalPlusExt’. Default: 5.0
downstream: float: Unit: kb, only used when rule is ‘basalPlusExt’. Default: 1.0
two_distance: float: Unit: kb, only used when rule is ‘twoClosest’. Default: 1000.0
one_distance: float: Unit: kb, only used when rule is ‘oneClosest’. Default: 1000.0
include_curated_reg_doms: int: Whether to include curated regulatory domains. Default: 1
bg_choice: str: A path to the background file or a string. Default: ‘wholeGenome’
tmp_dir: str: Temporary directory to save region sets as bed files for GREAT. Default: None
n_cpu: int: Number of cores to use. Default: 1
***kwargs: Other parameters to pass to ray.init

References

McLean, C. Y., Bristor, D., Hiller, M., Clarke, S. L., Schaar, B. T., Lowe, C. B., … & Bejerano, G. (2010). GREAT improves functional interpretation of cis-regulatory regions. Nature biotechnology, 28(5), 495-501.

pycisTopic.pyGREAT.pyGREAT_oneset(region_set: PyRanges, species: str, rule: str = 'basalPlusExt', span: float = 1000.0, upstream: float = 5.0, downstream: float = 1.0, two_distance: float = 1000.0, one_distance: float = 1000.0, include_curated_reg_doms: int = 1, bg_choice: str = 'wholeGenome', tmp_dir: str = None)[source]

Running GREAT (McLean et al., 2010) on a pyranges object. For more details in GREAT parameters, please visit http://great.stanford.edu/public/html/

Parameters:

region_sets: Dict: A dictionary containing region sets to query as pyRanges objects.
species: str: Genome assembly from where the coordinates come from. Possible values are: ‘mm9’, ‘mm10’, ‘hg19’, ‘hg38’
rule: str: How to associate genomic regions to genes. Possible options are ‘basalPlusExt’, ‘twoClosest’, ‘oneClosest’. Default: ‘basalPlusExt’
span: float: Unit: kb, only used when rule is ‘basalPlusExt’. Default: 1000.0
upstream: float: Unit: kb, only used when rule is ‘basalPlusExt’. Default: 5.0
downstream: float: Unit: kb, only used when rule is ‘basalPlusExt’. Default: 1.0
two_distance: float: Unit: kb, only used when rule is ‘twoClosest’. Default: 1000.0
one_distance: float: Unit: kb, only used when rule is ‘oneClosest’. Default: 1000.0
include_curated_reg_doms: int: Whether to include curated regulatory domains. Default: 1
bg_choice: str: A path to the background file or a string. Default: ‘wholeGenome’
tmp_dir: str: Temporary directory to save region sets as bed files for GREAT. Default: None
n_cpu: int: Number of cores to use. Default: 1
***kwargs: Other parameters to pass to ray.init

References

McLean, C. Y., Bristor, D., Hiller, M., Clarke, S. L., Schaar, B. T., Lowe, C. B., … & Bejerano, G. (2010). GREAT improves functional interpretation of cis-regulatory regions. Nature biotechnology, 28(5), 495-501.

Gene activity

pycisTopic.gene_activity.calculate_distance_join(pr_obj: PyRanges)[source]: A helper function to calculate distances between regions and genes.

pycisTopic.gene_activity.calculate_distance_with_limits_join(pr_obj: PyRanges)[source]: A helper function to calculate distances between regions and genes, returning information on what is the relative distance to the TSS and end of the gene.

pycisTopic.gene_activity.extend_pyranges(pr_obj: PyRanges, upstream: int, downstream: int)[source]: A helper function to extend coordinates downstream/upstream in a pyRanges given upstream and downstream distances.

pycisTopic.gene_activity.extend_pyranges_with_limits(pr_obj: PyRanges)[source]: A helper function to extend coordinates downstream/upstream in a pyRanges with Distance_upstream and Distance_downstream columns.

pycisTopic.gene_activity.get_gene_activity(imputed_acc_object: CistopicImputedFeatures, pr_annot: PyRanges, chromsizes: PyRanges, predefined_boundaries: PyRanges | None = None, use_gene_boundaries: bool | None = True, upstream: List[int] | None = [1000, 100000], downstream: List[int] | None = [1000, 100000], distance_weight: bool | None = True, decay_rate: float | None = 1, extend_gene_body_upstream: int | None = 5000, extend_gene_body_downstream: int | None = 0, gene_size_weight: bool | None = False, gene_size_scale_factor: int | str | None = 'median', remove_promoters: bool | None = False, scale_factor: float | None = 1, average_scores: bool | None = True, extend_tss: List[int] | None = [10, 10], return_weights: bool | None = True, gini_weight: bool | None = True, project: str | None = 'Gene_activity')[source]

Infer gene activity.

Parameters

imputed_features_obj: CistopicImputedFeatures: A cisTopic imputation data object.
pr_annot: pr.PyRanges: A pr.PyRanges containing gene annotation, including Chromosome, Start, End, Strand (as ‘+’ and ‘-‘), Gene name and Transcription Start Site.
chromsizes: pr.PyRanges: A pr.PyRanges containing size of each chromosome, containing ‘Chromosome’, ‘Start’ and ‘End’ columns.
predefined_boundaries: pr.PyRanges: A pr.PyRanges containing predefined genomic domain boundaries (e.g. TAD boundaries) to use as boundaries. If given, use_gene_boundaries will be ignored.
use_gene_boundaries: bool, optional: Whether to use the whole search space or stop when encountering another gene. Default: True
upstream: List, optional: Search space upstream. The minimum (first position) means that even if there is a gene right next to it these bp will be taken. The second position indicates the maximum distance. Default: [1000,100000]
downstream: List, optional: Search space downstream. The minimum (first position) means that even if there is a gene right next to it these bp will be taken. The second position indicates the maximum distance. Default: [1000,100000]
distance_weight: bool, optional: Whether to add a distance weight (an exponential function, the weight will decrease with distance). Default: True
decay_rate: float, optional: Exponent for the distance exponential funciton (the higher the faster will be the decrease). Default: 1
extend_gene_body_upstream: int, optional: Number of bp upstream immune to the distance weight (their value will be maximum for this weight). Default: 5000
extend_gene_body_downstream: int, optional: Number of bp downstream immune to the distance weight (their value will be maximum for this weight). Default: 0
gene_size_weight: bool, optional: Whether to add a weights based on th length of the gene. Default: False
gene_size_scale_factor: str or int, optional: Dividend to calculate the gene size weigth. Default is the median value of all genes in the genome.
remove_promoters: bool, optional: Whether to ignore promoters when computing gene activity. Default: False
average_scores: bool, optional: Whether to divide by the total number of region assigned to a gene when calculating the gene activity score. Default: True
scale_factor: int, optional: Value to multiply for the final gene activity matrix. Default: 1
extend_tss: list, optional: Space around the TSS consider as promoter. Default: [10,10]
return_weights: bool, optional: Whether to return the final weight values. Default: True
gini_weight: bool, optional: Whether to add a gini index weigth. The more unique the region is, the higher this weight will be. Default: True
project: str, optional;: Project name for the CistopicImputedFeatures with the gene activity

pycisTopic.gene_activity.reduce_pyranges_b(pr_obj: PyRanges, upstream: int, downstream: int)[source]: A helper function to reduce coordinates downstream/upstream in a pyRanges given upstream and downstream distances.

pycisTopic.gene_activity.reduce_pyranges_with_limits_b(pr_obj: PyRanges)[source]: A helper function to reduce coordinates downstream/upstream in a pyRanges with Distance_upstream and Distance_downstream columns.

pycisTopic.gene_activity.region_weights(imputed_acc_object, pr_annot, chromsizes, predefined_boundaries=None, use_gene_boundaries=True, upstream=[1000, 100000], downstream=[1000, 100000], distance_weight=True, decay_rate=1, extend_gene_body_upstream=5000, extend_gene_body_downstream=0, gene_size_weight=True, gene_size_scale_factor='median', remove_promoters=True, extend_tss=[10, 10], gini_weight=True)[source]

Calculate region weights.

Parameters

imputed_features_obj: CistopicImputedFeatures: A cisTopic imputation data object.
pr_annot: pr.PyRanges: A pr.PyRanges containing gene annotation, including Chromosome, Start, End, Strand (as ‘+’ and ‘-‘), Gene name and Transcription Start Site.
chromsizes: pr.PyRanges: A pr.PyRanges containing size of each chromosome, containing ‘Chromosome’, ‘Start’ and ‘End’ columns.
predefined_boundaries: pr.PyRanges: A pr.PyRanges containing predefined genomic domain boundaries (e.g. TAD boundaries) to use as boundaries. If given, use_gene_boundaries will be ignored.
use_gene_boundaries: bool, optional: Whether to use the whole search space or stop when encountering another gene. Default: True
upstream: List, optional: Search space upstream. The minimum (first position) means that even if there is a gene right next to it these bp will be taken. The second position indicates the maximum distance. Default: [1000,100000]
downstream: List, optional: Search space downstream. The minimum (first position) means that even if there is a gene right next to it these bp will be taken. The second position indicates the maximum distance. Default: [1000,100000]
distance_weight: bool, optional: Whether to add a distance weight (an exponential function, the weight will decrease with distance). Default: True
decay_rate: float, optional: Exponent for the distance exponential funciton (the higher the faster will be the decrease). Default: 1
extend_gene_body_upstream: int, optional: Number of bp upstream immune to the distance weight (their value will be maximum for this weight). Default: 5000
extend_gene_body_downstream: int, optional: Number of bp downstream immune to the distance weight (their value will be maximum for this weight). Default: 0
gene_size_weight: bool, optional: Whether to add a weights based on th length of the gene. Default: False
gene_size_scale_factor: str or int, optional: Dividend to calculate the gene size weigth. Default is the median value of all genes in the genome.
remove_promoters: bool, optional: Whether to ignore promoters when computing gene activity. Default: False
extend_tss: list, optional: Space around the TSS consider as promoter. Default: [10,10]
gini_weight: bool, optional: Whether to add a gini index weigth. The more unique the region is, the higher this weight will be. Default: True

pycisTopic.gene_activity.weighted_aggregation(imputed_acc_obj_mtx: csr_matrix, region_weights_df_per_gene: DataFrame, average_scores: bool)[source]

Weighted aggregation of region probabilities into gene activity

Parameters

imputed_acc_obj_mtx: sparse.csr_matrix: A sparse matrix with regions as rows and cells as columns.
region_weights_df_per_gene: pd.DataFrame: A data frame with region index (from the sparse matrix) for the gene
average_score: bool: Whether final values should be divided by the total number of regions aggregated

Label transfer

pycisTopic.label_transfer.label_transfer(ref_anndata: AnnData, query_anndata: AnnData, labels_to_transfer: List[str], sample_id_col: str | None = 'sample_id', n_cpu: int | None = 1, variable_genes: bool | None = True, methods: List[str] | None = ['ingest', 'harmony', 'bbknn', 'scanorama', 'cca'], pca_ncomps: List[int] | None = [50, 50], n_neighbours: List[int] | None = [10, 10], bbknn_components: int | None = 30, cca_components: int | None = 30, return_label_weights: bool | None = False, **kwargs)[source]

Wrapper function of Ray processes to compute label transfer from single reference to multiple query samples.

Parameters

ref_anndata: AnnData: An AnnData object containing the reference data set (typically, scRNA-seq data)
query_anndata: AnnData: An AnnData object containing the query data set, with features matching with the reference data set (typically, gene activities derived from scATAC-seq)
labels_to_transfer: List: Labels to transfer. They must be included in ref_anndata.obs.
sample_id_col: str: Name of the column containing the sample ids in the query data set. It must be included in query_anndata.obs. Default: sample_id
n_cpu: int, optional: Number of cores to use. Default: 1.
variable_genes: bool, optional: Whether variable genes matching between the two data set should be used (True) or otherwise, all matching genes (False). Default: True
methods: List, optional: Methods to be used for label transferring. These include: ‘ingest’ [from scanpy], ‘harmony’ [Korsunsky et al, 2019], ‘bbknn’ [Polański et al, 2020], ‘scanorama’ [Hie et al, 2019] and ‘cca’. Except for ingest, these methods return a common coembedding and labels are inferred using the distances between query and refenrence cells as weights.
pca_ncomps: List, optional: Number of principal components to use for reference and query, respectively. Default: [50,50]
n_neighbours: List, optional: Number of neighbours to use for reference and query, respectively. Default: [10,10]
bbknn_components: int, optional: Number of components to use for the umap for bbknn integration. Default: 30
cca_components: int, optional: Number of components to use for cca. Default: 30
return_label_weights: bool, optional: Whether to return the label scores per variable (as a dictionary, except for ingest). Default: False
**kwargs: Additional parameters for ray.init.

References

Korsunsky, I., Millard, N., Fan, J., Slowikowski, K., Zhang, F., Wei, K., … & Raychaudhuri, S. (2019). Fast, sensitive and accurate integration of single-cell data with Harmony. Nature methods, 16(12), 1289-1296.

Polański, K., Young, M. D., Miao, Z., Meyer, K. B., Teichmann, S. A., & Park, J. E. (2020). BBKNN: fast batch alignment of single cell transcriptomes. Bioinformatics, 36(3), 964-965.

Hie, B., Bryson, B., & Berger, B. (2019). Efficient integration of heterogeneous single-cell transcriptomes using Scanorama. Nature biotechnology, 37(6), 685-691.

pycisTopic.label_transfer.label_transfer_coembedded(dist, labels)[source]: A helper function to propagate labels in a common space

Utils

pycisTopic.utils.collapse_duplicates(df)[source]: Collapse duplicates from fragments df

pycisTopic.utils.coord_to_region_names(coord)[source]: PyRanges to region names

pycisTopic.utils.fig2img(fig)[source]: Convert a Matplotlib figure to a PIL Image and return it

pycisTopic.utils.get_tss_matrix(fragments, flank_window, tss_space_annotation)[source]: Get TSS matrix

pycisTopic.utils.gini(array)[source]: Calculate the Gini coefficient of a numpy array.

pycisTopic.utils.normalise_filepath(path: str | Path, check_not_directory: bool = True) → str[source]: Create a string path, expanding the home directory if present.

pycisTopic.utils.read_fragments_from_file(fragments_bed_filename, use_polars: bool = True) → PyRanges[source]

Read fragments BED file to PyRanges object.

Parameters:

fragments_bed_filename: Fragments BED filename.
use_polars: Use polars instead of pandas for reading the fragments BED file.

Returns:

PyRanges object of fragments.

pycisTopic.utils.region_names_to_coordinates(region_names: Sequence[str]) → DataFrame[source]

Create Pandas DataFrame with region IDs to coordinates mapping.

Parameters:

region_names: List of region names in “chrom:start-end” format.

Returns:

Pandas DataFrame with region IDs to coordinates mapping.