API
cisTopic object
- class pycisTopic.cistopic_class.CistopicObject(fragment_matrix: csr_matrix, binary_matrix: csr_matrix, cell_names: List[str], region_names: List[str], cell_data: DataFrame, region_data: DataFrame, path_to_fragments: str | Dict[str, str], project: str | None = 'cisTopic')[source]
cisTopic data class.
CistopicObject
contains the cell by fragment matrices (stored as countsfragment_matrix
and as binary accessibilitybinary_matrix
), cell metadatacell_data
, region metadataregion_data
and path/s to the fragments file/spath_to_fragments
.LDA models from
CisTopicLDAModel
can be storedselected_model
as well as cell/region projectionsprojections
as a dictionary.- Attributes:
- fragment_matrix: sparse.csr_matrix
A matrix containing cell names as column names, regions as row names and fragment counts as values.
- binary_matrix: sparse.csr_matrix
A matrix containing cell names as column names, regions as row names and whether regions as accessible (0: Not accessible; 1: Accessible) as values.
- cell_names: list
A list containing cell names.
- region_names: list
A list containing region names.
- cell_data: pd.DataFrame
A data frame containing cell information, with cells as indexes and attributes as columns.
- region_data: pd.DataFrame
A data frame containing region information, with region as indexes and attributes as columns.
- path_to_fragments: str or dict
A list containing the paths to the fragments files used to generate the
CistopicObject
.- project: str
Name of the cisTopic project.
- add_LDA_model(model: CistopicLDAModel)[source]
Add LDA model to a cisTopic object.
- Parameters:
- model: CistopicLDAModel
Selected cisTopic LDA model results (see LDAModels.evaluate_models)
- add_cell_data(cell_data: DataFrame, split_pattern: str | None = '___')[source]
Add cell metadata to
CistopicObject
. If the column already exist on the cell metadata, it will be overwritten.- Parameters:
- cell_data: pd.DataFrame
A data frame containing metadata information, with cell names as indexes. If cells are missing from the metadata, values will be filled with Nan.
- split_pattern: str
Pattern to split cell barcode from sample id. Default: ___
- add_region_data(region_data: DataFrame)[source]
Add region metadata to
CistopicObject
. If the column already exist on the region metadata, it will be overwritten.- Parameters:
- region_data: pd.DataFrame
A data frame containing metadata information, with region names as indexes. If regions are missing from the metadata, values will be filled with Nan.
- merge(cistopic_obj_list: List[CistopicObject], is_acc: int | None = 1, project: str | None = 'cisTopic_merge', copy: bool | None = False, split_pattern: str | None = '___')[source]
Merge a list of
CistopicObject
to the inputCistopicObject
. Reference coordinates must be the same between the objects. ExistentcisTopicCGSModel
and projections will be deleted. This is to ensure that models contained in aCistopicObject
are derived from the cells it contains.- Parameters:
- cistopic_obj_list: list
A list containing one or more
CistopicObject
to merge.- is_acc: int, optional
Minimal number of fragments for a region to be considered accessible. Default: 1.
- project: str, optional
Name of the cisTopic project.
- copy: bool, optional
Whether changes should be done on the input
CistopicObject
or a new object should be returned- split_pattern: str
Pattern to split cell barcode from sample id. Default: ___
- Return
- ——
- CistopicObject
A combined
CistopicObject
. Two new columns incell_data
indicate theCistopicObject
of origin (cisTopic_id) and the fragment file from which the cell comes from (path_to_fragments).
- subset(cells: List[str] | None = None, regions: List[str] | None = None, copy: bool | None = False, split_pattern: str | None = '___')[source]
Subset cells and/or regions from
CistopicObject
. ExistentCisTopicLDAModel
and projections will be deleted. This is to ensure that models contained in aCistopicObject
are derived from the cells it contains.- Parameters:
- cells: list, optional
A list containing the names of the cells to keep.
- regions: list, optional
A list containing the names of the regions to keep.
- copy: bool, optional
Whether changes should be done on the input
CistopicObject
or a new object should be returned- split_pattern: str
Pattern to split cell barcode from sample id. Default: ___
- pycisTopic.cistopic_class.create_cistopic_object(fragment_matrix: DataFrame | csr_matrix, cell_names: List[str] | None = None, region_names: List[str] | None = None, path_to_blacklist: str | None = None, min_frag: int | None = 1, min_cell: int | None = 1, is_acc: int | None = 1, path_to_fragments: str | Dict[str, str] | None = {}, project: str | None = 'cisTopic', tag_cells: bool | None = True, split_pattern: str | None = '___')[source]
Creates a CistopicObject from a count matrix.
- Parameters:
- fragment_matrix: pd.DataFrame or sparse.csr_matrix
A data frame containing cell names as column names, regions as row names and fragment counts as values or
sparse.csr_matrix
containing cells as columns and regions as rows.- cell_names: list, optional
A list containing cell names. Only used if the fragment matrix is
sparse.csr_matrix
.- region_names: list, optional
A list containing region names. Only used if the fragment matrix is
sparse.csr_matrix
.- path_to_blacklist: str, optional
Path to bed file containing blacklist regions (Amemiya et al., 2019).
- min_frag: int, optional
Minimal number of fragments in a cell for the cell to be kept. Default: 1
- min_cell: int, optional
Minimal number of cell in which a region is detected to be kept. Default: 1
- is_acc: int, optional
Minimal number of fragments for a region to be considered accessible. Default: 1
- path_to_fragments: str, dict
A dict or str containing the paths to the fragments files used to generate the
CistopicObject
. Default: {}.- project: str, optional
Name of the cisTopic project. Default: ‘cisTopic’
- tag_cells: bool, optional
Whether to add the project name as suffix to the cell names. Default: True
- split_pattern: str
Pattern to split cell barcode from sample id. Default: ___
References
Amemiya, H. M., Kundaje, A., & Boyle, A. P. (2019). The ENCODE blacklist: identification of problematic regions of the genome. Scientific reports, 9(1), 1-5.
- pycisTopic.cistopic_class.create_cistopic_object_from_fragments(path_to_fragments: str, path_to_regions: str, path_to_blacklist: str | None = None, metrics: str | DataFrame | None = None, valid_bc: List[str] | None = None, n_cpu: int | None = 1, min_frag: int | None = 1, min_cell: int | None = 1, is_acc: int | None = 1, check_for_duplicates: bool | None = True, project: str | None = 'cisTopic', partition: int | None = 5, fragments_df: DataFrame | PyRanges | None = None, split_pattern: str | None = '___', use_polars: bool | None = True)[source]
Creates a CistopicObject from a fragments file and defined genomic intervals (compatible with CellRangerATAC output)
- Parameters:
- path_to_fragments: str
The path to the fragments file containing chromosome, start, end and assigned barcode for each read (e.g. from CellRanger ATAC (/outs/fragments.tsv.gz)).
- path_to_regions: str
Path to the bed file with the defined regions.
- path_to_blacklist: str, optional
Path to bed file containing blacklist regions (Amemiya et al., 2019). Default: None
- metrics: str, optional
Data frame of CellRanger ot similar, with barcodes and metrics (e.g. from CellRanger ATAC /outs/singlecell.csv). If it is an output from CellRanger, only cells for which is__cell_barcode is 1 will be considered, otherwise only barcodes included in the metrics will be taken. Default: None
- valid_bc: list, optional
A list with valid cell barcodes can be provided, only used if path_to_metrics is not provided. Default: None
- n_cpu: int, optional
Number of cores to use. Default: 1.
- min_frag: int, optional
Minimal number of fragments in a cell for the cell to be kept. Default: 1
- min_cell: int, optional
Minimal number of cell in which a region is detected to be kept. Default: 1
- is_acc: int, optional
Minimal number of fragments for a region to be considered accessible. Default: 1
- check_for_duplicates: bool, optional
If no duplicate counts are provided per row in the fragments file, whether to collapse duplicates. Default: True.
- project: str, optional
Name of the cisTopic project. It will also be used as name for sample_id in the cell_data
CistopicObject.cell_data
. Default: ‘cisTopic’- partition: int, optional
When using Pandas > 0.21, counting may fail (https://github.com/pandas-dev/pandas/issues/26314). In that case, the fragments data frame is divided in this number of partitions, and after counting data is merged.
- fragments_df: pd.DataFrame or pr.PyRanges, optional
A PyRanges or DataFrame containing chromosome, start, end and assigned barcode for each read, corresponding to the data in path_to_fragments.
- split_pattern: str
Pattern to split cell barcode from sample id. Default: ___
- use_polars: bool, optional
Whether to use polars to read fragments files. Default: True.
References
Amemiya, H. M., Kundaje, A., & Boyle, A. P. (2019). The ENCODE blacklist: identification of problematic regions of the genome. Scientific reports, 9(1), 1-5.
- pycisTopic.cistopic_class.create_cistopic_object_from_matrix_file(fragment_matrix_file: str, path_to_blacklist: str | None = None, compression: str | None = None, min_frag: int | None = 1, min_cell: int | None = 1, is_acc: int | None = 1, path_to_fragments: Dict[str, str] | None = {}, sample_id: DataFrame | None = None, project: str | None = 'cisTopic', split_pattern: str | None = '___')[source]
Creates a CistopicObject from a count matrix file (tsv).
- Parameters:
- fragment_matrix: str
Path to a tsv file containing cell names as column names, regions as row names and fragment counts as values.
- path_to_blacklist: str, optional
Path to bed file containing blacklist regions (Amemiya et al., 2019). Default: None
- compression: str, None
Whether the file is compressed (e.g. bzip). Default: None
- min_frag: int, optional
Minimal number of fragments in a cell for the cell to be kept. Default: 1
- min_cell: int, optional
Minimal number of cell in which a region is detected to be kept. Default: 1
- is_acc: int, optional
Minimal number of fragments for a region to be considered accessible. Default: 1
- path_to_fragments: dict, optional
A list containing the paths to the fragments files used to generate the
CistopicObject
. Default: None.- sample_id: pd.DataFrame, optional
A data frame indicating from which sample each barcode is derived. Required if path_to_fragments is provided. Levels must agree with keys in path_to_fragments. Default: None.
- project: str, optional
Name of the cisTopic project. Default: ‘cisTopic’
- split_pattern: str
Pattern to split cell barcode from sample id. Default: ___
References
Amemiya, H. M., Kundaje, A., & Boyle, A. P. (2019). The ENCODE blacklist: identification of problematic regions of the genome. Scientific reports, 9(1), 1-5.
- pycisTopic.cistopic_class.merge(cistopic_obj_list: List[CistopicObject], is_acc: int | None = 1, project: str | None = 'cisTopic_merge', split_pattern: str | None = '___')[source]
Merge a list of
CistopicObject
to the inputCistopicObject
. Reference coordinates must be the same between the objects. ExistentcisTopicCGSModel
and projections will be deleted. This is to ensure that models contained in aCistopicObject
are derived from the cells it contains.- Parameters:
- cistopic_obj_list: list
A list containing one or more
CistopicObject
to merge.- is_acc: int, optional
Minimal number of fragments for a region to be considered accessible. Default: 1.
- project: str, optional
Name of the cisTopic project.
Pseudobulk formation and peak calling
- class pycisTopic.pseudobulk_peak_calling.MACSCallPeak(macs_path: str, bed_path: str, name: str, outdir: str, genome_size: str, input_format: str | None = 'BEDPE', shift: int | None = 73, ext_size: int | None = 146, keep_dup: str | None = 'all', q_value: int | None = 0.05, nolambda: bool | None = True, skip_empty_peaks: bool = False)[source]
Parameters
- macs_path: str
Path to MACS binary (e.g. /xxx/MACS/xxx/bin/macs2).
- bed_path: str
Path to fragments file bed file.
- name: str
Name of string of the group.
- outdir: str
Path to the output directory.
- genome_size: str
Effective genome size which is defined as the genome size which can be sequenced. Possible values: ‘hs’, ‘mm’, ‘ce’ and ‘dm’.
- input_format: str, optional
Format of tag file can be ELAND, BED, ELANDMULTI, ELANDEXPORT, SAM, BAM, BOWTIE, BAMPE, or BEDPE. Default is AUTO which will allow MACS to decide the format automatically. Default: ‘BEDPE’.
- shift: int, optional
To set an arbitrary shift in bp. For finding enriched cutting sites (such as in ATAC-seq) a shift of 73 bp is recommended. Default: 73.
- ext_size: int, optional
To extend reads in 5’->3’ direction to fix-sized fragment. For ATAC-seq data, a extension of 146 bp is recommended. Default: 146.
- keep_dup: str, optional
Whether to keep duplicate tags at te exact same location. Default: ‘all’.
- q_value: float, optional
The q-value (minimum FDR) cutoff to call significant regions. Default: 0.05.
- nolambda: bool, optional
Do not consider the local bias/lambda at peak candidate regions.
- pycisTopic.pseudobulk_peak_calling.export_pseudobulk(input_data: CistopicObject | DataFrame, variable: str, chromsizes: DataFrame | PyRanges, bed_path: str, bigwig_path: str, path_to_fragments: Dict[str, str] | None = None, sample_id_col: str = 'sample_id', n_cpu: int = 1, normalize_bigwig: bool = True, split_pattern: str = '___', temp_dir: str = '/tmp') Tuple[Dict[str, str], Dict[str, str]] [source]
Create pseudobulks as bed and bigwig from single cell fragments file given a barcode annotation.
Parameters
- input_data: CistopicObject or pd.DataFrame
A
CistopicObject
containing the specified variable as a column inCistopicObject.cell_data
or a cell metadatapd.DataFrame
containing barcode as rows, containing the specified variable as a column (additional columns are possible) and a sample_id column. Index names must contain the BARCODE (e.g. ATGTCGTC-1), additional tags are possible separating with - (e.g. ATGCTGTGCG-1-Sample_1). The levels in the sample_id column must agree with the keys in the path_to_fragments dictionary. Alternatively, if the cell metadata contains a column named barcode it will be used instead of the index names.- variable: str
A character string indicating the column that will be used to create the different group pseudobulk. It must be included in the cell metadata provided as input_data.
- chromsizes: pd.DataFrame or pr.PyRanges
A data frame or
pr.PyRanges
containing size of each chromosome, containing ‘Chromosome’, ‘Start’ and ‘End’ columns.- bed_path: str
Path to folder where the fragments bed files per group will be saved. If None, files will not be generated.
- bigwig_path: str
Path to folder where the bigwig files per group will be saved. If None, files will not be generated.
- path_to_fragments: str or dict, optional
A dictionary of character strings, with sample name as names indicating the path to the fragments file/s from which pseudobulk profiles have to be created. If a
CistopicObject
is provided as input it will be ignored, but if a cell metadatapd.DataFrame
is provided it is necessary to provide it. The keys of the dictionary need to match with the sample_id tag added to the index names of the input data frame.- sample_id_col: str, optional
Name of the column containing the sample name per barcode in the input
CistopicObject.cell_data
or class:pd.DataFrame. Default: ‘sample_id’.- n_cpu: int, optional
Number of cores to use. Default: 1.
- normalize_bigwig: bool, optional
Whether bigwig files should be CPM normalized. Default: True.
- split_pattern: str, optional
Pattern to split cell barcode from sample id. Default: ‘___’. Note, if split_pattern is not None, then export_pseudobulk will attempt to infer sample_id from the index of input_data and ignore sample_id_col.
- temp_dir: str
Path to temporary directory. Default: ‘/tmp’.
- pycisTopic.pseudobulk_peak_calling.macs_call_peak(macs_path: str, bed_path: str, name: str, outdir: str, genome_size: str, input_format: str | None = 'BEDPE', shift: int | None = 73, ext_size: int | None = 146, keep_dup: str | None = 'all', q_value: int | None = 0.05, nolambda: bool | None = True, skip_empty_peaks: bool = False)[source]
Performs pseudobulk peak calling with MACS2 in a group. It requires to have MACS2 installed (https://github.com/macs3-project/MACS).
Parameters
- macs_path: str
Path to MACS binary (e.g. /xxx/MACS/xxx/bin/macs2).
- bed_path: str
Path to fragments file bed file.
- name: str
Name of string of the group.
- outdir: str
Path to the output directory.
- genome_size: str
Effective genome size which is defined as the genome size which can be sequenced. Possible values: ‘hs’, ‘mm’, ‘ce’ and ‘dm’.
- input_format: str, optional
Format of tag file can be ELAND, BED, ELANDMULTI, ELANDEXPORT, SAM, BAM, BOWTIE, BAMPE, or BEDPE. Default is AUTO which will allow MACS to decide the format automatically. Default: ‘BEDPE’.
- shift: int, optional
To set an arbitrary shift in bp. For finding enriched cutting sites (such as in ATAC-seq) a shift of 73 bp is recommended. Default: 73.
- ext_size: int, optional
To extend reads in 5’->3’ direction to fix-sized fragment. For ATAC-seq data, a extension of 146 bp is recommended. Default: 146.
- keep_dup: str, optional
Whether to keep duplicate tags at te exact same location. Default: ‘all’.
- q_value: float, optional
The q-value (minimum FDR) cutoff to call significant regions. Default: 0.05.
- nolambda: bool, optional
Do not consider the local bias/lambda at peak candidate regions.
- pycisTopic.pseudobulk_peak_calling.peak_calling(macs_path: str, bed_paths: Dict, outdir: str, genome_size: str, n_cpu: int | None = 1, input_format: str | None = 'BEDPE', shift: int | None = 73, ext_size: int | None = 146, keep_dup: str | None = 'all', q_value: float | None = 0.05, nolambda: bool | None = True, skip_empty_peaks: bool = False, **kwargs)[source]
Performs pseudobulk peak calling with MACS2. It requires to have MACS2 installed (https://github.com/macs3-project/MACS).
Parameters
- macs_path: str
Path to MACS binary (e.g. /xxx/MACS/xxx/bin/macs2).
- bed_paths: dict
A dictionary containing group label as name and the path to their corresponding fragments bed file as value.
- outdir: str
Path to the output directory.
- genome_size: str
Effective genome size which is defined as the genome size which can be sequenced. Possible values: ‘hs’, ‘mm’, ‘ce’ and ‘dm’.
- n_cpu: int, optional
Number of cores to use. Default: 1.
- input_format: str, optional
Format of tag file can be ELAND, BED, ELANDMULTI, ELANDEXPORT, SAM, BAM, BOWTIE, BAMPE, or BEDPE. Default is AUTO which will allow MACS to decide the format automatically. Default: ‘BEDPE’.
- shift: int, optional
To set an arbitrary shift in bp. For finding enriched cutting sites (such as in ATAC-seq) a shift of 73 bp is recommended. Default: 73.
- ext_size: int, optional
To extend reads in 5’->3’ direction to fix-sized fragment. For ATAC-seq data, a extension of 146 bp is recommended. Default: 146.
- keep_dup: str, optional
Whether to keep duplicate tags at te exact same location. Default: ‘all’.
- q_value: float, optional
The q-value (minimum FDR) cutoff to call significant regions. Default: 0.05.
- **kwargs
Additional parameters to pass to ray.init().
Iterative peak filtering
- pycisTopic.iterative_peak_calling.calculate_peaks_and_extend(narrow_peaks: PyRanges, peak_half_width: int, chromsizes: DataFrame | PyRanges | None = None, path_to_blacklist: str | None = None)[source]
Extend peaks a number of base pairs in eca direction from the summit
Parameters
- narrow_peaks: pr.PyRanges
A pr.PyRanges with the narrowPeak results from MACS2.
- peak_half_width: int
Number of base pairs that each summit will be extended in each direction.
- chromsizes: pd.PyRanges or pd.DataFrame
A data frame or
pr.PyRanges
containing size of each column, containing ‘Chromosome’, ‘Start’ and ‘End’ columns.- path_to_blacklist: str, optional
Path to bed file containing blacklist regions (Amemiya et al., 2019). Default: None
- pycisTopic.iterative_peak_calling.cpm(x: PyRanges, column: str)[source]
cpm normalization
Parameters
- x: pr.PyRanges
A pyRanges object
- column: str
Name of the column that has to be normalized
- pycisTopic.iterative_peak_calling.get_consensus_peaks(narrow_peaks_dict: Dict[str, PyRanges], peak_half_width: int, chromsizes: DataFrame | PyRanges | None = None, path_to_blacklist: str | None = None)[source]
Returns consensus peaks from a set of MACS narrow peak results. First, each summit is extended a peak_half_width in each direction and then we iteratively filter out less significant peaks that overlap with a more significant one. During this procedure peaks will be merged and depending on the number of peaks included into them, different processes will happen: * 1 peak: The original peak region will be kept * 2 peaks: The original peak region with the highest score will be kept * 3 or more peaks: The orignal peak region with the most significant score will be taken, and all the original peak regions in this merged peak region that overlap with the significant peak region will be removed. The process is repeated with the next most significant peak (if it was not removed already) until all peaks are processed.
This proccess will happen twice, first in each pseudobulk peaks; and after peak score normalization, to process all peaks together.
This approach is described in Corces et al. 2018.
Parameters
- narrow_peaks_dict: dict
A dictionary containing group labels as keys and pr.PyRanges with the narrowPeak results from MACS2 as values (as returned by .pseudobulkPeakCalling.peakCalling()).
- peak_half_width: int
Number of base pairs that each summit will be extended in each direction.
- chromsizes: pd.PyRanges or pd.DataFrame
A data frame or
pr.PyRanges
containing size of each column, containing ‘Chromosome’, ‘Start’ and ‘End’ columns.- path_to_blacklist: str, optional
Path to bed file containing blacklist regions (Amemiya et al., 2019). Default: None
- pycisTopic.iterative_peak_calling.iterative_peak_filtering(center_extended_peaks: PyRanges)[source]
Returns consensus peaks from a set of MACS narrow peak results. First, each summit is extended a peak_half_width in each direction and then we iteratively filter out less significant peaks that overlap with a more significant one. During this procedure, described in this functions, peaks will be merged and depending on the number of peaks included into them, different processes will happen: * 1 peak: The original peak region will be kept * 2 peaks: The original peak region with the highest score will be kept * 3 or more peaks: The orignal peak region with the most significant score will be taken, and all the original peak regions in this merged peak region that overlap with the significant peak region will be removed. The process is repeated with the next most significant peak (if it was not removed already) until all peaks are processed.
This proccess will happen twice, first in each pseudobulk peaks; and after peak score normalization, to process all peaks together.
This approach is described in Corces et al. 2018.
Parameters
- center_extended_peaks: pr.PyRanges
A pr.PyRanges with all the peaks to be combined (and their MACS score), after centering and extending the peaks.
Fragments
- pycisTopic.fragments.create_pyranges_from_polars_df(bed_df_pl: DataFrame) PyRanges [source]
Create PyRanges DataFrame from Polars DataFrame.
- Parameters:
- bed_df_pl
Polars DataFrame containing BED entries. e.g.: This can also be a filtered Polars DataFrame with fragments or
TSS annotation.
- Returns:
- PyRanges DataFrame.
See also
Examples
Read BED file to Polars DataFrame with pyarrow engine.
>>> bed_df_pl = read_bed_to_polars_df("test.bed", engine="pyarrow")
Create PyRanges object directly from Polars DataFrame.
>>> bed_df_pr = create_pyranges_from_polars_df(bed_df_pl=bed_df_pl)
- pycisTopic.fragments.filter_fragments_by_cb(fragments_df_pl: DataFrame, cbs: Series | Sequence) DataFrame [source]
Filter fragments by cell barcodes.
- Parameters:
- fragments_df_pl
Polars DataFrame with fragments.
- cbs
List/Polars Series with Cell barcodes. See
pycisTopic.fragments.get_cbs_passing_filter()
for a way to get a filtered list of cell barcodes (selected_cbs
variable).
- Returns:
- Polars DataFrame with fragments for the requested cell barcodes.
See also
Examples
Read gzipped fragments BED file to a Polars DataFrame.
>>> fragments_df_pl = read_fragments_to_polars_df( ... fragments_bed_filename="fragments.tsv.gz", ... )
List of cell barcodes for which to retain fragments.
>>> cbs = ["GGACATAAGGGCCACT-1", "ACCTTCATCTTTGAGA-1"]
Polars DataFrame with fragments for the requested cell barcodes.
>>> fragments_cb_filtered_df_pl = filter_fragments_by_cb( ... fragments_df_pl=fragments_df_pl, ... cbs=cbs, ... )
List of cell barcodes for which to retain fragments.
>>> cbs = ["GGACATAAGGGCCACT-1", "ACCTTCATCTTTGAGA-1"]
Polars DataFrame with fragments for the requested cell barcodes.
>>> fragments_cb_filtered_df_pl = filter_fragments_by_cb( ... fragments_df_pl=fragments_df_pl, ... cbs=cbs, ... )
List of cell barcodes as a Polars categorical Series for which to retain fragments.
>>> cbs = pl.Series( ... "CB", ... ["GGACATAAGGGCCACT-1", "ACCTTCATCTTTGAGA-1"], ... dtype=pl.Categorical, ... )
Read list of cell barcodes from a file.
>>> cbs = read_barcodes_file_to_polars_series("barcodes.tsv")
Polars DataFrame with fragments for the requested cell barcodes.
>>> fragments_cb_filtered_df_pl = filter_fragments_by_cb( ... fragments_df_pl=fragments_df_pl, ... cbs=cbs, ... )
- pycisTopic.fragments.get_cbs_passing_filter(fragments_stats_per_cb_df_pl: pl.DataFrame, cbs: pl.Series | Sequence | None = None, min_fragments_per_cb: int | None = None, keep_top_x_cbs: int | None = None, collapse_duplicates: bool | None = True)[source]
Get cell barcodes passing the filter.
- Parameters:
- fragments_stats_per_cb_df_pl
Polars DataFrame with number of fragments and duplication ratio per cell barcode. See
pycisTopic.fragments.get_fragments_per_cb()
.- cbs
Cell barcodes to keep. If specified,
min_fragments_per_cb
andmin_cbs
are ignored.- min_fragments_per_cb
Minimum number of fragments needed per cell barcode to keep the cell barcode. Only used if
cbs
isNone
,min_cbs
will be ignored.- keep_top_x_cbs
Keep the x most abundant cell barcodes based on the number of fragments. Only used if
cbs
isNone
andmin_fragments_per_cb
isNone
.- collapse_duplicates
Collapse duplicate fragments (same chromosomal positions and linked to the same cell barcode).
- Returns:
- (Cell barcodes passing the filter,
fragments_stats_per_cb_df_pl filtered by the cell barcodes passing the filter)
Examples
Read gzipped fragments BED file to a Polars DataFrame.
>>> fragments_df_pl = read_fragments_to_polars_df( ... fragments_bed_filename="fragments.tsv.gz", ... )
Get number of fragments and duplication ratio per cell barcode (which have 10 fragments or more after collapsing duplicates).
>>> fragments_stats_per_cb_df_pl = get_fragments_per_cb( ... fragments_df_pl=fragments_df_pl, ... min_fragments_per_cb=10, ... collapse_duplicates=True, ... )
Keep only cell barcodes which have 1000 or more fragments.
>>> cbs_selected, fragments_stats_per_cb_filtered_df_pl = get_cbs_passing_filter( ... fragments_stats_per_cb_df_pl=fragments_stats_per_cb_df_pl, ... min_fragments_per_cb=1000, ... collapse_duplicates=True, ... )
Keep only the 4000 most abundant cell barcodes based on the number of fragments after collapsing duplicates.
>>> cbs_selected, fragments_stats_per_cb_filtered_df_pl = get_cbs_passing_filter( ... fragments_stats_per_cb_df_pl=fragments_stats_per_cb_df_pl, ... keep_top_x_cbs=4000, ... collapse_duplicates=True, ... )
- pycisTopic.fragments.get_fragments_in_peaks(fragments_df_pl: DataFrame, regions_df_pl: DataFrame) DataFrame [source]
Get number of total and unique fragments in peaks.
- Parameters:
- fragments_df_pl
Polars DataFrame with fragments.
- regions_df_pl
Polars DataFrame with peak regions (consensus peaks or SCREEN regions). See
pycisTopic.fragments.read_bed_to_polars_df()
for a way to read a BED file with peak regsions.
- Returns:
- Polars DataFrame with total fragment counts and unique fragment counts per region.
Examples
As input get a Polars DataFrame with fragments for the cell barcodes of interest. See pycisTopic.fragments.filter_fragments_by_cb
>>> fragments_cb_filtered_df_pl = filter_fragments_by_cb( ... fragments_df_pl=fragments_df_pl, ... cbs=cbs, ... )
Read BED file with consensus peaks or SCREEN regions (get first 3 columns only).
>>> regions_df_pl = read_bed_to_polars_df( ... bed_filename=screen_regions_bed_filename, ... min_column_count=3, ... )
Polars DataFrame with number of total and unique fragments in peaks.
>>> fragments_in_peaks_df_pl = get_fragments_in_peaks( ... fragments_df_pl=fragments_cb_filtered_df_pl, ... regions_df_pl=regions_df_pl, ... )
- pycisTopic.fragments.get_fragments_per_cb(fragments_df_pl: DataFrame, min_fragments_per_cb: int = 10, collapse_duplicates: bool | None = True) DataFrame [source]
Get number of fragments and duplication ratio per cell barcode.
- Parameters:
- fragments_df_pl:
Polars DataFrame with fragments. See
pycisTopic.fragments.read_fragments_to_polars_df()
.- min_fragments_per_cb:
Minimum number of fragments needed per cell barcode to keep the fragments for those cell barcodes.
- collapse_duplicates:
Collapse duplicate fragments (same chromosomal positions and linked to the same cell barcode).
- Returns:
- Polars DataFrame with number of fragments and duplication ratio per cell barcode.
Examples
Read gzipped fragments BED file to a Polars DataFrame.
>>> fragments_df_pl = read_fragments_to_polars_df( ... fragments_bed_filename="fragments.tsv.gz", ... )
Get number of fragments and duplication ratio per cell barcode (which have 10 fragments or more after collapsing duplicates).
>>> fragments_stats_per_cb_df_pl = get_fragments_per_cb( ... fragments_df_pl=fragments_df_pl, ... min_fragments_per_cb=10, ... collapse_duplicates=True, ... )
- pycisTopic.fragments.get_insert_size_distribution(fragments_df_pl: DataFrame) DataFrame [source]
Get insert size distribution of fragments.
- Parameters:
- fragments_df_pl
Polars DataFrame with fragments.
- cbs
List/Polars Series with Cell barcodes. See
pycisTopic.fragments.get_cbs_passing_filter()
for a way to get a filtered list of cell barcodes (selected_cbs
variable).
- Returns:
- Polars DataFrame with fragment counts and fragment ratios for each found insert
- size.
Examples
As input get a Polars DataFrame with fragments for the cell barcodes of interest. See pycisTopic.fragments.filter_fragments_by_cb
>>> fragments_cb_filtered_df_pl = filter_fragments_by_cb( ... fragments_df_pl=fragments_df_pl, ... cbs=cbs, ... )
Polars DataFrame with insert size distribution of fragments.
>>> insert_size_dist_df_pl = get_insert_size_distribution( ... fragments_df_pl=fragments_cb_filtered_df_pl, ... )
- pycisTopic.fragments.read_barcodes_file_to_polars_series(barcodes_tsv_filename: str) Series [source]
Read barcode TSV file to a Polars Series.
- Parameters:
- barcodes_tsv_filename
TSV file with CBs.
- Returns:
- Polars Series with CBs.
Examples
Read gzipped barcodes TSV file to a Polars Series.
>>> cbs = read_barcodes_file_to_polars_series( ... barcodes_tsv_filename="barcodes.tsv.gz", ... )
Read uncompressed barcodes TSV file to a Polars Series.
>>> cbs = read_barcodes_file_to_polars_series( ... barcodes_tsv_filename="barcodes.tsv", ... )
- pycisTopic.fragments.read_bed_to_polars_df(bed_filename: str, engine: str | Literal['polars'] | Literal['pyarrow'] = 'pyarrow', min_column_count: int = 3) DataFrame [source]
Read BED file to a Polars DataFrame.
- Parameters:
- bed_filename
BED filename.
- engine
Use Polars or pyarrow to read the BED file (default: pyarrow).
- min_column_count
Minimum number of required columns needed in BED file.
- Returns:
- Polars DataFrame with BED entries.
Examples
Read BED file to Polars DataFrame with pyarrow engine.
>>> bed_df_pl = read_bed_to_polars_df("test.bed", engine="pyarrow")
Read BED file to Polars DataFrame with pyarrow engine and require that the BED file has at least 4 columns.
>>> bed_with_at_least_4_columns_df_pl = read_bed_to_polars_df( ... "test.bed", ... engine="pyarrow", ... min_column_count=4, ... )
- pycisTopic.fragments.read_fragments_to_polars_df(fragments_bed_filename: str, engine: str | Literal['polars'] | Literal['pyarrow'] = 'pyarrow') DataFrame [source]
Read fragments BED file to a Polars DataFrame.
If fragments don’t have a Score column, a Score columns is created by counting the number of fragments with the same chromosome, start, end and CB.
- Parameters:
- fragments_bed_filename
Fragments BED filename.
- engine
Use Polars or pyarrow to read the fragments BED file (default: pyarrow).
- Returns:
- Polars DataFrame with fragments.
Examples
Read gzipped fragments BED file to a Polars DataFrame.
>>> fragments_df_pl = read_fragments_to_polars_df( ... fragments_bed_filename="fragments.tsv.gz", ... )
Read uncompressed fragments BED file to a Polars DataFrame.
>>> fragments_df_pl = read_fragments_to_polars_df( ... fragments_bed_filename="fragments.tsv", ... )
- pycisTopic.fragments.read_fragments_to_pyranges(fragments_bed_filename: str, engine: str | Literal['polars'] | Literal['pyarrow'] | Literal['pandas'] = 'pyarrow') PyRanges [source]
Read fragments BED file to PyRanges object.
- Parameters:
- fragments_bed_filename
Fragments BED filename.
- engine
Use Polars, pyarrow or pandas to read the fragments BED file (default:
pyarrow
).
- Returns:
- PyRanges object with fragments.
Examples
Read BED file to PyRanges object with pyarrow engine.
>>> bed_pr = read_fragments_to_pyranges("test.bed", engine="pyarrow")
Gene annotation
- pycisTopic.gene_annotation.change_chromosome_source_in_bed(chrom_sizes_and_alias_df_pl: DataFrame, bed_df_pl: DataFrame, from_chrom_source_name: str, to_chrom_source_name: str) DataFrame [source]
Change chromosome names from Polars DataFrame with BED entries from one chromosome source to another one.
- Parameters:
- chrom_sizes_and_alias_df_pl
Polars DataFrame with chromosome sizes and alias mapping. See
pycisTopic.gene_annotation.get_chrom_sizes_and_alias_mapping_from_file()
,pycisTopic.gene_annotation.get_chrom_sizes_and_alias_mapping_from_ncbi()
andpycisTopic.gene_annotation.get_chrom_sizes_and_alias_mapping_from_ucsc()
.- bed_df_pl
Polars DataFrame with BED entries for which chromosome names need to be remapped from
from_chrom_source_name
toto_chrom_source_name
. SeepycisTopic.fragments.read_bed_to_polars_df()
andpycisTopic.gene_annotation.read_tss_annotation_from_bed()
- from_chrom_source_name
Current chromosome source name for the input BED file:
ucsc
,ensembl
,genbank
orrefseq
. Can be guessed withpycisTopic.gene_annotation.find_most_likely_chromosome_source_in_bed()
.- to_chrom_source_name
Chromosome source name to which the output Polars DataFrame with BED entries should be mapped:
ucsc
,ensembl
,genbank
orrefseq
.
- Returns:
- Polars Dataframe with BED entries with changed chromosome names.
See also
pycisTopic.fragments.read_bed_to_polars_df
pycisTopic.gene_annotation.find_most_likely_chromosome_source_in_bed
pycisTopic.gene_annotation.get_chrom_sizes_and_alias_mapping_from_file
pycisTopic.gene_annotation.get_chrom_sizes_and_alias_mapping_from_ncbi
pycisTopic.gene_annotation.get_chrom_sizes_and_alias_mapping_from_ucsc
pycisTopic.gene_annotation.read_tss_annotation_from_bed
pycisTopic.gene_annotation.write_tss_annotation_to_bed
Examples
Get chromosome sizes and alias mapping for hg38.
>>> chrom_sizes_and_alias_hg38_df_pl = get_chrom_sizes_and_alias_mapping_from_ucsc(ucsc_assembly="hg38")
Get gene annotation for hg38 from Ensembl BioMart.
>>> hg38_tss_annotation_bed_df_pl = get_tss_annotation_from_ensembl( ... biomart_name="hsapiens_gene_ensembl", ... ) >>> hg38_tss_annotation_bed_df_pl
Replace Ensembl chromosome names with UCSC chromosome names in gene annotation for hg38.
>>> hg38_tss_annotation_ucsc_chroms_bed_df_pl = change_chromosome_source_in_bed( ... chrom_sizes_and_alias_df_pl=chrom_sizes_and_alias_hg38_df_pl, ... bed_df_pl=hg38_tss_annotation_bed_df_pl, ... from_chrom_source_name="ensembl", ... to_chrom_source_name="ucsc", ... ) >>> hg38_tss_annotation_ucsc_chroms_bed_df_pl
- pycisTopic.gene_annotation.find_most_likely_chromosome_source_in_bed(chrom_sizes_and_alias_df_pl: pl.DataFrame, bed_df_pl: pl.DataFrame)[source]
Find which chromosome source is the most likely in the provided BED file entries.
Find which chromosome source (UCSC, Ensembl, GenBank and RefSeq) given as a
chrom_sizes_and_alias_df_pl
Polars DataFrame is the most likely in the provided Polars DataFrame with BED entries.- Parameters:
- chrom_sizes_and_alias_df_pl
Polars DataFrame with chromosome sizes and alias mapping. See
pycisTopic.gene_annotation.get_chrom_sizes_and_alias_mapping_from_file()
,pycisTopic.gene_annotation.get_chrom_sizes_and_alias_mapping_from_ncbi()
andpycisTopic.gene_annotation.get_chrom_sizes_and_alias_mapping_from_ucsc()
.- bed_df_pl
Polars DataFrame with BED entries. See
pycisTopic.fragments.read_bed_to_polars_df()
.
- Returns:
- Tuple of most likely chromosome source and a Polars DataFrame with the ranking of
- all possible chromosome sources.
See also
pycisTopic.fragments.read_bed_to_polars_df
pycisTopic.gene_annotation.change_chromosome_source_in_bed
pycisTopic.gene_annotation.get_chrom_sizes_and_alias_mapping_from_file
pycisTopic.gene_annotation.get_chrom_sizes_and_alias_mapping_from_ncbi
pycisTopic.gene_annotation.get_chrom_sizes_and_alias_mapping_from_ucsc
Examples
>>> chrom_sizes_and_alias_hg38_df_pl = get_chrom_sizes_and_alias_mapping_from_ucsc(ucsc_assembly="hg38") >>> bed_df_pl = read_bed_to_polars_df("test.bed", engine="pyarrow") >>> best_chrom_source_name, chrom_source_stats_df_pl = find_most_likely_chromosome_source_in_bed( ... chrom_sizes_and_alias_df_pl=chrom_sizes_and_alias_hg38_df_pl, ... bed_df_pl=bed_df_pl, ... ) >>> print(best_chrom_source_name, chrom_source_stats_df_pl)
- pycisTopic.gene_annotation.get_all_gene_annotation_ensembl_biomart_dataset_names(biomart_host: str = 'http://www.ensembl.org', use_cache: bool = True) pd.DataFrame [source]
Get all avaliable gene annotation Ensembl BioMart dataset names.
- Parameters:
- biomart_host
- BioMart host URL to use.
Default:
http://www.ensembl.org
Archived Ensembl BioMart URLs: https://www.ensembl.org/info/website/archives/index.html (List of currently available archives)
- use_cache
Whether to cache requests to Ensembl BioMart server.
- Returns:
- Pandas dataframe with all available gene annotation Ensembl BioMart datasets.
Examples
>>> biomart_latest_datasets = get_all_biomart_ensembl_dataset_names( ... biomart_host="http://www.ensembl.org", ... ) >>> biomart_jul2022_datasets = get_all_biomart_ensembl_dataset_names( ... biomart_host="http://jul2022.archive.ensembl.org/", ... )
- pycisTopic.gene_annotation.get_biomart_dataset_name_for_species(biomart_datasets: pd.DataFrame, species: str) pd.DataFrame [source]
Get gene annotation Ensembl BioMart dataset names for species of interest.
- Parameters:
- biomart_datasets
All gene annotation Ensembl BioMart datasets See
pycisTopic.gene_annotation.get_all_gene_annotation_ensembl_biomart_dataset_names()
.- species
Species name to search for.
- Returns:
- Filtered list of gene annotation Ensembl BioMart dataset names.
- pycisTopic.gene_annotation.get_chrom_sizes_and_alias_mapping_from_file(chrom_sizes_and_alias_tsv_filename: str | Path) DataFrame [source]
Get chromosome sizes and alias mapping from a chromosome alias TSV file.
Get chromosome sizes and alias mapping from a chromosome alias TSV file to map chromosome names between UCSC, Ensembl, GenBank and RefSeq chromosome names.
- Parameters:
- chrom_sizes_and_alias_tsv_filename:
- Chromosome alias TSV files created with:
get_chrom_sizes_and_alias_mapping_from_ncbi
get_chrom_sizes_and_alias_mapping_from_ucsc
- Returns:
- Polars Dataframe with chromosome sizes and alias mapping between UCSC, Ensembl,
- GenBank and RefSeq chromosome names.
See also
Examples
Get chromosome sizes and alias mapping for hg38 from a previous written TSV file:
>>> chrom_sizes_and_alias_hg38_from_file_df_pl = get_chrom_sizes_and_alias_mapping_from_file( ... chrom_sizes_and_alias_tsv_filename="hg38.chrom_sizes_and_alias.tsv", ... )
- pycisTopic.gene_annotation.get_chrom_sizes_and_alias_mapping_from_ncbi(accession_id: str, chrom_sizes_and_alias_tsv_filename: str | Path | None) DataFrame [source]
Get chromosome sizes and alias mapping from NCBI sequence reports.
Get chromosome sizes and alias mapping from NCBI sequence reports to be able to map chromosome names between UCSC, Ensembl, GenBank and RefSeq chromosome names or read mapping from local file (
chrom_sizes_and_alias_tsv_filename
) instead.- Parameters:
- accession_id
NCBI assembly accession ID.
- chrom_sizes_and_alias_tsv_filename
If specified, write the chromosome sizes and alias mapping to the specified file.
- Returns:
- Polars Dataframe with chromosome alias mapping between UCSC, Ensembl, GenBank and
- RefSeq chromosome names.
See also
Examples
Get chromosome sizes and alias mapping for different assemblies from NCBI.
Assemby accession IDs for a species can be queries with pycisTopic.gene_annotation.get_ncbi_assembly_accessions_for_species
>>> chrom_sizes_and_alias_hg38_df_pl = get_chrom_sizes_and_alias_mapping_from_ncbi( ... accession_id="GCF_000001405.40" ... ) >>> chrom_sizes_and_alias_mm10_df_pl = get_chrom_sizes_and_alias_mapping_from_ncbi( ... accession_id="GCF_000001215.4" ... ) >>> chrom_sizes_and_alias_dm6_df_pl = get_chrom_sizes_and_alias_mapping_from_ncbi( ... accession_id="GCF_000001215.4" ... )
Get chromosome sizes and alias mapping for Homo sapiens and also write it to a TSV file:
>>> chrom_sizes_and_alias_hg38_df_pl = get_chrom_sizes_and_alias_mapping_from_ncbi( ... accession_id="GCF_000001405.40", ... chrom_sizes_and_alias_tsv_filename="GCF_000001405.40.chrom_sizes_and_alias.tsv", ... )
- pycisTopic.gene_annotation.get_chrom_sizes_and_alias_mapping_from_ucsc(ucsc_assembly: str, chrom_sizes_and_alias_tsv_filename: str | Path | None = None) DataFrame [source]
Get chromosome sizes and alias mapping from UCSC genome browser.
Get chromosome sizes and alias mapping from UCSC genome browser for UCSC assembly to be able to map chromosome names between UCSC, Ensembl, GenBank and RefSeq chromosome names or read mapping from local file (
chrom_sizes_and_alias_tsv_filename
) instead.- Parameters:
- ucsc_assembly:
UCSC assembly names (
hg38
,mm10
,dm6
, …).- chrom_sizes_and_alias_tsv_filename:
If specified, write the chromosome sizes and alias mapping to the specified file.
- Returns:
- Polars Dataframe with chromosome sizes and alias mapping between UCSC, Ensembl,
- GenBank and RefSeq chromosome names.
See also
Examples
Get chromosome sizes and aliases for different assemblies from UCSC:
>>> chrom_sizes_and_alias_hg38_df_pl = get_chrom_sizes_and_alias_mapping_from_ucsc( ... ucsc_assembly="hg38" ... ) >>> chrom_sizes_and_alias_mm10_df_pl = get_chrom_sizes_and_alias_mapping_from_ucsc( ... ucsc_assembly="mm10" ... ) >>> chrom_sizes_and_alias_dm6_df_pl = get_chrom_sizes_and_alias_mapping_from_ucsc( ... ucsc_assembly="dm6" ... )
Get chromosome sizes and aliases for hg38 and also write it to a TSV file:
>>> chrom_sizes_and_alias_hg38_df_pl = get_chrom_sizes_and_alias_mapping_from_ucsc( ... ucsc_assembly="hg38", ... chrom_sizes_and_alias_tsv_filename="hg38.chrom_sizes_and_alias.tsv", ... )
- pycisTopic.gene_annotation.get_ncbi_assembly_accessions_for_species(species: str) str [source]
Get NCBI assembly accession numbers and assembly names for a certain species.
- Parameters:
- species
Species name (latin name) for which to look for NCBI assembly accession numbers.
- Returns:
- String with NCBI assembly accession number and assembly name.
Examples
>>> print(get_ncbi_assembly_accessions_for_species("homo sapiens")) accession assembly_name GCF_000001405.40 GRCh38.p14 GCF_000001405.25 GRCh37.p13 GCF_000001405.26 GRCh38 GCF_000001405.27 GRCh38.p1 GCF_000001405.28 GRCh38.p2 GCF_000001405.29 GRCh38.p3 GCF_000001405.30 GRCh38.p4 GCF_000001405.31 GRCh38.p5 GCF_000001405.32 GRCh38.p6 GCF_000001405.33 GRCh38.p7 GCF_000001405.34 GRCh38.p8 GCF_000001405.35 GRCh38.p9 GCF_000001405.36 GRCh38.p10 GCF_000001405.37 GRCh38.p11 GCF_000001405.38 GRCh38.p12 GCF_000001405.39 GRCh38.p13 GCF_000002125.1 HuRef GCF_000306695.2 CHM1_1.1 GCF_009914755.1 T2T-CHM13v2.0 >>> print(get_ncbi_assembly_accessions_for_species("drosophila melanogaster")) accession assembly_name GCF_000001215.4 Release 6 plus ISO1 MT
- pycisTopic.gene_annotation.get_tss_annotation_from_ensembl(biomart_name: str, biomart_host: str = 'http://www.ensembl.org', transcript_type: Sequence[str] | None = ['protein_coding'], use_cache: bool = True) DataFrame [source]
Get TSS annotation for requested transcript types from Ensembl BioMart.
- Parameters:
- biomart_name
Ensembl BioMart ID of the dataset. See
pycisTopic.gene_annotation.get_biomart_dataset_name_for_species()
to get the biomart_name for species of interest: e.g.:hsapiens_gene_ensembl
,mmusculus_gene_ensembl
,dmelanogaster_gene_ensembl
, …- biomart_host
- BioMart host URL to use.
Default:
http://www.ensembl.org
Archived Ensembl BioMart URLs: https://www.ensembl.org/info/website/archives/index.html (List of currently available archives)
- transcript_type
Only keep list of specified transcript types (e.g.:
["protein_coding"]
) or all (None
).- use_cache
Whether to cache requests to Ensembl BioMart server.
- Returns:
- Polars DataFrame with TSS positions in BED format.
See also
Examples
>>> tss_annotation_bed_df_pl = get_tss_annotation_from_ensembl( ... biomart_name="hsapiens_gene_ensembl" ... ) >>> tss_annotation_jul2022_bed_df_pl = get_tss_annotation_from_ensembl( ... biomart_name="hsapiens_gene_ensembl", ... biomart_host="http://jul2022.archive.ensembl.org/", ... )
- pycisTopic.gene_annotation.read_tss_annotation_from_bed(tss_annotation_bed_filename: str) DataFrame [source]
Read TSS annotation BED file to Polars DataFrame.
Read TSS annotation BED file created by
pycisTopic.gene_annotation.get_tss_annotation_from_ensembl()
andpycisTopic.gene_annotation.write_tss_annotation_to_bed()
to Polars DataFrame with TSS positions in BED format.- Parameters:
- tss_annotation_bed_filename
TSS annotation BED file to read. TSS annotation BED files can be written with
pycisTopic.gene_annotation.write_tss_annotation_to_bed()
and will have the following header line:# Chromosome Start End Gene Score Strand Transcript_type
- Minimum required columns for
pycisTopic.tss_profile.get_tss_profile()
: Chromosome, Start (0-based BED), Strand
- Minimum required columns for
- Returns:
- Polars DataFrame with TSS positions in BED format.
See also
Examples
Get TSS annotation from Ensembl.
>>> tss_annotation_bed_df_pl = get_tss_annotation_from_ensembl( ... biomart_name="hsapiens_gene_ensembl" ... )
If your fragments files use a different chromosome convention than the one used by Ensembl, take a look at
pycisTopic.gene_annotation.change_chromosome_source_in_bed()
to convert the Ensembl chromosome names to UCSC, Ensembl, GenBank or RefSeq chromosome names.Write TSS annotation to a file.
>>> write_tss_annotation_to_bed( ... tss_annotation_bed_df_pl=tss_annotation_bed_df_pl, ... tss_annotation_bed_filename="hg38.tss.bed", ... )
Read TSS annotation from a file.
>>> tss_annotation_bed_df_pl = read_tss_annotation_from_bed( ... tss_annotation_bed_filename="hg38.tss.bed" ... )
- pycisTopic.gene_annotation.write_tss_annotation_to_bed(tss_annotation_bed_df_pl, tss_annotation_bed_filename: str) None [source]
Write TSS annotation Polars DataFrame to a BED file.
Write TSS annotation Polars DataFrame with TSS positions in BED format. to a BED file.
- Parameters:
- tss_annotation_bed_df_pl
TSS annotation Polars DataFrame with TSS positions in BED format created with
pycisTopic.gene_annotation.get_tss_annotation_from_ensembl()
.- tss_annotation_bed_filename
TSS annotation BED file to write to. TSS annotation BED files from
pycisTopic.gene_annotation.get_tss_annotation_from_ensembl()
will have the following header line:# Chromosome Start End Gene Score Strand Transcript_type
- Minimum required columns for
pycisTopic.tss_profile.get_tss_profile()
: Chromosome, Start (0-based BED), Strand
- Minimum required columns for
- Returns:
- Polars DataFrame with TSS positions in BED format.
See also
Examples
Get TSS annotation from Ensembl.
>>> tss_annotation_bed_df_pl = get_tss_annotation_from_ensembl( ... biomart_name="hsapiens_gene_ensembl" ... )
If your fragments files use a different chromosome convention than the one used by Ensembl, take a look at
pycisTopic.gene_annotation.change_chromosome_source_in_bed()
to convert the Ensembl chromosome names to UCSC, Ensembl, GenBank or RefSeq chromosome names.Write TSS annotation to a file.
>>> write_tss_annotation_to_bed( ... tss_annotation_bed_df_pl=tss_annotation_bed_df_pl, ... tss_annotation_bed_filename="hg38.tss.bed", ... )
Read TSS annotation from a file.
>>> tss_annotation_bed_df_pl = read_tss_annotation_from_bed( ... tss_annotation_bed_filename="hg38.tss.bed" ... )
Genomic ranges
- pycisTopic.genomic_ranges.intersection(regions1_df_pl: DataFrame, regions2_df_pl: DataFrame, how: Literal['all', 'containment', 'first', 'last'] | str | None = None, regions1_info: bool = True, regions2_info: bool = False, regions1_coord: bool = False, regions2_coord: bool = False, regions1_suffix: str = '@1', regions2_suffix: str = '@2') DataFrame [source]
Get overlapping subintervals between first set and second set of regions.
- Parameters:
- regions1_df_pl
Polars DataFrame containing BED entries for first set of regions.
- regions2_df_pl
Polars DataFrame containing BED entries for second set of regions.
- how
- What intervals to report:
"all"
(None
): all overlaps with second set or regions."containment"
: only overlaps where region of first set is contained within region of second set."first"
: first overlap with second set of regions."last"
: last overlap with second set of regions."outer"
: all regions for first and all regions of second (outer join). If no overlap was found for a region, the other region set will containNone
for that entry."left"
: all first set of regions and overlap with second set of regions (left join). If no overlap was found for a region in the first set, the second region set will contain None for that entry."right"
: all second set of regions and overlap with first set of regions (right join). If no overlap was found for a region in the second set, the first region set will containNone
for that entry.
- regions1_info
Add non-coordinate columns from first set of regions to output of intersection.
- regions2_info
Add non-coordinate columns from first set of regions to output of intersection.
- regions1_coord
Add coordinates from first set of regions to output of intersection.
- regions2_coord
Add coordinates from second set of regions to output of intersection.
- regions1_suffix
Suffix added to coordinate columns of first set of regions.
- regions2_suffix
Suffix added to coordinate and info columns of second set of regions.
- strandedness
Note: Not implemented yet. {
None
,"same"
,"opposite"
,False
}, defaultNone
, i.e. auto Whether to compare PyRanges on the same strand, the opposite or ignore strand information. The default,None
, means use"same"
if both PyRanges are stranded, otherwise ignore the strand information.
- Returns:
- intersection_df_pl
Polars Dataframe containing BED entries with the intersection.
Examples
>>> regions1_df_pl = pl.from_dict( ... { ... "Chromosome": ["chr1"] * 3, ... "Start": [1, 4, 10], ... "End": [3, 9, 11], ... "ID": ["a", "b", "c"], ... } ... ) >>> regions1_df_pl shape: (3, 4) ┌────────────┬───────┬─────┬─────┐ │ Chromosome ┆ Start ┆ End ┆ ID │ │ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 ┆ str │ ╞════════════╪═══════╪═════╪═════╡ │ chr1 ┆ 1 ┆ 3 ┆ a │ ├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤ │ chr1 ┆ 4 ┆ 9 ┆ b │ ├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤ │ chr1 ┆ 10 ┆ 11 ┆ c │ └────────────┴───────┴─────┴─────┘
>>> regions2_df_pl = pl.from_dict( ... { ... "Chromosome": ["chr1"] * 3, ... "Start": [2, 2, 9], ... "End": [3, 9, 10], ... "Name": ["reg1", "reg2", "reg3"] ... } ... ) >>> regions2_df_pl shape: (3, 4) ┌────────────┬───────┬─────┬──────┐ │ Chromosome ┆ Start ┆ End ┆ Name │ │ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 ┆ str │ ╞════════════╪═══════╪═════╪══════╡ │ chr1 ┆ 2 ┆ 3 ┆ reg1 │ ├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┤ │ chr1 ┆ 2 ┆ 9 ┆ reg2 │ ├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┤ │ chr1 ┆ 9 ┆ 10 ┆ reg3 │ └────────────┴───────┴─────┴──────┘
>>> intersection(regions1_df_pl, regions2_df_pl) shape: (3, 3) ┌────────────┬───────┬─────┬─────┐ │ Chromosome ┆ Start ┆ End ┆ ID │ │ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 ┆ str │ ╞════════════╪═══════╪═════╪═════╡ │ chr1 ┆ 2 ┆ 3 ┆ a │ ├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤ │ chr1 ┆ 2 ┆ 3 ┆ a │ ├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤ │ chr1 ┆ 4 ┆ 9 ┆ b │ └────────────┴───────┴─────┴─────┘
>>> intersection(regions1_df_pl, regions2_df_pl, how="first") shape: (2, 4) ┌────────────┬───────┬─────┬─────┐ │ Chromosome ┆ Start ┆ End ┆ ID │ │ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 ┆ str │ ╞════════════╪═══════╪═════╪═════╡ │ chr1 ┆ 2 ┆ 3 ┆ a │ ├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤ │ chr1 ┆ 4 ┆ 9 ┆ b │ └────────────┴───────┴─────┴─────┘
>>> intersection( ... regions1_df_pl, ... regions2_df_pl, ... how="containment", ... regions1_info=False, ... regions2_info=True, ... ) shape: (1, 4) ┌────────────┬───────┬─────┬──────┐ │ Chromosome ┆ Start ┆ End ┆ Name │ │ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 ┆ str │ ╞════════════╪═══════╪═════╪══════╡ │ chr1 ┆ 4 ┆ 9 ┆ reg2 │ └────────────┴───────┴─────┴──────┘
>>> intersection( ... regions1_df_pl, ... regions2_df_pl, ... regions1_coord=True, ... regions2_coord=True, ... ) shape: (3, 10) ┌────────────┬───────┬─────┬──────────────┬─────────┬───────┬──────────────┬─────────┬───────┬─────┐ │ Chromosome ┆ Start ┆ End ┆ Chromosome@1 ┆ Start@1 ┆ End@1 ┆ Chromosome@2 ┆ Start@2 ┆ End@2 ┆ ID │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 ┆ str ┆ i64 ┆ i64 ┆ str ┆ i64 ┆ i64 ┆ str │ ╞════════════╪═══════╪═════╪══════════════╪═════════╪═══════╪══════════════╪═════════╪═══════╪═════╡ │ chr1 ┆ 2 ┆ 3 ┆ chr1 ┆ 1 ┆ 3 ┆ chr1 ┆ 2 ┆ 9 ┆ a │ ├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤ │ chr1 ┆ 2 ┆ 3 ┆ chr1 ┆ 1 ┆ 3 ┆ chr1 ┆ 2 ┆ 3 ┆ a │ ├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤ │ chr1 ┆ 4 ┆ 9 ┆ chr1 ┆ 4 ┆ 9 ┆ chr1 ┆ 2 ┆ 9 ┆ b │ └────────────┴───────┴─────┴──────────────┴─────────┴───────┴──────────────┴─────────┴───────┴─────┘
>>> intersection( ... regions1_df_pl, ... regions2_df_pl, ... regions1_info=False, ... regions_info=True, ... regions2_coord=True, ... ) shape: (3, 7) ┌────────────┬───────┬─────┬──────────────┬─────────┬───────┬──────┐ │ Chromosome ┆ Start ┆ End ┆ Chromosome@2 ┆ Start@2 ┆ End@2 ┆ Name │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 ┆ str ┆ i64 ┆ i64 ┆ str │ ╞════════════╪═══════╪═════╪══════════════╪═════════╪═══════╪══════╡ │ chr1 ┆ 2 ┆ 3 ┆ chr1 ┆ 2 ┆ 9 ┆ reg2 │ ├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤ │ chr1 ┆ 2 ┆ 3 ┆ chr1 ┆ 2 ┆ 3 ┆ reg1 │ ├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤ │ chr1 ┆ 4 ┆ 9 ┆ chr1 ┆ 2 ┆ 9 ┆ reg2 │ └────────────┴───────┴─────┴──────────────┴─────────┴───────┴──────┘
- pycisTopic.genomic_ranges.overlap(regions1_df_pl: DataFrame, regions2_df_pl: DataFrame, how: Literal['all', 'containment', 'first'] | str | None = 'first', invert: bool = False) DataFrame [source]
Get overlap between two region sets.
Get overlap between first set and second set of regions and return interval of first set of regions.
- Parameters:
- regions1_df_pl
Polars DataFrame containing BED entries for first set of regions.
- regions2_df_pl
Polars DataFrame containing BED entries for second set of regions.
- how
- What overlaps to report:
"all"
(None
): all overlaps with second set or regions."containment"
: only overlaps where region of first set is contained within region of second set."first"
: first overlap with second set of regions.
- invert
Whether to return the intervals without overlaps.
- strandedness
Note: Not implemented yet. {
None
,"same"
,"opposite"
,False
}, defaultNone
, i.e. auto Whether to compare PyRanges on the same strand, the opposite or ignore strand information. The default,None
, means use"same"
if both PyRanges are stranded, otherwise ignore the strand information.
- Returns:
- overlap_df_pl
Polars Dataframe containing BED entries with the overlap.
Examples
>>> regions1_df_pl = pl.from_dict( ... { ... "Chromosome": ["chr1"] * 3, ... "Start": [1, 4, 10], ... "End": [3, 9, 11], ... "ID": ["a", "b", "c"], ... } ... ) >>> regions1_df_pl shape: (3, 4) ┌────────────┬───────┬─────┬─────┐ │ Chromosome ┆ Start ┆ End ┆ ID │ │ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 ┆ str │ ╞════════════╪═══════╪═════╪═════╡ │ chr1 ┆ 1 ┆ 3 ┆ a │ ├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤ │ chr1 ┆ 4 ┆ 9 ┆ b │ ├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤ │ chr1 ┆ 10 ┆ 11 ┆ c │ └────────────┴───────┴─────┴─────┘
>>> regions2_df_pl = pl.from_dict( ... { ... "Chromosome": ["chr1"] * 3, ... "Start": [2, 2, 9], ... "End": [3, 9, 10], ... "Name": ["reg1", "reg2", "reg3"] ... } ... ) >>> regions2_df_pl shape: (3, 4) ┌────────────┬───────┬─────┬──────┐ │ Chromosome ┆ Start ┆ End ┆ Name │ │ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 ┆ str │ ╞════════════╪═══════╪═════╪══════╡ │ chr1 ┆ 2 ┆ 3 ┆ reg1 │ ├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┤ │ chr1 ┆ 2 ┆ 9 ┆ reg2 │ ├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┤ │ chr1 ┆ 9 ┆ 10 ┆ reg3 │ └────────────┴───────┴─────┴──────┘
>>> overlap(regions1_df_pl, regions2_df_pl, how="first") shape: (2, 4) ┌────────────┬───────┬─────┬─────┐ │ Chromosome ┆ Start ┆ End ┆ ID │ │ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 ┆ str │ ╞════════════╪═══════╪═════╪═════╡ │ chr1 ┆ 1 ┆ 3 ┆ a │ ├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤ │ chr1 ┆ 4 ┆ 9 ┆ b │ └────────────┴───────┴─────┴─────┘
>>> overlap(regions1_df_pl, regions2_df_pl, how="all") shape: (3, 4) ┌────────────┬───────┬─────┬─────┐ │ Chromosome ┆ Start ┆ End ┆ ID │ │ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 ┆ str │ ╞════════════╪═══════╪═════╪═════╡ │ chr1 ┆ 1 ┆ 3 ┆ a │ ├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤ │ chr1 ┆ 1 ┆ 3 ┆ a │ ├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤ │ chr1 ┆ 4 ┆ 9 ┆ b │ └────────────┴───────┴─────┴─────┘
>>> overlap(regions1_df_pl, regions2_df_pl, how="containment") shape: (1, 4) ┌────────────┬───────┬─────┬─────┐ │ Chromosome ┆ Start ┆ End ┆ ID │ │ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 ┆ str │ ╞════════════╪═══════╪═════╪═════╡ │ chr1 ┆ 4 ┆ 9 ┆ b │ └────────────┴───────┴─────┴─────┘
>>> overlap(regions1_df_pl, regions2_df_pl, how="containment", invert=True) shape: (2, 4) ┌────────────┬───────┬─────┬─────┐ │ Chromosome ┆ Start ┆ End ┆ ID │ │ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 ┆ str │ ╞════════════╪═══════╪═════╪═════╡ │ chr1 ┆ 1 ┆ 3 ┆ a │ ├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤ │ chr1 ┆ 10 ┆ 11 ┆ c │ └────────────┴───────┴─────┴─────┘
TSS profile
- pycisTopic.tss_profile.get_tss_profile(fragments_df_pl: DataFrame, tss_annotation: DataFrame, flank_window: int = 2000, smoothing_rolling_window: int = 10, minimum_signal_window: int = 100, tss_window: int = 50, min_norm: float = 0.2, use_genomic_ranges: bool = True)[source]
Get TSS profile for Polars DataFrame with fragments filtered by cell barcodes.
- Parameters:
- fragments_df_pl
Polars DataFrame with fragments (filtered by cell barcodes of interest). See
pycisTopic.fragments.filter_fragments_by_cb()
.- tss_annotation
TSS annotation Polars DataFrame with at least the following columns:
["Chromosome", "Start", "Strand"]
. The “Start” column is 0-based like a BED file. SeepycisTopic.gene_annotation.get_tss_annotation_from_ensembl()
andpycisTopic.gene_annotation.change_chromosome_source_in_bed()
for ways to get TSS annotation from Ensembl BioMart.- flank_window
Flanking window around the TSS. Used for intersecting fragments with TSS positions and keeping cut sites. Default:
2000
(+/- 2000 bp).- smoothing_rolling_window
Rolling window used to smooth the cut sites signal. Default: 10.
- minimum_signal_window
- Average signal in the tails of the flanking window around the TSS:
[-flank_window, -flank_window + minimum_signal_window + 1]
[flank_window - minimum_signal_window + 1, flank_window]
is used to normalize the TSS enrichment. Default:
100
(average signal in[-2000, -1901]
,[1901, 2000]
around TSS ifflank_window=2000
).- tss_window
Window around the TSS used to count fragments in the TSS when calculating the TSS enrichment per cell barcode. Default:
50
(+/- 50 bp).- min_norm
Minimum normalization score. If the average minimum signal value is below this value, this number is used to normalize the TSS signal. This approach penalizes cells with fewer reads. Default:
0.2
- use_genomic_ranges
Use genomic ranges implementation for calculating intersections, instead of using pyranges.
- Returns:
- tss_enrichment_per_cb, tss_norm_matrix_sample, tss_norm_matrix_per_cb
See also
Examples
Get TSS annotation for requested transcript types from Ensembl BioMart.
>>> ensembl_tss_annotation_bed_df_pl = get_tss_annotation_from_ensembl( ... biomart_name="hsapiens_gene_ensembl" )
Get TSS profile for Polars DataFrame with fragments filtered by cell barcodes.
>>> get_tss_profile( ... fragments_df_pl=fragments_cb_filtered_df_pl, ... tss_annotation=ensembl_tss_annotation_bed_df_pl, ... flank_window=2000, ... smoothing_rolling_window=10, ... minimum_signal_window=100, ... tss_window=50, ... min_norm=0.2, ... )
QC
- pycisTopic.qc.compute_kde(training_data: ndarray, test_data: ndarray, no_threads: int = 8)[source]
Compute kernel-density estimate (KDE) using Gaussian kernels.
This function calculates the KDE in parallel and gives the same result as:
>>> from scipy.stats import gaussian_kde >>> gaussian_kde(training_data)(test_data)
- Parameters:
- training_data
2D numpy array with training data to train the KDE.
- test_data
2D numpy array with test data for which to evaluate the estimated probability density function (PDF).
- no_threads
Number of threads to use in parallelization of KDE function.
- Returns:
- 1D numpy array with probability density function (PDF) values for points in
- test_data.
- pycisTopic.qc.compute_qc_stats(fragments_df_pl: DataFrame, regions_df_pl: DataFrame, tss_annotation: DataFrame, tss_flank_window: int = 2000, tss_smoothing_rolling_window: int = 10, tss_minimum_signal_window: int = 100, tss_window: int = 50, tss_min_norm: float = 0.2, use_genomic_ranges: bool = True, min_fragments_per_cb: int = 10, collapse_duplicates: bool = True, no_threads: int = 8) tuple[DataFrame, DataFrame, DataFrame, DataFrame] [source]
Compute quality check statistics from Polars DataFrame with fragments.
- Parameters:
- fragments_df_pl
Polars DataFrame with fragments. fragments_df_pl Polars DataFrame with fragments (filtered by cell barcodes of interest). See
pycisTopic.fragments.filter_fragments_by_cb()
.- regions_df_pl
Polars DataFrame with peak regions (consensus peaks or SCREEN regions). See
pycisTopic.fragments.read_bed_to_polars_df()
for a way to read a BED file with peak regions.- tss_annotation
TSS annotation Polars DataFrame with at least the following columns:
["Chromosome", "Start", "Strand"]
. The “Start” column is 0-based like a BED file. SeepycisTopic.gene_annotation.read_tss_annotation_from_bed()
,pycisTopic.gene_annotation.get_tss_annotation_from_ensembl()
andpycisTopic.gene_annotation.change_chromosome_source_in_bed()
for ways to get TSS annotation from Ensembl BioMart.- tss_flank_window
Flanking window around the TSS. Used for intersecting fragments with TSS positions and keeping cut sites. Default:
2000
(+/- 2000 bp). SeepycisTopic.tss_profile.get_tss_profile()
.- tss_smoothing_rolling_window
Rolling window used to smooth the cut sites signal. Default:
10
. SeepycisTopic.tss_profile.get_tss_profile()
.- tss_minimum_signal_window
- Average signal in the tails of the flanking window around the TSS:
[-flank_window, -flank_window + minimum_signal_window + 1]
[flank_window - minimum_signal_window + 1, flank_window]
is used to normalize the TSS enrichment. Default:
100
(average signal in[-2000, -1901]
,[1901, 2000]
around TSS ifflank_window=2000
). SeepycisTopic.tss_profile.get_tss_profile()
.- tss_window
Window around the TSS used to count fragments in the TSS when calculating the TSS enrichment per cell barcode. Default:
50
(+/- 50 bp). SeepycisTopic.tss_profile.get_tss_profile()
.- tss_min_norm
Minimum normalization score. If the average minimum signal value is below this value, this number is used to normalize the TSS signal. This approach penalizes cells with fewer reads. Default:
0.2
SeepycisTopic.tss_profile.get_tss_profile()
.- use_genomic_ranges
Use genomic ranges implementation for calculating intersections, instead of using pyranges.
- min_fragments_per_cb
Minimum number of fragments needed per cell barcode to keep the fragments for those cell barcodes.
- collapse_duplicates
Collapse duplicate fragments (same chromosomal positions and linked to the same cell barcode).
- no_threads
Number of threads to use when calculating kernel-density estimate (KDE) to get probability density function (PDF) values for log10 unique fragments in peaks vs TSS enrichment, fractions of fragments in peaks and duplication ratio. Default:
8
- Returns:
- Tuple with:
Polars DataFrame with fragments statistics per cell barcode.
Polars DataFrame with insert size distribution of fragments.
Polars DataFrame with TSS normalization matrix for the whole sample.
Polars DataFrame with TSS normalization matrix per cell barcode.
See also
pycisTopic.fragments.filter_fragments_by_cb
pycisTopic.fragments.get_insert_size_distribution
pycisTopic.fragments.get_fragments_in_peaks
pycisTopic.fragments.read_bed_to_polars_df
pycisTopic.fragments.read_fragments_to_polars_df
pycisTopic.gene_annotation.read_tss_annotation_from_bed
pycisTopic.tss_profile.get_tss_profile
Examples
>>> from pycisTopic.fragments import read_bed_to_polars_df >>> from pycisTopic.fragments import read_fragments_to_polars_df >>> from pycisTopic.gene_annotation import read_tss_annotation_from_bed
Read gzipped fragments BED file to a Polars DataFrame.
>>> fragments_df_pl = read_fragments_to_polars_df( ... fragments_bed_filename="fragments.tsv.gz", ... )
Read BED file with consensus peaks or SCREEN regions (get first 3 columns only) which will be used for counting number of fragments in peaks.
>>> regions_df_pl = read_bed_to_polars_df( ... bed_filename=screen_regions_bed_filename, ... min_column_count=3, ... )
Read TSS annotation from a file. See
pycisTopic.gene_annotation.read_tss_annotation_from_bed()
for more info.
>>> tss_annotation_bed_df_pl = read_tss_annotation_from_bed( ... tss_annotation_bed_filename="hg38.tss.bed", ... )
Compute QC statistics.
>>> ( ... fragments_stats_per_cb_df_pl, ... insert_size_dist_df_pl, ... tss_norm_matrix_sample, ... tss_norm_matrix_per_cb, ... ) = compute_qc_stats( ... fragments_df_pl=fragments_cb_filtered_df_pl, ... regions_df_pl=regions_df_pl, ... tss_annotation=tss_annotation_bed_df_pl, ... tss_flank_window=2000, ... tss_smoothing_rolling_window=10, ... tss_minimum_signal_window=100, ... tss_window=50, ... tss_min_norm=0.2, ... use_genomic_ranges=True, ... min_fragments_per_cb=10, ... collapse_duplicates=True, ... no_threads=8, ... )
- pycisTopic.qc.get_barcodes_passing_qc_for_sample(sample_id: str, pycistopic_qc_output_dir: str | Path, unique_fragments_threshold: int | None = None, tss_enrichment_threshold: float | None = None, frip_threshold: float | None = None, use_automatic_thresholds: bool = True) tuple[np.ndarray, dict[str, float]] [source]
Get barcodes passing quality control (QC) for a sample.
- Parameters:
- sample_id
Sample ID.
- pycistopic_qc_output_dir
Directory with output from pycistopic qc.
- unique_fragments_threshold
Threshold for number of unique fragments in peaks. If not defined, and use_automatic_thresholds is False, the threshold will be set to 0.
- tss_enrichment_threshold
Threshold for TSS enrichment score. If not defined, and use_automatic_thresholds is False, the threshold will be set to 0.
- frip_threshold
Threshold for fraction of reads in peaks (FRiP). If not defined the threshold will be set to 0.
- use_automatic_thresholds
Use automatic thresholds for unique fragments in peaks and TSS enrichment score as calculated by Otsu’s method. If False, the thresholds will be set to 0 if not defined.
- Returns:
- Tuple with:
Numpy array with cell barcodes passing QC.
Dictionary with thresholds used for QC.
- Raises:
- FileNotFoundError
If the file with fragments statistics per cell barcode does not exist.
- pycisTopic.qc.get_otsu_threshold(fragments_stats_per_cb_df_pl: DataFrame, min_otsu_fragments: int = 100, min_otsu_tss: float = 1.0)[source]
Get Otsu thresholds for number of unique fragments in peaks and TSS enrichment score.
- Parameters:
- fragments_stats_per_cb_df_pl
Polars DataFrame with fragments statistics per cell barcode as generated by
pycisTopic.qc.compute_qc_stats()
.- min_otsu_fragments
When calculating Otsu threshold for number of unique fragments in peaks per CB, only consider those CBs which have at least this number of fragments.
- min_otsu_tss
When calculating Otsu threshold for TSS enrichment score per CB, only consider those CBs which have at least this TSS value.
- Returns:
- Tuple with:
Otsu threshold for number of unique fragments in peaks.
Otsu threshold for TSS enrichment.
Polars DataFrame with fragments statistics per cell barcode for cell barcodes that passed both Otsu thresholds.
Examples
Only keep fragments stats for CBs that pass both Otsu thresholds. >>> ( … unique_fragments_in_peaks_count_otsu_threshold, … tss_enrichment_otsu_threshold, … fragments_stats_per_cb_for_otsu_threshold_df_pl, … ) = get_otsu_threshold( … fragments_stats_per_cb_df_pl=fragments_stats_per_cb_df_pl, … min_otsu_fragments=100, … min_otsu_tss=1.0, … )
Topic modelling
- class pycisTopic.lda_models.CistopicLDAModel(metrics: DataFrame, coherence: DataFrame, marg_topic: DataFrame, topic_ass: DataFrame, cell_topic: DataFrame, topic_region: DataFrame, parameters: DataFrame)[source]
cisTopic LDA model class
cistopicLdaModel
contains model quality metrics (model coherence (adaptation from Mimno et al., 2011), log-likelihood (Griffiths and Steyvers, 2004), density-based (Cao Juan et al., 2009) and divergence-based (Arun et al., 2010)), topic quality metrics (coherence, marginal distribution and total number of assignments), cell-topic and topic-region distribution, model parameters and model dimensions.- Parameters:
- metrics: pd.DataFrame
pd.DataFrame
containing model quality metrics, including model coherence (adaptation from Mimno et al., 2011), log-likelihood and density and divergence-based methods (Cao Juan et al., 2009; Arun et al., 2010).- coherence: pd.DataFrame
pd.DataFrame
containing the coherence of each topic (Mimno et al., 2011).- marginal_distribution: pd.DataFrame
pd.DataFrame
containing the marginal distribution for each topic. It can be interpreted as the importance of each topic for the whole corpus.- topic_ass: pd.DataFrame
pd.DataFrame
containing the total number of assignments per topic.- cell_topic: pd.DataFrame
pd.DataFrame
containing the topic cell distributions, with cells as columns, topics as rows and the probability of each topic in each cell as values.- topic_region: pd.DataFrame
pd.DataFrame
containing the topic cell distributions, with topics as columns, regions as rows and the probability of each region in each topic as values.- parameters: pd.DataFrame
pd.DataFrame
containing parameters used for the model.- n_cells: int
Number of cells in the model.
- n_regions: int
Number of regions in the model.
- n_topic: int
Number of topics in the model.
References
Mimno, D., Wallach, H., Talley, E., Leenders, M., & McCallum, A. (2011). Optimizing semantic coherence in topic models. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (pp. 262-272).
Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National academy of Sciences, 101(suppl 1), 5228-5235.
Cao, J., Xia, T., Li, J., Zhang, Y., & Tang, S. (2009). A density-based method for adaptive LDA model selection. Neurocomputing, 72(7-9), 1775-1781.
Arun, R., Suresh, V., Madhavan, C. V., & Murthy, M. N. (2010). On finding the natural number of topics with latent dirichlet allocation: Some observations. In Pacific-Asia conference on knowledge discovery and data mining (pp. 391-402). Springer, Berlin, Heidelberg.
- class pycisTopic.lda_models.LDAMallet(num_topics: int, corpus: Iterable | None = None, alpha: float | None = 50, eta: float | None = 0.1, id2word: FakeDict | None = None, n_cpu: int | None = 1, tmp_dir: str | None = None, optimize_interval: int | None = 0, iterations: int | None = 150, topic_threshold: float | None = 0.0, random_seed: int | None = 555, reuse_corpus: bool | None = False, mallet_path: str = 'mallet')[source]
Wrapper class to run LDA models with Mallet. This class has been adapted from gensim (https://github.com/RaRe-Technologies/gensim/blob/27bbb7015dc6bbe02e00bb1853e7952ac13e7fe0/gensim/models/wrappers/ldamallet.py).
- Parameters:
- num_topics: int
The number of topics to use in the model.
- corpus: iterable of iterable of (int, int), optional
Collection of texts in BoW format. Default: None.
- alpha: float, optional
Scalar value indicating the symmetric Dirichlet hyperparameter for topic proportions. Default: 50.
- id2word
gensim.utils.FakeDict
, optional Mapping between tokens ids and words from corpus, if not specified - will be inferred from corpus. Default: None.
- n_cpuint, optional
Number of threads that will be used for training. Default: 1.
- tmp_dirstr, optional
tmp_dir for produced temporary files. Default: None.
- optimize_intervalint, optional
Optimize hyperparameters every optimize_interval iterations (sometimes leads to Java exception 0 to switch off hyperparameter optimization). Default: 0.
- iterationsint, optional
Number of training iterations. Default: 150.
- topic_thresholdfloat, optional
Threshold of the probability above which we consider a topic. Default: 0.0.
- random_seed: int, optional
Random seed to ensure consistent results, if 0 - use system clock. Default: 555.
- mallet_path: str
Path to the mallet binary (e.g. /xxx/Mallet/bin/mallet). Default: “mallet”.
- convert_input(corpus)[source]
Convert corpus to Mallet format and save it to a temporary text file.
- Parameters:
- corpus
iterable of iterable of (int, int) Collection of texts in BoW format.
- Returns:
- None.
- corpus_to_mallet(corpus, file_like)[source]
Convert corpus to Mallet format and write it to file_like descriptor.
- Parameters:
- corpus
iterable of iterable of (int, int) Collection of texts in BoW format.
- file_like
Writable file-like object in text mode.
- Returns:
- None.
- fdoctopics()[source]
Get path to document topic text file.
- Returns:
- str
Path to document topic text file.
- finferencer()[source]
Get path to inferencer.mallet file.
- Returns:
- str
Path to inferencer.mallet file.
- get_topics()[source]
Get topics X words matrix.
- Returns:
- np.ndarray
Topics X words matrix, shape num_topics x vocabulary_size.
- pycisTopic.lda_models.evaluate_models(models: List[CistopicLDAModel], select_model: int | None = None, return_model: bool | None = True, metrics: str | None = ['Minmo_2011', 'loglikelihood', 'Cao_Juan_2009', 'Arun_2010'], min_topics_coh: int | None = 5, plot: bool | None = True, figsize: Tuple[float, float] | None = (6.4, 4.8), plot_metrics: bool | None = False, save: str | None = None)[source]
Model selection based on model quality metrics (model coherence (adaptation from Mimno et al., 2011), log-likelihood (Griffiths and Steyvers, 2004), density-based (Cao Juan et al., 2009) and divergence-based (Arun et al., 2010)).
- Parameters:
- models: list of :class:`CistopicLDAModel`
A list containing cisTopic LDA models, as returned from run_cgs_models or run_cgs_modelsMallet.
- selected_model: int, optional
Integer indicating the number of topics of the selected model. If not provided, the best model will be selected automatically based on the model quality metrics. Default: None.
- return_model: bool, optional
Whether to return the selected model as
CistopicLDAModel
- metrics: list of str
- Metrics to use for plotting and model selection:
Minmo_2011: Uses the average model coherence as calculated by Mimno et al (2011). In order to reduce the impact of the number of topics, we calculate the average coherence based on the top selected average values. The better the model, the higher coherence. log-likelihood: Uses the log-likelihood in the last iteration as calculated by Griffiths and Steyvers (2004). The better the model, the higher the log-likelihood. Arun_2010: Uses a divergence-based metric as in Arun et al (2010) using the topic-region distribution, the cell-topic distribution and the cell coverage. The better the model, the lower the metric. Cao_Juan_2009: Uses a density-based metric as in Cao Juan et al (2009) using the topic-region distribution. The better the model, the lower the metric.
Default: all metrics.
- min_topics_coh: int, optional
Minimum number of topics on a topic to use its coherence for model selection. Default: 5.
- plot: bool, optional
Whether to return plot to the console. Default: True.
- figsize: tuple, optional
Size of the figure. Default: (6.4, 4.8)
- plot_metrics: bool, optional
Whether to plot metrics independently. Default: False.
- save: str, optional
Output file to save plot. Default: None.
References
Mimno, D., Wallach, H., Talley, E., Leenders, M., & McCallum, A. (2011). Optimizing semantic coherence in topic models. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (pp. 262-272).
Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National academy of Sciences, 101(suppl 1), 5228-5235
Cao, J., Xia, T., Li, J., Zhang, Y., & Tang, S. (2009). A density-based method for adaptive LDA model selection. Neurocomputing, 72(7-9), 1775-1781.
Arun, R., Suresh, V., Madhavan, C. V., & Murthy, M. N. (2010). On finding the natural number of topics with latent dirichlet allocation: Some observations. In Pacific-Asia conference on knowledge discovery and data mining (pp. 391-402). Springer, Berlin, Heidelberg.
- pycisTopic.lda_models.run_cgs_model_mallet(binary_matrix: csr_matrix, corpus: Iterable, id2word: FakeDict, n_topics: List[int], cell_names: List[str], region_names: List[str], n_cpu: int | None = 1, n_iter: int | None = 500, random_state: int | None = 555, alpha: float | None = 50, alpha_by_topic: bool | None = True, eta: float | None = 0.1, eta_by_topic: bool | None = False, top_topics_coh: int | None = 5, tmp_path: str | None = None, save_path: str | None = None, reuse_corpus: bool | None = False, mallet_path: str = 'mallet')[source]
Run Latent Dirichlet Allocation in a model as implemented in Mallet (McCallum, 2002).
- Parameters:
- binary_matrix: sparse.csr_matrix
Binary sparse matrix containing cells as columns, regions as rows, and 1 if a regions is considered accessible on a cell (otherwise, 0).
- n_topics: list of int
A list containing the number of topics to use in each model.
- cell_names: list of str
List containing cell names as ordered in the binary matrix columns.
- region_names: list of str
List containing region names as ordered in the binary matrix rows.
- n_cpu: int, optional
Number of cpus to use for modelling. In this function parallelization is done per model, that is, one model will run entirely in a unique cpu. We recommend to set the number of cpus as the number of models that will be inferred, so all models start at the same time.
- n_iter: int, optional
Number of iterations for which the Gibbs sampler will be run. Default: 150.
- random_state: int, optional
Random seed to initialize the models. Default: 555.
- alpha: float, optional
Scalar value indicating the symmetric Dirichlet hyperparameter for topic proportions. Default: 50.
- alpha_by_topic: bool, optional
Boolean indicating whether the scalar given in alpha has to be divided by the number of topics. Default: True
- eta: float, optional
Scalar value indicating the symmetric Dirichlet hyperparameter for topic multinomials. Default: 0.1.
- eta_by_topic: bool, optional
Boolean indicating whether the scalar given in beta has to be divided by the number of topics. Default: False
- top_topics_coh: int, optional
Number of topics to use to calculate the model coherence. For each model, the coherence will be calculated as the average of the top coherence values. Default: 5.
- tmp_path: str, optional
Path to a temporary folder for Mallet. Default: None.
- save_path: str, optional
Path to save models as independent files as they are completed. This is recommended for large data sets. Default: None.
- reuse_corpus: bool, optional
Whether to reuse the mallet corpus in the tmp directory. Default: False
- mallet_path: str
Path to Mallet binary (e.g. “/xxx/Mallet/bin/mallet”). Default: “mallet”.
References
McCallum, A. K. (2002). Mallet: A machine learning for language toolkit. http://mallet.cs.umass.edu.
- pycisTopic.lda_models.run_cgs_models(cistopic_obj: CistopicObject, n_topics: List[int], n_cpu: int | None = 1, n_iter: int | None = 150, random_state: int | None = 555, alpha: float | None = 50, alpha_by_topic: bool | None = True, eta: float | None = 0.1, eta_by_topic: bool | None = False, top_topics_coh: int | None = 5, save_path: str | None = None, **kwargs)[source]
Run Latent Dirichlet Allocation using Gibbs Sampling as described in Griffiths and Steyvers, 2004.
- Parameters:
- cistopic_obj: CistopicObject
A
CistopicObject
. Note that cells/regions have to be filtered before running any LDA model.- n_topics: list of int
A list containing the number of topics to use in each model.
- n_cpu: int, optional
Number of cpus to use for modelling. In this function parallelization is done per model, that is, one model will run entirely in a unique cpu. We recommend to set the number of cpus as the number of models that will be inferred, so all models start at the same time.
- n_iter: int, optional
Number of iterations for which the Gibbs sampler will be run. Default: 150.
- random_state: int, optional
Random seed to initialize the models. Default: 555.
- alpha: float, optional
Scalar value indicating the symmetric Dirichlet hyperparameter for topic proportions. Default: 50.
- alpha_by_topic: bool, optional
Boolean indicating whether the scalar given in alpha has to be divided by the number of topics. Default: True
- eta: float, optional
Scalar value indicating the symmetric Dirichlet hyperparameter for topic multinomials. Default: 0.1.
- eta_by_topic: bool, optional
Boolean indicating whether the scalar given in beta has to be divided by the number of topics. Default: False
- top_topics_coh: int, optional
Number of topics to use to calculate the model coherence. For each model, the coherence will be calculated as the average of the top coherence values. Default: 5.
- save_path: str, optional
Path to save models as independent files as they are completed. This is recommended for large data sets. Default: None.
References
Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National academy of Sciences, 101(suppl 1), 5228-5235.
- pycisTopic.lda_models.run_cgs_models_mallet(cistopic_obj: CistopicObject, n_topics: List[int], n_cpu: int | None = 1, n_iter: int | None = 150, random_state: int | None = 555, alpha: float | None = 50, alpha_by_topic: bool | None = True, eta: float | None = 0.1, eta_by_topic: bool | None = False, top_topics_coh: int | None = 5, tmp_path: str | None = None, save_path: str | None = None, reuse_corpus: bool | None = False, mallet_path: str = 'mallet')[source]
Run Latent Dirichlet Allocation per model as implemented in Mallet (McCallum, 2002).
- Parameters:
- cistopic_obj: CistopicObject
A
CistopicObject
. Note that cells/regions have to be filtered before running any LDA model.- n_topics: list of int
A list containing the number of topics to use in each model.
- n_cpu: int, optional
Number of cpus to use for modelling. In this function parallelization is done per model, that is, one model will run entirely in a unique cpu. We recommend to set the number of cpus as the number of models that will be inferred, so all models start at the same time.
- n_iter: int, optional
Number of iterations for which the Gibbs sampler will be run. Default: 150.
- random_state: int, optional
Random seed to initialize the models. Default: 555.
- alpha: float, optional
Scalar value indicating the symmetric Dirichlet hyperparameter for topic proportions. Default: 50.
- alpha_by_topic: bool, optional
Boolean indicating whether the scalar given in alpha has to be divided by the number of topics. Default: True
- eta: float, optional
Scalar value indicating the symmetric Dirichlet hyperparameter for topic multinomials. Default: 0.1.
- eta_by_topic: bool, optional
Boolean indicating whether the scalar given in beta has to be divided by the number of topics. Default: False
- top_topics_coh: int, optional
Number of topics to use to calculate the model coherence. For each model, the coherence will be calculated as the average of the top coherence values. Default: 5.
- tmp_path: str, optional
Path to a temporary folder for Mallet. Default: None.
- save_path: str, optional
Path to save models as independent files as they are completed. This is recommended for large data sets. Default: None.
- reuse_corpus: bool, optional
Whether to reuse the mallet corpus in the tmp directory. Default: False
- mallet_path: str
Path to Mallet binary (e.g. “/xxx/Mallet/bin/mallet”). Default: “mallet”.
References
McCallum, A. K. (2002). Mallet: A machine learning for language toolkit. http://mallet.cs.umass.edu.
Clustering & visualization
- pycisTopic.clust_vis.cell_topic_heatmap(cistopic_obj: CistopicObject, variables: List[str] | None = None, remove_nan: bool | None = True, scale: bool | None = False, cluster_topics: bool | None = False, color_dict: Dict[str, Dict[str, str]] | None = {}, seed: int | None = 555, legend_loc_x: float | None = 1.2, legend_loc_y: float | None = -0.5, legend_dist_y: float | None = -1, figsize: Tuple[float, float] | None = (6.4, 4.8), selected_topics: List[int] | None = None, selected_cells: List[str] | None = None, harmony: bool | None = False, save: str | None = None)[source]
Plot heatmap with cell-topic distributions. Parameters ——— cistopic_obj: class::CistopicObject
A cisTopic object with a model in class::CistopicObject.selected_model.
- variables: list
List of variables to plot. They should be included in class::CistopicObject.cell_data and class::CistopicObject.region_data, depending on which target is specified.
- remove_nan: bool, optional
Whether to remove data points for which the variable value is ‘nan’. Default: True
- reduction_name: str
Name of the dimensionality reduction to use
- scale: bool, optional
Whether to scale the cell-topic or topic-regions contributions prior to plotting. Default: False
- cluster_topics: bool, optional
Whether to cluster rows in the heatmap. Otherwise, they will be ordered based on the maximum values over the ordered cells. Default: False
- color_dictionary: dict, optional
A dictionary containing an entry per variable, whose values are dictionaries with variable levels as keys and corresponding colors as values. Default: None
- seed: int, optional
Random seed used to select random colors. Default: 555
- legend_loc_x: float, optional
X location for legend. Default: 1.2
- legend_loc_y: float, optional
Y location for legend. Default: -0.5
- legend_dist_y: float, optional
Y distance between legends. Default: -1
- figsize: tuple, optional
Size of the figure. Default: (6.4, 4.8)
- selected_topics: list, optional
A list with selected topics to be used for plotting. Default: None (use all topics)
- selected_cellss: list, optional
A list with selected cells to plot. Default: None (use all cells)
- harmony: bool, optional
If target is ‘cell’, whether to use harmony processed topic contributions. Default: False
- save: str, optional
Path to save plot. Default: None.
- pycisTopic.clust_vis.find_clusters(cistopic_obj: CistopicObject, target: str | None = 'cell', k: int | None = 10, res: List[float] | None = [0.6], seed: int | None = 555, scale: bool | None = False, prefix: str | None = '', selected_topics: List[int] | None = None, selected_features: List[str] | None = None, harmony: bool | None = False, rna_components: DataFrame | None = None, use_umap_integration: bool | None = False, rna_weight: float | None = 0.5, split_pattern: str | None = '___', **kwargs)[source]
Performing leiden cell or region clustering and add results to cisTopic object’s metadata.
Parameters
- cistopic_obj: class::CistopicObject
A cisTopic object with a model in class::CistopicObject.selected_model.
- target: str, optional
Whether cells (‘cell’) or regions (‘region’) should be clustered. Default: ‘cell’
- k: int, optional
Number of neighbours in the k-neighbours graph. Default: 10
- res: float, optional
Resolution parameter for the leiden algorithm step. Default: 0.6
- seed: int, optional
Seed parameter for the leiden algorithm step. Default: 555
- scale: bool, optional
Whether to scale the cell-topic or topic-regions contributions prior to the clustering. Default: False
- prefix: str, optional
Prefix to add to the clustering name when adding it to the correspondent metadata attribute. Default: ‘’
- selected_topics: list, optional
A list with selected topics to be used for clustering. Default: None (use all topics)
- selected_features: list, optional
A list with selected features (cells or regions) to cluster. This is recommended when working with regions (e.g. selecting regions in binarized topics), as working with all regions can be time consuming. Default: None (use all features)
- harmony: bool, optional
If target is ‘cell’, whether to use harmony processed topic contributions. Default: False.
- rna_components: pd.DataFrame, optional
A pandas dataframe containing RNA dimensionality reduction (e.g. PCA) components. If provided, both layers (atac and rna) will be considered for clustering.
- use_umap_integration: bool, optional
Whether to use a weighted UMAP representation for the clustering or directly integrating the two graphs. Default: True
- rna_weight: float, optional
Weight of the RNA layer on the clustering (only applicable when clustering via UMAP). Default: 0.5 (same weight)
- pycisTopic.clust_vis.harmony(cistopic_obj: CistopicObject, vars_use: List[str], scale: bool | None = True, random_state: int | None = 555, **kwargs)[source]
Apply harmony batch effect correction (Korsunsky et al, 2019) over cell-topic distribution
Parameters
- cistopic_obj: class::CistopicObject
A cisTopic object with a model in class::CistopicObject.selected_model.
- vars_use: list
List of variables to correct batch effect with.
- scale: bool, optional
Whether to scale probability matrix prior to correction. Default: True
- random_state: int, optional
Random seed used to use with harmony. Default: 555
References
Korsunsky, I., Millard, N., Fan, J., Slowikowski, K., Zhang, F., Wei, K., … & Raychaudhuri, S. (2019). Fast, sensitive and accurate integration of single-cell data with Harmony. Nature methods, 16(12), 1289-1296.
- pycisTopic.clust_vis.input_check(atac_topics: DataFrame, rna_pca: DataFrame)[source]
A function to select cells present in both the RNA and the ATAC layers
- pycisTopic.clust_vis.plot_imputed_features(cistopic_obj: CistopicObject, reduction_name: str, imputed_data: cisTopicImputedFeatures, features: ~typing.List[str], scale: bool | None = False, cmap: str | matplotlib.cm | None = <matplotlib.colors.ListedColormap object>, dot_size: int | None = 10, alpha: float | int | None = 1, selected_cells: ~typing.List[str] | None = None, figsize: ~typing.Tuple[float, float] | None = (6.4, 4.8), num_columns: int | None = 1, save: str | None = None)[source]
Plot imputed features into dimensionality reduction.
Parameters
- cistopic_obj: class::CistopicObject
A cisTopic object with dimensionality reductions in class::CistopicObject.dr.
- reduction_name: str
Name of the dimensionality reduction to use
- imputed_data: class::cisTopicImputedFeatures
A class::cisTopicImputedFeatures object derived from the input cisTopic object.
- features: list
Names of the features to plot.
- scale: bool, optional
Whether to scale the imputed features prior to plotting. Default: False
- cmap: str or ‘matplotlib.cm’, optional
For continuous variables, color map to use for the legend color bar. Default: cm.viridis
- dot_size: int, optional
Dot size in the plot. Default: 10
- alpha: float, optional
Transparency value for the dots in the plot. Default: 1
- selected_cells: list, optional
A list with selected cells to plot. Default: None (use all cells)
- figsize: tuple, optional
Size of the figure. If num_columns is 1, this is the size for each figure; if num_columns is above 1, this is the overall size of the figure (if keeping default, it will be the size of each subplot in the figure). Default: (6.4, 4.8)
- num_columns: int, optional
For multiplot figures, indicates the number of columns (the number of rows will be automatically determined based on the number of plots). Default: 1
- save: str, optional
Path to save plot. Default: None.
- pycisTopic.clust_vis.plot_metadata(cistopic_obj: ~pycisTopic.cistopic_class.CistopicObject, reduction_name: str, variables: ~typing.List[str], target: str | None = 'cell', remove_nan: bool | None = True, show_label: bool | None = True, show_legend: bool | None = False, cmap: str | <module 'matplotlib.cm' from '/home/docs/checkouts/readthedocs.org/user_builds/pycistopic/envs/polars/lib/python3.11/site-packages/matplotlib/cm.py'> | None = <matplotlib.colors.ListedColormap object>, dot_size: int | None = 10, text_size: int | None = 10, alpha: float | int | None = 1, seed: int | None = 555, color_dictionary: ~typing.Dict[str, str] | None = {}, figsize: ~typing.Tuple[float, float] | None = (6.4, 4.8), num_columns: int | None = 1, selected_features: ~typing.List[str] | None = None, save: str | None = None)[source]
Plot categorical and continuous metadata into dimensionality reduction.
Parameters
- cistopic_obj: class::CistopicObject
A cisTopic object with dimensionality reductions in class::CistopicObject.projections.
- reduction_name: str
Name of the dimensionality reduction to use
- variables: list
List of variables to plot. They should be included in class::CistopicObject.cell_data and class::CistopicObject.region_data, depending on which target is specified.
- target: str, optional
Whether cells (‘cell’) or regions (‘region’) should be used. Default: ‘cell’
- remove_nan: bool, optional
Whether to remove data points for which the variable value is ‘nan’. Default: True
- show_label: bool, optional
For categorical variables, whether to show the label in the plot. Default: True
- show_legend: bool, optional
For categorical variables, whether to show the legend next to the plot. Default: False
- cmap: str or ‘matplotlib.cm’, optional
For continuous variables, color map to use for the legend color bar. Default: cm.viridis
- dot_size: int, optional
Dot size in the plot. Default: 10
- text_size: int, optional
For categorical variables and if show_label is True, size of the labels in the plot. Default: 10
- alpha: float, optional
Transparency value for the dots in the plot. Default: 1
- seed: int, optional
Random seed used to select random colors. Default: 555
- color_dictionary: dict, optional
A dictionary containing an entry per variable, whose values are dictionaries with variable levels as keys and corresponding colors as values. Default: None
- figsize: tuple, optional
Size of the figure. If num_columns is 1, this is the size for each figure; if num_columns is above 1, this is the overall size of the figure (if keeping default, it will be the size of each subplot in the figure). Default: (6.4, 4.8)
- num_columns: int, optional
For multiplot figures, indicates the number of columns (the number of rows will be automatically determined based on the number of plots). Default: 1
- selected_features: list, optional
A list with selected features (cells or regions) to plot. This is recommended when working with regions (e.g. selecting regions in binarized topics), as working with all regions can be time consuming. Default: None (use all features)
- save: str, optional
Path to save plot. Default: None.
- pycisTopic.clust_vis.plot_topic(cistopic_obj: ~pycisTopic.cistopic_class.CistopicObject, reduction_name: str, target: str | None = 'cell', cmap: str | <module 'matplotlib.cm' from '/home/docs/checkouts/readthedocs.org/user_builds/pycistopic/envs/polars/lib/python3.11/site-packages/matplotlib/cm.py'> | None = <matplotlib.colors.ListedColormap object>, dot_size: int | None = 10, alpha: float | int | None = 1, scale: bool | None = False, selected_topics: ~typing.List[int] | None = None, selected_features: ~typing.List[str] | None = None, harmony: bool | None = False, figsize: ~typing.Tuple[float, float] | None = (6.4, 4.8), num_columns: int | None = 1, save: str | None = None)[source]
Plot topic distributions into dimensionality reduction.
Parameters
- cistopic_obj: class::CistopicObject
A cisTopic object with dimensionality reductions in class::CistopicObject.projections.
- reduction_name: str
Name of the dimensionality reduction to use
- target: str, optional
Whether cells (‘cell’) or regions (‘region’) should be used. Default: ‘cell’
- cmap: str or ‘matplotlib.cm’, optional
For continuous variables, color map to use for the legend color bar. Default: cm.viridis
- dot_size: int, optional
Dot size in the plot. Default: 10
- alpha: float, optional
Transparency value for the dots in the plot. Default: 1
- scale: bool, optional
Whether to scale the cell-topic or topic-regions contributions prior to plotting. Default: False
- selected_topics: list, optional
A list with selected topics to be used for plotting. Default: None (use all topics)
- selected_features: list, optional
A list with selected features (cells or regions) to plot. This is recommended when working with regions (e.g. selecting regions in binarized topics), as working with all regions can be time consuming. Default: None (use all features)
- harmony: bool, optional
If target is ‘cell’, whether to use harmony processed topic contributions. Default: False
- figsize: tuple, optional
Size of the figure. If num_columns is 1, this is the size for each figure; if num_columns is above 1, this is the overall size of the figure (if keeping default, it will be the size of each subplot in the figure). Default: (6.4, 4.8)
- num_columns: int, optional
For multiplot figures, indicates the number of columns (the number of rows will be automatically determined based on the number of plots). Default: 1
- save: str, optional
Path to save plot. Default: None.
- pycisTopic.clust_vis.run_tsne(cistopic_obj: CistopicObject, target: str | None = 'cell', scale: bool | None = False, reduction_name: str | None = 'tSNE', random_state: int | None = 555, perplexity: int | None = 30, selected_topics: List[int] | None = None, selected_features: List[str] | None = None, harmony: bool | None = False, rna_components: DataFrame | None = None, rna_weight: float | None = 0.5, **kwargs)[source]
Run tSNE and add it to the dimensionality reduction dictionary. If FItSNE is installed it will be used, otherwise sklearn TSNE implementation will be used.
Parameters
- cistopic_obj: class::CistopicObject
A cisTopic object with a model in class::CistopicObject.selected_model.
- target: str, optional
Whether cells (‘cell’) or regions (‘region’) should be used. Default: ‘cell’
- scale: bool, optional
Whether to scale the cell-topic or topic-regions contributions prior to the dimensionality reduction. Default: False
- reduction_name: str, optional
Reduction name to use as key in the dimensionality reduction dictionary. Default: ‘tSNE’
- random_state: int, optional
Seed parameter for running tSNE. Default: 555
- perplexity: int, optional
Perplexity parameter for FitSNE. Default: 30
- selected_topics: list, optional
A list with selected topics to be used for clustering. Default: None (use all topics)
- selected_features: list, optional
A list with selected features (cells or regions) to cluster. This is recommended when working with regions (e.g. selecting regions in binarized topics), as working with all regions can be time consuming. Default: None (use all features)
- harmony: bool, optional
If target is ‘cell’, whether to use harmony processed topic contributions. Default: False
- rna_components: pd.DataFrame, optional
A pandas dataframe containing RNA dimensionality reduction (e.g. PCA) components. If provided, both layers (atac and rna) will be considered for clustering.
- rna_weight: float, optional
Weight of the RNA layer on the clustering (only applicable when clustering via UMAP). Default: 0.5 (same weight)
- **kwargs
Parameters to pass to fitsne.FItSNE or sklearn.manifold.TSNE.
- pycisTopic.clust_vis.run_umap(cistopic_obj: CistopicObject, target: str | None = 'cell', scale: bool | None = False, reduction_name: str | None = 'UMAP', random_state: int | None = 555, selected_topics: List[int] | None = None, selected_features: List[str] | None = None, harmony: bool | None = False, rna_components: DataFrame | None = None, rna_weight: float | None = 0.5, **kwargs)[source]
Run UMAP and add it to the dimensionality reduction dictionary.
Parameters
- cistopic_obj: class::CistopicObject
A cisTopic object with a model in class::CistopicObject.selected_model.
- target: str, optional
Whether cells (‘cell’) or regions (‘region’) should be used. Default: ‘cell’
- scale: bool, optional
Whether to scale the cell-topic or topic-regions contributions prior to the dimensionality reduction. Default: False
- reduction_name: str, optional
Reduction name to use as key in the dimensionality reduction dictionary. Default: ‘UMAP’
- random_state: int, optional
Seed parameter for running UMAP. Default: 555
- selected_topics: list, optional
A list with selected topics to be used for clustering. Default: None (use all topics)
- selected_features: list, optional
A list with selected features (cells or regions) to cluster. This is recommended when working with regions (e.g. selecting regions in binarized topics), as working with all regions can be time consuming. Default: None (use all features)
- harmony: bool, optional
If target is ‘cell’, whether to use harmony processed topic contributions. Default: False.
- rna_components: pd.DataFrame, optional
A pandas dataframe containing RNA dimensionality reduction (e.g. PCA) components. If provided, both layers (atac and rna) will be considered for clustering.
- rna_weight: float, optional
Weight of the RNA layer on the clustering (only applicable when clustering via UMAP). Default: 0.5 (same weight)
- **kwargs
Parameters to pass to umap.UMAP.
Drop-out imputation & Differential features
- class pycisTopic.diff_features.CistopicImputedFeatures(imputed_acc: csr_matrix, feature_names: List[str], cell_names: List[str], project: str)[source]
cisTopic imputation data class.
CistopicImputedFeatures
contains the cell by features matrices (stored atmtx
, with features being eithere regions or genes ), cell namescell_names
and feature namesfeature_names
.Attributes
- mtx: sparse.csr_matrix
A matrix containing imputed values.
- cell_names: list
A list containing cell names.
- feature_names: list
A list containing feature names.
- project: str
Name of the cisTopic imputation project.
- make_rankings(seed=123)[source]
A function to generate rankings per cell based on the imputed accessibility scores per region.
Parameters
- seed: int, optional
Random seed to ensure reproducibility of the rankings when there are ties
Return
- CistopicImputedFeatures
A
CistopicImputedFeatures
containing with ranking values rather than scores.
- merge(cistopic_imputed_features_list: List[CistopicImputedFeatures], project: str | None = 'cisTopic_impute_merge', copy: bool | None = False)[source]
Merge a list of
CistopicImputedFeatures
to the inputCistopicImputedFeatures
. Reference coordinates (for regions) must be the same between the objects.Parameters
- cistopic_imputed_features_list: list
A list containing one or more
CistopicImputedFeatures
to merge.- project: str, optional
Name of the cisTopic imputation project.
- copy: bool, optional
Whether changes should be done on the input
CistopicObject
or a new object should be returned
Return
- CistopicImputedFeatures
A combined
CistopicImputedFeatures
.
- subset(cells: List[str] | None = None, features: List[str] | None = None, copy: bool | None = False, split_pattern: str | None = '___')[source]
Subset cells and/or regions from
CistopicImputedFeatures
.Parameters
- cells: list, optional
A list containing the names of the cells to keep.
- features: list, optional
A list containing the names of the features to keep.
- copy: bool, optional
Whether changes should be done on the input
CistopicObject
or a new object should be returned- split_pattern: str
Pattern to split cell barcode from sample id. Default: ___
- pycisTopic.diff_features.find_diff_features(cistopic_obj: CistopicObject, imputed_features_obj: CistopicImputedFeatures, variable: str, var_features: List[str] | None = None, contrasts: List[List[str]] | None = None, adjpval_thr: float | None = 0.05, log2fc_thr: float | None = 0.5849625007211562, split_pattern: str | None = '___', n_cpu: int | None = 1, **kwargs)[source]
Find differential imputed features.
Parameters
- cistopic_obj: class::CistopicObject
A cisTopic object including the cells in imputed_features_obj.
- imputed_features_obj:
CistopicImputedFeatures
A cisTopic imputation data object.
- variable: str
Name of the group variable to do comparison. It must be included in class::CistopicObject.cell_data
- var_features: list, optional
A list of features to use (e.g. variable features from find_highly_variable_features())
- contrasts: List, optional
A list including contrasts to make in the form of lists with foreground and background, e.g. [[[‘Group_1’], [‘Group_2, ‘Group_3’]], [][‘Group_2’], [‘Group_1, ‘Group_3’]], [][‘Group_1’], [‘Group_2, ‘Group_3’]]]. Default: None.
- adjpval_thr: float, optional
Adjusted p-values threshold. Default: 0.05
- log2fc_thr: float, optional
Log2FC threshold. Default: np.log2(1.5)
- split_pattern: str
Pattern to split cell barcode from sample id. Default: ___
- n_cpu: int, optional
Number of cores to use. Default: 1
- **kwargs
Parameters to pass to ray.init()
- pycisTopic.diff_features.find_highly_variable_features(input_mat: DataFrame | CistopicImputedFeatures, min_disp: float | None = 0.05, min_mean: float | None = 0.0125, max_disp: float | None = inf, max_mean: float | None = 3, n_bins: int | None = 20, n_top_features: int | None = None, plot: bool | None = True, save: str | None = None)[source]
Find highly variable features.
Parameters
- input_mat: pd.DataFrame or
CistopicImputedFeatures
A dataframe with values to be normalize or cisTopic imputation data.
- min_disp: float, optional
Minimum dispersion value for a feature to be selected. Default: 0.05
- min_mean: float, optional
Minimum mean value for a feature to be selected. Default: 0.0125
- max_disp: float, optional
Maximum dispersion value for a feature to be selected. Default: np.inf
- max_mean: float, optional
Maximum mean value for a feature to be selected. Default: 3
- n_bins: int, optional
Number of bins for binning the mean gene expression. Normalization is done with respect to each bin. Default: 20
- n_top_features: int, optional
Number of highly-variable features to keep. If specifed, dispersion and mean thresholds will be ignored. Default: None
- plot: bool, optional
Whether to plot dispersion versus mean values. Default: True.
- save: str, optional
Path to save feature selection plot. Default: None
- input_mat: pd.DataFrame or
- pycisTopic.diff_features.get_log2_fc(fg_mat, bg_mat)[source]
Calculate log2 fold change between foreground and background matrix.
- Parameters:
- fg_mat
2D-numpy foreground matrix.
- bg_mat
2D-numpy background matrix.
- pycisTopic.diff_features.get_wilcox_test_pvalues(fg_mat, bg_mat)[source]
Calculate wilcox test p-values between foreground and background matrix.
- Parameters:
- fg_mat
2D-numpy foreground matrix.
- bg_mat
2D-numpy background matrix.
- pycisTopic.diff_features.impute_accessibility(cistopic_obj: CistopicObject, selected_cells: List[str] | None = None, selected_regions: List[str] | None = None, scale_factor: int | None = 1000000, chunk_size: int = 20000, project: str | None = 'cisTopic_Impute')[source]
Impute region accessibility.
- Parameters:
- cistopic_obj: `class::CistopicObject`
A cisTopic object with a model in class::CistopicObject.selected_model.
- selected_cells: list, optional
A list with selected cells to impute accessibility for. Default: None
- selected_regions: list, optional
A list with selected regions to impute accessibility for. Default: None
- scale_factor: int, optional
A number to multiply the imputed values for. This is useful to convert low probabilities to 0, making the matrix more sparse. Default: 10**6.
- chunk_size:
Chunk size used (number of regions for which imputed accessibility is calculated at the same time).
- project: str, optional
Name of the cisTopic imputation project. Default:
cisTopic_impute
.
- pycisTopic.diff_features.markers(input_mat: DataFrame | CistopicImputedFeatures, barcode_group: List[List[str]], contrast_name: str, adjpval_thr: float | None = 0.05, log2fc_thr: float | None = 1, n_cpu: int | None = 1)[source]
Find differential imputed features.
- Parameters:
- input_mat: :class:`pd.DataFrame` or :class:`CistopicImputedFeatures`
A data frame or a cisTopic imputation data object.
- barcode_group: List
List of length 2, including foreground cells on the first slot and background on the second.
- contrast_name: str
Name of the contrast
- adjpval_thr: float, optional
Adjusted p-values threshold. Default: 0.05
- log2fc_thr: float, optional
Log2FC threshold. Default: np.log2(1.5)
- n_cpu: int, optional
Number of cores to use. Default: 1
- pycisTopic.diff_features.mean_axis1(arr)[source]
Calculate column wise mean of 2D-numpy matrix with numba, mimicking np.mean(x, axis=1).
- Parameters:
- arr
2D-numpy array to calculate the mean per column for.
- pycisTopic.diff_features.normalize_scores(imputed_acc: DataFrame | CistopicImputedFeatures, scale_factor: int = 10000)[source]
Log-normalize imputation data. Feature counts for each cell are divided by the total counts for that cell and multiplied by the scale_factor.
- Parameters:
- imputed_acc: pd.DataFrame or :class:`CistopicImputedFeatures`
A dataframe with values to be normalized or cisTopic imputation data.
- scale_factor: int
Scale factor for cell-level normalization. Default: 10**4
- pycisTopic.diff_features.p_adjust_bh(p: float)[source]
Benjamini-Hochberg p-value correction for multiple hypothesis testing.
- pycisTopic.diff_features.subset_array_second_axis(arr, col_indices)[source]
Subset array by second axis based on provided col_indices.
Returns the same as arr[:, col_indices], but is much faster when arr and col_indices are big.
- Parameters:
- arr
2D-numpy array to subset by provided column indices.
- col_indices
1D-numpy array (preferably with np.int64 as dtype) with column indices.
Topic binarization
- pycisTopic.topic_binarization.binarize_topics(cistopic_obj: CistopicObject, target: str | None = 'region', method: str | None = 'otsu', smooth_topics: bool = True, ntop: int = 2000, predefined_thr: dict[str, float] | None = None, nbins: int = 100, plot: bool = False, figsize: tuple[float, float] | None = (6.4, 4.8), num_columns: int = 1, save: str | None = None)[source]
Binarize topic distributions.
- Parameters:
- cistopic_obj
A cisTopic object with a model in
CistopicObject
.- target
Whether cell-topic (“cell”) or region-topic (“region”) distributions should be binarized. Default: “region”.
- method
- Method to use for topic binarization. Possible options are:
otsu
[Otsu, 1979]yen
[Yen et al., 1995]li
[Li & Lee, 1993]aucell
[Van de Sande et al., 2020]ntop
[Taking the top n regions per topic]
Default:
otsu
.- smooth_topics
Whether to smooth topics distributions to penalize regions enriched across many topics. The following formula is applied:
\[\beta_{w, k} (\log\beta_{w,k} - 1 / K \sum_{k'} \log \beta_{w,k'})\]- ntop
Number of top regions to select when using
method="ntop"
. Default: 2000.- predefined_thr
A dictionary containing topics as keys and threshold as values. If a topic is not present, thresholds will be computed with the specified method. This can be used for manually adjusting thresholds when necessary. Default: None.
- nbins
Number of bins to use in the histogram used for
otsu
,yen
andli
thresholding. Default: 100.- plot
Whether to plot region-topic distributions and their threshold. Default: False.
- figsize
Size of the figure. If num_columns is 1, this is the size for each figure. If
num_columns
is above 1, this is the overall size of the figure. If keeping the default, it will be the size of each subplot in the figure. Default: (6.4, 4.8).- num_columns
For multiplot figures, indicates the number of columns (the number of rows will be automatically determined based on the number of plots). Default: 1.
- save
Path to save plot. Default: None.
- Returns:
- A dictionary containing a pd.DataFrame with the selected regions with region names
- as indexes and a topic score column.
References
Otsu, N., 1979. A threshold selection method from gray-level histograms. IEEE transactions on systems, man, and cybernetics, 9(1), pp.62-66.
Yen, J.C., Chang, F.J. and Chang, S., 1995. A new criterion for automatic multilevel thresholding. IEEE Transactions on Image Processing, 4(3), pp.370-378.
Li, C.H. and Lee, C.K., 1993. Minimum cross entropy thresholding. Pattern recognition, 26(4), pp.617-625.
Van de Sande, B., Flerin, C., Davie, K., De Waegeneer, M., Hulselmans, G., Aibar, S., Seurinck, R., Saelens, W., Cannoodt, R., Rouchon, Q. and Verbeiren, T., 2020. A scalable SCENIC workflow for single-cell gene regulatory network analysis. Nature Protocols, 15(7), pp.2247-2276.
- pycisTopic.topic_binarization.cross_entropy(array: ndarray, threshold: float, nbins: int = 100) float [source]
Calculate entropies for Li thresholding on topic-region distributions [Li & Lee, 1993].
- Parameters:
- array
Array containing the region values for the topic to be binarized.
- threshold
Distribution threshold to calculate entropy from.
- nbins
Number of bins to use in the binarization histogram.
- Returns:
- Entropy for the given threshold.
- pycisTopic.topic_binarization.histogram_and_bin_centers(array: ndarray, nbins: int = 100) tuple[ndarray, ndarray] [source]
Draw histogram from distribution and identify centers.
- Parameters:
- array
Scores distribution.
- nbins
Number of bins to use in the histogram.
- Returns:
- Histogram values and bin centers.
- pycisTopic.topic_binarization.smooth_topics_distributions(topic_region_distributions: DataFrame) DataFrame [source]
Smooth topic-region distributions.
Smooth topics distributions to penalize regions enriched across many topics. The formula applied is:
\[\beta_{w, k} (\log\beta_{w,k} - 1 / K \sum_{k'} \log \beta_{w,k'})\]- Parameters:
- topic_region_distributions
A pandas dataframe with topic-region distributions (with topics as columns and regions as rows).
- Returns:
- Smoothed topic-region dataframe.
- pycisTopic.topic_binarization.threshold_otsu(array: ndarray, nbins: int = 100) float [source]
Apply Otsu threshold on topic-region distributions [Otsu, 1979].
- Parameters:
- array
Array containing the region values for the topic to be binarized.
- nbins
Number of bins to use in the binarization histogram.
- Returns:
- Binarization threshold.
- pycisTopic.topic_binarization.threshold_yen(array: ndarray, nbins: int = 100) float [source]
Apply Yen threshold on topic-region distributions [Yen et al., 1995].
- Parameters:
- array
Array containing the region values for the topic to be binarized.
- nbins
Number of bins to use in the binarization histogram.
- Returns:
- Binarization threshold.
Topic QC
- pycisTopic.topic_qc.compute_topic_metrics(cistopic_obj: CistopicObject, return_metrics: bool | None = True)[source]
Compute topic quality control metrics.
Parameters
- cistopic_obj: class::CistopicObject
A cisTopic object with a model in class::CistopicObject.selected_model.
- return_metrics: bool, optional
Whether to return metrics as class::pd.DataFrame. The metrics will be also appended to class::CistopicObject.selected_model.topic_qc_metrics despite the value of this parameter. Default: True.
References
Mimno, D., Wallach, H., Talley, E., Leenders, M., & McCallum, A. (2011). Optimizing semantic coherence in topic models. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (pp. 262-272).
- pycisTopic.topic_qc.plot_topic_qc(topic_qc_metrics: DataFrame | CistopicObject, var_x: str, var_y: str, min_x: int | None = None, max_x: int | None = None, min_y: int | None = None, max_y: int | None = None, var_color: str | None = None, cmap: str | None = 'viridis', dot_size: int | None = 10, text_size: int | None = 10, plot: bool | None = False, save: str | None = None, return_topics: bool | None = False, return_fig: bool | None = False)[source]
Plotting topic qc metrics and filtering.
Parameters
- topic_qc_metrics: class::pd.DataFrame or class::CistopicObject
A topic metrics dataframe or a cisTopic object with class::CistopicObject.selected_model.topic_qc_metrics filled.
- var_x: str
Metric to plot.
- var_y: str, optional
A second metric to plot in combination with var_x.
- min_x: float, optional
Minimum value on var_x to keep the barcode/cell. Default: None.
- max_x: float, optional
Maximum value on var_x to keep the barcode/cell. Default: None.
- min_y: float, optional
Minimum value on var_y to keep the barcode/cell. Default: None.
- max_y: float, optional
Maximum value on var_y to keep the barcode/cell. Default: None.
- var_color: str, optional
Metric to color plot by. Default: None
- cmap: str, optional
Color map to color 2D dot plots by density. Default: None.
- dot_size: int, optional
Dot size in the plot. Default: 10
- text_size: int, optional
Size of the labels in the plot. Default: 10
- plot: bool, optional
Whether the plots should be returned to the console. Default: True.
- save: bool, optional
Path to save plots as a file. Default: None.
- return_topics: bool, optional
Whether to return selected topics based on user-given thresholds. Default: True.
- return_fig: bool, optional
Whether to return the plot figure; if several samples it will return a dictionary with the figures per sample. Default: False.
Return — list
A list with the selected topics.
- pycisTopic.topic_qc.topic_annotation(cistopic_obj: CistopicObject, annot_var: str, binarized_cell_topic: Dict[str, DataFrame] | None = None, general_topic_thr: float | None = 0.2, **kwargs)[source]
Automatic annotation of topics.
Parameters
- cistopic_obj: class::CistopicObject
A cisTopic object with a model in class::CistopicObject.selected_model.
- annot_var: str
Name of the variable (contained in ‘class::CistopicObject.cell_data’) to use for annotation
- binarized_cell_topic: Dict, optional
A dictionary containing binarized cell topic distributions (from binarize_topics()). If not provided, binarized_topics() will be run. Default: None.
- general_topic_thr: float, optional
Threshold for considering a topic as general. After assigning topics to annotations, the ratio of cells in the binarized topic in the whole population is compared with the ratio of the total number of cells in the assigned groups versus the whole population. If the difference is above this threshold, the topic is considered general. Default: 0.2.
- **kwargs
Arguments to pass to binarize_topics()
Export to loom
- pycisTopic.loom.add_annotation(loom, annots: DataFrame)[source]
A helper function to add annotations
- pycisTopic.loom.add_clusterings(loom: SCopeLoom, cluster_data: DataFrame)[source]
A helper function to add clusters
- pycisTopic.loom.add_markers(loom: SCopeLoom, markers_dict: Dict[str, Dict[str, DataFrame]])[source]
A helper function to add markers to clusterings
- pycisTopic.loom.df_to_named_matrix(df: DataFrame)[source]
A helper function to create metadata structure.
- pycisTopic.loom.export_gene_activity_to_loom(gene_activity_matrix: CistopicImputedFeatures | DataFrame, cistopic_obj: CistopicObject, out_fname: str, regulons: List[Regulon] = None, selected_genes: List[str] | None = None, selected_cells: List[str] | None = None, auc_mtx: DataFrame | None = None, auc_thresholds: DataFrame | None = None, cluster_annotation: List[str] = None, cluster_markers: Dict[str, Dict[str, DataFrame]] = None, tree_structure: Sequence[str] = (), title: str = None, nomenclature: str = 'Unknown', split_pattern='___', num_workers: int = 1, **kwargs)[source]
Create SCope [Davie et al, 2018] compatible loom files for gene activity exploration
Parameters
- gene_activity_matrix: class::CistopicImputedFeatures or class::pd.DataFrame
A cisTopic imputed features object containing imputed gene activity as values. Alternatively, a pandas data frame with genes as columns, cells as rows and gene activity per gene as values.
- cistopic_obj: class::CisTopicObject
The cisTopic object from which gene activity values have been derived. It must include cell meta data (including specified cluster annotation columns).
- regulons: list
A list of regulons as derived from pySCENIC (Van de Sande et al., 2020).
- out_fname: str
Path to output file.
- selected_genes: list, optional
A list specifying which genes should be included in the loom file. Default: None
- selected_cells: list, optional
A list specifying which cells should be included in the loom file. Default: None
- auc_mtx: pd.DataFrame, optional
A regulon AUC matrix for the regulons as derived from pySCENIC (Van de Sande et al., 2020). If not provided it will be inferred.
- auc_thresholds: pd.DataFrame, optional
A AUC thresholds for the regulons as derived from pySCENIC (Van de Sande et al., 2020). If not provided it will be inferred.
- cluster_annotation: list, optional
A list indicating which information in cistopic_obj.cell_data should be used as clusters. The specified names must be included as columns in cistopic_obj.cell_data. Default: None.
- cluster_markers: dict, optional
A dictionary including an entry per cluster annotation (which should match with the names in cluster_annotation) including a dictionary per cluster with a pandas data frame with marker regions as rows and logFC and adjusted p-values as columns (the output of find_diff_features). Default: None.
- tree_structure: sequence, optional
A sequence of strings that defines the category tree structure. Needs to be a sequence of strings with three elements. Default: ()
- title: str, optional
The title for this loom file. If None than the basename of the filename is used as the title. Default: None
- nomenclature: str, optional
The name of the genome. Default: ‘Unknown’
- **kwargs
Additional parameters for pyscenic.export.export2loom
References
Davie, K., Janssens, J., Koldere, D., De Waegeneer, M., Pech, U., Kreft, Ł., … & Aerts, S. (2018). A single-cell transcriptome atlas of the aging Drosophila brain. Cell, 174(4), 982-998.
Van de Sande, B., Flerin, C., Davie, K., De Waegeneer, M., Hulselmans, G., Aibar, S., … & Aerts, S. (2020). A scalable SCENIC workflow for single-cell gene regulatory network analysis. Nature Protocols, 15(7), 2247-2276.
- pycisTopic.loom.export_minimal_loom_gene(ex_mtx: DataFrame, embeddings: Mapping[str, DataFrame], out_fname: str, regulons: List[Regulon] = None, cell_annotations: Mapping[str, str] | None = None, tree_structure: Sequence[str] = (), title: str | None = None, nomenclature: str = 'Unknown', num_workers: int = 2, auc_mtx=None, auc_thresholds=None, compress: bool = False)[source]
Create a loom file for a single cell experiment to be used in SCope. :param ex_mtx: The expression matrix (n_cells x n_genes). :param regulons: A list of Regulons. :param cell_annotations: A dictionary that maps a cell ID to its corresponding cell type annotation. :param out_fname: The name of the file to create. :param tree_structure: A sequence of strings that defines the category tree structure. Needs to be a sequence of strings with three elements. :param title: The title for this loom file. If None than the basename of the filename is used as the title. :param nomenclature: The name of the genome. :param num_workers: The number of cores to use for AUCell regulon enrichment. :param embeddings: A dictionary that maps the name of an embedding to its representation as a pandas DataFrame with two columns: the first column is the first component of the projection for each cell followed by the second. The first mapping is the default embedding (use collections.OrderedDict to enforce this). :param compress: compress metadata (only when using SCope).
- pycisTopic.loom.export_region_accessibility_to_loom(accessibility_matrix: CistopicImputedFeatures | DataFrame, cistopic_obj: CistopicObject, binarized_topic_region: Dict[str, DataFrame], binarized_cell_topic: Dict[str, DataFrame], out_fname: str, selected_regions: List[str] = None, selected_cells: List[str] = None, cluster_annotation: List[str] = None, cluster_markers: Dict[str, Dict[str, DataFrame]] = None, tree_structure: Sequence[str] = (), title: str = None, nomenclature: str = 'Unknown', split_pattern: str = '___', **kwargs)[source]
Create SCope [Davie et al, 2018] compatible loom files for accessibility data exploration
Parameters
- accessibility_matrix: class::CistopicImputedFeatures or class::pd.DataFrame
A cisTopic imputed features object containing imputed accessibility as values. Alternatively, a pandas data frame with regions as columns, cells as rows and accessibility per regions as values.
- cistopic_obj: class::CisTopicObject
The cisTopic object from which accessibility values have been derived. It must include cell meta data (including specified cluster annotation columns) and the topic model from which accessibility has been imputed.
- binarized_topic_region: dictionary
A dictionary containing topics as keys and class::pd.DataFrame with regions in topics as index and their topic contribution as values. This is the output of binarize_topics() using target=’region’.
- binarized_cell_topic: dictionary
A dictionary containing topics as keys and class::pd.DataFrame with cells in topics as index and their topic contribution as values. This is the output of binarize_topics() using target=’cell’.
- out_fname: str
Path to output file.
- selected_regions: list, optional
A list specifying which regions should be included in the loom file. This is useful when working with very large data sets (e.g. one can select only regions in topics as DARs to reduce the file size). Default: None
- selected_cells: list, optional
A list specifying which cells should be included in the loom file. Default: None
- cluster_annotation: list, optional
A list indicating which information in cistopic_obj.cell_data should be used as clusters. The specified names must be included as columns in cistopic_obj.cell_data. Default: None.
- cluster_markers: dict, optional
A dictionary including an entry per cluster annotation (which should match with the names in cluster_annotation) including a dictionary per cluster with a pandas data frame with marker regions as rows and logFC and adjusted p-values as columns (the output of find_diff_features). Default: None.
- tree_structure: sequence, optional
A sequence of strings that defines the category tree structure. Needs to be a sequence of strings with three elements. Default: ()
- title: str, optional
The title for this loom file. If None than the basename of the filename is used as the title. Default: None
- nomenclature: str, optional
The name of the genome. Default: ‘Unknown’
- **kwargs
Additional parameters for pyscenic.export.export2loom
References
Davie, K., Janssens, J., Koldere, D., De Waegeneer, M., Pech, U., Kreft, Ł., … & Aerts, S. (2018). A single-cell transcriptome atlas of the aging Drosophila brain. Cell, 174(4), 982-998.
Signature enrichment
- pycisTopic.signature_enrichment.gene_set_to_signature(gene_set: List, name: str)[source]
A helper function to generat gene signatures
Parameters
- gene_set: pr.PyRanges
List of genes
- name: str
Name for the signature
- pycisTopic.signature_enrichment.region_set_to_signature(query_region_set: PyRanges, target_region_set: PyRanges, name: str)[source]
A helper function to intersect query regions with the input data set regions
Parameters
- query_region_set: pr.PyRanges
Pyranges with regions to query
- target_region_set: pr.PyRanges
Pyranges with target regions
- name: str
Name for the signature
- pycisTopic.signature_enrichment.signature_enrichment(rankings: CistopicImputedFeatures, signatures: Dict[str, PyRanges] | Dict[str, List], enrichment_type: str = 'region', auc_threshold: float = 0.05, normalize: bool = False, n_cpu: int = 1)[source]
Get enrichment of a region signature in cells or topics using AUCell (Van de Sande et al., 2020)
Parameters
- rankings: CistopicImputedFeatures
A CistopicImputedFeatures object with ranking values
- signatures: Dictionary of pr.PyRanges (for regions) or list (for genes)
A dictionary containing region signatures as pr.PyRanges or gene names as list
- enrichment_type: str
Whether features are genes or regions
- auc_threshold: float
The fraction of the ranked genome to take into account for the calculation of the Area Under the recovery Curve. Default: 0.05
- normalize: bool
Normalize the AUC values to a maximum of 1.0 per regulon. Default: False
- num_workers: int
The number of cores to use. Default: 1
pyGREAT
- pycisTopic.pyGREAT.get_region_signature(pyGREAT_results: Dict[str, DataFrame], region_set_key: str, ontology: str, term: str)[source]
Retriving GO region signature from GREAT results
- Parameters:
- pyGREAT_results: Dict
A dictionary with pyGREAT results.
- region_set_key: str
Key of the region set to query
- ontology: str
Ontology to query
- term: str
Term to retrive regions from
- pycisTopic.pyGREAT.pyGREAT(region_sets: Dict[str, PyRanges], species: str, rule: str = 'basalPlusExt', span: float = 1000.0, upstream: float = 5.0, downstream: float = 1.0, two_distance: float = 1000.0, one_distance: float = 1000.0, include_curated_reg_doms: int = 1, bg_choice: str = 'wholeGenome', tmp_dir: str = None, n_cpu: int = 1, **kwargs)[source]
Running GREAT (McLean et al., 2010) on a dictionary of pyranges. For more details in GREAT parameters, please visit http://great.stanford.edu/public/html/
- Parameters:
- region_sets: Dict
A dictionary containing region sets to query as pyRanges objects.
- species: str
Genome assembly from where the coordinates come from. Possible values are: ‘mm9’, ‘mm10’, ‘hg19’, ‘hg38’
- rule: str
How to associate genomic regions to genes. Possible options are ‘basalPlusExt’, ‘twoClosest’, ‘oneClosest’. Default: ‘basalPlusExt’
- span: float
Unit: kb, only used when rule is ‘basalPlusExt’. Default: 1000.0
- upstream: float
Unit: kb, only used when rule is ‘basalPlusExt’. Default: 5.0
- downstream: float
Unit: kb, only used when rule is ‘basalPlusExt’. Default: 1.0
- two_distance: float
Unit: kb, only used when rule is ‘twoClosest’. Default: 1000.0
- one_distance: float
Unit: kb, only used when rule is ‘oneClosest’. Default: 1000.0
- include_curated_reg_doms: int
Whether to include curated regulatory domains. Default: 1
- bg_choice: str
A path to the background file or a string. Default: ‘wholeGenome’
- tmp_dir: str
Temporary directory to save region sets as bed files for GREAT. Default: None
- n_cpu: int
Number of cores to use. Default: 1
- ***kwargs
Other parameters to pass to ray.init
References
McLean, C. Y., Bristor, D., Hiller, M., Clarke, S. L., Schaar, B. T., Lowe, C. B., … & Bejerano, G. (2010). GREAT improves functional interpretation of cis-regulatory regions. Nature biotechnology, 28(5), 495-501.
- pycisTopic.pyGREAT.pyGREAT_oneset(region_set: PyRanges, species: str, rule: str = 'basalPlusExt', span: float = 1000.0, upstream: float = 5.0, downstream: float = 1.0, two_distance: float = 1000.0, one_distance: float = 1000.0, include_curated_reg_doms: int = 1, bg_choice: str = 'wholeGenome', tmp_dir: str = None)[source]
Running GREAT (McLean et al., 2010) on a pyranges object. For more details in GREAT parameters, please visit http://great.stanford.edu/public/html/
- Parameters:
- region_sets: Dict
A dictionary containing region sets to query as pyRanges objects.
- species: str
Genome assembly from where the coordinates come from. Possible values are: ‘mm9’, ‘mm10’, ‘hg19’, ‘hg38’
- rule: str
How to associate genomic regions to genes. Possible options are ‘basalPlusExt’, ‘twoClosest’, ‘oneClosest’. Default: ‘basalPlusExt’
- span: float
Unit: kb, only used when rule is ‘basalPlusExt’. Default: 1000.0
- upstream: float
Unit: kb, only used when rule is ‘basalPlusExt’. Default: 5.0
- downstream: float
Unit: kb, only used when rule is ‘basalPlusExt’. Default: 1.0
- two_distance: float
Unit: kb, only used when rule is ‘twoClosest’. Default: 1000.0
- one_distance: float
Unit: kb, only used when rule is ‘oneClosest’. Default: 1000.0
- include_curated_reg_doms: int
Whether to include curated regulatory domains. Default: 1
- bg_choice: str
A path to the background file or a string. Default: ‘wholeGenome’
- tmp_dir: str
Temporary directory to save region sets as bed files for GREAT. Default: None
- n_cpu: int
Number of cores to use. Default: 1
- ***kwargs
Other parameters to pass to ray.init
References
McLean, C. Y., Bristor, D., Hiller, M., Clarke, S. L., Schaar, B. T., Lowe, C. B., … & Bejerano, G. (2010). GREAT improves functional interpretation of cis-regulatory regions. Nature biotechnology, 28(5), 495-501.
Gene activity
- pycisTopic.gene_activity.calculate_distance_join(pr_obj: PyRanges)[source]
A helper function to calculate distances between regions and genes.
- pycisTopic.gene_activity.calculate_distance_with_limits_join(pr_obj: PyRanges)[source]
A helper function to calculate distances between regions and genes, returning information on what is the relative distance to the TSS and end of the gene.
- pycisTopic.gene_activity.extend_pyranges(pr_obj: PyRanges, upstream: int, downstream: int)[source]
A helper function to extend coordinates downstream/upstream in a pyRanges given upstream and downstream distances.
- pycisTopic.gene_activity.extend_pyranges_with_limits(pr_obj: PyRanges)[source]
A helper function to extend coordinates downstream/upstream in a pyRanges with Distance_upstream and Distance_downstream columns.
- pycisTopic.gene_activity.get_gene_activity(imputed_acc_object: CistopicImputedFeatures, pr_annot: PyRanges, chromsizes: PyRanges, predefined_boundaries: PyRanges | None = None, use_gene_boundaries: bool | None = True, upstream: List[int] | None = [1000, 100000], downstream: List[int] | None = [1000, 100000], distance_weight: bool | None = True, decay_rate: float | None = 1, extend_gene_body_upstream: int | None = 5000, extend_gene_body_downstream: int | None = 0, gene_size_weight: bool | None = False, gene_size_scale_factor: int | str | None = 'median', remove_promoters: bool | None = False, scale_factor: float | None = 1, average_scores: bool | None = True, extend_tss: List[int] | None = [10, 10], return_weights: bool | None = True, gini_weight: bool | None = True, project: str | None = 'Gene_activity')[source]
Infer gene activity.
Parameters
- imputed_features_obj:
CistopicImputedFeatures
A cisTopic imputation data object.
- pr_annot: pr.PyRanges
A
pr.PyRanges
containing gene annotation, including Chromosome, Start, End, Strand (as ‘+’ and ‘-‘), Gene name and Transcription Start Site.- chromsizes: pr.PyRanges
A
pr.PyRanges
containing size of each chromosome, containing ‘Chromosome’, ‘Start’ and ‘End’ columns.- predefined_boundaries: pr.PyRanges
A
pr.PyRanges
containing predefined genomic domain boundaries (e.g. TAD boundaries) to use as boundaries. If given, use_gene_boundaries will be ignored.- use_gene_boundaries: bool, optional
Whether to use the whole search space or stop when encountering another gene. Default: True
- upstream: List, optional
Search space upstream. The minimum (first position) means that even if there is a gene right next to it these bp will be taken. The second position indicates the maximum distance. Default: [1000,100000]
- downstream: List, optional
Search space downstream. The minimum (first position) means that even if there is a gene right next to it these bp will be taken. The second position indicates the maximum distance. Default: [1000,100000]
- distance_weight: bool, optional
Whether to add a distance weight (an exponential function, the weight will decrease with distance). Default: True
- decay_rate: float, optional
Exponent for the distance exponential funciton (the higher the faster will be the decrease). Default: 1
- extend_gene_body_upstream: int, optional
Number of bp upstream immune to the distance weight (their value will be maximum for this weight). Default: 5000
- extend_gene_body_downstream: int, optional
Number of bp downstream immune to the distance weight (their value will be maximum for this weight). Default: 0
- gene_size_weight: bool, optional
Whether to add a weights based on th length of the gene. Default: False
- gene_size_scale_factor: str or int, optional
Dividend to calculate the gene size weigth. Default is the median value of all genes in the genome.
- remove_promoters: bool, optional
Whether to ignore promoters when computing gene activity. Default: False
- average_scores: bool, optional
Whether to divide by the total number of region assigned to a gene when calculating the gene activity score. Default: True
- scale_factor: int, optional
Value to multiply for the final gene activity matrix. Default: 1
- extend_tss: list, optional
Space around the TSS consider as promoter. Default: [10,10]
- return_weights: bool, optional
Whether to return the final weight values. Default: True
- gini_weight: bool, optional
Whether to add a gini index weigth. The more unique the region is, the higher this weight will be. Default: True
- project: str, optional;
Project name for the
CistopicImputedFeatures
with the gene activity
- imputed_features_obj:
- pycisTopic.gene_activity.reduce_pyranges_b(pr_obj: PyRanges, upstream: int, downstream: int)[source]
A helper function to reduce coordinates downstream/upstream in a pyRanges given upstream and downstream distances.
- pycisTopic.gene_activity.reduce_pyranges_with_limits_b(pr_obj: PyRanges)[source]
A helper function to reduce coordinates downstream/upstream in a pyRanges with Distance_upstream and Distance_downstream columns.
- pycisTopic.gene_activity.region_weights(imputed_acc_object, pr_annot, chromsizes, predefined_boundaries=None, use_gene_boundaries=True, upstream=[1000, 100000], downstream=[1000, 100000], distance_weight=True, decay_rate=1, extend_gene_body_upstream=5000, extend_gene_body_downstream=0, gene_size_weight=True, gene_size_scale_factor='median', remove_promoters=True, extend_tss=[10, 10], gini_weight=True)[source]
Calculate region weights.
Parameters
- imputed_features_obj:
CistopicImputedFeatures
A cisTopic imputation data object.
- pr_annot: pr.PyRanges
A
pr.PyRanges
containing gene annotation, including Chromosome, Start, End, Strand (as ‘+’ and ‘-‘), Gene name and Transcription Start Site.- chromsizes: pr.PyRanges
A
pr.PyRanges
containing size of each chromosome, containing ‘Chromosome’, ‘Start’ and ‘End’ columns.- predefined_boundaries: pr.PyRanges
A
pr.PyRanges
containing predefined genomic domain boundaries (e.g. TAD boundaries) to use as boundaries. If given, use_gene_boundaries will be ignored.- use_gene_boundaries: bool, optional
Whether to use the whole search space or stop when encountering another gene. Default: True
- upstream: List, optional
Search space upstream. The minimum (first position) means that even if there is a gene right next to it these bp will be taken. The second position indicates the maximum distance. Default: [1000,100000]
- downstream: List, optional
Search space downstream. The minimum (first position) means that even if there is a gene right next to it these bp will be taken. The second position indicates the maximum distance. Default: [1000,100000]
- distance_weight: bool, optional
Whether to add a distance weight (an exponential function, the weight will decrease with distance). Default: True
- decay_rate: float, optional
Exponent for the distance exponential funciton (the higher the faster will be the decrease). Default: 1
- extend_gene_body_upstream: int, optional
Number of bp upstream immune to the distance weight (their value will be maximum for this weight). Default: 5000
- extend_gene_body_downstream: int, optional
Number of bp downstream immune to the distance weight (their value will be maximum for this weight). Default: 0
- gene_size_weight: bool, optional
Whether to add a weights based on th length of the gene. Default: False
- gene_size_scale_factor: str or int, optional
Dividend to calculate the gene size weigth. Default is the median value of all genes in the genome.
- remove_promoters: bool, optional
Whether to ignore promoters when computing gene activity. Default: False
- extend_tss: list, optional
Space around the TSS consider as promoter. Default: [10,10]
- gini_weight: bool, optional
Whether to add a gini index weigth. The more unique the region is, the higher this weight will be. Default: True
- imputed_features_obj:
- pycisTopic.gene_activity.weighted_aggregation(imputed_acc_obj_mtx: csr_matrix, region_weights_df_per_gene: DataFrame, average_scores: bool)[source]
Weighted aggregation of region probabilities into gene activity
Parameters
- imputed_acc_obj_mtx: sparse.csr_matrix
A sparse matrix with regions as rows and cells as columns.
- region_weights_df_per_gene: pd.DataFrame
A data frame with region index (from the sparse matrix) for the gene
- average_score: bool
Whether final values should be divided by the total number of regions aggregated
Label transfer
- pycisTopic.label_transfer.label_transfer(ref_anndata: AnnData, query_anndata: AnnData, labels_to_transfer: List[str], sample_id_col: str | None = 'sample_id', n_cpu: int | None = 1, variable_genes: bool | None = True, methods: List[str] | None = ['ingest', 'harmony', 'bbknn', 'scanorama', 'cca'], pca_ncomps: List[int] | None = [50, 50], n_neighbours: List[int] | None = [10, 10], bbknn_components: int | None = 30, cca_components: int | None = 30, return_label_weights: bool | None = False, **kwargs)[source]
Wrapper function of Ray processes to compute label transfer from single reference to multiple query samples.
Parameters
- ref_anndata: AnnData
An AnnData object containing the reference data set (typically, scRNA-seq data)
- query_anndata: AnnData
An AnnData object containing the query data set, with features matching with the reference data set (typically, gene activities derived from scATAC-seq)
- labels_to_transfer: List
Labels to transfer. They must be included in ref_anndata.obs.
- sample_id_col: str
Name of the column containing the sample ids in the query data set. It must be included in query_anndata.obs. Default: sample_id
- n_cpu: int, optional
Number of cores to use. Default: 1.
- variable_genes: bool, optional
Whether variable genes matching between the two data set should be used (True) or otherwise, all matching genes (False). Default: True
- methods: List, optional
Methods to be used for label transferring. These include: ‘ingest’ [from scanpy], ‘harmony’ [Korsunsky et al, 2019], ‘bbknn’ [Polański et al, 2020], ‘scanorama’ [Hie et al, 2019] and ‘cca’. Except for ingest, these methods return a common coembedding and labels are inferred using the distances between query and refenrence cells as weights.
- pca_ncomps: List, optional
Number of principal components to use for reference and query, respectively. Default: [50,50]
- n_neighbours: List, optional
Number of neighbours to use for reference and query, respectively. Default: [10,10]
- bbknn_components: int, optional
Number of components to use for the umap for bbknn integration. Default: 30
- cca_components: int, optional
Number of components to use for cca. Default: 30
- return_label_weights: bool, optional
Whether to return the label scores per variable (as a dictionary, except for ingest). Default: False
- **kwargs
Additional parameters for ray.init.
References
Korsunsky, I., Millard, N., Fan, J., Slowikowski, K., Zhang, F., Wei, K., … & Raychaudhuri, S. (2019). Fast, sensitive and accurate integration of single-cell data with Harmony. Nature methods, 16(12), 1289-1296.
Polański, K., Young, M. D., Miao, Z., Meyer, K. B., Teichmann, S. A., & Park, J. E. (2020). BBKNN: fast batch alignment of single cell transcriptomes. Bioinformatics, 36(3), 964-965.
Hie, B., Bryson, B., & Berger, B. (2019). Efficient integration of heterogeneous single-cell transcriptomes using Scanorama. Nature biotechnology, 37(6), 685-691.
Utils
- pycisTopic.utils.get_tss_matrix(fragments, flank_window, tss_space_annotation)[source]
Get TSS matrix
- pycisTopic.utils.normalise_filepath(path: str | Path, check_not_directory: bool = True) str [source]
Create a string path, expanding the home directory if present.
- pycisTopic.utils.read_fragments_from_file(fragments_bed_filename, use_polars: bool = True) PyRanges [source]
Read fragments BED file to PyRanges object.
- Parameters:
- fragments_bed_filename: Fragments BED filename.
- use_polars: Use polars instead of pandas for reading the fragments BED file.
- Returns:
- PyRanges object of fragments.