biseqt.blot module¶
-
biseqt.blot.
find_peaks
(xs, rs, threshold)[source]¶ Finds maximal (disjoint) peak regions in a sequence of real numbers. Each value that is at least as large as the threshold constitutes the center of a peak with radius according to its position. In the output all overlapping peaks are merged into maximal peaks.
Parameters: - xs – the 1D data sequence of interest
- rs – the radii for peaks defined at every point exceeding threshold, could be a 1D sequence or a fixed number,
- threshold – cutoff value to compare with
xs
values.
Returns: A least of “peaks”, each a tuple of
(left, right)
coordinates inxs
. Returned peaks are guaranteed to be disjoint.Return type: list (tuple)
-
biseqt.blot.
wall_to_wall_distance
(len0, len1, diag)[source]¶ Wall to wall distance for a diagonal position
\[L = \min(l_0 - d, l_1) + \min(d, 0)\]with \(l_0,l_1\) being the length of the sequences (i.e
len0
andlen1
arguments) and \(d\) the starting diagonal (i.ediag
argument).
-
biseqt.blot.
expected_overlap_len
(len0, len1, diag, gap_prob)[source]¶ Calculates the expected length of an overlap alignment given its starting coordinates:
\[K = \left(\frac{2}{2 - g}\right) L\]where \(L\) is the
wall_to_wall_distance()
.Parameters: - len0 (int) – Length of the 1st sequence.
- len1 (int) – Length of the 2nd sequence.
- diag (int) – Starting diagonal of alignments to consider.
- gap_prob (float) – Probability of indels occuring at any position.
Returns: Expected length of an overlap alignment.
Return type: int
-
biseqt.blot.
band_radius
(expected_len, gap_prob, sensitivity)[source]¶ Calculates the smallest band radius in the dynamic programming table such that an alignment of given expected length, with the given gap probability, stays entirely within the diagonal band with probability given as sensitivity. This is given by:
\[r = \mathrm{erf}^{-1}\left(1-\epsilon\right) \sqrt{2gK}\]where \(g\) is the gap probability, \(1-\epsilon\) is the desired sensitivity, and \(K\) is the given expected length.
Parameters: - expected_len (int) – minimum expected length of similar region.
- gap_prob (float) – Probability of indels occuring at any position.
- sensitivity (float) – The probability that an alignment with given gap probability remains entirely within the band.
Returns: The smallest band radius guaranteeing the required sensitivity.
Return type: int
-
biseqt.blot.
band_radii
(expected_lens, gap_prob, sensitivity)[source]¶ Same as
band_radius()
but for bulk calculations (e.g. finding overlaps).Parameters: - expected_lens (list) – List of integers for which to calculate band radii.
- gap_prob (float) – As in
band_radius()
. - sensitivity (float) – As in
band_radius()
.
-
biseqt.blot.
H0_moments
(alphabet_len, wordlen, area)[source]¶ The mean and standrad deviation of the limiting normal distribution under the \(H_0\) (unrelated) model given by:
\[\begin{split}\begin{aligned} \mu_0 & = Ap^w \\ \sigma_0^2 & = A\left[(1 - p^w)\left(p^w + \frac{2p^{w+1}}{1 - p}\right) - 2wp^{2w}\right] \end{aligned}\end{split}\]where \(w\) is the word length, \(A\) is the area of the ROI, and \(p = \frac{1}{|\Sigma|}\) with \(|\Sigma|\) being the alphabet length.
-
biseqt.blot.
H1_moments
(alphabet_len, wordlen, area, seglen, p_match)[source]¶ The mean and standrad deviation of the limiting normal distribution under the \(H_1\) (related) model given by:
\[\begin{split}\begin{aligned} \mu_1 & = \mu_0 + Kp^w \\ \sigma_1^2 & = \sigma_0^2 + 2K\left[(1 - p^w)\left(p^w + \frac{2p^{w+1}}{1 - p}\right) - 2wp^{2w}\right] \end{aligned}\end{split}\]where \(w\) is the word length, \(K\) is the similarity length, and \(p\) is the match probability.
-
class
biseqt.blot.
WordBlot
(S, T, g_max=None, sensitivity=None, **kw)[source]¶ Bases:
biseqt.seeds.SeedIndex
A similarity finder based on m-dependent CLT statistics.
-
g_max
¶ float – Upper bound for indel probabilities in mutation model.
-
sensitivity
¶ float – Desired sensitivity of bands.
-
score_num_seeds
(**kw)[source]¶ Calculates our key central statistics based on m-dependent CLT. For a given observation of number of seeds in a region of interest (ROI), calculates the z-core against the \(H_0\) (unrelated) and \(H_1\) (related) models. In either case, the distribution of number of seeds, being a sum of m-dependent identically distributed random variables, is asymptotically normal with parameters given by
H0_moments()
,H1_moments()
.Keyword Arguments: - num_seeds (int) – Number of observed seed in the ROI.
- area (int|float) – Area of the ROI.
- seglen (int) – Similar segment length in H1 model.
- p_match (float) – Expected match probability at any given position.
Returns: z-scores in H0 and H1 models
Return type: tuple (float)
-
band_radius
(K)[source]¶ Wraps
band_radius()
with our mutation parameters and sequence lengths.Parameters: K (int) – expected alignment length of interest. Returns: radius of band for desired sensitivity
.Return type: int
-
segment_dims
(d_band=None, a_band=None)[source]¶ Calculate the edit path length \(K\) and the area of the given diagonal/antiodiagonal segment.
Keyword Arguments: - d_band (tuple) – lower and upper diagonals limiting the segment.
- a_band (tuple) – lower and upper antidiagonal positions limiting the segment.
Returns: edit path length and segment area.
Return type: tuple
-
estimate_match_probability
(num_seeds, d_band=None, a_band=None)[source]¶ Estimate the edit path match probability given the provided observed number of seeds \(n\) in given sigment:
\[\hat{p} = \left(\frac{n - Ap_\circ^w}{K}\right)^{\frac{1}{w}}\]where \(n, A, p_\circ, K, w\) are the number of seeds, segment area, null probability of matching nucleotides, segment length, and the word length, respectively.
Parameters: num_seeds (int) – number of seeds observed in segment. - Keywords Args:
- d_band (tuple):
- lower and upper diagonals limiting the segment.
- a_band (tuple):
- lower and upper antidiagonal positions limiting the segment.
Returns: estimated match probability Return type: float
-
classmethod
find_all_neighbors
(seeds, d_radius, a_radius)[source]¶ For each seed finds all seeds in its neighborhood defined by:
\[U_{(d, a)} = \{(d', a'): |d - d'| < r_d, |a - a'| < r_a \}\]This is done using a Quad-Tree in \(O(m lg m)\) time where m is the number of seeds.
Returns: - tuples
((d, a), neighs)
whereneighs
is a list of - neighbor indices.
Return type: list - tuples
-
score_seeds
(K)[source]¶ Find the neighbors of each seed in the sense of
find_all_neighbors()
and estimates the match probability of the segment centered at the coordinates of each seed.Parameters: K (int) – the similarity legnth of interest that dictates neighborhood shapes and estimated match probabilities. Returns: List of dictionaries with keys: seed
(coordinates of exactly matching kmer in diagonal coordinates),neighs
(list of indices of neighbors of this seed in the appropriate diagonal strip),p
the estimated match probability of a segment centered at the seed.Return type: list(dict)
-
similar_segments
(K_min, p_min, at_least_one=False)[source]¶ Find all maximal local similarities of given minium length and match probability. Additionally for each segment, the match probability is estimated and H0/H1 scores are calculated.
Parameters: - K_min (int) – minimum required length of similarity.
- p_min (float) – Minimum required match probability at each position.
- score (bool) – Whether to score the segment against H1 by counting seeds inside.
Yields: dict – dictionary with keys:
segment
(coordinates of similar region in diagonal coordinates((d_min, d_max), (a_min, a_max))
)),p
the estimated match probability, andscore
the H1 z-score if keyword argumentscore
is true.
-
-
class
biseqt.blot.
WordBlotOverlap
(S, T, g_max=None, sensitivity=None, **kw)[source]¶ Bases:
biseqt.blot.WordBlot
A specialized version of WordBlot for detecting overlap (suffix-prefix) similarities between sequences (e.g. in a sequencing context).
-
score_seeds
()[source]¶ For each seed finds all seeds in its neighborhood defined by:
\[U_{(d, a)} = \{(d', a'): |d - d'| < r_d(d), |a - a'| < r_a(d) \}\]in such a way that each seed’s neighborhood is the entire diagonal band containing it. Each seed recieves an estimated match probability for a similarity on its diagonal band and a z-score with respect to the H1 model.
Returns: - dicts with keys
seed
(diagonal coordinates of seed), p
(estimated match probability of overlap alignment),L
(the length of an alignment through the seed’s band according to sequence lengths),score
(the z-score with respect to H1), andr
(band radius at the seed’s coordinates).
Return type: list - dicts with keys
-
highest_scoring_overlap_band
()[source]¶ Finds the highest scoring diagonal band according to probabiliy estimations of
score_seeds()
.Returns: - with same keys
p, score, d_band
consistent with the - output of
score_seeds()
for the highest scoring diagonal band.
Return type: dict - with same keys
-
-
class
biseqt.blot.
WordBlotOverlapRef
(ref, allowed_memory=1, **kw)[source]¶ Bases:
biseqt.blot.WordBlotOverlap
An in-memory, SQL-free version of
WordBlotOverlap
for faster comparisons. Due to implementation details the word length is constrained above by the available memory.-
allowed_memory
¶ int|float – allocatable memory in GB for kmers index.
-
seeds
(exclude_trivial=True)[source]¶ Yields all seeds, optionally those within a diagonal band.
Keyword Arguments: d_band (tuple|None) – If specified a (d_min, d_max)
tuple restricting the seed count to a diagonal band.Yields: tuple – seeds coordinates \((i, j)\).
-
seed_count
(d_band=None, a_band=None)[source]¶ Counts the number of seeds either in the whole table or in the specified diagonal band.
Parameters: - d_band (tuple|None) – If specified a \((d_{\min}, d_{\max})\) tuple restricting the seed count to a diagonal band.
- a_band (tuple|None) – If specified a \((a_{\min}, a_{\max})\) tuple restricting the seed count to an antidiagonal band.
Returns: Number of seeds found in the entire table or in the specified diagonal band.
Return type: int
-
-
class
biseqt.blot.
WordBlotLocalRef
(ref, allowed_memory=1, **kw)[source]¶ Bases:
biseqt.blot.WordBlot
An in-memory, SQL-free version of
WordBlot
for faster comparisons. Due to implementation details the word length is constrained above by the available memory.-
allowed_memory
¶ int|float – allocatable memory in GB for kmers index.
-
seeds
(exclude_trivial=True)[source]¶ Yields all seeds, optionally those within a diagonal band.
Keyword Arguments: d_band (tuple|None) – If specified a (d_min, d_max)
tuple restricting the seed count to a diagonal band.Yields: tuple – seeds coordinates \((i, j)\).
-
seed_count
(d_band=None, a_band=None)[source]¶ Counts the number of seeds either in the whole table or in the specified diagonal band.
Parameters: - d_band (tuple|None) – If specified a \((d_{\min}, d_{\max})\) tuple restricting the seed count to a diagonal band.
- a_band (tuple|None) – If specified a \((a_{\min}, a_{\max})\) tuple restricting the seed count to an antidiagonal band.
Returns: Number of seeds found in the entire table or in the specified diagonal band.
Return type: int
-
similar_segments
(seq, K_min, p_min, at_least_one=False)[source]¶ Find all maximal local similarities of given minium length and match probability. Additionally for each segment, the match probability is estimated and H0/H1 scores are calculated.
Parameters: - K_min (int) – minimum required length of similarity.
- p_min (float) – Minimum required match probability at each position.
- score (bool) – Whether to score the segment against H1 by counting seeds inside.
Yields: dict – dictionary with keys:
segment
(coordinates of similar region in diagonal coordinates((d_min, d_max), (a_min, a_max))
)),p
the estimated match probability, andscore
the H1 z-score if keyword argumentscore
is true.
-
-
class
biseqt.blot.
WordBlotMultiple
(*seqs, **kw)[source]¶ Bases:
biseqt.seeds.SeedIndexMultiple
A multiple sequence similarity finder based on m-dependent CLT statistics.
-
g_max
¶ float – Upper bound for indel probabilities in mutation model.
-
sensitivity
¶ float – Desired sensitivity of bands.
-
band_radius
(K)[source]¶ Wraps
band_radius()
with our mutation parameters and sequence lengths.Parameters: K (int) – expected alignment length of interest. Returns: radius of band for desired sensitivity
.Return type: int
-
wall_to_wall_distance
(ds)[source]¶ Wall to wall distance \(L\) for a diagonal position \(d_1, \ldots, d_{n-1}\).
The distance \(L\) is the largest possible value of the antidiagonal coordinate once all diagonal coordinates are fixed.
-
estimate_match_probability
(num_seeds, K, volume)[source]¶ Estimate the edit path match probability given the provided observed number of seeds in given sigment.
\[\hat{p} = \left( \frac{n - Vp_\circ^{w(N-1)}}{K} \right)^{\frac{1}{w(N-1)}}\]where \(n, N, V, p_\circ, w\) are the number of seeds, number of sequences, segment n-d volume, null probability of matching nucleotides, and word length, respectively.
Parameters: - num_seeds (int|float) – number of seeds observed in segment.
- K (int|float) – the expected length of the alignment.
- volume (int|float) – the n-d volume of the region of interest.
Returns: estimated match probability
Return type: float
-
score_seeds
(K)[source]¶ Find the neighbors of each seed in the sense of
find_all_neighbors()
and estimates the match probability of the segment centered at the coordinates of each seed.Parameters: K (int) – the similarity legnth of interest that dictates neighborhood shapes and estimated match probabilities. Returns: - list of tuples
((d, a), neighs, p)
where(d, a)
- is the diagonal coordinates of the seed,
neighs
is a list of integer indices of seeds in its diagonal/antidiagonal neighborhood, andp
is the estimated match probability of a segment of lengthK
centered at the seed.
Return type: list - list of tuples
-
find_all_neighbors
(d_radius, a_radius)[source]¶ For each seed finds all seeds in its neighborhood defined by:
\[U_{(d_1,\ldots,d_{n-1}, a)} = \{(d_1', \ldots, d_{n-1}', a'): |d_k - d_k'| < r_d, |a - a'| < r_a \}\]This is done using a kD-Tree in \(O(nm lg m)\) time where m is the number of seeds and n is the number of sequences.
Returns: - tuples
((ds, a), neighs)
whereneighs
is a list of - neighbor indices.
Return type: list - tuples
-
score_num_seeds
(num_seeds, **kw)[source]¶ The exrtension of
WordBlot.score_num_seeds()
to multiple sequence comparisons.Keyword Arguments: - num_seeds (int) – Number of observed seed in the ROI.
- volume (int|float) – Area of the ROI.
- seglen (int) – Similar segment length in H1 model.
- p_match (float) – Expected match probability at any given position.
Returns: z-scores in H0 and H1 models
Return type: tuple (float)
-
similar_segments
(K_min, p_min, at_least_one=False)[source]¶ Find all maximal local similarities of given minium length and match probability. Additionally for each segment, the match probability is estimated and H0/H1 scores are calculated.
Parameters: - K_min (int) – minimum required length of similarity.
- p_min (float) – Minimum required match probability at each position.
- score (bool) – Whether to score the segment against H1 by counting seeds inside.
Yields: dict – dictionary with keys:
segment
(coordinates of similar region in diagonal coordinates,p
the estimated match probability, andscore
the H1 z-score if keyword argumentscore
is true.
-
-
class
biseqt.blot.
WordBlotMultipleFast
(*seqs, **kw)[source]¶ Bases:
biseqt.blot.WordBlotMultiple
An in-memory, SQL-free version of
WordBlotOverlapMultiple
for faster comparisons. Due to implementation details the word length is constrained above by the available memory.-
allowed_memory
¶ int|float – allocatable memory in GB for kmers index, default is 1.
-
seeds
()[source]¶ Yields all seeds in diagonal coordinates.
Yields: tuple – seeds coordinates \((d_1, \ldots, d_{n-1}, a)\).
-
seed_count
(ds_band=None, a_band=None)[source]¶ Counts the number of seeds either in the whole table or in the specified diagonal band.
Parameters: - ds_band (List|None) – If specified a list of \(n-1\) tuples \((d_{k,\min}, d_{k,\max})\) restricting the seed count to a diagonal hyper-band.
- a_band (tuple|None) – If specified a \((a_{\min}, a_{\max})\) tuple restricting the seed count to an antidiagonal band.
Returns: Number of seeds found in the entire table or in the specified diagonal band.
Return type: int
-