biseqt.blot module

biseqt.blot.find_peaks(xs, rs, threshold)[source]

Finds maximal (disjoint) peak regions in a sequence of real numbers. Each value that is at least as large as the threshold constitutes the center of a peak with radius according to its position. In the output all overlapping peaks are merged into maximal peaks.

Parameters:
  • xs – the 1D data sequence of interest
  • rs – the radii for peaks defined at every point exceeding threshold, could be a 1D sequence or a fixed number,
  • threshold – cutoff value to compare with xs values.
Returns:

A least of “peaks”, each a tuple of (left, right) coordinates in xs. Returned peaks are guaranteed to be disjoint.

Return type:

list (tuple)

biseqt.blot.wall_to_wall_distance(len0, len1, diag)[source]

Wall to wall distance for a diagonal position

\[L = \min(l_0 - d, l_1) + \min(d, 0)\]

with \(l_0,l_1\) being the length of the sequences (i.e len0 and len1 arguments) and \(d\) the starting diagonal (i.e diag argument).

biseqt.blot.expected_overlap_len(len0, len1, diag, gap_prob)[source]

Calculates the expected length of an overlap alignment given its starting coordinates:

\[K = \left(\frac{2}{2 - g}\right) L\]

where \(L\) is the wall_to_wall_distance().

Parameters:
  • len0 (int) – Length of the 1st sequence.
  • len1 (int) – Length of the 2nd sequence.
  • diag (int) – Starting diagonal of alignments to consider.
  • gap_prob (float) – Probability of indels occuring at any position.
Returns:

Expected length of an overlap alignment.

Return type:

int

biseqt.blot.band_radius(expected_len, gap_prob, sensitivity)[source]

Calculates the smallest band radius in the dynamic programming table such that an alignment of given expected length, with the given gap probability, stays entirely within the diagonal band with probability given as sensitivity. This is given by:

\[r = \mathrm{erf}^{-1}\left(1-\epsilon\right) \sqrt{2gK}\]

where \(g\) is the gap probability, \(1-\epsilon\) is the desired sensitivity, and \(K\) is the given expected length.

Parameters:
  • expected_len (int) – minimum expected length of similar region.
  • gap_prob (float) – Probability of indels occuring at any position.
  • sensitivity (float) – The probability that an alignment with given gap probability remains entirely within the band.
Returns:

The smallest band radius guaranteeing the required sensitivity.

Return type:

int

biseqt.blot.band_radii(expected_lens, gap_prob, sensitivity)[source]

Same as band_radius() but for bulk calculations (e.g. finding overlaps).

Parameters:
  • expected_lens (list) – List of integers for which to calculate band radii.
  • gap_prob (float) – As in band_radius().
  • sensitivity (float) – As in band_radius().
biseqt.blot.H0_moments(alphabet_len, wordlen, area)[source]

The mean and standrad deviation of the limiting normal distribution under the \(H_0\) (unrelated) model given by:

\[\begin{split}\begin{aligned} \mu_0 & = Ap^w \\ \sigma_0^2 & = A\left[(1 - p^w)\left(p^w + \frac{2p^{w+1}}{1 - p}\right) - 2wp^{2w}\right] \end{aligned}\end{split}\]

where \(w\) is the word length, \(A\) is the area of the ROI, and \(p = \frac{1}{|\Sigma|}\) with \(|\Sigma|\) being the alphabet length.

biseqt.blot.H1_moments(alphabet_len, wordlen, area, seglen, p_match)[source]

The mean and standrad deviation of the limiting normal distribution under the \(H_1\) (related) model given by:

\[\begin{split}\begin{aligned} \mu_1 & = \mu_0 + Kp^w \\ \sigma_1^2 & = \sigma_0^2 + 2K\left[(1 - p^w)\left(p^w + \frac{2p^{w+1}}{1 - p}\right) - 2wp^{2w}\right] \end{aligned}\end{split}\]

where \(w\) is the word length, \(K\) is the similarity length, and \(p\) is the match probability.

class biseqt.blot.WordBlot(S, T, g_max=None, sensitivity=None, **kw)[source]

Bases: biseqt.seeds.SeedIndex

A similarity finder based on m-dependent CLT statistics.

g_max

float – Upper bound for indel probabilities in mutation model.

sensitivity

float – Desired sensitivity of bands.

score_num_seeds(**kw)[source]

Calculates our key central statistics based on m-dependent CLT. For a given observation of number of seeds in a region of interest (ROI), calculates the z-core against the \(H_0\) (unrelated) and \(H_1\) (related) models. In either case, the distribution of number of seeds, being a sum of m-dependent identically distributed random variables, is asymptotically normal with parameters given by H0_moments(), H1_moments().

Keyword Arguments:
 
  • num_seeds (int) – Number of observed seed in the ROI.
  • area (int|float) – Area of the ROI.
  • seglen (int) – Similar segment length in H1 model.
  • p_match (float) – Expected match probability at any given position.
Returns:

z-scores in H0 and H1 models

Return type:

tuple (float)

band_radius(K)[source]

Wraps band_radius() with our mutation parameters and sequence lengths.

Parameters:K (int) – expected alignment length of interest.
Returns:radius of band for desired sensitivity.
Return type:int
segment_dims(d_band=None, a_band=None)[source]

Calculate the edit path length \(K\) and the area of the given diagonal/antiodiagonal segment.

Keyword Arguments:
 
  • d_band (tuple) – lower and upper diagonals limiting the segment.
  • a_band (tuple) – lower and upper antidiagonal positions limiting the segment.
Returns:

edit path length and segment area.

Return type:

tuple

estimate_match_probability(num_seeds, d_band=None, a_band=None)[source]

Estimate the edit path match probability given the provided observed number of seeds \(n\) in given sigment:

\[\hat{p} = \left(\frac{n - Ap_\circ^w}{K}\right)^{\frac{1}{w}}\]

where \(n, A, p_\circ, K, w\) are the number of seeds, segment area, null probability of matching nucleotides, segment length, and the word length, respectively.

Parameters:num_seeds (int) – number of seeds observed in segment.
Keywords Args:
d_band (tuple):
lower and upper diagonals limiting the segment.
a_band (tuple):
lower and upper antidiagonal positions limiting the segment.
Returns:estimated match probability
Return type:float
classmethod find_all_neighbors(seeds, d_radius, a_radius)[source]

For each seed finds all seeds in its neighborhood defined by:

\[U_{(d, a)} = \{(d', a'): |d - d'| < r_d, |a - a'| < r_a \}\]

This is done using a Quad-Tree in \(O(m lg m)\) time where m is the number of seeds.

Returns:
tuples ((d, a), neighs) where neighs is a list of
neighbor indices.
Return type:list
score_seeds(K)[source]

Find the neighbors of each seed in the sense of find_all_neighbors() and estimates the match probability of the segment centered at the coordinates of each seed.

Parameters:K (int) – the similarity legnth of interest that dictates neighborhood shapes and estimated match probabilities.
Returns:List of dictionaries with keys: seed (coordinates of exactly matching kmer in diagonal coordinates), neighs (list of indices of neighbors of this seed in the appropriate diagonal strip), p the estimated match probability of a segment centered at the seed.
Return type:list(dict)
similar_segments(K_min, p_min, at_least_one=False)[source]

Find all maximal local similarities of given minium length and match probability. Additionally for each segment, the match probability is estimated and H0/H1 scores are calculated.

Parameters:
  • K_min (int) – minimum required length of similarity.
  • p_min (float) – Minimum required match probability at each position.
  • score (bool) – Whether to score the segment against H1 by counting seeds inside.
Yields:

dict – dictionary with keys: segment (coordinates of similar region in diagonal coordinates ((d_min, d_max), (a_min, a_max)))), p the estimated match probability, and score the H1 z-score if keyword argument score is true.

class biseqt.blot.WordBlotOverlap(S, T, g_max=None, sensitivity=None, **kw)[source]

Bases: biseqt.blot.WordBlot

A specialized version of WordBlot for detecting overlap (suffix-prefix) similarities between sequences (e.g. in a sequencing context).

score_seeds()[source]

For each seed finds all seeds in its neighborhood defined by:

\[U_{(d, a)} = \{(d', a'): |d - d'| < r_d(d), |a - a'| < r_a(d) \}\]

in such a way that each seed’s neighborhood is the entire diagonal band containing it. Each seed recieves an estimated match probability for a similarity on its diagonal band and a z-score with respect to the H1 model.

Returns:
dicts with keys seed (diagonal coordinates of seed),
p (estimated match probability of overlap alignment), L (the length of an alignment through the seed’s band according to sequence lengths), score (the z-score with respect to H1), and r (band radius at the seed’s coordinates).
Return type:list
highest_scoring_overlap_band()[source]

Finds the highest scoring diagonal band according to probabiliy estimations of score_seeds().

Returns:
with same keys p, score, d_band consistent with the
output of score_seeds() for the highest scoring diagonal band.
Return type:dict
class biseqt.blot.WordBlotOverlapRef(ref, allowed_memory=1, **kw)[source]

Bases: biseqt.blot.WordBlotOverlap

An in-memory, SQL-free version of WordBlotOverlap for faster comparisons. Due to implementation details the word length is constrained above by the available memory.

allowed_memory

int|float – allocatable memory in GB for kmers index.

seeds(exclude_trivial=True)[source]

Yields all seeds, optionally those within a diagonal band.

Keyword Arguments:
 d_band (tuple|None) – If specified a (d_min, d_max) tuple restricting the seed count to a diagonal band.
Yields:tuple – seeds coordinates \((i, j)\).
seed_count(d_band=None, a_band=None)[source]

Counts the number of seeds either in the whole table or in the specified diagonal band.

Parameters:
  • d_band (tuple|None) – If specified a \((d_{\min}, d_{\max})\) tuple restricting the seed count to a diagonal band.
  • a_band (tuple|None) – If specified a \((a_{\min}, a_{\max})\) tuple restricting the seed count to an antidiagonal band.
Returns:

Number of seeds found in the entire table or in the specified diagonal band.

Return type:

int

score_seeds_(seq)[source]
highest_scoring_overlap_band(seq)[source]

Finds the highest scoring diagonal band according to probabiliy estimations of score_seeds().

Returns:
with same keys p, score, d_band consistent with the
output of score_seeds() for the highest scoring diagonal band.
Return type:dict
class biseqt.blot.WordBlotLocalRef(ref, allowed_memory=1, **kw)[source]

Bases: biseqt.blot.WordBlot

An in-memory, SQL-free version of WordBlot for faster comparisons. Due to implementation details the word length is constrained above by the available memory.

allowed_memory

int|float – allocatable memory in GB for kmers index.

seeds(exclude_trivial=True)[source]

Yields all seeds, optionally those within a diagonal band.

Keyword Arguments:
 d_band (tuple|None) – If specified a (d_min, d_max) tuple restricting the seed count to a diagonal band.
Yields:tuple – seeds coordinates \((i, j)\).
seed_count(d_band=None, a_band=None)[source]

Counts the number of seeds either in the whole table or in the specified diagonal band.

Parameters:
  • d_band (tuple|None) – If specified a \((d_{\min}, d_{\max})\) tuple restricting the seed count to a diagonal band.
  • a_band (tuple|None) – If specified a \((a_{\min}, a_{\max})\) tuple restricting the seed count to an antidiagonal band.
Returns:

Number of seeds found in the entire table or in the specified diagonal band.

Return type:

int

score_seeds_(seq, K)[source]
similar_segments(seq, K_min, p_min, at_least_one=False)[source]

Find all maximal local similarities of given minium length and match probability. Additionally for each segment, the match probability is estimated and H0/H1 scores are calculated.

Parameters:
  • K_min (int) – minimum required length of similarity.
  • p_min (float) – Minimum required match probability at each position.
  • score (bool) – Whether to score the segment against H1 by counting seeds inside.
Yields:

dict – dictionary with keys: segment (coordinates of similar region in diagonal coordinates ((d_min, d_max), (a_min, a_max)))), p the estimated match probability, and score the H1 z-score if keyword argument score is true.

class biseqt.blot.WordBlotMultiple(*seqs, **kw)[source]

Bases: biseqt.seeds.SeedIndexMultiple

A multiple sequence similarity finder based on m-dependent CLT statistics.

g_max

float – Upper bound for indel probabilities in mutation model.

sensitivity

float – Desired sensitivity of bands.

band_radius(K)[source]

Wraps band_radius() with our mutation parameters and sequence lengths.

Parameters:K (int) – expected alignment length of interest.
Returns:radius of band for desired sensitivity.
Return type:int
wall_to_wall_distance(ds)[source]

Wall to wall distance \(L\) for a diagonal position \(d_1, \ldots, d_{n-1}\).

The distance \(L\) is the largest possible value of the antidiagonal coordinate once all diagonal coordinates are fixed.

estimate_match_probability(num_seeds, K, volume)[source]

Estimate the edit path match probability given the provided observed number of seeds in given sigment.

\[\hat{p} = \left( \frac{n - Vp_\circ^{w(N-1)}}{K} \right)^{\frac{1}{w(N-1)}}\]

where \(n, N, V, p_\circ, w\) are the number of seeds, number of sequences, segment n-d volume, null probability of matching nucleotides, and word length, respectively.

Parameters:
  • num_seeds (int|float) – number of seeds observed in segment.
  • K (int|float) – the expected length of the alignment.
  • volume (int|float) – the n-d volume of the region of interest.
Returns:

estimated match probability

Return type:

float

score_seeds(K)[source]

Find the neighbors of each seed in the sense of find_all_neighbors() and estimates the match probability of the segment centered at the coordinates of each seed.

Parameters:K (int) – the similarity legnth of interest that dictates neighborhood shapes and estimated match probabilities.
Returns:
list of tuples ((d, a), neighs, p) where (d, a)
is the diagonal coordinates of the seed, neighs is a list of integer indices of seeds in its diagonal/antidiagonal neighborhood, and p is the estimated match probability of a segment of length K centered at the seed.
Return type:list
find_all_neighbors(d_radius, a_radius)[source]

For each seed finds all seeds in its neighborhood defined by:

\[U_{(d_1,\ldots,d_{n-1}, a)} = \{(d_1', \ldots, d_{n-1}', a'): |d_k - d_k'| < r_d, |a - a'| < r_a \}\]

This is done using a kD-Tree in \(O(nm lg m)\) time where m is the number of seeds and n is the number of sequences.

Returns:
tuples ((ds, a), neighs) where neighs is a list of
neighbor indices.
Return type:list
score_num_seeds(num_seeds, **kw)[source]

The exrtension of WordBlot.score_num_seeds() to multiple sequence comparisons.

Keyword Arguments:
 
  • num_seeds (int) – Number of observed seed in the ROI.
  • volume (int|float) – Area of the ROI.
  • seglen (int) – Similar segment length in H1 model.
  • p_match (float) – Expected match probability at any given position.
Returns:

z-scores in H0 and H1 models

Return type:

tuple (float)

similar_segments(K_min, p_min, at_least_one=False)[source]

Find all maximal local similarities of given minium length and match probability. Additionally for each segment, the match probability is estimated and H0/H1 scores are calculated.

Parameters:
  • K_min (int) – minimum required length of similarity.
  • p_min (float) – Minimum required match probability at each position.
  • score (bool) – Whether to score the segment against H1 by counting seeds inside.
Yields:

dict – dictionary with keys: segment (coordinates of similar region in diagonal coordinates, p the estimated match probability, and score the H1 z-score if keyword argument score is true.

class biseqt.blot.WordBlotMultipleFast(*seqs, **kw)[source]

Bases: biseqt.blot.WordBlotMultiple

An in-memory, SQL-free version of WordBlotOverlapMultiple for faster comparisons. Due to implementation details the word length is constrained above by the available memory.

allowed_memory

int|float – allocatable memory in GB for kmers index, default is 1.

seeds()[source]

Yields all seeds in diagonal coordinates.

Yields:tuple – seeds coordinates \((d_1, \ldots, d_{n-1}, a)\).
seed_count(ds_band=None, a_band=None)[source]

Counts the number of seeds either in the whole table or in the specified diagonal band.

Parameters:
  • ds_band (List|None) – If specified a list of \(n-1\) tuples \((d_{k,\min}, d_{k,\max})\) restricting the seed count to a diagonal hyper-band.
  • a_band (tuple|None) – If specified a \((a_{\min}, a_{\max})\) tuple restricting the seed count to an antidiagonal band.
Returns:

Number of seeds found in the entire table or in the specified diagonal band.

Return type:

int