biseqt.stochastics module

biseqt.stochastics.rand_seq(alphabet, size, p=None)[source]

Generate a random Sequence of the given length from the given Alphabet.

Keyword Arguments:
 
  • alphabet (Alphabet) –
  • size (int) – The length of the randomly generated sequence.
  • p (list) – The discrete probability distribution for letters of alphabet to appear at each position; default is None in which case letters are chosen uniformly.
biseqt.stochastics.rand_read(seq, len_mean=None, len_sd=1, expected_coverage=None, num=None)[source]

Generates a random collection of lossless reads from the given sequence. Each read is generated by taking a substring with Gaussian length and uniformly chosen starting point. The number of reads is either determined as an argument or calculated based on the expected coverage:

\[num = \frac{\mathbb{E}[coverage]\times |seq|}{\mathbb{E}[len]}\]
Parameters:

seq (sequence.Sequence) – The original sequence to be sampled.

Keyword Arguments:
 
  • num (int) – Number of reads to generate; default is None in which case expected_coverage takes precedence. If both are None a single read is returned.
  • expected_coverage (float) – The expected number of times each letter in the sequence ought to appear in the entire read collection; default is None in which case num takes precedence. If both are None a single read is returned.
  • len_mean (float) – The mean of the normal distribution of read lengths. Default is 100.
  • len_sd (float) – The standard deviation of the normal distribution or read lengths. Default is 1.
Yields:

tuple – The read Sequence and its starting position int.

class biseqt.stochastics.MutationProcess(alphabet, subst_probs=None, ge_prob=0, go_prob=0, insert_dist=None)[source]

Bases: object

alphabet

Alphabet – The sequence.Alphabet this mutation process operates on.

subst_probs

list – The probability matrix for the distribution of substitutions such that subst_probs[i][j] is the probability of the i-th letter being replaced by the j-th letter at a non-indel edit operation. Alternatively, a single number can be provided in which case it is treated as the probability of any substitution to occur and all letters and all substitutions are considered equally. For instance, if the single number 0.2 is given and the alphabet has 3 letters the probability matrix will be:

\[\begin{split}\begin{pmatrix} 0.8 & 0.1 & 0.1 \\ 0.1 & 0.8 & 0.1 \\ 0.1 & 0.1 & 0.8 \end{pmatrix}\end{split}\]
go_prob

float – probability of a single indel following a subtitution, a match, or an indel of a different kind (i.e from insertion to deletion); default is 0 (cf. ge_prob).

ge_prob

float – The probability of an open gap to be extended by an indel of the same kind; default is 0. For consistency, it is always required that the gap-extend probability to be at least as large as the gap-open probability with equality implying a linear model which translates to a linear gap penalty scheme in log_odds_scores().

insert_dist

list – the probability distribution for inserted content; default is None which is taken to mean uniform.

mutate(seq)[source]

Returns a mutant of the given sequence by copying it while at each position:

  • either, the current letter is replaced by a random letter according to the distribution dictated by subst_probs. This could be a match or a substitution.
  • or, a gap is opened (with probability go_prob) the length of which follows a geometric distribution with parameter ge_prob.

Accordingly, a transcript is generated which corresponds to the performed edit sequence.

Parameters:seq (Sequence) – The original sequence.
Returns:The mutant Sequence and the corresponding edit transcript (as a string).
Return type:tuple
noisy_read(seq, **kw)[source]

Wraps rand_read() to generates a collection of lossy reads from the given sequence. First, lossless reads are generated by rand_read() (all keyword arguments are passed as is), then each read is modified by this mutation process, as in mutate().

Parameters:

seq (sequence.Sequence) – The original sequence to be sampled.

Keyword Arguments:
 
  • num (int) – Number of reads to generate; if not given expected_coverage must be specified.
  • expected_coverage (float) – The expected number of times each letter in the sequence ought to appear in the entire read collection; if not given num must be specified.
  • len_mean (float) – The mean of the normal distribution of read lengths.
  • len_sd (float) – The standard deviation of the normal distribution or read lengths; default is 1.
Yields:

tuple – The noisy read Sequence, the starting position int of the original lossless read, and the corresponding EditTranscript.

log_odds_scores(null_hypothesis=None)[source]

Converts the mutation probabilities (substitution and gap) to log-odds scores usable for sequence alignment (cf. AlignScores).

Keyword Arguments:
 null_hypothesis (list) – The probability distribution of letters to use as the null hypothesis; default is None which is taken to mean uniform distribution.
Returns:The substitution score matrix (a list of length equal to the alphabet containing lists each of the same length containing the substitution scores) and the gap scores as a tuple of gap-open and gap-extend scores.
Return type:tuple

Substitution scores

The scores are natural logs of odds ratios, namely the substitution score of letter \(a_i\) to letter \(a_j\) is:

\[S(a_i\rightarrow a_j) = \log[(1-g)\Pr(a_j|a_i)] - \log[\Pr(a_j)]\]

where \(g\) is the gap-extend probability (see below), \(\Pr(a_j|a_i)\) is the substution probability, and \(\Pr(a_j)\) is the null hypothesis distribution.

Gap scores

Gap open and extend probabilities are converted to an affine gap penalty scheme (but reported values are “scores” not penalties, i.e they are negative). The probabilitstic model is as described in MutationProcess() (also cf. mutate()). Accordingly, given the gap-open and gap-extend probabilities \(g_o, g_e\), the score of a gap of length \(n \ge 1\) is

\[\log g_o + (n-1)\log g_e = \log\left( \frac{g_o}{g_e} \right) + n\log g_e\]

where the first term in the RHS is the reported gap-score and the coefficient of \(n\) in the second term is the reported gap-extend score.

Note

  • In the calculation of substitution scores the gap-open probability is ignored and a linear model, with its sole parameter equal to the gap-extend probability, is assumed. This is because under an affine model where the probability of opening a gap differs from that of extending a gap, the substitution probabilities also become context-dependent (see where \(g\) appears in the formula above) which is not supported by pwlib.
  • The above formula for gap scores is the technical reason why we always demand that \(g_e\ge g_o\) so that the gap score is not positive, with equality implying linear gap penalties (i.e the gap-open score is 0 when the gap-open and gap-extend probabilities are identical).