biseqt.stochastics module¶
-
biseqt.stochastics.
rand_seq
(alphabet, size, p=None)[source]¶ Generate a random
Sequence
of the given length from the givenAlphabet
.Keyword Arguments: - alphabet (Alphabet) –
- size (int) – The length of the randomly generated sequence.
- p (list) – The discrete probability distribution for letters of alphabet to appear at each position; default is None in which case letters are chosen uniformly.
-
biseqt.stochastics.
rand_read
(seq, len_mean=None, len_sd=1, expected_coverage=None, num=None)[source]¶ Generates a random collection of lossless reads from the given sequence. Each read is generated by taking a substring with Gaussian length and uniformly chosen starting point. The number of reads is either determined as an argument or calculated based on the expected coverage:
\[num = \frac{\mathbb{E}[coverage]\times |seq|}{\mathbb{E}[len]}\]Parameters: seq (sequence.Sequence) – The original sequence to be sampled.
Keyword Arguments: - num (int) – Number of reads to generate; default is None in which case
expected_coverage
takes precedence. If both are None a single read is returned. - expected_coverage (float) – The expected number of times each letter in
the sequence ought to appear in the entire read collection; default
is None in which case
num
takes precedence. If both are None a single read is returned. - len_mean (float) – The mean of the normal distribution of read lengths. Default is 100.
- len_sd (float) – The standard deviation of the normal distribution or read lengths. Default is 1.
Yields: tuple – The read
Sequence
and its starting positionint
.- num (int) – Number of reads to generate; default is None in which case
-
class
biseqt.stochastics.
MutationProcess
(alphabet, subst_probs=None, ge_prob=0, go_prob=0, insert_dist=None)[source]¶ Bases:
object
-
alphabet
¶ Alphabet – The
sequence.Alphabet
this mutation process operates on.
-
subst_probs
¶ list – The probability matrix for the distribution of substitutions such that
subst_probs[i][j]
is the probability of the i-th letter being replaced by the j-th letter at a non-indel edit operation. Alternatively, a single number can be provided in which case it is treated as the probability of any substitution to occur and all letters and all substitutions are considered equally. For instance, if the single number 0.2 is given and the alphabet has 3 letters the probability matrix will be:\[\begin{split}\begin{pmatrix} 0.8 & 0.1 & 0.1 \\ 0.1 & 0.8 & 0.1 \\ 0.1 & 0.1 & 0.8 \end{pmatrix}\end{split}\]
-
go_prob
¶ float – probability of a single indel following a subtitution, a match, or an indel of a different kind (i.e from insertion to deletion); default is 0 (cf.
ge_prob
).
-
ge_prob
¶ float – The probability of an open gap to be extended by an indel of the same kind; default is 0. For consistency, it is always required that the gap-extend probability to be at least as large as the gap-open probability with equality implying a linear model which translates to a linear gap penalty scheme in
log_odds_scores()
.
-
insert_dist
¶ list – the probability distribution for inserted content; default is None which is taken to mean uniform.
-
mutate
(seq)[source]¶ Returns a mutant of the given sequence by copying it while at each position:
- either, the current letter is replaced by a random letter according
to the distribution dictated by
subst_probs
. This could be a match or a substitution. - or, a gap is opened (with probability
go_prob
) the length of which follows a geometric distribution with parameterge_prob
.
Accordingly, a transcript is generated which corresponds to the performed edit sequence.
Parameters: seq (Sequence) – The original sequence. Returns: The mutant Sequence
and the corresponding edit transcript (as a string).Return type: tuple - either, the current letter is replaced by a random letter according
to the distribution dictated by
-
noisy_read
(seq, **kw)[source]¶ Wraps
rand_read()
to generates a collection of lossy reads from the given sequence. First, lossless reads are generated byrand_read()
(all keyword arguments are passed as is), then each read is modified by this mutation process, as inmutate()
.Parameters: seq (sequence.Sequence) – The original sequence to be sampled.
Keyword Arguments: - num (int) – Number of reads to generate; if not given
expected_coverage
must be specified. - expected_coverage (float) – The expected number of times each letter
in the sequence ought to appear in the entire read collection;
if not given
num
must be specified. - len_mean (float) – The mean of the normal distribution of read lengths.
- len_sd (float) – The standard deviation of the normal distribution or read lengths; default is 1.
Yields: tuple – The noisy read
Sequence
, the starting positionint
of the original lossless read, and the correspondingEditTranscript
.- num (int) – Number of reads to generate; if not given
-
log_odds_scores
(null_hypothesis=None)[source]¶ Converts the mutation probabilities (substitution and gap) to log-odds scores usable for sequence alignment (cf.
AlignScores
).Keyword Arguments: null_hypothesis (list) – The probability distribution of letters to use as the null hypothesis; default is None which is taken to mean uniform distribution. Returns: The substitution score matrix (a list of length equal to the alphabet containing lists each of the same length containing the substitution scores) and the gap scores as a tuple of gap-open and gap-extend scores. Return type: tuple Substitution scores
The scores are natural logs of odds ratios, namely the substitution score of letter \(a_i\) to letter \(a_j\) is:
\[S(a_i\rightarrow a_j) = \log[(1-g)\Pr(a_j|a_i)] - \log[\Pr(a_j)]\]where \(g\) is the gap-extend probability (see below), \(\Pr(a_j|a_i)\) is the
substution probability
, and \(\Pr(a_j)\) is the null hypothesis distribution.Gap scores
Gap
open
andextend
probabilities are converted to an affine gap penalty scheme (but reported values are “scores” not penalties, i.e they are negative). The probabilitstic model is as described inMutationProcess()
(also cf.mutate()
). Accordingly, given the gap-open and gap-extend probabilities \(g_o, g_e\), the score of a gap of length \(n \ge 1\) is\[\log g_o + (n-1)\log g_e = \log\left( \frac{g_o}{g_e} \right) + n\log g_e\]where the first term in the RHS is the reported gap-score and the coefficient of \(n\) in the second term is the reported gap-extend score.
Note
- In the calculation of substitution scores the gap-open
probability is ignored and a linear model, with its sole
parameter equal to the gap-extend probability, is assumed. This
is because under an affine model where the probability of opening
a gap differs from that of extending a gap, the substitution
probabilities also become context-dependent (see where \(g\)
appears in the formula above) which is not supported by
pwlib
. - The above formula for gap scores is the technical reason why we always demand that \(g_e\ge g_o\) so that the gap score is not positive, with equality implying linear gap penalties (i.e the gap-open score is 0 when the gap-open and gap-extend probabilities are identical).
- In the calculation of substitution scores the gap-open
probability is ignored and a linear model, with its sole
parameter equal to the gap-extend probability, is assumed. This
is because under an affine model where the probability of opening
a gap differs from that of extending a gap, the substitution
probabilities also become context-dependent (see where \(g\)
appears in the formula above) which is not supported by
-