biseqt.kmers module¶
-
biseqt.kmers.
kmer_as_int
(contents, alphabet)[source]¶ Calculates the integer representation of a kmer by treating its contents as digits in the base of alphabet length. For instance, in the DNA alphabet
AGA
becomes \((020)_4\) which is 8. Note that each kmer gives a unique integer as long as all kmers have the same word length (which is the case here). There are two restrictions imposed on the word length and alphabet size (enforced in__init__()
):The alphabet must be such that all letters can be represented by single ASCII characters between
[0-9a-z]
(cf. int). This implies a maximum alphabet size of 36.The word length must be such that a single integer can store the entire representation of a kmer. This requires that we have:
\[k < \frac{I-1}{2}\]where \(k\) is the word length and \(I\) is the number of bits allocated for an integer. For instance, on a 64-bit system the maximum word length is 31.
-
biseqt.kmers.
as_kmer_seq
(seq, wordlen, mask=[])[source]¶ A generator for kmer hit tuples of the form
(kmer, pos)
. Kmers are represented in integer form (cf.kmer_as_int()
).Parameters: - seq (sequence.Sequence) – The sequence to be scanned.
- wordlen (int) – Size of kmers.
- mask (list) – A list of sets of integers
(i_1, ..., i_k)
which mask kmers (represented byNone
) if the kmer content (set of letters appearing in the kmer, represented as integers as inSequence.contents
) matches the set.
Returns: of integers representing kmers.
Return type: list
-
class
biseqt.kmers.
KmerDBWrapper
(name='', path=':memory:', alphabet=None, wordlen=None, mask=[], log_level=20, init_script=None)[source]¶ Bases:
object
Generic wrapper for an SQLite database for Kmers.
-
name
¶ str – String name used as a suffix for table names.
-
path
¶ str – Path to the SQLite datbase (or
:memory:
).
-
alphabet
¶ sequence.Alphabet – The alphabet for sequences in the database.
-
wordlen
¶ int – Length of kmers of interest to this index.
-
mask
¶ list – A list of sets of integers which mask kmers (represented by
None
), cf.as_kmer_seq()
.
-
init_script
¶ str – SQL script to be executed upon initialization; typically creates tables needed by the class.
-
connection
(reset=False)[source]¶ Provides a SQLite database connection that can be used as a context manager. The returned object is always the same connection object belonging to the
KmerIndex
instance (otherwise in-memory connections would reset the database contents upon every invocation).Returns: apsw.Connection
-
-
class
biseqt.kmers.
KmerCache
(name='', **kw)[source]¶ Bases:
biseqt.kmers.KmerDBWrapper
A cache backed by SQLite for representations of sequences as integer sequences. Upon initialization the following SQL script is executed
CREATE TABLE seq_kmers ( 'seq' VARCHAR, -- content identifier of sequence 'kmres' VARCHAR, -- comma separated representation of sequence -- as integers encoding kmers. );
This implies that any time only one
KmerCache
can exist with the same path. FIXME: using the same database for different word lengths / alphabets is quietly accepted (and wrong results returned).-
kmers_table
¶ The kmer hits table name
seq_kmers_[name]
, cf.KmerDBWrapper.name
.
-
as_kmer_seq
(seq)[source]¶ Return the integer representation of a given sequence.
Parameters: seq (sequence.Sequence) – input sequence. Returns: - list of integers of length
n-w+1
containing kmers in - input sequence represented as an integer, cf.
as_kmer_seq()
.
Return type: list - list of integers of length
-
-
class
biseqt.kmers.
KmerIndex
(name='', kmer_cache=None, **kw)[source]¶ Bases:
biseqt.kmers.KmerDBWrapper
An index backed by SQLite for occurences of kmers in a body of sequences. Upon initialization the following script is executated:
CREATE TABLE kmers_[name] ( 'kmer' INTEGER, -- The kmer in integer representation. 'seq' INTEGER, -- integer identifier of sequence 'pos' INTEGER -- the position of kmer in sequence. ); CREATE TABLE IF NOT EXISTS kmer_indexed_[name] ( 'seq' VARCHAR, -- content id, 'seqid' INTEGER PRIMARY KEY AUTOINCREMENT -- integer id. );
-
name
¶ str
-
cache
¶ KmerCache – optional
KmerCache
object to use for retrieving integer representations of sequences.
-
kmers_table
¶ The kmer hits table name
kmers_[name]
, cf.KmerDBWrapper.name
.
-
log_table
¶ The log table name
kmer_indexed_[name]
, cf.KmerDBWrapper.name
.
-
index_kmers
(seq)[source]¶ Indexes all kmers observed in the given sequence in
kmers_table
.Parameters: - seq (sequence.Sequence) – The sequence just inserted into the database.
- seqid (int) – The integer identifier to use for sequence.
-
hits
(kmer)[source]¶ Returns all hits of a given kmer in indexed sequences.
Parameters: kmer (int) – kmer of interest. Returns: A list of 2-tuples containing sequence ids (int) and positions. Return type: list
-