biseqt.kmers module

biseqt.kmers.kmer_as_int(contents, alphabet)[source]

Calculates the integer representation of a kmer by treating its contents as digits in the base of alphabet length. For instance, in the DNA alphabet AGA becomes \((020)_4\) which is 8. Note that each kmer gives a unique integer as long as all kmers have the same word length (which is the case here). There are two restrictions imposed on the word length and alphabet size (enforced in __init__()):

  • The alphabet must be such that all letters can be represented by single ASCII characters between [0-9a-z] (cf. int). This implies a maximum alphabet size of 36.

  • The word length must be such that a single integer can store the entire representation of a kmer. This requires that we have:

    \[k < \frac{I-1}{2}\]

    where \(k\) is the word length and \(I\) is the number of bits allocated for an integer. For instance, on a 64-bit system the maximum word length is 31.

biseqt.kmers.as_kmer_seq(seq, wordlen, mask=[])[source]

A generator for kmer hit tuples of the form (kmer, pos). Kmers are represented in integer form (cf. kmer_as_int()).

Parameters:
  • seq (sequence.Sequence) – The sequence to be scanned.
  • wordlen (int) – Size of kmers.
  • mask (list) – A list of sets of integers (i_1, ..., i_k) which mask kmers (represented by None) if the kmer content (set of letters appearing in the kmer, represented as integers as in Sequence.contents) matches the set.
Returns:

of integers representing kmers.

Return type:

list

class biseqt.kmers.KmerDBWrapper(name='', path=':memory:', alphabet=None, wordlen=None, mask=[], log_level=20, init_script=None)[source]

Bases: object

Generic wrapper for an SQLite database for Kmers.

name

str – String name used as a suffix for table names.

path

str – Path to the SQLite datbase (or :memory:).

alphabet

sequence.Alphabet – The alphabet for sequences in the database.

wordlen

int – Length of kmers of interest to this index.

mask

list – A list of sets of integers which mask kmers (represented by None), cf. as_kmer_seq().

init_script

str – SQL script to be executed upon initialization; typically creates tables needed by the class.

connection(reset=False)[source]

Provides a SQLite database connection that can be used as a context manager. The returned object is always the same connection object belonging to the KmerIndex instance (otherwise in-memory connections would reset the database contents upon every invocation).

Returns:apsw.Connection
log(*args, **kwargs)[source]

Wraps Logger.log.

class biseqt.kmers.KmerCache(name='', **kw)[source]

Bases: biseqt.kmers.KmerDBWrapper

A cache backed by SQLite for representations of sequences as integer sequences. Upon initialization the following SQL script is executed

CREATE TABLE seq_kmers (
    'seq' VARCHAR,   -- content identifier of sequence
    'kmres' VARCHAR, -- comma separated representation of sequence
                     -- as integers encoding kmers.
);

This implies that any time only one KmerCache can exist with the same path. FIXME: using the same database for different word lengths / alphabets is quietly accepted (and wrong results returned).

kmers_table

The kmer hits table name seq_kmers_[name], cf. KmerDBWrapper.name.

cached_seqs()[source]

Returns content identifiers for all cached sequences.

as_kmer_seq(seq)[source]

Return the integer representation of a given sequence.

Parameters:seq (sequence.Sequence) – input sequence.
Returns:
list of integers of length n-w+1 containing kmers in
input sequence represented as an integer, cf. as_kmer_seq().
Return type:list
class biseqt.kmers.KmerIndex(name='', kmer_cache=None, **kw)[source]

Bases: biseqt.kmers.KmerDBWrapper

An index backed by SQLite for occurences of kmers in a body of sequences. Upon initialization the following script is executated:

CREATE TABLE kmers_[name] (
  'kmer'  INTEGER,      -- The kmer in integer representation.
  'seq'   INTEGER,      -- integer identifier of sequence
  'pos'   INTEGER       -- the position of kmer in sequence.
);

CREATE TABLE IF NOT EXISTS kmer_indexed_[name] (
  'seq'  VARCHAR,                           -- content id,
  'seqid' INTEGER PRIMARY KEY AUTOINCREMENT -- integer id.
);
name

str

cache

KmerCache – optional KmerCache object to use for retrieving integer representations of sequences.

kmers_table

The kmer hits table name kmers_[name], cf. KmerDBWrapper.name.

log_table

The log table name kmer_indexed_[name], cf. KmerDBWrapper.name.

index_kmers(seq)[source]

Indexes all kmers observed in the given sequence in kmers_table.

Parameters:
  • seq (sequence.Sequence) – The sequence just inserted into the database.
  • seqid (int) – The integer identifier to use for sequence.
create_sql_index()[source]

Creates SQL index over the kmer column of kmers table.

hits(kmer)[source]

Returns all hits of a given kmer in indexed sequences.

Parameters:kmer (int) – kmer of interest.
Returns:A list of 2-tuples containing sequence ids (int) and positions.
Return type:list
kmers()[source]

Returns all observed kmers.

Returns:list of kmers in integer representation.
Return type:list
drop_data()[source]

Drop all tables created by this object.