Documentation for biseqt

BiSeqt is a biological sequence analysis tool.

Wiki

Overview

(1) Alphabets and Sequences

An Alphabet is defined by a list of letters and a Sequence is a list of letters from an alphabet.

>>> from biseqt.sequence import Alphabet, Sequence
>>> A = Alphabet('ACGT')
>>> S = A.parse('AACTTCG')
>>> print S.contents
(0, 0, 1, 3, 3, 1, 2)
>>> print S[:3]
AAC
>>> print S.content_id[:8]
'd9f235b1a44358cc80237d5e5c46c06d81a83e46' # SHA1 of sequence contents

The letters in an alphabet not need be single characters but all must have the same length.

>>> A = Alphabet(['A1', 'A2', 'A3', 'A4'])
>>> S = A.parse('A1A1A3A2')
>>> print len(S)
4

[source: biseqt.sequence]

(2) Random Processes

The biseqt.stochastics module provides a collection of tools for statistical calculations and for simulating random processes.

>>> from biseqt.sequence import Alphabet
>>> from biseqt.stochastics import rand_seq, MutationProcess
>>> A = Alphabet('ACGT')
>>> seq = rand_seq(A, 10)
>>> print(seq)
CTTGGAAAAA
>>> M = MutationProcess(A, go_prob=.1, ge_prob=.2, subst_probs=.3)
>>> mutant, transcript = M.mutate(seq)
>>> print transcript
SMIMMMMSSMS

[source: biseqt.stochastics]

(3) Pairwise Alignment

The module exposes the pairwise sequence alignment algorithms implemented in C in pwlib.

>>> from biseqt.sequence import Alphabet, Sequence
>>> from biseqt.pw import Aligner
>>> A = Alphabet('ACGT')
>>> S = A.parse('AAACGCGT')
>>> T = A.parse('AACGCCTT')
>>> with Aligner(S, T) as aligner:
...     score = aligner.solve()
...     aln = aligner.traceback()
>>> print type(aln)
<class 'biseqt.pw.Alignment'>
>>> print aln
origin[0]: AAACGC-GT-
mutant[0]: AA-CGCC-TT
>>> print aln.render_term() # same as above but with colored output

[source: biseqt.pw]

(4) Handling Kmers

The biseqt.kmers module provides tools for k-mer analysis. Kmers are represented as integer (and hence a maximum word length is imposed by integer size of the machine).

>>> from biseqt.sequence import Alphabet, Sequence
>>> from biseqt.kmers import as_kmer_seq, KmerIndex
>>> A = Alphabet('ACGT')
>>> S = A.parse('AAACGCGT')
>>> wordlen = 3
>>> as_kmer_seq(S, wordlen)
[0, 1, 6, 25, 38, 27]
>>> kmer_index = KmerIndex(path=':memory:', alphabet=A, wordlen=wordlen)
>>> kmer_index.index_kmers(S)
1 # the internal 'id' assigned to sequence S
>>> kmer_index.kmers()
[0, 1, 6, 25, 27, 38]
>>> kmer_index.hits(27)
[(1, 5)] # (seqid, position)

[source: biseqt.kmers]

(5) Seeds (exactly matching kmers)

The biseqt.seeds module provides tools for storing and analyzing exactly matching kmers, aka seeds.

>>> from biseqt.seeds import SeedIndex
>>> from biseqt.sequence import Sequence, Alphabet
>>> A = Alphabet('ACGT')
>>> S, T = A.parse('TAAGCGT'), A.parse('GGCGTAA')
>>> seed_index = SeedIndex(S, T, path=':memory:', wordlen=3, alphabet=A)
>>> list(seed_index.seeds())
[(4, 2), (3, 1), (0, 4)]

[source: biseqt.seeds]

(6) Alignment-free local similarity detection with Word-Blot

The biseqt.blot module implements the WordBlot algorithm.

>>> from biseqt.blot import WordBlot
>>> from biseqt.sequence import Sequence, Alphabet
>>> from biseqt.stochastics import rand_seq, MutationProcess
>>> A = Alphabet('ACGT')
>>> S = rand_seq(A, 100)
>>> M = MutationProcess(A, go_prob=.1, ge_prob=.1, subst_probs=.2)
>>> T, _ = M.mutate(S)
>>> WB = WordBlot(S, T, wordlen=5, alphabet=A, g_max=.5,                             sensitivity=.99)
>>> list(WB.similar_segments(50, .7))
[(((-28, 7), (0, 99)), 1.1988215486845972, 0.7630903108462143)]
>>>

[source: biseqt.blot]

Developer Notes

Insert-or-Append Queries

It may be useful to perform an SQL query which either inserts a new record or appends the given value to some column in case of conflict. This is achived by using one of SQLite’s conflict resolution mechanisms ON CONFLICT REPLACE or in short OR REPLACE. For instance, consider a table:

id | field
1  | foo

where we wish INSERT INTO ... (id, field) VALUES (2, 'bar') to give:

id | field
1  | foo
2  | bar

and INSERT INTO ... (id, field) VALUES (1, 'bar') to give:

id | field
1  | foo,bar

This can be implemented by using the following query format:

INSERT INTO ... (id, field) VALUES
SELECT ?, IFNULL(SELECT field FROM ... WHERE id = ?, "") || ?

invoked like this:

id, field = ...
conn = apsw.Connection('example.db')
conn.cursor().execute(query, (id, id, ',' + field))

Note that this pattern only works if the id column has a unique constraint on it. Otherwise, no conflict will arise to be resolved and new values will appear in new records instead of being appended to old ones.

[source: biseqt.kmers]

Integer Sizes

All kmers are stored as their integer representation to save on space and processing time. Python is flexible with the maximum size of integers, as integers automatically switch to longs, which have “unlimited precision”. SQLite, too, is flexible but has a maximum integer cap of 64-bits: integers take 2, 4, 6, or 8 bytes per integer dependning on the size it needs.

The checks resulting from maximum integer size are performed in KmerIndex.__init__ which basically block kmers taking more than 64-bit integers to represent.

[source: biseqt.kmers]

SQLite Performance Tuning

Tuning strategy naturally depends on the balance between the volume of insert vs. select queries and the concurrency requirements. Here we will assume:

  • The volume of inserts is much larger than selects,
  • Application logic can be trusted with respecting unique constraints (i.e the code creating data does not violate semantic constraints).
  • Usage pattern consists of bulk of inserts followed by bulk of selects (e.g. not interleaved).

Under these circumstances, the following guidelines are suggested:

  • Create indices after bulk inserts not before. For instance, instead of:

    CREATE TABLE foo ('f1' int, 'f2' int, UNIQUE(f1, f2))
    -- INSERT INTO foo VALUES ...
    

    it’s more performant to say:

    CREATE TABLE foo ('f1' int, 'f2' int)
    -- INSERT INTO foo VALUES ...
    CREATE UNIQUE INDEX foo_index ON foo (f1, f2)
    
  • If there is no concern about data corruption upon application or operating sytem crash journaling can be turned off as well, from docs:

    If the application crashes in the middle of a transaction when the
    OFF journaling mode is set, then the database file will very likely
    go corrupt.

    To turn off journaling:

    PRAGMA journaling_mode = OFF
    

    Note that turning off journaling breaks rollbacks:

    The OFF journaling mode disables the rollback journal completely. No
    rollback journal is ever created and hence there is never a rollback
    journal to delete. The OFF journaling mode disables the atomic commit
    and rollback capabilities of SQLite.
  • When a table has a unique integer key it should be declared as INTEGER PRIMARY KEY so that it would take over the default rowid field. This saves space (and thus a small amount of time) on both the field and the corresponding index.

  • Foreign key constraints, if enforced, slow down bulk inserts significantly. However, by default, foreign key checks are turned off. To turn it on:

    PRAGMA foreign_keys = ON;
    

    Note that this default maybe modified by compile time flags (i.e foreign keys may be turned on by default). Furthermore, if foreign keys are turned on, consider deferring foreign key enforcement to transaction commits and and keep in mind that pysqlite (following python’s DB-API) fudges with transaction beginning and ends.

  • Larger page sizes can marginally improve read/write performance. To increase the page size:

    PRAGMA page_size = 65536
    

[source: biseqt.kmers]