# Experiments¶

## (§1) Seed Statistics Simulations¶

experiments.num_seeds.exp_count_seeds()[source]

Shows theoretical and simulation results for the mean and variance of the number of exactly matching kmers between related and unrelated sequences as a function of sequence length. Theoretical predictions are based on m-dependent Central Limit Theorem which suggests a limiting Normal distribution with mean and variance given by biseqt.blot.H0_moments() and biseqt.blot.H1_moments().

Supported Claims

• The theoretical calculations of mean and variance for the number of seeds, given by biseqt.blot.H0_moments() and biseqt.blot.H1_moments() agree with simulations at least upto 25kbp sequence lengths.
• Although the expected number of seeds is technically quadratic, the highest order coefficient is so small that it can be considered effectively linear in sequence length. Furthermore, we note that word length provides expoenentially strong control on the number of seeds as sequence lengths increase; hence maintaining a small quadratic coefficient across biologically relevant sequence lengths (up to 1 Gbp) is feasible with reasonable word lenghts (up to 30)
experiments.num_seeds.exp_count_seeds_segment()[source]

Shows simulation results for alignment-free estimated match probability for globally homologous sequence pairs of length up to 25kbp.

Supported Claims

• The match probability estimator provided by biseqt.blot.WordBlot.estimate_match_probability() using the band radius provided by biseqt.blot.band_radius() are accurate for similarities of lengths up to 25kbp regardless of gap probability upper bound. Therefore, we can justify using generous overestimates of gap probability (e.g. $$g_{\max}=0.4$$) in typical contexts.
• For unrelated sequences our estimator reports a number close to .25 (one standard deviation range $$[0, .4]$$).

## (§2) Diagonal Band Radius Simulations¶

experiments.band_radius.exp_time_in_band()[source]

Shows theoretical and empirical results for proportion of alignment time (i.e. number of edit operations) spent within diagonal bands of varying radius.

Supported Claims

• The random walk model for capturing the probability distribution of diagonal positions is accurate.
• Calculating the band radius such that the endpoint of an alignment is within diagonal band with probability at least $$1 - \epsilon$$ is a good conservative estimate for the more reasonable requirement that the average proportion of time spent in band is at least $$1 - \epsilon$$.
• Calculated band radii are accurate and reliable when applied to simulated as well as biological edit sequences.
experiments.band_radius.exp_gap_probs()[source]

Shows gap probability variability in simulated and biological edit sequences.

Supported Claims

• Gap probabilities are nonstationary in biological data.
• Removing long indel stretches mitigates this nonstationarity for the most part and allows us to treat gap probability in biological alignments as stationary.

## (§3) Basic Word-Blot Usage Experiments¶

experiments.blot_rearrangement.exp_rearrangement()[source]

Example demonstrating of Word-Blot for pairwise local similarity search on two randomly generated sequencees with motif sequences violating collinearity $$S=M_1M_2M_3, T=M'_1M'_1M'_3M'_2$$ where motif pairs $$(M_i, M'_i)_{i=1,2,3}$$ have lengths 200, 400, 600 and are related by match probabilities 0.95, 0.85, and 0.75, respectively.

## (§4) Word-Blot Performance (pairwise)¶

experiments.blot_stats.exp_simulated_K_p()[source]

Performance of Word-Blot and associated statistical scores for pairwise local similarity search in simulations:

• z-scores assigned to diagonal strips with and without similarities under the corresponding limiting Normal distributions ($$H0, H1$$), for varying similarity lengths and match probabilities.
• Estimated coordinates, lengths and match probabilities of local similarities for varying similarity lengths and match probabilities. In each trial with similarity length $$K$$ and match probability $$p$$, two pairs of input sequences are provided to Word-Blot: a pair of sequences of length $$2K$$ whose substrings at positions $$[\frac{K}{2}, \frac{3K}{2}]$$ are homologies of length $$K$$ and match probability $$p$$, and two unrelated sequences of length $$2K$$ are considered.

Supported Claims

experiments.blot_stats.exp_comp_aligned_genes()[source]

Performance of Word-Blot and associated statistical scores for pairwise local similarity search on biological data. Given aligned copies of a gene in multiple species we consider the following for each pair:

• estimated match probability $$\hat{p}$$ at homologous seeds (i.e. those exactly matching kmers that lie on the known global pairwise alignment) compared to local match probability of said global alignment.
• discrimination power of $$\hat{p}$$ to separate homologous seeds from non-homologous seeds.
• all identified similar segments between the two sequences verified by comparison to banded dynamic programming alignment in the identified diagonal strip.

Supported Claims

• Word-Blot accurately estimates match probabilities, length, and coordinates of local similarities in biological data.
• estimated match probabilities $$\hat{p}$$ are effective statistics to separate homologous and non-homologous seeds.
• Dynamic programming alignment confirms that Word-Blot correctly identifies local similarities, even those that are inevitably missed by global alignments.

## (§5) Overlap Discovery Experiments¶

experiments.blot_overlaps.exp_overlap_simulations()[source]

Performance of Word-Blot in overlap discovery simulations. In each trial for overlap length $$K$$, a pair of sequences are presented to biseqt.blot.WordBlotOverlap which have either of:

• an overlap similarity of length $$K$$ with gap and substitution probabilities both equal to 0.1 (overlap pair),
• an overlap similarity of length $$K$$ with twice as much mutation probabilities (noisy overlap pair),
• a similarity of length $$K$$ with 0.1 substitution and gap probabilities, flanked by two unrelated regions of length $$\frac{K}{2}$$ such that it does not qualify as a proper overlap similarity (incomplete overlap pair),
• no similarities.

In each case the highest overlap band match probability reported by Word-Blot is used as the discriminating statistic.

Supported Claims

• In simulations Word-Blot accurately detects overlaps, estimates their diagonal position, and their match probability and thus can be used for quality-sensitive overlap detection and as a guide for further banded alignment.
experiments.blot_overlaps.exp_overlap_sequencing_reads()[source]

Performance of Word-Blot in overlap discovery on Pac Bio sequencing reads. Pac Bio reads from chromosome 1 of Leishmenia Donovani ($n=1055$ reads) were mapped to a reference genome using BLASR which provides a ground truth for classification of reads into overlapping and non-overlapping pairs. Word-Blot overlap discovery is then applied to all pairs of reads and the performance of the highest overlap band match probability as a classifier is shown.

Supported Claims

• Word-Blot overlap discovery is a good classifier for overlapping/non-overlapping pairs among Pac Bio sequencing reads.

## (§6) Immunoglobin VDJ Genotyping¶

experiments.blot_ig_genotyping.exp_ig_genotyping()[source]

Shows the performance of Word-Blot in genotyping Immunoglobin heavy chain genotyping. A thousand reads (average length 240nt) covering the mature VDJ region of chromosome 14 from the stanford S22 dataset are genotyped by Word-Blot against the IMGT list of Immunoglobin heavy chain genes:

• 358 genes and variants for V with average length 292nt,
• 44 genes and variants for D with average length 24nt,
• 13 genes and variants for J with average length 53nt.

For each read and each gene type (V, D, or J) 3 candidates are offered by Word-Blot which are then compared against the same output of IgBlast.

For each gene type we consider two measures of accuracy:

• the percentage of Word-Blot top 3 candidates that are also IgBlast top 3 candidates.

• a more forgiving measure which accepts the following candidates as “at least as good as IgBlast candidates”:

• those genes reported by Word-Blot that have a longer alignment than the shortest IgBlast candidate alignment while their percent identity is at most 1 point below that of the worst IgBlast candidate,
• those genes with a higher percent identity than the worst IgBlast candidate with an aligned length at most 2 nucleotides shorter than the shortest IgBlast candidate.

The relevant measures for Word-Blot candidates (exact value for percent identity and aligned length) are obtained by banded local alignment in the diagonal strip suggested by Word-Blot. To ensure comparability of results the same alignment scores are used as those of IgBlast (gap open -5, gap extend -2, and mismatch -1, -3, and -2 for V, D, and J respectively).

Supported Claims

• Despite relying on statistical properties of kmers, by using short word lengths Word-Blot performs well and acceptably fast even for small sequence lengths and for discriminating highly similar sequences.

## (§7) Structrual Variant Calling¶

experiments.blot_sv.call_svs(ref, reads, margin=50, K_min=200, p_min=0.8, **WB_kw)[source]

Calls structural variants based on single read mappings. For each read, the algorithm proceeds as follows:

• all local similarities are found via Word-Blot (biseqt.blot.WordBlotLocalRef.similar_segments() with given $$K_{\min}, p_{\min}$$).
• Two chain graphs are built based on the projections of similarities on the read and on the reference genome, henceforth the read graph and the reference graph.
• Reads containining SVs are identified using the following rules:
• Normal reads are characterized by a shortest path in both the read and reference graphs with exactly two edges between the start and the end of projections.
• deletions and duplications produce shortest paths with four edges in the read graph.
• insertions produce shortest paths with four edges in the reference graph.
experiments.blot_sv.exp_structural_variants()[source]

Performance of Word-Blot in detecting structural variations (SV) in simulations:

• Given a single read containing part of a SV and a reference sequence, one can deduce the presence of a SV from the topological arrangement of local similarities. The simple algorithm used here, call_svs(), simply pays attention to shortest paths in either of two directed graphs obtained by chaining projected similarities on either the reference or the read sequences.
• For each of three copy number variation SVs, insertion, deletion, duplication, similar segments detected by Word-Blot are used to detect whether each read contains an SV. In each case the true positive rate and false positive rate are plotted as a function of similarity between individual and reference sequences.

Supported Claims

• The simple topological algorithm in call_svs() can accurately distinguish normal reads from those containing a SV.
• The crux of our simple, single-read SV calling algorithm is that, roughly, SVs can be detected in a single read whenever there are more than one read/reference local similarities that, when chained, span the entire read (for duplication and deletion) or a whole region on reference (for insertion). This shows that Word-Blot accurately recovers local similarities at their maximal length without producing unnecessary fragmentation.
• Since Word-Blot is a general purpose local similarity search and thus makes no assumption of collinearity it can accurately identify simple SVs (copy number variations) by simply looking at mapping of individual reads to a reference sequence.

## (§8) Transposable Element Annotation¶

experiments.blot_te.exp_transposable_elements()[source]

Comparison between Word-Blot, Blast, and Hmmer in identifying simulated transposable elements (TE). TEs of length 1000nt are simulated (20 TEs), 20 genomes are simulated, each containing mutated versions of 10 TEs with varying match probabilities. Each algorithm (Blast or Word-Blot) is used with word length 8, minimum percent identity %60, and minimum length 500nt to identify local similarities between all known TEs and the genomic sequences and a TE is called if a local similarity of above properties is found between a genome and a TE. Example blastn command is:

$makeblastdb -dbtype nucl -in TE.fa -out blast/te.db$ rmblast -gapextend 2 \  # necessary for rmblast to not crash!
-db blast/te.db \
-word_size 8 \
-evalue 1e3 \
-outfmt '6 qseqid sseqid length pident' \
-query 'genome[0.69].fa'


and example nhmmer command is:

\$ nhmmer --qformat fasta \
--tblout /dev/stdout -o /dev/null \
TE.fa 'genome[0.69.fa]'


Supported Claims

• Word-Blot greatly outperforms rmblast (blastn tuned for RepeatMasker) in sensitivity (with roughly 5x time) while maintaining reasonable specificity (>%98) specifically when identifying distant TE homologos sequences (between %40 and %70 identity).
• Word-Blot outperforms nhmmer in speed (more than 2x) and to some extent in sensitivity (>%10) when identifying distant TEs.

## (§9) Word-Blot Performance (multiple)¶

experiments.multiple_sequence.exp_biological_multiple_sequences()[source]

Multiple sequence similarities with Word-Blot on biological data.

Supported Claimes

• Word-Blot estimated match probabilities are good classifiers for homologous and non-homologous seeds in multiple sequences.
experiments.multiple_sequence.exp_simulated_K_p()[source]

Performance of Word-Blot and associated statistical scores for multiple local similarity search in simulations:

• z-scores assigned to diagonal n-dimensional volumes with and without similarities under the corresponding limiting Normal distributions ($$H0, H1$$), for varying similarity lengths and match probabilities.
• Estimated coordinates, lengths and match probabilities of local similarities for varying similarity lengths and match probabilities. Same procedure is used as in pairwise comparison simulations (i.e. trials containing a similarity consists of multiple sequences of length $$2K$$ whose substrings at positions $$[\frac{K}{2}, \frac{3K}{2}]$$ are homologies of length $$K$$ and match probability $$p$$ in the multiple sequence sense).
• Word lengths are automatically adjusted to ensure effectively linear growth of complexity with respect to sequence lengths.

Supported Claims

• z-scores calculated by biseqt.blot.WordBlotMultiuple.score_num_seeds() against the limiting normal distributions for multiple sequences are stable and comparable across different similarity lengths and match probabilities; thus both scores are reliable statistics.
• Local similarities found by Word-Blot as per biseqt.blot.WordBlotMultiple.similar_segments() are accurate in length, estimated match probability, and coordinates.
• Adjusting kmer word lengths in logarithmic relationship with sequence length maintains effectively linear complexity and up to sequences of length 6.4 kbp behaves well.