# Experiments¶

## (§1) Seed Statistics Simulations¶

experiments.num_seeds.exp_count_seeds()[source]

Shows theoretical and simulation results for the mean and variance of the number of exactly matching kmers between related and unrelated sequences as a function of sequence length. Theoretical predictions are based on m-dependent Central Limit Theorem which suggests a limiting Normal distribution with mean and variance given by biseqt.blot.H0_moments() and biseqt.blot.H1_moments().

Supported Claims

• The theoretical calculations of mean and variance for the number of seeds, given by biseqt.blot.H0_moments() and biseqt.blot.H1_moments() agree with simulations at least upto 25kbp sequence lengths.
• Although the expected number of seeds is technically quadratic, the highest order coefficient is so small that it can be considered effectively linear in sequence length. Furthermore, we note that word length provides expoenentially strong control on the number of seeds as sequence lengths increase; hence maintaining a small quadratic coefficient across biologically relevant sequence lengths (up to 1 Gbp) is feasible with reasonable word lenghts (up to 30) Time to find all exactly matching 8-mers (left) for related (green) and unrelated (red) sequences of varying lengths (n=50 samples for each length; shaded regions show one standard deviation). Related sequences are mutations of each other with substitution and gap probabilities both equal to 0.1. For the same simulations, mean (middle) and standard deviation (right) of the number of seeds as a function of sequence length are shown (solid lines) with theoretical predictions for each case (dashed lines). All axes are in log scale, and the dotted black lines in the left and middle plots are y=x lines for comparison.

experiments.num_seeds.exp_count_seeds_segment()[source]

Shows simulation results for alignment-free estimated match probability for globally homologous sequence pairs of length up to 25kbp.

Supported Claims

• The match probability estimator provided by biseqt.blot.WordBlot.estimate_match_probability() using the band radius provided by biseqt.blot.band_radius() are accurate for similarities of lengths up to 25kbp regardless of gap probability upper bound. Therefore, we can justify using generous overestimates of gap probability (e.g. $$g_{\max}=0.4$$) in typical contexts.
• For unrelated sequences our estimator reports a number close to .25 (one standard deviation range $$[0, .4]$$). For multiple values of maximum allowed gap probability $$g_{\max}$$ (which dictates diagonal band radius), estimated match probability (using word length 6) is shown as a function of sequence length (left) for globally homologous sequences (solid lines) and unrelated sequences (dashed lines), n=50 samples, shaded regions show one standard deviation. Homologous sequences were simulated by mutations with gap probability 0.1 and substitution probability 0.15 (hence a match probability of 0.77 indicated by a solid arrow (note agreement with Word-Blot estimation), the dashed arrow shows the 0.25 point). For each value of $$g_{\max}$$, the corresponding band radius is shown as a function of similarity length (right). Horizontal axes in both plots is in log scale.

## (§2) Diagonal Band Radius Simulations¶

experiments.band_radius.exp_time_in_band()[source]

Shows theoretical and empirical results for proportion of alignment time (i.e. number of edit operations) spent within diagonal bands of varying radius.

Supported Claims

• The random walk model for capturing the probability distribution of diagonal positions is accurate.
• Calculating the band radius such that the endpoint of an alignment is within diagonal band with probability at least $$1 - \epsilon$$ is a good conservative estimate for the more reasonable requirement that the average proportion of time spent in band is at least $$1 - \epsilon$$.
• Calculated band radii are accurate and reliable when applied to simulated as well as biological edit sequences. Predicted proportion of time spent in diagonal band (top left) in alignments of total length K=500 according to full (dashed) and approximate (solid) probabilistic model and for two values of gap probabilities. Vertical lines indicate the calculated band radius for %.01 accuracy. Simulated alignments (n=500 samples) are drawn in diagonal coordinates as a function of alignment time for gap probability 0.05 (right top) and 0.15 (right middle) with bounds obtained from biseqt.blot.band_radius() for two sensitivity values. Proportion of time spent in band as a function of band radius is shown for simulated sequences (bottom left) and real biological data (bottom right), and in each case the empirical percentage of time spent outside of band is reported; shaded regions indicate one standard deviation. Biological alignments are chosen from projected pairwise alignments of alpha-Actinin 2 (ACTN2) for 7 vertebrates obtained from UCSC genome browser. Each sample biological edit sequence is chosen such that it has the desired length and gap probability; gaps of length more than 10 nucleotides are removed to avoid nonstationary gap probabilities.

experiments.band_radius.exp_gap_probs()[source]

Shows gap probability variability in simulated and biological edit sequences.

Supported Claims

• Gap probabilities are nonstationary in biological data.
• Removing long indel stretches mitigates this nonstationarity for the most part and allows us to treat gap probability in biological alignments as stationary. Gap probability variability along alignments (length K=1200) in simulated (black) and biological edit sequences (green), before (left) and after (right) removing long indel stretches (deletion or insertion stretches of length at least 10) from biological samples; n=50 samples in both cases. Local gap probabilites are calculated in a window of radius 100 nt. Simulated edit sequences (black) are generated using stationary gap probability (i.e constant along alignment) of 0.15. Biological edit sequences are sampled from projected pairwise alignments of Actinin 2 for 7 vertebrates obtained from UCSC genome browser such that the overall gap probability (i.e over K=1200 operations) is 0.15.

## (§3) Basic Word-Blot Usage Experiments¶

experiments.blot_rearrangement.exp_rearrangement()[source]

Example demonstrating of Word-Blot for pairwise local similarity search on two randomly generated sequencees with motif sequences violating collinearity $$S=M_1M_2M_3, T=M'_1M'_1M'_3M'_2$$ where motif pairs $$(M_i, M'_i)_{i=1,2,3}$$ have lengths 200, 400, 600 and are related by match probabilities 0.95, 0.85, and 0.75, respectively. Dynamic programming scores of the forward pass of Smith Waterman are shown in color code (left) with seeds (word length 6) grey intensity coded according to the local match probability assigned by Word-Blot (minimum similarity length 200). Similar segments reported by Word-Blot are shown as grey diagonal strips (left) and schematically (right) color coded by their Word-Blot estimated match probabilities (note agreement with true match probabilities).

## (§4) Word-Blot Performance (pairwise)¶

experiments.blot_stats.exp_simulated_K_p()[source]

Performance of Word-Blot and associated statistical scores for pairwise local similarity search in simulations:

• z-scores assigned to diagonal strips with and without similarities under the corresponding limiting Normal distributions ($$H0, H1$$), for varying similarity lengths and match probabilities.
• Estimated coordinates, lengths and match probabilities of local similarities for varying similarity lengths and match probabilities. In each trial with similarity length $$K$$ and match probability $$p$$, two pairs of input sequences are provided to Word-Blot: a pair of sequences of length $$2K$$ whose substrings at positions $$[\frac{K}{2}, \frac{3K}{2}]$$ are homologies of length $$K$$ and match probability $$p$$, and two unrelated sequences of length $$2K$$ are considered.

Supported Claims Z-scores of the number of seeds in diagonal strip for related (solid lines) and unrelated (dashed lines) segments of varying lengths, calculated against the limiting distribution for unrelated pairs of sequences (left) and related pairs (right) as a function of similarity length (top) and match probability (bottom), n=50 samples, shaded regions indicate one standard deviation. Note that, as desired, H0 score is stable for unrelated sequences of any length and H1 score is stable for related sequences of any length or match probability. Estimated length (top left), match probability (bottom left), diagonal position (top right) and antidiagonal position (bottom right) of local similarities, word length is 7, n=20 samples, shaded regions indicate one standard deviation. Dashed black lines are ground truth for comparison. For length estimates both the sum of the lengths of all reported similar segments (solid) and the length of the longest similar segment (dashed) are shown. Estimated match probablity and Length estimates for unrelated pairs are also shown (dotted). Estimated match probability, diagonal and antidiagonal coordinates correspond to the longest local similarity.

experiments.blot_stats.exp_comp_aligned_genes()[source]

Performance of Word-Blot and associated statistical scores for pairwise local similarity search on biological data. Given aligned copies of a gene in multiple species we consider the following for each pair:

• estimated match probability $$\hat{p}$$ at homologous seeds (i.e. those exactly matching kmers that lie on the known global pairwise alignment) compared to local match probability of said global alignment.
• discrimination power of $$\hat{p}$$ to separate homologous seeds from non-homologous seeds.
• all identified similar segments between the two sequences verified by comparison to banded dynamic programming alignment in the identified diagonal strip.

Supported Claims

• Word-Blot accurately estimates match probabilities, length, and coordinates of local similarities in biological data.
• estimated match probabilities $$\hat{p}$$ are effective statistics to separate homologous and non-homologous seeds.
• Dynamic programming alignment confirms that Word-Blot correctly identifies local similarities, even those that are inevitably missed by global alignments. Word-Blot performance in identifying local similarities among Iroquois homeobox protein 1 (IRX1) genes for 10 amniota obtain from Ensembl with word length 8, minimum match probability 0.8, and minimum similarity length 200nt. An example pairwise comparison (top left) shows seeds (grey) color coded by intensity according to their estimated match probability with those exceeding the threshold shown larger for clarity. Known global alignment is shown in green and aligned local similarities identified by Word-Blot are shown in blue. Estimated similarity profile for the same example pair (bottom left) is shown at positions exceeding the match probability minimum (black dots) and compared to the profile obtained from the known global alignment (green). For all pairwise comparisons estimated and true match probability at homologous seeds are shown (top middle, green), the diagonal shows ground truth for comparison (dashed black). Similarly for all identified similar segments, estimated match probability and true probability obtained from banded global alignment are shown (top right, blue), the diagonal showing ground truth (dashed black) and the horizontal line shows the cutoff used by Word-Blot (dashed black). Performance of Word-Blot estimated match probability at discriminating between homologous and non-homologous seeds in the same context as the figure above. The cumulative distribution of estimated match probability in each case (bottom right), the resulting ROC curve (left) and the predictive ROC curve (top right). The threshold 0.8 is indicated in blue in all three plots. Word-Blot performance in identifying local similarities among Beta Actin (ACTB) genes for 7 vertebrates obtain from UCSC genome browser with word length 6, minimum match probability 0.5, and minimum similarity length 200nt. All subplots are similar to those of IRX1 experiment shown above. Performance of Word-Blot estimated match probability at discriminating between homologous and non-homologous seeds in the same context as the figure above.

## (§5) Overlap Discovery Experiments¶

experiments.blot_overlaps.exp_overlap_simulations()[source]

Performance of Word-Blot in overlap discovery simulations. In each trial for overlap length $$K$$, a pair of sequences are presented to biseqt.blot.WordBlotOverlap which have either of:

• an overlap similarity of length $$K$$ with gap and substitution probabilities both equal to 0.1 (overlap pair),
• an overlap similarity of length $$K$$ with twice as much mutation probabilities (noisy overlap pair),
• a similarity of length $$K$$ with 0.1 substitution and gap probabilities, flanked by two unrelated regions of length $$\frac{K}{2}$$ such that it does not qualify as a proper overlap similarity (incomplete overlap pair),
• no similarities.

In each case the highest overlap band match probability reported by Word-Blot is used as the discriminating statistic.

Supported Claims

• In simulations Word-Blot accurately detects overlaps, estimates their diagonal position, and their match probability and thus can be used for quality-sensitive overlap detection and as a guide for further banded alignment. Highest overlap band match probability reported by Word-Blot (left) for each of the four cases; word length 6, n=20, shaded regions indicate one standard deviation. For overlap and noisy overlap pairs the known truth is shown in thick dashed lines of corresponding color. For each case the cumulative probability distribution of the match probability reported by Word-Blot is shown (right), the clear separation of which indicates that Word-Blot is capable of detecting overlaps of a certain quality with high accuracy. Percentage of cases where the correct diagonal is not contained in reported diagonal band: %0.44 (noisy overlap), %0 (overlap), %0 (incomplete overlap).

experiments.blot_overlaps.exp_overlap_sequencing_reads()[source]

Performance of Word-Blot in overlap discovery on Pac Bio sequencing reads. Pac Bio reads from chromosome 1 of Leishmenia Donovani ($n=1055$ reads) were mapped to a reference genome using BLASR which provides a ground truth for classification of reads into overlapping and non-overlapping pairs. Word-Blot overlap discovery is then applied to all pairs of reads and the performance of the highest overlap band match probability as a classifier is shown.

Supported Claims

• Word-Blot overlap discovery is a good classifier for overlapping/non-overlapping pairs among Pac Bio sequencing reads. ROC curve for Word-Blot (word length = 10) overlap band match probability estimates as a classifier statistic for overlapping/non-overlapping pairs in Pac Bio sequencing reads. Average comparison time for reads of average length 7kbp is 60 ms. With match probability threshold 0.75 (indicated in all subplots in blue) we obtains %99 specificity and %98 negative predictive value (the crucial factors in overlap detection) and %45 sensitivity and %46 positive predictive value.

## (§6) Immunoglobin VDJ Genotyping¶

experiments.blot_ig_genotyping.exp_ig_genotyping()[source]

Shows the performance of Word-Blot in genotyping Immunoglobin heavy chain genotyping. A thousand reads (average length 240nt) covering the mature VDJ region of chromosome 14 from the stanford S22 dataset are genotyped by Word-Blot against the IMGT list of Immunoglobin heavy chain genes:

• 358 genes and variants for V with average length 292nt,
• 44 genes and variants for D with average length 24nt,
• 13 genes and variants for J with average length 53nt.

For each read and each gene type (V, D, or J) 3 candidates are offered by Word-Blot which are then compared against the same output of IgBlast.

For each gene type we consider two measures of accuracy:

• the percentage of Word-Blot top 3 candidates that are also IgBlast top 3 candidates.

• a more forgiving measure which accepts the following candidates as “at least as good as IgBlast candidates”:

• those genes reported by Word-Blot that have a longer alignment than the shortest IgBlast candidate alignment while their percent identity is at most 1 point below that of the worst IgBlast candidate,
• those genes with a higher percent identity than the worst IgBlast candidate with an aligned length at most 2 nucleotides shorter than the shortest IgBlast candidate.

The relevant measures for Word-Blot candidates (exact value for percent identity and aligned length) are obtained by banded local alignment in the diagonal strip suggested by Word-Blot. To ensure comparability of results the same alignment scores are used as those of IgBlast (gap open -5, gap extend -2, and mismatch -1, -3, and -2 for V, D, and J respectively).

Supported Claims

• Despite relying on statistical properties of kmers, by using short word lengths Word-Blot performs well and acceptably fast even for small sequence lengths and for discriminating highly similar sequences. For each of the gene types, V (left, word length 8), D (middle, word length 4), and J (right, word length 6), alignment lengths (top) and match probabilities (bottom) of top 3 candidates reported by Word-Blot and IgBlast are compared. For each Word-Blot candidate the corresponding IgBlast coordinate is that of the worst of the three reported by IgBlast for the same read. The two measures of accuracy for each gene type are shown (more forgiving measure in parentheses). Each gene mapping is color coded, greens are those that are in some way “at least as good as IgBlast candidates” and reds are the rest. Average mapping time for each read is 0.7 second.

## (§7) Structrual Variant Calling¶

experiments.blot_sv.call_svs(ref, reads, margin=50, K_min=200, p_min=0.8, **WB_kw)[source]

Calls structural variants based on single read mappings. For each read, the algorithm proceeds as follows: Schematic representation of the chain graphs used to call structural variants. Each read (vertical axis) is compared against the reference sequence (horizontal axis) and in each case two chain graphs are created on each of the read and reference axes where two similar segments are connected with an edge if their projections on the corresponding axis can be consistently chained.

• all local similarities are found via Word-Blot (biseqt.blot.WordBlotLocalRef.similar_segments() with given $$K_{\min}, p_{\min}$$).
• Two chain graphs are built based on the projections of similarities on the read and on the reference genome, henceforth the read graph and the reference graph.
• Reads containining SVs are identified using the following rules:
• Normal reads are characterized by a shortest path in both the read and reference graphs with exactly two edges between the start and the end of projections.
• deletions and duplications produce shortest paths with four edges in the read graph.
• insertions produce shortest paths with four edges in the reference graph.
experiments.blot_sv.exp_structural_variants()[source]

Performance of Word-Blot in detecting structural variations (SV) in simulations:

• Given a single read containing part of a SV and a reference sequence, one can deduce the presence of a SV from the topological arrangement of local similarities. The simple algorithm used here, call_svs(), simply pays attention to shortest paths in either of two directed graphs obtained by chaining projected similarities on either the reference or the read sequences.
• For each of three copy number variation SVs, insertion, deletion, duplication, similar segments detected by Word-Blot are used to detect whether each read contains an SV. In each case the true positive rate and false positive rate are plotted as a function of similarity between individual and reference sequences.

Supported Claims

• The simple topological algorithm in call_svs() can accurately distinguish normal reads from those containing a SV.
• The crux of our simple, single-read SV calling algorithm is that, roughly, SVs can be detected in a single read whenever there are more than one read/reference local similarities that, when chained, span the entire read (for duplication and deletion) or a whole region on reference (for insertion). This shows that Word-Blot accurately recovers local similarities at their maximal length without producing unnecessary fragmentation.
• Since Word-Blot is a general purpose local similarity search and thus makes no assumption of collinearity it can accurately identify simple SVs (copy number variations) by simply looking at mapping of individual reads to a reference sequence. True positive rate (solid) and false positive rate (dashed) for discriminating normal reads from those containing evidence for SV, duplications (left), deletion (middle), insertion (right) using the simple algorithm based on Word-Blot local similarities. SV length is 500nt, reference sequence length is 2500nt, sequencing coverage is 10, word length is 8, and each condition (match probability and SV type) is repeated n=5 times, average computation time for each read is 0.8 seconds.

## (§8) Transposable Element Annotation¶

experiments.blot_te.exp_transposable_elements()[source]

Comparison between Word-Blot, Blast, and Hmmer in identifying simulated transposable elements (TE). TEs of length 1000nt are simulated (20 TEs), 20 genomes are simulated, each containing mutated versions of 10 TEs with varying match probabilities. Each algorithm (Blast or Word-Blot) is used with word length 8, minimum percent identity %60, and minimum length 500nt to identify local similarities between all known TEs and the genomic sequences and a TE is called if a local similarity of above properties is found between a genome and a TE. Example blastn command is:

$makeblastdb -dbtype nucl -in TE.fa -out blast/te.db$ rmblast -gapextend 2 \  # necessary for rmblast to not crash!
-db blast/te.db \
-word_size 8 \
-evalue 1e3 \
-outfmt '6 qseqid sseqid length pident' \
-query 'genome[0.69].fa'


and example nhmmer command is:

\$ nhmmer --qformat fasta \
--tblout /dev/stdout -o /dev/null \
TE.fa 'genome[0.69.fa]'


Supported Claims

• Word-Blot greatly outperforms rmblast (blastn tuned for RepeatMasker) in sensitivity (with roughly 5x time) while maintaining reasonable specificity (>%98) specifically when identifying distant TE homologos sequences (between %40 and %70 identity).
• Word-Blot outperforms nhmmer in speed (more than 2x) and to some extent in sensitivity (>%10) when identifying distant TEs. Sensitivity (left), specificity (middle) and computation time (right) for Blast (blue), nhmmer (green), and Word-Blot (black) for varying similarities (horizontal axes) between simulated TEs and their occurences in simulated genomes.

## (§9) Word-Blot Performance (multiple)¶

experiments.multiple_sequence.exp_biological_multiple_sequences()[source]

Multiple sequence similarities with Word-Blot on biological data.

Supported Claimes

• Word-Blot estimated match probabilities are good classifiers for homologous and non-homologous seeds in multiple sequences. Multiple sequence comparison on Beta Actin (ACTB) gene on the 7 vertebrate USCS dataset (excluding monodelphis domestica in which the gene is split between chromosomes 1 and 6) with word length 6, minimum similarity length 100nt, and minimum match probability 0.75. For an arbitrary group of 3 sequences a dot plot similar to those of pairwise comparisons is shown (left). Seeds are color coded by intensity to represent their estimated match probability, blue ligns show local similarities identified by Word-Blot. The performance of estimated match probabilities in the full data set to discriminate between homologous and non-homologous seeds is shown by an ROC curve (right). Same as figure above but with the IRX1 data set from Ensembl (excluding rattus norvegicus whose gene copy is much shorter than the rest) with word length 10.

experiments.multiple_sequence.exp_simulated_K_p()[source]

Performance of Word-Blot and associated statistical scores for multiple local similarity search in simulations:

• z-scores assigned to diagonal n-dimensional volumes with and without similarities under the corresponding limiting Normal distributions ($$H0, H1$$), for varying similarity lengths and match probabilities.
• Estimated coordinates, lengths and match probabilities of local similarities for varying similarity lengths and match probabilities. Same procedure is used as in pairwise comparison simulations (i.e. trials containing a similarity consists of multiple sequences of length $$2K$$ whose substrings at positions $$[\frac{K}{2}, \frac{3K}{2}]$$ are homologies of length $$K$$ and match probability $$p$$ in the multiple sequence sense).
• Word lengths are automatically adjusted to ensure effectively linear growth of complexity with respect to sequence lengths.

Supported Claims

• z-scores calculated by biseqt.blot.WordBlotMultiuple.score_num_seeds() against the limiting normal distributions for multiple sequences are stable and comparable across different similarity lengths and match probabilities; thus both scores are reliable statistics.
• Local similarities found by Word-Blot as per biseqt.blot.WordBlotMultiple.similar_segments() are accurate in length, estimated match probability, and coordinates.
• Adjusting kmer word lengths in logarithmic relationship with sequence length maintains effectively linear complexity and up to sequences of length 6.4 kbp behaves well. Z-scores of the number of seeds in diagonal volumes for related (solid lines) and unrelated (dashed lines) multiple sequence (N=6) segments of varying lengths, calculated against the limiting distribution for unrelated pairs of sequences (left) and related pairs (right) as a function of similarity length (top) and match probability (bottom), n=50 samples. Note that, as desired, H0 score is stable for unrelated sequences of any length and H1 score is stable for related sequences of any length or match probability. Estimated length (top left), match probability (bottom left), diagonal position (top right) and antidiagonal position (bottom right) of multiple (N=6) local similarities, word length is adapted to sequence lengths, n=20 samples, shaded regions indicate one standard deviation. Dashed black lines are ground truth for comparison. For length estimates both the sum of the lengths of all reported similar segments (solid) and the length of the longest similar segment (dashed) are shown. Estimated match probablity and Length estimates for unrelated pairs are also shown (dotted). Estimated match probability, diagonal and antidiagonal coordinates correspond to the longest local similarity. Time to find and score all multiple sequence (N=6) seeds as a function of varying similarity length (left) and varying match probability (right), n=20 samples, shaded regions indicated one standard deviation. Word lengths are logarithmically adjusted to sequence length.