Scalable Algorithms for String Kernels with Inexact Matching Scalable
Pavel P. Kuksa, Pai-Hsi Huang, Vladimir Pavlovic
GOAL: fast & accurate inexact string matching for classification.

Poster

W23

Sequence data unordered

Theoretical results Benefits  New algorithmic approach for string kernels  Alphabet-size independent with inexact matching (mismatch, gapped) O(km||mn)  O(ck,mn)  Improves complexity bounds over existing  Applies to mismatch, gapped, substring algorithms kernels  Excellent performance on Empirical results  Protein classification  Several orders of magnitude speed-up  Music Genre Recognition  Explore more complex matching processes  T xt classification e with improved prediction accuracy  Large datasets (>1M, semi-supervised)
Relative running time vs. alphabet size 10,000x faster for large (||=1000) alphabets
Protein fold prediction (multi-class)

features ordered

index

Original Relaxe d Supervised 46.7 8 52.8 7 15% better than state-of-the-art 30% better than profile kernel Semi-su pervised 70.8 7 77.5 1

features

index

Speed-up [x]:

Protein DNA Text 100 75 >106

Music 104

Remote homology detection

similarity matrix
Music genre classification (multi-class)

Original Relaxed Supervised 4 1 .9 2 5 2 .0 0 Semi-supervised 81.05 86.3 2

Original Relaxed

5 7 .4 6 4 .4

http://seqam.rutgers.edu/projects/new-inexact