Scalable Algorithms for String Kernels with Inexact Matching Scalable Pavel P. Kuksa, Pai-Hsi Huang, Vladimir Pavlovic GOAL: fast & accurate inexact string matching for classification. Poster W23 Sequence data unordered Theoretical results Benefits New algorithmic approach for string kernels Alphabet-size independent with inexact matching (mismatch, gapped) O(km||mn) O(ck,mn) Improves complexity bounds over existing Applies to mismatch, gapped, substring algorithms kernels Excellent performance on Empirical results Protein classification Several orders of magnitude speed-up Music Genre Recognition Explore more complex matching processes T xt classification e with improved prediction accuracy Large datasets (>1M, semi-supervised) Relative running time vs. alphabet size 10,000x faster for large (||=1000) alphabets Protein fold prediction (multi-class) features ordered index Original Relaxe d Supervised 46.7 8 52.8 7 15% better than state-of-the-art 30% better than profile kernel Semi-su pervised 70.8 7 77.5 1 features index Speed-up [x]: Protein DNA Text 100 75 >106 Music 104 Remote homology detection similarity matrix Music genre classification (multi-class) Original Relaxed Supervised 4 1 .9 2 5 2 .0 0 Semi-supervised 81.05 86.3 2 Original Relaxed 5 7 .4 6 4 .4 http://seqam.rutgers.edu/projects/new-inexact