SEQL (SEQuence Learner) is an implementation of a greedy coordinate-wise gradient descent technique for efficiently learning sequence classifiers.The theoretical framework focuses on discriminative sequence classification where linear classifiers work directly in the rich (but very high dimensional) predictor space of all subsequences in the training set (as opposed to string-kernel-induced feature spaces). This is computationally challenging (in particular if we allow subsequences with wildcards as features), but made feasible by employing a greedy coordinate-descent algorithm coupled with bounding the magnitude of the gradient for efficiently selecting discriminative subsequences. In our KDD11 paper (see References below) we characterize the loss functions for which our generic learning algorithm can be applied and present concrete implementations for logistic regression (binomial log-likelihood loss) and support vector machines (squared hinge loss). The application of this learning framework to protein remote homology detection and remote fold recognition results in classification quality comparable to that of state-of-the-art methods (e.g., kernel support vector machines). Unlike state-of-the-art classifiers, the resulting classification models are linear (i.e., simply lists of weighted discriminative subsequences) and can thus be interpreted and related back to the target problem, an asset highly relevant for many applications.
- C++ compiler
- POSIX getopt library
:: MORE INFORMATION
G. Ifrim, C. Wiuf:
“Bounded Coordinate-Descent for Biological Sequence Classification in High Dimensional Predictor Space“,