Sequence-based classification using discriminatory motif feature selection

Hao Xiong, Daniel Capurso, Saunak Sen, Mark R. Segal

Research output: Contribution to journalArticle

8 Citations (Scopus)

Abstract

Most existing methods for sequence-based classification use exhaustive feature generation, employing, for example, all k-mer patterns. The motivation behind such (enumerative) approaches is to minimize the potential for overlooking important features. However, there are shortcomings to this strategy. First, practical constraints limit the scope of exhaustive feature generation to patterns of length ≤, such that potentially important, longer (>k) predictors are not considered. Second, features so generated exhibit strong dependencies, which can complicate understanding of derived classification rules. Third, and most importantly, numerous irrelevant features are created. These concerns can compromise prediction and interpretation. While remedies have been proposed, they tend to be problem-specific and not broadly applicable. Here, we develop a generally applicable methodology, and an attendant software pipeline, that is predicated on discriminatory motif finding. In addition to the traditional training and validation partitions, our framework entails a third level of data partitioning, a discovery partition. A discriminatory motif finder is used on sequences and associated class labels in the discovery partition to yield a (small) set of features. These features are then used as inputs to a classifier in the training partition. Finally, performance assessment occurs on the validation partition. Important attributes of our approach are its modularity (any discriminatory motif finder and any classifier can be deployed) and its universality (all data, including sequences that are unaligned and/or of unequal length, can be accommodated). We illustrate our approach on two nucleosome occupancy datasets and a protein solubility dataset, previously analyzed using enumerative feature generation. Our method achieves excellent performance results, with and without optimization of classifier tuning parameters. A Python pipeline implementing the approach is available at http://www.epibiostat.ucsf.edu/biostat/sen/dmfs/.

Original languageEnglish (US)
Article numbere27382
JournalPloS one
Volume6
Issue number11
DOIs
StatePublished - Nov 10 2011

Fingerprint

Feature extraction
Classifiers
Boidae
Nucleosomes
Pipelines
Solubility
Python
nucleosomes
Software
protein solubility
Labels
Tuning
methodology
Proteins
prediction
Datasets

All Science Journal Classification (ASJC) codes

  • Biochemistry, Genetics and Molecular Biology(all)
  • Agricultural and Biological Sciences(all)

Cite this

Sequence-based classification using discriminatory motif feature selection. / Xiong, Hao; Capurso, Daniel; Sen, Saunak; Segal, Mark R.

In: PloS one, Vol. 6, No. 11, e27382, 10.11.2011.

Research output: Contribution to journalArticle

Xiong, Hao ; Capurso, Daniel ; Sen, Saunak ; Segal, Mark R. / Sequence-based classification using discriminatory motif feature selection. In: PloS one. 2011 ; Vol. 6, No. 11.
@article{a5dbfb4a9e524876a3ac91261284c7db,
title = "Sequence-based classification using discriminatory motif feature selection",
abstract = "Most existing methods for sequence-based classification use exhaustive feature generation, employing, for example, all k-mer patterns. The motivation behind such (enumerative) approaches is to minimize the potential for overlooking important features. However, there are shortcomings to this strategy. First, practical constraints limit the scope of exhaustive feature generation to patterns of length ≤, such that potentially important, longer (>k) predictors are not considered. Second, features so generated exhibit strong dependencies, which can complicate understanding of derived classification rules. Third, and most importantly, numerous irrelevant features are created. These concerns can compromise prediction and interpretation. While remedies have been proposed, they tend to be problem-specific and not broadly applicable. Here, we develop a generally applicable methodology, and an attendant software pipeline, that is predicated on discriminatory motif finding. In addition to the traditional training and validation partitions, our framework entails a third level of data partitioning, a discovery partition. A discriminatory motif finder is used on sequences and associated class labels in the discovery partition to yield a (small) set of features. These features are then used as inputs to a classifier in the training partition. Finally, performance assessment occurs on the validation partition. Important attributes of our approach are its modularity (any discriminatory motif finder and any classifier can be deployed) and its universality (all data, including sequences that are unaligned and/or of unequal length, can be accommodated). We illustrate our approach on two nucleosome occupancy datasets and a protein solubility dataset, previously analyzed using enumerative feature generation. Our method achieves excellent performance results, with and without optimization of classifier tuning parameters. A Python pipeline implementing the approach is available at http://www.epibiostat.ucsf.edu/biostat/sen/dmfs/.",
author = "Hao Xiong and Daniel Capurso and Saunak Sen and Segal, {Mark R.}",
year = "2011",
month = "11",
day = "10",
doi = "10.1371/journal.pone.0027382",
language = "English (US)",
volume = "6",
journal = "PLoS One",
issn = "1932-6203",
publisher = "Public Library of Science",
number = "11",

}

TY - JOUR

T1 - Sequence-based classification using discriminatory motif feature selection

AU - Xiong, Hao

AU - Capurso, Daniel

AU - Sen, Saunak

AU - Segal, Mark R.

PY - 2011/11/10

Y1 - 2011/11/10

N2 - Most existing methods for sequence-based classification use exhaustive feature generation, employing, for example, all k-mer patterns. The motivation behind such (enumerative) approaches is to minimize the potential for overlooking important features. However, there are shortcomings to this strategy. First, practical constraints limit the scope of exhaustive feature generation to patterns of length ≤, such that potentially important, longer (>k) predictors are not considered. Second, features so generated exhibit strong dependencies, which can complicate understanding of derived classification rules. Third, and most importantly, numerous irrelevant features are created. These concerns can compromise prediction and interpretation. While remedies have been proposed, they tend to be problem-specific and not broadly applicable. Here, we develop a generally applicable methodology, and an attendant software pipeline, that is predicated on discriminatory motif finding. In addition to the traditional training and validation partitions, our framework entails a third level of data partitioning, a discovery partition. A discriminatory motif finder is used on sequences and associated class labels in the discovery partition to yield a (small) set of features. These features are then used as inputs to a classifier in the training partition. Finally, performance assessment occurs on the validation partition. Important attributes of our approach are its modularity (any discriminatory motif finder and any classifier can be deployed) and its universality (all data, including sequences that are unaligned and/or of unequal length, can be accommodated). We illustrate our approach on two nucleosome occupancy datasets and a protein solubility dataset, previously analyzed using enumerative feature generation. Our method achieves excellent performance results, with and without optimization of classifier tuning parameters. A Python pipeline implementing the approach is available at http://www.epibiostat.ucsf.edu/biostat/sen/dmfs/.

AB - Most existing methods for sequence-based classification use exhaustive feature generation, employing, for example, all k-mer patterns. The motivation behind such (enumerative) approaches is to minimize the potential for overlooking important features. However, there are shortcomings to this strategy. First, practical constraints limit the scope of exhaustive feature generation to patterns of length ≤, such that potentially important, longer (>k) predictors are not considered. Second, features so generated exhibit strong dependencies, which can complicate understanding of derived classification rules. Third, and most importantly, numerous irrelevant features are created. These concerns can compromise prediction and interpretation. While remedies have been proposed, they tend to be problem-specific and not broadly applicable. Here, we develop a generally applicable methodology, and an attendant software pipeline, that is predicated on discriminatory motif finding. In addition to the traditional training and validation partitions, our framework entails a third level of data partitioning, a discovery partition. A discriminatory motif finder is used on sequences and associated class labels in the discovery partition to yield a (small) set of features. These features are then used as inputs to a classifier in the training partition. Finally, performance assessment occurs on the validation partition. Important attributes of our approach are its modularity (any discriminatory motif finder and any classifier can be deployed) and its universality (all data, including sequences that are unaligned and/or of unequal length, can be accommodated). We illustrate our approach on two nucleosome occupancy datasets and a protein solubility dataset, previously analyzed using enumerative feature generation. Our method achieves excellent performance results, with and without optimization of classifier tuning parameters. A Python pipeline implementing the approach is available at http://www.epibiostat.ucsf.edu/biostat/sen/dmfs/.

UR - http://www.scopus.com/inward/record.url?scp=80755126015&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=80755126015&partnerID=8YFLogxK

U2 - 10.1371/journal.pone.0027382

DO - 10.1371/journal.pone.0027382

M3 - Article

VL - 6

JO - PLoS One

JF - PLoS One

SN - 1932-6203

IS - 11

M1 - e27382

ER -