An evaluation of approaches for using unlabeled data with domain adaptation

Nic Herndon1, Doina Caragea1
1Kansas State University, Manhattan, USA

Tóm tắt

We consider the problem of adding a large unlabeled sample from the target domain to boost the performance of a domain adaptation algorithm when only a small set of labeled examples is available from the target domain. In particular, we consider the problem setting motivated by the tasks of splice site prediction and protein localization. For example, for splice site prediction, annotating a genome using machine learning requires a lot of labeled data, whereas for non-model organisms, there are only some labeled data and lots of unlabeled data. With domain adaptation one can leverage the large amount of data from a related model organism, along with the labeled and unlabeled data from the organism of interest to train a classifier for the latter. Our goal is to analyze the three approaches of incorporating the unlabeled data—with soft labels only (i.e., Expectation-Maximization), with hard labels only (i.e., self-training), or with both soft and hard labels—for the splice site prediction and protein localization in particular, and more broadly for a general iterative domain adaptation setting. We provide empirical results on splice site prediction and protein localization indicating that using a combination of soft and hard labels performs as good as the best of the other two approaches of integrating unlabeled data.

Tài liệu tham khảo

Bernal A, Crammer K, Hatzigeorgiou A, Pereira F (2007) Global discriminative learning for higher-accuracy computational gene prediction. PLOS Comput Biol 3(3):e54 Blum A, Mitchell T (1998) Combining labeled and unlabeled data with co-training. In: Proceedings of the eleventh annual conference on computational learning theory., COLT’ 98ACM, New York, NY, USA, pp 92–100 Brown MPS, Grundy WN, Lin D, Cristianini N, Sugnet CW, Furey TS, Ares M Jr, Haussler D (2000) Knowledge-based analysis of microarray gene expression data using support vector machines. Proc Natl Acad Sci 97(1):262–267 Chapelle O, Schölkopf B, Zien A (eds) (2006) Semi-supervised Learning. Adaptive computation and machine learning. The MIT Press, Cambridge Dai W, Xue GR, Yang Q, Yu Y (2007) Transferring Naïve Bayes classifiers for text classification. In: Proceedings of the national conference on artificial intelligence. AAAI Press, MIT Press, Menlo Park, CA, Cambridge, MA, London, vol 22, p 540 Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B 39(1):1–38 Emanuelsson O, Nielsen H, Brunak S, von Heijne G (2000) Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. J Mol Biol 300(4):1005–1016 Gardy JL, Brinkman FS (2006) Methods for predicting bacterial protein subcellular localization. Nat Rev Microbiol 4(1):741–751 Gardy JL, Laird MR, Chen F, Rey S, Walsh C, Ester M, Brinkman FS (2005) Psortb v. 2.0: expanded prediction of bacterial protein subcellular localization and insights gained from comparative proteome analysis. Bioinformatics 21(5):617–623 Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. ACM SIGKDD Explor Newsl 11(1):10–18 Herndon N, Caragea D (2014a) Empirical study of domain adaptation with Naïve Bayes on the task of splice site prediction. In: Proceedings of the 5th international conference on bioinformatics models, methods and algorithms, BIOINFORMATICS 2014, pp 57–67 Herndon N, Caragea D (2014b) Predicting protein localization using a domain adaptation approach. In: Biomedical engineering systems and technologies. Springer, Berlin, pp 191–206 Herndon N, Caragea D (2015) Domain adaptation with logistic regression for the task of splice site prediction. In: Proceedings of the 11th international symposium on bioinformatics research and applications, ISBRA 2015, pp 125–137 Hubbard T, Park J (1995) Fold recognition and Ab Initio structure predictions using hidden Markov models and beta-strand pair potentials. Proteins 23(3):398–402 Joachims T (2002) Learning to classify text using support vector machines: methods, theory and algorithms. Kluwer Academic Publishers, Berlin John GH, Langley P (1995) Estimating continuous distributions in bayesian classifiers. In: Proceedings of the eleventh conference on uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc., pp 338–345 Kullback S, Leibler RA (1951) On information and sufficiency. Ann Math Stat 22(1):79–86 Lewis DD (1992) Representation and learning in information retrieval. Ph.D. thesis, University of Massachusetts Maeireizo B, Litman D, Hwa R (2004) Co-training for predicting emotions with spoken dialogue data. In: Proceedings of the association for computational linguistics on interactive poster and demonstration sessions, ACL demo ’04. Association for computational linguistics, Stroudsburg, PA, USA McCallum A, Nigam K et al (1998) A comparison of event models for Naïve Bayes text classification. In: Proceedings of the association for the advancement of artificial intelligence workshop on learning for text categorization, vol 752, Citeseer, pp 41–48 Müller KR, Mika S, Rätsch G, Tsuda S, Schölkopf B (2001) An introduction to kernel-based learning algorithms. IEEE Trans Neural Netw 12(2):181–202 Nigam K, McCallum AK, Thrun S, Mitchell T (2000) Text classification from labeled and unlabeled documents using EM. Mach Learn 39(2–3):103–134 Noble WS (2006) What is a support vector machine? Nat Biotechnol 24(12):1565–1567 Rätsch G, Sonnenburg S, Srinivasan J, Witte H, Müller KR, Sommer R, Schölkopf B (2007) Improving the C. elegans genome annotation using machine learning. PLoS Comput Biol 3:e20 Riloff E, Wiebe J, Wilson T (2003) Learning subjective nouns using extraction pattern bootstrapping. In: Proceedings of the seventh conference on natural language learning at HLT-NAACL., CONLL ’03Association for computational linguistics, Stroudsburg, PA, USA, pp 25–32 Roli F, Marcialis G (2006) Semi-supervised PCA-based face recognition using self-training. In: Yeung DY, Kwok J, Fred A, Roli F, de Ridder D (eds) Structural, syntactic, and statistical pattern recognition. Lecture notes in computer science, vol 4109. Springer, Berlin, pp 560–568 Schweikert G, Widmer C, Schölkopf B, Rätsch G (2008) An empirical analysis of domain adaptation algorithms for genomic sequence analysis. In: Proceedings of the fifth annual conference on neural information processing systems (NIPS), pp 1433–1440 Sonnenburg S, Schweikert G, Philips P, Behr J, Rätsch G (2007) Accurate splice site prediction using support vector machines. BMC Bioinf 8(Supplement 10):1–16 Stanke M, Waack S (2003) Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics 19(suppl 2):ii215–ii225 Yarowsky D (1995) Unsupervised word sense disambiguation rivaling supervised methods. In: Proceedings of the 33rd annual meeting on association for computational linguistics., ACL ’95Association for computational linguistics, Stroudsburg, PA, USA, pp 189–196 Zhu X, Ghahramani Z (2002) Learning from labeled and unlabeled data with label propagation. Tech. rep, Citeseer Zien A, Rätsch G, Mika S, Schölkopf B, Lengauer T, Müller KR (2000) Engineering support vector machine kernels that recognize translation initiation Sites. Bioinformatics 16(9):799–807