Structural XML Classification in Concept Drifting Data Streams

New Generation Computing - Tập 33 - Trang 345-366 - 2015
Dariusz Brzezinski1, Maciej Piernik1
1Institute of Computing Science, Poznan University of Technology, Poznan, Poland

Tóm tắt

Classification of large, static collections of XML data has been intensively studied in the last several years. Recently however, the data processing paradigm is shifting from static to streaming data, where documents have to be processed online using limited memory and class definitions can change with time in an event called concept drift. As most existing XML classifiers are capable of processing only static data, there is a need to develop new approaches dedicated for streaming environments. In this paper, we propose a new classification algorithm for XML data streams called XSC. The algorithm uses incrementally mined frequent subtrees and a tree-subtree similarity measure to classify new documents in an associative manner. The proposed approach is experimentally evaluated against eight state-of-the-art stream classifiers on real and synthetic data. The results show that XSC performs significantly better than competitive algorithms in terms of accuracy and memory usage.

Tài liệu tham khảo

Baena-García, M., del Campo-Ávila, J., Fidalgo, R., Bifet, A., Gavaldá, R. and Morales-Bueno, R., “Early drift detection method,” in Fourth International Workshop on Knowledge Discovery from Data Streams, 2006. Bifet A. et al.: “MOA: Massive Online Analysis,”. J. Mach. Learn. Res. 11, 1601–1604 (2010) Bifet, A. and Gavaldà, R., “Learning from time-changing data with adaptive windowing,” in SDM (Skillicorn, D., Liu, B., Apte, C. and Parthasarathy, S. eds.), SIAM, 2007. Bifet, A. and Gavalda, R. “Adaptive xml tree classification on evolving data streams,” in Proc. ECML/PKDD 2009 (1), LNCS, 5781, pp. 147–162, 2009. Bifet, A., Holmes, G. and Pfahringer, B., “Leveraging bagging for evolving data streams,” in ECML/PKDD (1), pp. 135–150, 2010. Bouchachia, A. and Hassler, M., “Classification of XML Documents,” in CIDM 2007 Proceedings, pp. 390–396, 2007. Brzezinski, D. and Piernik, M., “Adaptive XML stream classification using partial tree-edit distance,” in Foundations of Intelligent Systems - 21st International Symposium, ISMIS 2014, Roskilde, Denmark, June 25-27, 2014. Proceedings, ser. LNCS (Andreasen, T., Christiansen, H., Talavera, J. C. C. and Ras, Z. W. eds.), 8502, Springer, pp. 10–19, 2014. Brzezinski D., Stefanowski J.: “Combining block-based and online methods in learning ensembles from concept drifting data streams,”. Information Sciences 265, 50–67 (2014) Brzezinski D., Stefanowski J.: “Reacting to different types of concept drift: The accuracy updated ensemble algorithm,”. IEEE Trans. on Neural Netw. Learn. Syst. 25(1), 81–94 (2014) Cohen E., Strauss M. J.: “Maintaining time-decaying stream aggregates,”. J. Algorithms 59(1), 19–36 (2006) Costa, G., Ortale, R. and Ritacco, E., “X-class: Associative classification of XML documents by structure,” ACM Trans. Inf. Syst., 31, 1, 3, pp. 1–40, 2013. De Vries, C. M. et al., “Overview of the inex 2010 xml mining track : clustering and classification of xml documents,” in Initiative for the Evaluation of XML Retrieval (INEX) 2010, 2011. Demsar J.: “Statistical comparisons of classifiers over multiple data sets,”. J. Machine Learning Research 7, 1–30 (2006) Domingos, P. and Hulten, G., “Mining high-speed data streams,” in Proc. 6th ACM SIGKDD Int. Conf. Knowl. Disc. Data Min., pp. 71–80, 2000. Elwell R., Polikar R.: “Incremental learning of concept drift in nonstationary environments,”. IEEE Trans. Neural Netw. 22(10), 1517–1531 (2011) Gama, J., Medas, P., Castillo, G. and Rodrigues, P., “Learning with drift detection,” in Proc. 17th SBIA Brazilian Symp. Art. Intel., pp. 286–295, 2004. Gama, J., Knowledge Discovery from Data Streams, Chapman and Hall, 2010. Gama J., Sebastião R., Rodrigues P. P.: “On evaluating stream learning algorithms,”. Machine Learning 90(3), 317–346 (2013) Gama, J., Zliobaite, I., Bifet, A., Pechenizkiy, M. and Bouchachia A., “A survey on concept drift adaptation,” ACM Computing Surveys, 46, 4, 2014. Garboni, C. et al., “Sequential Pattern Mining for Structure-based XML Document Classification,” in INEX 2005 Proceedings, pp. 458–468, 2006. Hulten, G., Spencer, L. and Domingos, P., “Mining time-changing data streams,” in KDD, pp. 97–106, 2001. Kolter J. Z., Maloof M. A.: “Dynamic weighted majority: An ensemble method for drifting concepts,”. J. Mach. Learn. Res. 8, 2755–2790 (2007) Kuncheva, L. I., “Classifier ensembles for changing environments,” in Proc. 5th MCS Int. Workshop on Mult. Class. Syst., ser. LNCS, 3077, Springer, pp. 1–15, 2004. Kuncheva, L. I., “Classifier ensembles for detecting concept change in streaming data: Overview and perspectives,” in 2nd Workshop SUEMA 2008 (ECAI 2008), pp. 5–10, 2008. Mayorga, V. and Polyzotis, N., “Sketch-based summarization of ordered XML streams,” in ICDE (Ioannidis, Y. E., Lee, D. L. and Ng, R. T. eds.), IEEE, pp. 541–552, 2009. Nayak, R. et al., “Overview of the INEX 2009 XML Mining Track: Clustering and Classification of XML Documents,” in Focused Retrieval and Evaluation, 6203, pp. 366–378, 2010. Oza, N. C. and Russell, S. J., “Experimental comparisons of online and batch versions of bagging and boosting,” in Proc. 7th ACM SIGKDD Int. Conf. Knowl. Disc. Data Min., pp. 359–364, 2001. Page, E. S., “Continuous inspection schemes,” Biometrika, 41, 1/2, pp. 100–115, 1954. Piernik M., Brzezinski D., Morzy T., Lesniewska A.: “XML clustering: A review of structural approaches,”. The Knowledge Engineering Review 30(03), 297–323 (2015) Piernik, M. and Morzy, T., “Partial tree-edit distance,” Poznan University of Technology, Tech. Rep. RA-10/2013, 2013, available at: http://www.cs.put.poznan.pl/mpiernik/publications/PTED.pdf. Ross G.J., Adams N. M., Tasoulis D. K., Hand D. J.: “Exponentially weighted moving average charts for detecting concept drift,”. Pattern Recognition Letters 33(2), 191–198 (2012) Street, W. N. and Kim, Y., “A streaming ensemble algorithm (SEA. for largescale classification,” in Proc. 7th ACM SIGKDD Int. Conf. Knowl. Disc. Data Min. New York, NY, USA: ACM Press, pp. 377–382, 2001. Thasleena N., Varghese S.: “Enhanced associative classification of XML documents supported by semantic concepts,”. Procedia Computer Science 46, 194–201 (2015) Wang, H. et al., “Mining concept-drifting data streams using ensemble classifiers,” in Proc. 9th ACM SIGKDD Int. Conf. Knowl. Disc. Data Min., pp. 226–235, 2003. Yang, J. and Wang, S., “Extended VSM for XML Document Classification Using Frequent Subtrees,” in INEX 2009 Proceedings, pp. 441–448, 2010. Zaki M.J., Aggarwal C.C.: “XRules: An effective algorithm for structural classification of xml data,”. Machine Learning 62(1-2), 137–170 (2006)