Web mining in soft computing framework: relevance, state of the art and future directions

IEEE Transactions on Neural Networks - Tập 13 Số 5 - Trang 1163-1177 - 2002
S.K. Pal1, V. Talwar2, P. Mitra1
1Indian Statistical Institute, Calcutta, India
2Department of Computer Science, Netaji Subhas Institute of Technology, New Delhi, India

Tóm tắt

The paper summarizes the different characteristics of Web data, the basic components of Web mining and its different types, and the current state of the art. The reason for considering Web mining, a separate field from data mining, is explained. The limitations of some of the existing Web mining methods and tools are enunciated, and the significance of soft computing (comprising fuzzy logic (FL), artificial neural networks (ANNs), genetic algorithms (GAs), and rough sets (RSs) are highlighted. A survey of the existing literature on "soft Web mining" is provided along with the commercially available systems. The prospective areas of Web mining where the application of soft computing needs immediate attention are outlined with justification. Scope for future research in developing "soft Web mining" systems is explained. An extensive bibliography is also provided.

Từ khóa

#Web mining #Fuzzy logic #Artificial intelligence #Data mining #Artificial neural networks #Genetic algorithms #Computer networks #Rough sets #Information retrieval #Search engines

Tài liệu tham khảo

singh, 1998, wrapper induction for semistructured web based information sources, Proc 2nd Int Conf KDD Data Mining, 329 10.1023/A:1010022931168 10.1016/S0004-3702(00)00013-8 10.1109/TAI.1997.632303 10.1016/S0169-7552(98)00110-X 1999, IEEE Computer (Special Issue on Digital Libraries), 32 mitchell, 1997, webwatcher: a tour guide for the world wide web, Proc Int Joint Conf AIIJCA197, 770 furnkranz, 1999, exploiting structural information for text classification on the www, Proc Advances Intell Data Anal 3rd Int Symp IDA99, 487 10.1109/5254.784085 10.1016/S0167-739X(97)00022-8 10.1145/360402.360414 ghani, 2000, data mining on symbolic knowledge extracted from the web, Proc 6th Int Conf Knowledge Discovery Data Mining (KDD-2000) Workshop on Text Mining, 29 10.1145/846183.846187 etzioni, 1997, ahoy! the homepage finder, Proc 6th WWW Conf 10.1145/360402.360421 10.1145/133160.133214 mobasher, 1997, Web Mining Patterns from WWW Transactions 10.1016/S0169-7552(97)00021-4 mladenic, 1998, efficient text categorization, Proc Text Mining Workshop 10th European Conf Machine Learning ECML98 cohen, 1995, what can we learn from the web?, Proc 16th Int Conf Machine Learning (ICML99), 515 pal, 1999, Neuro-Fuzzy Pattern Recognition Methods in Soft Computing 10.1109/72.977258 gyenesei, 2000, A Fuzzy Approach for Mining Quantitative Association Rules mobasher, 1997, Clustering in a High Dimensional Space Using Hypergraph Models joshi, 1998, robust fuzzy clustering methods to support web mining, Proc Workshop in Data Mining and Knowledge Discovery SIGMOD, 15-1 10.1109/FUZZY.1999.790086 pasi, 2000, application of fuzzy set theory to extend boolean information retrieval, Soft Computing in Information Retrieval Techniques and Applications, 50, 21, 10.1007/978-3-7908-1849-9_2 gedeon, 2000, a model of intelligent information retrieval using fuzzy tolerance relations based on hierarchical co-occurrence of words, Soft Computing in Information Retrieval Techniques and Applications, 50, 48, 10.1007/978-3-7908-1849-9_3 yager, 2000, a framework for linguistic and hierarchical queries for document retrieval, Soft Computing in Information Retrieval Techniques and Applications, 50, 3, 10.1007/978-3-7908-1849-9_1 zadeh, 2001, a new direction in ai: toward a computational theory of perceptions, AI Mag, 22, 73 soderland, 1999, learning information extraction rules for semistructured and free text, Machine Learning (Special Issue Natural Language Learning), 34, 233 10.1145/296854.277639 etzioni, 1996, moving up the information food cahin: deploying softbots on the web, Proc 14th Nat Conf AI, 1322 10.1016/S0169-7552(97)00031-7 10.1016/S0169-7552(97)00010-X baeza-yates, 1999, Modern Information Retrieval pal, 2000, Soft Computing for Image Processing, 10.1007/978-3-7908-1858-1 10.1016/0169-7552(96)00024-4 10.1145/358923.358934 10.1108/eb005334 mobasher, 2000, discovery of aggregate usage profiles for web personalization, Proc KDD-2000 Workshop Web Mining E-Commerce 10.1145/175247.175255 etzioni, 1998, web document clustering: a feasibility demonstration, Proc 21st Annu Int ACM SIGIR Conf, 46 10.1016/S1389-1286(99)00052-3 10.1145/276627.276652 pazzani, 1998, learning collaborative information filters, Proc 15th Int Conf Machine Learning, 46 lin, 2000, collaborative recommendation via adaptive association rule mining, Int Workshop Web Mining for E-Commerce (WEBKDD 00) pazzani, 1996, syskill and webert: identifying interesting web sites, Proc 13th Nat Conf AI, 54 kohonen, 1997, Self-Organizing Maps, 10.1007/978-3-642-97966-8 10.1109/72.846729 freitag, 2000, boosted wrapper induction, Proc AAAI, 577 crestani, 2000, Soft Computing in Information Retrieval Techniques and Application, 50, 10.1007/978-3-7908-1849-9 kim, 2000, web document retrieval by genetic learning of importance factors for html tags, Proc Int Workshop Text Web Mining, 13 drummond, 1995, A Learning Agent That Assists the Browsing of Software Libraries 10.1109/72.363450 10.1109/CEC.1999.782599 10.1007/978-3-7908-1849-9_8 10.1109/NAFIPS.1999.781751 shavlik, 1994, knowledge-based artificial neural networks, Artificial Intelligence, 70, 119, 10.1016/0004-3702(94)90105-8 shavlik, 2001, a system for building intelligent agents that learn to retrieve and extract information, Int J User Modeling User Adapted Interaction (Special Issue on User Modeling and Intelligent Agents) 10.1007/978-3-7908-1849-9_6 boughanem, 1998, mercure at trec7, Proc 7th Int Conf Text Retrieval TREC7, —355 10.1007/978-3-7908-1849-9_4 merkl, 2000, document classification with unsupervised artificial neural networks, Soft Computing in Information Retrieval Techniques and Applications, 50, 102, 10.1007/978-3-7908-1849-9_5 10.1109/SBRN.2000.889727 freitag, 1999, information extraction from hmm's and shrinkage, Proc AAAI-99 Workshop Machine Learning Inform Extraction 10.1145/360402.360406 bikel, 1999, an algorithm that learns what's in a name, Machine Learning (Special Issue Natural Language Learning), 34, 211 10.1145/240455.240473 10.1016/S0888-613X(96)00072-2 maheswari, 2001, the variable precision rough set model for web usage mining, Proc 1st Asia-Pacific Conf Web Intell (WI-2001 10.1007/3-540-45372-5_51 10.1007/978-3-7908-1849-9_14 pal, 1999, Rough Fuzzy Hybridization A New Trend in Decision Making wong, 2000, granular information retrieval, Soft Computing in Information Retrieval Techniques and Applications, 50, 317, 10.1007/978-3-7908-1849-9_13 lee, 2001, developing an adaptive search engine for e-commerce using a web mining approach, Proc Int Conf Information Technology Coding and Computing, 604 10.1007/978-1-4471-0687-6 wan, 2001, content-based sound retrieval for web application, Web Intelligence Research and Development, lncs 2198, 389, 10.1007/3-540-45490-X_49 10.1007/3-540-45490-X_38 10.1109/5254.757626 freitag, 1998, information extraction from html: application of a general machine learning approach, Proc 15th Conf Artificial Intell AAAAI-98, 517 brown, 1994, the harvest information discovery and access system, Proc 2nd Int WWW Conf Distributed Environments, 763 10.1109/CAIA.1995.378787 levy, 1995, the information manifold, AAAI Spring Symposium on Information Gathering From Heterogeneous Distributed Environments kwok, 1996, planning to gather information, Proc 14th Nat Conf AI 10.1016/S0169-7552(97)00033-0 10.1145/63039.63044 etzioni, 1996, A Scalable Comparison Shopping Agent for the World Wide Web 10.1109/ICEC.1994.349905 etzioni, 1995, category translation: learning to understand information on the internet, Proc 15th Int Joint Conf Artificial Intell, 930 10.1109/ICEC.1996.542674 craven, 1998, learning to extract symbolic knowledge from the world wide web, Proc 15th Nat Conf AI (AAAI98), 509 loia, 2001, an evolutionary approach to automatic web page categorization and updating, Web Intelligence Research and Development, lncs 2198, 292, 10.1007/3-540-45490-X_35 yang, 1992, Query Modification Using Genetic Algorithms in Vector Space Models 10.1109/72.728363 kargupta, 1999, collective data mining: a new perspective toward distributed data mining, Advances in Distributed and Parallel Knowledge Discovery etzioni, 1997, adaptive web sites: an ai challenge, Proc 15th Int Joint Conf Artificial Intell (IJCAI 97), 16 skowron, 1998, Rough Sets in Knowledge Discovery 10.1109/TKDE.2003.1161579