Automating Content Extraction of HTML Documents

Springer Science and Business Media LLC - Tập 8 - Trang 179-224 - 2005
Suhit Gupta1, Gail E. Kaiser1, Peter Grimm, Michael F. Chiang2, Justin Starren3
1Department of Computer Sciences, Columbia University, New York, USA
2Departments of Ophthalmology and Biomedical Informatics, Columbia University, New York, USA
3Departments of Biomedical Informatics and Radiology, Columbia University, New York, USA

Tóm tắt

Web pages often contain clutter (such as unnecessary images and extraneous links) around the body of an article that distracts a user from actual content. Extraction of “useful and relevant” content from web pages has many applications, including cell phone and PDA browsing, speech rendering for the visually impaired, and text summarization. Most approaches to making content more readable involve changing font size or removing HTML and data components such as images, which takes away from a webpage’s inherent look and feel. Unlike “Content Reformatting,” which aims to reproduce the entire webpage in a more convenient form, our solution directly addresses “Content Extraction.” We have developed a framework that employs an easily extensible set of techniques. It incorporates advantages of previous work on content extraction. Our key insight is to work with DOM trees, a W3C specified interface that allows programs to dynamically access document structure, rather than with raw HTML markup. We have implemented our approach in a publicly available Web proxy to extract content from HTML web pages. This proxy can be used both centrally, administered for groups of users, as well as by individuals for personal browsers. We have also, after receiving feedback from users about the proxy, created a revised version with improved performance and accessibility in mind.

Tài liệu tham khảo

American Foundation for the Blind, Statistics and Sources for Professionals, American Foundation for the Blind: New York, 2000. S. Brin and L. Page, “The anatomy of a large-scale hypertextual web search engine,” Computer Networks and ISDN Systems 30, 1998, 107–117. C. Brown, “Assistive technology computers and personal with disabilities,” Communications of the ACM 35, 1992, 36–45. M. H. Brown and R. A. Shillner, “A new paradigm for browsing the Web,” in Proc. of Human Factors in Computing Systems (CHI’95 Conference Companion), 1995. O. Buyukkokten, H. Garcia-Molina, and A. Paepcke, “Accordion summarization for end-game browsing on PDAs and cellular phones,” in Proc. of Conference on Human Factors in Computing Systems (CHI’01), 2001. O. Buyukkokten, H. Garcia-Molina, and A. Paepcke, “Seeing the whole in parts: text summarization for Web browsing on handheld devices,” in Proc. of 10th Internat. World-Wide Web Conference, 2001. O. Buyukkokten, H. Garcia-Molina, and A. Paepcke, “Text summarization for Web browsing on handheld devices,” in Proc. of 10th Internat. World-Wide Web Conference, 2001. Y. Chen, W. Y. Ma, and H. J. Zhang, “Detecting Web page structure for adaptive viewing on small form factor devices,” in Proc. WWW’03, Budapest, Hungary, May 2003. M. Chiang, “World Wide Web accessibility by visually disabled patients: Problems and solutions,” Final Report for CS6125 WHIM, Columbia University’s Computer Science Department. W. Chisolm, G. Vanderheiden, and I. Jacobs, “Web content accessibility guidelines 1.0,” Interactions 8, 2001, 35–54. W. K. Edwards, E. D. Mynatt, and K. Stockton, “Access to graphical interfaces for blind users,” Interactions 2, 1995, 54–67. K. A. Ericsson and H. A. Simon, Protocol Analysis: Verbal Reports as Data, MIT Press: Cambridge, MA, 1993. A. Finn, N. Kushmerick, and B. Smyth, “Fact or fiction: content classification for digital libraries,” in Proc. of Joint DELOS–NSF Workshop on Personalisation and Recommender Systems in Digital Libraries (Dublin), 2001. S. Hanzlik, “Gorilla design studios presents: the hosts file,” Gorilla Design Studios, August 31, 2002, http://accs-net.com/hosts/ http://sourceforge.net/projects/wpar http://www-3.ibm.com/able/solution_offerings/hpr.html http://www.apache.org/ http://www.apache.org/~andyc/neko/doc/html/ http://www.avantbrowser.com http://www.bitstream.com/wireless http://www.bitstream.com/wireless/server/workflow.html http://www.dolphinuk.co.uk/products/hal.htm http://www.eclipse.org/articles/Article-Accessibility/accessibility.html http://www.eclipse.org/articles/Article-SWT-Design-1/SWT-Design-1.html http://www.gnu.org/software/gcc/java/ http://www.greenlightwireless.net/services/default.asp http://www.junkbusters.com http://www.microsoft.com/enable/ http://www.microsoft.com/technet/treeview/default.asp?url=/technet/prodtechnol/winxppro/reader_overview.asp http://www.mozilla.org http://www.openxml.org http://www.opera.com http://www.promotiondata.com/article.php?sid=190 http://www.webaim.org/simulations/screenreader http://www.webwiper.com E. Kaasinen, M. Aaltonen, J. Kolari, S. Melakoski, and T. Laakko, “Two approaches to bringing Internet services to WAP devices,” in Proc. of 9th Internat. World-Wide Web Conference, 2000. M.-Y. Kan, Private communication, Columbia NLP group, 2002. M.-Y. Kan, J. L. Klavans, and K. R. McKeown, “Linear segmentation and segment relevance,” in Proc. of 6th Internat. Workshop of Very Large Corpora (WVLC-6), 1998. K. R. McKeown, R. Barzilay, D. Evans, V. Hatzivassiloglou, M. Y. Kan, B. Schiffman, and S. Teufel, “Columbia multi-document summarization: approach and evaluation,” in Proc. of Document Understanding Conference, 2001. R. L. Kline and E. P. Glinert, “Improving GUI accessibility for people with low vision,” in Proc. of Human Factors in Computing Systems (CHI’95 Conference Companion), 1995. M. Kunze and D. Rosner, “An XML-based approach for the presentation and exploitation of extracted information,” in Proc. of the 19th Internat. Conference on Computational Linguistics (Coling), 2002. A. W. Kushniruk, D. R. Kaufman, V. L. Patel , “Assessment of a computerized patient record system: A cognitive approach to evaluating medical technology,” MD Comput. 13, 1996, 406–415. A. W. Kushniruk, V. L. Patel, and J. J. Cimino, “Usability testing in medical informatics: cognitive approaches to evaluation of information systems and user interfaces,” in Proc. AMIA Sympos. 1997, pp. 218–222. A. W. Kushniruk, M. Y. Kan, K. McKeown et al., “Usability evaluation of an experimental text summarization system and three search engines: Implications for the reengineering of health care interfaces,” in Proc. AMIA Sympos. 2002, pp. 420–424. C. Lewis, Using the ‘Thinking-Aloud’ method in cognitive interface design, IBM Research Report RC 9265, IBM Thomas J. Watson Research Center: Yorktown Heights, NY, 1982. I. Muslea, S. Minton, and C. Knoblock, “A hierarchal approach to wrapper induction,” in Proc. of the 3rd Internat. Conference on Autonomous Agents (Agents’99), 1999. J. Nielsen, Usability Engineering, Academic Press: New York, 1993. I. J. Pitt and A. D. N. Edwards, “Improving the usability of speech-based interfaces for blind users,” in Proc. of the 2nd Annual ACM Conference on Assistive Technologies (ASSETS), 1996. A. F. R. Rahman, H. Alam, and R. Hartono, “Content extraction from HTML documents,” in Proc. of the 1st Internat. Workshop on Web Document Analysis (WDA2001), 2001. A. F. R. Rahman, H. Alam, and R. Hartono, “Understanding the flow of content in summarizing HTML documents,” in Proc. of the Internat. Workshop on Document Layout Interpretation and Its Applications, DLIA’01, September 2001. W. Reichl, B. Carpenter, J. Chu-Carroll, and W. Chou, “Language modeling for content extraction in human–computer dialogues,” in Proc. of the Internat. Conference on Spoken Language Processing (ICSLP), 1998. D. P. Rice, Chronic Care in America: A 21st Century Challenge, Institute for Health and Aging, University of California, San Francisco, Robert Wood Johnson Foundation: Princeton, NJ, 1996. B. Schneiderman, Designing the User Interface: Strategies for Effective Human–Computer Interaction, 3rd ed., Addison-Wesley: Reading, MA, 1997. I. U. Scott, W. J. Feurer, and J. A. Jacko, “Impact of graphical user interface screen features on computer task accuracy and speed in a cohort of patients with age-related macular degeneration,” Amer. J. Ophthalmol. 134, 2002, 857–862. J. A. Shoemaker, “Vision problems in the US: prevalence of adult vision impairment and age-related eye diseases in America,” National Eye Institute: Bethesda, MD, 2002. N. Wacholder, D. Evans, and J. Klavans, “Automatic identification and organization of index terms for interactive browsing,” in Proc. of the Joint Conference on Digital Libraries’01, 2001. M. Welsh, “The staged event-driven architecture for highly-concurrent server applications,” Ph.D. Qualifying Examination Proposal, UC Berkeley, December 2000, http://www.cs.berkeley.edu/~mdw/papers/quals-seda.pdf