Automating Content Extraction of HTML Documents
Tóm tắt
Web pages often contain clutter (such as unnecessary images and extraneous links) around the body of an article that distracts a user from actual content. Extraction of “useful and relevant” content from web pages has many applications, including cell phone and PDA browsing, speech rendering for the visually impaired, and text summarization. Most approaches to making content more readable involve changing font size or removing HTML and data components such as images, which takes away from a webpage’s inherent look and feel. Unlike “Content Reformatting,” which aims to reproduce the entire webpage in a more convenient form, our solution directly addresses “Content Extraction.” We have developed a framework that employs an easily extensible set of techniques. It incorporates advantages of previous work on content extraction. Our key insight is to work with DOM trees, a W3C specified interface that allows programs to dynamically access document structure, rather than with raw HTML markup. We have implemented our approach in a publicly available Web proxy to extract content from HTML web pages. This proxy can be used both centrally, administered for groups of users, as well as by individuals for personal browsers. We have also, after receiving feedback from users about the proxy, created a revised version with improved performance and accessibility in mind.
Tài liệu tham khảo
American Foundation for the Blind, Statistics and Sources for Professionals, American Foundation for the Blind: New York, 2000.
S. Brin and L. Page, “The anatomy of a large-scale hypertextual web search engine,” Computer Networks and ISDN Systems 30, 1998, 107–117.
C. Brown, “Assistive technology computers and personal with disabilities,” Communications of the ACM 35, 1992, 36–45.
M. H. Brown and R. A. Shillner, “A new paradigm for browsing the Web,” in Proc. of Human Factors in Computing Systems (CHI’95 Conference Companion), 1995.
O. Buyukkokten, H. Garcia-Molina, and A. Paepcke, “Accordion summarization for end-game browsing on PDAs and cellular phones,” in Proc. of Conference on Human Factors in Computing Systems (CHI’01), 2001.
O. Buyukkokten, H. Garcia-Molina, and A. Paepcke, “Seeing the whole in parts: text summarization for Web browsing on handheld devices,” in Proc. of 10th Internat. World-Wide Web Conference, 2001.
O. Buyukkokten, H. Garcia-Molina, and A. Paepcke, “Text summarization for Web browsing on handheld devices,” in Proc. of 10th Internat. World-Wide Web Conference, 2001.
Y. Chen, W. Y. Ma, and H. J. Zhang, “Detecting Web page structure for adaptive viewing on small form factor devices,” in Proc. WWW’03, Budapest, Hungary, May 2003.
M. Chiang, “World Wide Web accessibility by visually disabled patients: Problems and solutions,” Final Report for CS6125 WHIM, Columbia University’s Computer Science Department.
W. Chisolm, G. Vanderheiden, and I. Jacobs, “Web content accessibility guidelines 1.0,” Interactions 8, 2001, 35–54.
W. K. Edwards, E. D. Mynatt, and K. Stockton, “Access to graphical interfaces for blind users,” Interactions 2, 1995, 54–67.
K. A. Ericsson and H. A. Simon, Protocol Analysis: Verbal Reports as Data, MIT Press: Cambridge, MA, 1993.
A. Finn, N. Kushmerick, and B. Smyth, “Fact or fiction: content classification for digital libraries,” in Proc. of Joint DELOS–NSF Workshop on Personalisation and Recommender Systems in Digital Libraries (Dublin), 2001.
S. Hanzlik, “Gorilla design studios presents: the hosts file,” Gorilla Design Studios, August 31, 2002, http://accs-net.com/hosts/
http://sourceforge.net/projects/wpar
http://www-3.ibm.com/able/solution_offerings/hpr.html
http://www.apache.org/
http://www.apache.org/~andyc/neko/doc/html/
http://www.avantbrowser.com
http://www.bitstream.com/wireless
http://www.bitstream.com/wireless/server/workflow.html
http://www.dolphinuk.co.uk/products/hal.htm
http://www.eclipse.org/articles/Article-Accessibility/accessibility.html
http://www.eclipse.org/articles/Article-SWT-Design-1/SWT-Design-1.html
http://www.gnu.org/software/gcc/java/
http://www.greenlightwireless.net/services/default.asp
http://www.junkbusters.com
http://www.microsoft.com/enable/
http://www.microsoft.com/technet/treeview/default.asp?url=/technet/prodtechnol/winxppro/reader_overview.asp
http://www.mozilla.org
http://www.openxml.org
http://www.opera.com
http://www.promotiondata.com/article.php?sid=190
http://www.webaim.org/simulations/screenreader
http://www.webwiper.com
E. Kaasinen, M. Aaltonen, J. Kolari, S. Melakoski, and T. Laakko, “Two approaches to bringing Internet services to WAP devices,” in Proc. of 9th Internat. World-Wide Web Conference, 2000.
M.-Y. Kan, Private communication, Columbia NLP group, 2002.
M.-Y. Kan, J. L. Klavans, and K. R. McKeown, “Linear segmentation and segment relevance,” in Proc. of 6th Internat. Workshop of Very Large Corpora (WVLC-6), 1998.
K. R. McKeown, R. Barzilay, D. Evans, V. Hatzivassiloglou, M. Y. Kan, B. Schiffman, and S. Teufel, “Columbia multi-document summarization: approach and evaluation,” in Proc. of Document Understanding Conference, 2001.
R. L. Kline and E. P. Glinert, “Improving GUI accessibility for people with low vision,” in Proc. of Human Factors in Computing Systems (CHI’95 Conference Companion), 1995.
M. Kunze and D. Rosner, “An XML-based approach for the presentation and exploitation of extracted information,” in Proc. of the 19th Internat. Conference on Computational Linguistics (Coling), 2002.
A. W. Kushniruk, D. R. Kaufman, V. L. Patel , “Assessment of a computerized patient record system: A cognitive approach to evaluating medical technology,” MD Comput. 13, 1996, 406–415.
A. W. Kushniruk, V. L. Patel, and J. J. Cimino, “Usability testing in medical informatics: cognitive approaches to evaluation of information systems and user interfaces,” in Proc. AMIA Sympos. 1997, pp. 218–222.
A. W. Kushniruk, M. Y. Kan, K. McKeown et al., “Usability evaluation of an experimental text summarization system and three search engines: Implications for the reengineering of health care interfaces,” in Proc. AMIA Sympos. 2002, pp. 420–424.
C. Lewis, Using the ‘Thinking-Aloud’ method in cognitive interface design, IBM Research Report RC 9265, IBM Thomas J. Watson Research Center: Yorktown Heights, NY, 1982.
I. Muslea, S. Minton, and C. Knoblock, “A hierarchal approach to wrapper induction,” in Proc. of the 3rd Internat. Conference on Autonomous Agents (Agents’99), 1999.
J. Nielsen, Usability Engineering, Academic Press: New York, 1993.
I. J. Pitt and A. D. N. Edwards, “Improving the usability of speech-based interfaces for blind users,” in Proc. of the 2nd Annual ACM Conference on Assistive Technologies (ASSETS), 1996.
A. F. R. Rahman, H. Alam, and R. Hartono, “Content extraction from HTML documents,” in Proc. of the 1st Internat. Workshop on Web Document Analysis (WDA2001), 2001.
A. F. R. Rahman, H. Alam, and R. Hartono, “Understanding the flow of content in summarizing HTML documents,” in Proc. of the Internat. Workshop on Document Layout Interpretation and Its Applications, DLIA’01, September 2001.
W. Reichl, B. Carpenter, J. Chu-Carroll, and W. Chou, “Language modeling for content extraction in human–computer dialogues,” in Proc. of the Internat. Conference on Spoken Language Processing (ICSLP), 1998.
D. P. Rice, Chronic Care in America: A 21st Century Challenge, Institute for Health and Aging, University of California, San Francisco, Robert Wood Johnson Foundation: Princeton, NJ, 1996.
B. Schneiderman, Designing the User Interface: Strategies for Effective Human–Computer Interaction, 3rd ed., Addison-Wesley: Reading, MA, 1997.
I. U. Scott, W. J. Feurer, and J. A. Jacko, “Impact of graphical user interface screen features on computer task accuracy and speed in a cohort of patients with age-related macular degeneration,” Amer. J. Ophthalmol. 134, 2002, 857–862.
J. A. Shoemaker, “Vision problems in the US: prevalence of adult vision impairment and age-related eye diseases in America,” National Eye Institute: Bethesda, MD, 2002.
N. Wacholder, D. Evans, and J. Klavans, “Automatic identification and organization of index terms for interactive browsing,” in Proc. of the Joint Conference on Digital Libraries’01, 2001.
M. Welsh, “The staged event-driven architecture for highly-concurrent server applications,” Ph.D. Qualifying Examination Proposal, UC Berkeley, December 2000, http://www.cs.berkeley.edu/~mdw/papers/quals-seda.pdf