Nội dung được dịch bởi AI, chỉ mang tính chất tham khảo

TACIT: Một công cụ phân tích văn bản, thu thập và diễn giải mã nguồn mở

Springer Science and Business Media LLC - Tập 49 - Trang 538-547 - 2016

Anurag Singh¹, Joe Hoover¹, Niki Jitendra Parmar¹, Linda Pulickal¹, Aswin Rajkumar¹, Reihane Boghrati¹, Kate M. Johnson¹, Justin Garten¹, Yuvarani Shankar¹, Morteza Dehghani¹, Vijayan Balasubramanian¹

¹University of Southern California, Los Angeles, USA

Tóm tắt

Khi hoạt động và tương tác của con người ngày càng diễn ra trên mạng, những dấu vết kỹ thuật số của những hoạt động này cung cấp một cái nhìn quý giá vào một loạt các quy trình tâm lý và xã hội. Đã có nhiều tiến bộ trong việc tận dụng những cơ hội này; tuy nhiên, sự phức tạp trong việc quản lý và phân tích khối lượng dữ liệu hiện có đã hạn chế cả loại hình phân tích được sử dụng và số lượng nhà nghiên cứu có thể tận dụng những dữ liệu này. Mặc dù các lĩnh vực như khoa học máy tính đã phát triển nhiều kỹ thuật và phương pháp để xử lý những khó khăn này, việc sử dụng những công cụ đó thường yêu cầu kiến thức chuyên môn và kinh nghiệm lập trình. Công cụ Phân tích Văn bản, Thu thập và Diễn giải (TACIT) được thiết kế để khắc phục khoảng cách này bằng cách cung cấp một công cụ và giao diện trực quan để tận dụng những phương pháp tiên tiến trong phân tích văn bản và quản lý dữ liệu quy mô lớn. Hơn nữa, TACIT được triển khai dưới dạng kiến trúc mở, có thể mở rộng, dựa trên các plugin, điều này sẽ cho phép các nhà nghiên cứu khác mở rộng và phát triển các khả năng này khi có các phương pháp mới được giới thiệu.

Từ khóa

#phân tích văn bản #dữ liệu lớn #khoa học máy tính #công cụ mã nguồn mở #quản lý dữ liệu

Tài liệu tham khảo

Abell, M. (2014). SAS Text Miner [Software]. Available from www.sas.com/en_us/software/analytics/text-miner.html Akthar, F., & Hahne, C. (2012). RapidMiner 5 operator reference. Cambridge: RapidMiner. Andrzejewski, D., & Zhu, X. (2009). Latent Dirichlet allocation with topic-in-set knowledge. In Proceedings of the NAACL HLT 2009 Workshop on Semi-Supervised Learning for Natural Language Processing (pp. 43–48). Stroudsburg: Association for Computational Linguistics. Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022. Bouckaert, R. R., Frank, E., Hall, M. A., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2010). WEKA—Experiences with a Java open-source project. Journal of Machine Learning Research, 11, 2533–2541. Breiman, L. (2001). Random forests. Machine Learning, 45, 5–32. Chang, C.-C., & Lin, C.-J. (2011). LIBSVM: A library for support vector machines. ACM Transactions on Intelligent System Technology, 2, 27:1–27:27. Cohn, M. A., Mehl, M. R., & Pennebaker, J. W. (2004). Linguistic markers of psychological change surrounding September 11, 2001. Psychological science, 15(10), 687–693. Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20, 273–297. De Choudhury, M., Gamon, M., Counts, S., & Horvitz, E. (2013). Predicting depression via social media. In ICWSM 2013. Boston, MA: Association for the Advancement of Artificial Intelligence. Deerwester, S., Dumais, S., Landauer, T., & Furnas, G. (1990). Indexing by latent semantic analysis. JASIS. Retrieved from www.cob.unt.edu/itds/faculty/evangelopoulos/dsci5910/LSA_Deerwester1990.pdf Dehghani, M., Johnson, K. M., Hoover, J., Sagi, E., Garten, J., Parmar, N. J., . . . Graham, J. (2016). Purity homophily in social networks. Journal of Experimental Psychology: General, 145, 366–375. doi:10.1037/xge0000139 Dehghani, M., Sagae, K., Sachdeva, S., & Gratch, J. (2014). Analyzing political rhetoric in conservative and liberal weblogs related to the construction of the “Ground Zero Mosque.”. Journal of Information Technology & Politics, 11, 1–14. doi:10.1080/19331681.2013.826613 Duggan, M., Ellison, N. B., Lampe, C., Lenhart, A., & Madden, M. (2015). Social media update 2014. Washington: Pew Research Center. Eichstaedt, J. C., Schwartz, H. A., Kern, M. L., Park, G., Labarthe, D. R., Merchant, R. M., . . . Seligman, M. E. P. (2015). Psychological language on Twitter predicts county-level heart disease mortality. Psychological Science, 26, 159–169. doi:10.1177/0956797614557867 Frank, E., Hall, M., Reutemann, P., & Trigg, L. (2006). Weka 3: Data mining software in Java. Hamilton: University of Waikato. Frimer, J. A., Aquino, K., Gebauer, J. E., Zhu, L. L., & Oakes, H. (2015). A decline in prosocial language helps explain public disapproval of the US Congress. Proceedings of the National Academy of Sciences, 112, 6591–6594. Graham, J., Haidt, J., & Nosek, B. A. (2009). Liberals and conservatives rely on different sets of moral foundations. Journal of Personality and Social Psychology, 96, 1029–1046. doi:10.1037/a0015141 Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National Academy of Sciences, 101(Suppl. 1), 5228–5235. Han, B., & Baldwin, T. (2011). Lexical normalisation of short text messages: Makn sens a# twitter. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (Vol. 1, pp. 368–378). Stroudsburg, PA, USA: Association for Computational Linguistics. Hart, R. P. (1984). Verbal style and the presidency: A computer-based analysis. New York: Academic Press. Hirschberg, J., & Manning, C. D. (2015). Advances in natural language processing. Science, 349, 261–266. doi:10.1126/science.aaa8685 Corp, I. B. M. (2011). SPSS Modeler 16 algorithms guide. Armonk: Author. Ireland, M. E., Slatcher, R. B., Eastwick, P. W., Scissors, L. E., Finkel, E. J., & Pennebaker, J. W. (2011). Language style matching predicts relationship initiation and stability. Psychological Science, 22, 39–44. Johnson, S. C. (1967). Hierarchical clustering schemes. Psychometrika, 32, 241–254. Klimt, B., & Yang, Y. (2004). The Enron Corpus: A new dataset for email classification research. In Machine learning: ECML 2004 (pp. 217–226). Berlin, Germany: Springer. Kyle, K., & Crossley, S. A. (2015). Automatically assessing lexical sophistication: Indices, tools, findings, and application. TESOL Quarterly, 49, 757–786. Le, Q. V., & Mikolov, T. (2014). Distributed representations of sentences and documents. arXiv [cs.CL]. Retrieved from http://arxiv.org/abs/1405.4053 Leech, G., Garside, R., & Atwell, E. S. (1983). The automatic grammatical tagging of the LOB Corpus. International Computer Archive of Modern and Medieval English Journal, 7, 13–33. Lewis, D. D. (1998). Naive (Bayes) at forty: The independence assumption in information retrieval. In Machine learning: ECML-98 (Lecture Notes in Computer Science, Vol. 1398, pp. 4–15). New York, NY: Springer. doi:10.1007/BFb0026666 Lu, X. (2010). Automatic analysis of syntactic complexity in second language writing. International Journal of Corpus Linguistics, 15, 474–496. doi:10.1075/ijcl.15.4.02lu MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In L. M. Le Cam & J. Neyman (Eds.), Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability Vol 1 (pp. 281–297). Berkeley: University of California Press. Mander, J. (2015). Global Web Index Social Summary Q1 2015. London: Global Web Index. Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S. J., & McClosky, D. (2014). The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations (pp. 55–60). Stroudsburg, PA, USA: Association for Computational Linguistics. Math, C. (2014). The Apache Commons Mathematics Library. Retrieved from commons.apache.org/proper/commons-math/, September 8, 2013. McCallum, A. K. (2002). MALLET: A machine learning for language toolkit [Software]. Retrieved from http://mallet.cs.umass.edu McNamara, D. S., Graesser, A. C., McCarthy, P. M., & Cai, Z. (2014). Automated evaluation of text and discourse with Coh-Metrix. Cambridge: Cambridge University Press. Mehl, M. R., Gosling, S. D., & Pennebaker, J. W. (2006). Personality in its natural habitat: Manifestations and implicit folk theories of personality in daily life. Journal of Personality and Social Psychology, 90, 862–877. doi:10.1037/0022-3514.90.5.862 Meyer, D., Dimitriadou, E., Hornik, K., Weingessel, A., Leisch, F., Chang, C.-C., & Lin, C.-C. (2015). Package “e1071.” Retrieved from https://cran.r-project.org/web/packages/e1071/index.html Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, & K. Q. Weinberger (Eds.), Advances in neural information processing systems 26 (pp. 3111–3119). Cambridge: MIT Press. Mukherjee, A., & Liu, B. (2010). Improving gender classification of blog authors. In Proceedings of the 2010 conference on Empirical Methods in natural Language Processing (pp. 207–217). Stroudsburg, PA, USA: Association for Computational Linguistics. Newman, M. L., Pennebaker, J. W., Berry, D. S., & Richards, J. M. (2003). Lying words: Predicting deception from linguistic styles. Personality and Social Psychology Bulletin, 29, 665–675. Nguyen, M.-T., & Lim, E.-P. (2014). On predicting religion labels in microblogging networks. In Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 1211–1214). New York, NY, USA: ACM. Papadimitriou, C. H., Tamaki, H., Raghavan, P., & Vempala, S. (1998). Latent semantic indexing: A probabilistic analysis. In Proceedings of the Seventeenth ACM SIGACT–SIGMOD–SIGART Symposium on Principles of Database Systems (pp. 159–168). New York, NY, USA: ACM. Pennebaker, J. W., Booth, R. J., & Francis, M. E. (2007). Linguistic inquiry and word count: LIWC [Computer software]. Austin: Liwc.net. Pennebaker, J. W., Francis, M. E., & Booth, R. J. (2001). Linguistic inquiry and word count: LIWC 2001. Mahwah: Erlbaum. Pennebaker, J. W., & King, L. A. (1999). Linguistic styles: Language use as an individual difference. Journal of Personality and Social Psychology, 77, 1296–1312. doi:10.1037/0022-3514.77.6.1296 Pennington, J., Socher, R., & Manning, C. D. (2014). Glove: Global vectors for word representation. Proceedings of the Empirical Methods in Natural Language Processing Conference (EMNLP 2014), 12, 1532–1543. Porter, M. F. (2001). Snowball: A language for stemming algorithms. snowball.tartarus.org. Retrieved from http://snowball.tartarus.org/texts/introduction.html Porter, M. F. (2006). An algorithm for suffix stripping. Program: Electronic Library and Information Systems, 40, 211–218. Rude, S., Gortner, E.-M., & Pennebaker, J. (2004). Language use of depressed and depression-vulnerable college students. Cognition and Emotion, 18, 1121–1133. Stirman, S. W., & Pennebaker, J. W. (2001). Word use in the poetry of suicidal and nonsuicidal poets. Psychosomatic Medicine, 63, 517–522. Stone, P. J., Dunphy, D. C., & Smith, M. S. (1966). The general inquirer: A computer approach to content analysis. Cambridge: MIT Press. Tausczik, Y. R., & Pennebaker, J. W. (2010). The psychological meaning of words: LIWC and computerized text analysis methods. Journal of Language and Social Psychology, 29, 24–54. T. M. I. Project. (2015). How millennials get news: Inside the habits of America’s first digital generation. Retrieved November 8, 2015, from www.mediainsight.org/PDFs/Millennials/Millennials%20Report%20FINAL.pdf Van Mieghem, P. (2011). Human psychology of common appraisal: The Reddit score. IEEE Transactions on Multimedia, 13, 1404–1406. Wojcik, S. P., Hovasapian, A., Graham, J., Motyl, M., & Ditto, P. H. (2015). Conservatives report, but liberals display, greater happiness. Science, 347, 1243–1246. doi:10.1126/science.1260817 Yang, H., Zhuang, T., & Zong, C. (2015). Domain adaptation for syntactic and semantic dependency parsing using deep belief networks. Transactions of the Association for Computational Linguistics, 3, 271–282. Yu, B., Kaufmann, S., & Diermeier, D. (2008). Classifying party affiliation from political speech. Journal of Information Technology and Politics, 5, 33–48.

Scholar Hub - Công cụ hỗ trợ trích dẫn và phân tích khoa học Việt Nam

Về chúng tôi

Scholar Hub là công cụ hỗ trợ trích dẫn và phân tích các bài báo, công bố khoa học Việt Nam. Công cụ trợ giúp người nghiên cứu, tạp chí, đơn vị nghiên cứu tra cứu, phân tích và thống kê dữ liệu nghiên cứu khoa học tại Việt Nam và quốc tế.
ScholarHub KHÔNG đăng thông tin tổng hợp, KHÔNG đăng lại nội dung từ các trang báo chí Việt Nam hoặc trang thông tin điện tử khác tại Việt Nam.

Thông tin, cập nhật

Đăng ký Tạp chí tham gia vào Scholar Hub

Phản hồi ý kiến về Scholar Hub

Bài viết, nội dung cập nhật

Chủ đề khoa học

Website liên kết

Hệ thống CSDL Khoa học & Công nghệ

Phần mềm kiểm tra trùng lặp Kiểm Tra Tài Liệu

Phần mềm xuất bản tạp chí điện tử VOJS

Nền tảng trắc nghiệm và đề thi đa lĩnh vực LetQA