Using neural-network based paragraph embeddings for the calculation of within and between document similarities
Tóm tắt
Science mapping using document networks comes often with the implicit assumption that scientific papers are indivisible units with unique links to neighbour documents. Research on proximity in co-citation analysis and the study of lexical properties of sections and citation contexts indicate that this assumption doesn’t always hold. Moreover, the meaning of words and co-words depends on the context in which they appear. This study proposes the use of a neural network architecture for word and paragraph embeddings (Doc2Vec) for the measurement of similarity among those smaller units of analysis. It is shown that paragraphs in the “Introduction” and the “Discussion” Section are more similar to the abstract, that the similarity among paragraphs is related to -but not linearly- the distance between the paragraphs. The “Methodology” Section is least similar to the other sections. Abstracts of citing-cited documents are more similar than random pairs and the context in which a reference appears is most similar to the abstract of the cited document. This novel approach with higher granularity can be used for bibliometric aided retrieval and to assist in measuring interdisciplinarity through the application of network-based centrality measures.
Tài liệu tham khảo
Abramo, G., D'Angelo, C. A., & Costa, F. D. (2012). Identifying interdisciplinarity through the disciplinary classification of coauthors of scientific publications. Journal of the Association for Information Science & Technology, 63(11), 2206–2222.
Ahlgren, P., & Collinader, C. (2009). Document-document similarity approaches and science mapping: Experimental comparison of five approaches. Journal of Informetrics, 3(1), 49–63.
Bertin M., Atanassova I., Larivière V., Gingras, Y. (2013) The distribution of references in scientific papers: An Analysis of the IMRaD Structure. In: Proceedings of the 14th International Conference of the International Society for Scientometrics and Informetrics. Vienna, Austria, (pp. 591–603).
Bertin, M., Atanassova, I., Sugimoto, C. R., & Larivière, V. (2016). The linguistic patterns and rhetorical structure of citation context: an approach using n-grams. Scientometrics, 109, 1417–1434.
Blei, D. (2012). Probabilistic topic models. Communications of the ACM., 55(4), 77–84.
Boyack, K. W. (2017). Investigating the effect of global data on topic detection. Scientometrics, 111(2), 999–1015.
Chen, D., Mannig, C.D., (2014). A fast and accurate dependency parser using neural networks. In: Proceedings of EMNLP 2014. Doha, Qatar.
Gal, D., Thijs, B., Sipido, K., Glänzel, W., (2017) Topic modelling based network maps in cardiovascular research. In: Proceedings of the 16th International Conference of the International Society for Scientometrics and Informetrics. Wuhan, China, (pp. 591–603).
Gipp, B, Beel, J. (2007) Citation Proximity Analysis (CPA)—A New approach for identifying related work based on co-citation analysis. In: Proceedings of the 12th International Conference of the International Society for Scientometrics and Informetrics. Rio de Janeiro, Brazil, (pp. 571–575).
Glänzel, W., & Thijs, B. (2017). Using hybrid methods and 'core documents' for the representation of clusters and topics: the astronomy dataset. Scientometrics, 111(2), 1071–1087.
Harris, J. A., Arabzadeh, E., Fairhall, A. L., Benito, C., & Diamond, M. E. (2006). Factors affecting frequency discrimination of vibrotactile stimuli: implications for cortical encoding. PlosOne, 1(1), e100.
Kiss, J. Z., Aanes, G., Schiefloe, M., Coelho, L. H. F., Millar, K. D. L., & Edelmann, R. E. (2014). Changes in operational procedures to improve spaceflight experiments in plant biology in the European Modular cultivation system. Advances in Space Research, 53(5), 818–827.
Leydesdorff, L., & Hellsten, I. (2006). Measuring the meaning of words in contexts: An automated analysis of controversies about 'Monarch butterflies', 'Frankenfoods', and 'stem cells'. Scientometrics, 67(2), 231–258.
Leydesdorff, L., & Rafols, I. (2011). Indicators of the interdisciplinarity of journals: diversity, centrality, and citations. Journal of Informetrics, 5(1), 87–100.
Mikolov, T., Chen, K., Corrado, G. & Dean, J. (2013). Efficient estimation of word representations in vector space. CoRR, abs/1301.3781.
Pennington, J., Socher, R., Manning, C.D., (2014). GloVe: Global Vectors for Word Representation. (available at: https://nlp.stanford.edu/pubs/glove.pdf)
Quoc, L. & Mikolov, T., (2014), Distributed representations of sentences and documents. In: Proceedings of the 31th International Conference on Machine Learning, ICML. Beijing, China, (pp. 1188–1196).
Rehurek, R., Sojka, P. (2010). Software framework for topic modelling with large corpora.In: Proceedings LREC Workshop on New Challenges for NLP Frameworks.
Small, H. (1994). A SCI-map case-study—building a map of AIDS research. Scientometrics, 30(1), 229–241.
Thijs B. (2017) Drakkar: A graph based all-nearest neighbour search algorithm for bibliographic coupling. CEUR workshop Proceedings, 1823, art.nr. 10.
Thijs, B., Glänzel, W., Meyer, M.S. (2017) Improved lexical similarities for hybrid clustering through the use of noun phrases extraction. MSI working paper series. University of Leuven, Leuven, Belgium
Thijs, B., & Glänzel, W. (2018). The contribution of the lexical component in hybrid clustering, the case of four decades of "Scientometrics". Scientometrics, 115(1), 21–33.
Wang, J., Thijs, B., & Glänzel, W. (2015). Interdisciplinarity and impact: Distinct effects of variety. Balance and Disparity. Plos One, 10(5), e0127298.
Wang, S., & Koopman, R. (2017). Clustering articles based on semantic similarity. Scientometrics, 111(2), 1017–1031.
Zhang, L., Rousseau, R., & Glänzel, W. (2016). Diversity of references as an indicator of the interdisciplinarity of journals: Taking similarity between subject fields into account. Journal of the association for information science and technology, 67(5), 1257–1265.