Time-varying graph representation learning via higher-order skip-gram with negative samplingSpringer Science and Business Media LLC - Tập 11 - Trang 1-21 - 2022
Simone Piaggesi, André Panisson
Representation learning models for graphs are a successful family of techniques that project nodes into feature spaces that can be exploited by other machine learning algorithms. Since many real-world networks are inherently dynamic, with interactions among nodes changing over time, these techniques can be defined both for static and for time-varying graphs. Here, we show how the skip-gram embedding approach can be generalized to perform implicit tensor factorization on different tensor representations of time-varying graphs. We show that higher-order skip-gram with negative sampling (HOSGNS) is able to disentangle the role of nodes and time, with a small fraction of the number of parameters needed by other approaches. We empirically evaluate our approach using time-resolved face-to-face proximity data, showing that the learned representations outperform state-of-the-art methods when used to solve downstream tasks such as network reconstruction. Good performance on predicting the outcome of dynamical processes such as disease spreading shows the potential of this method to estimate contagion risk, providing early risk awareness based on contact tracing data.
Privacy-by-design in big data analytics and social miningSpringer Science and Business Media LLC - Tập 3 - Trang 1-26 - 2014
Anna Monreale, Salvatore Rinzivillo, Francesca Pratesi, Fosca Giannotti, Dino Pedreschi
Privacy is ever-growing concern in our society and is becoming a fundamental aspect to take into account when one wants to use, publish and analyze data involving human personal sensitive information. Unfortunately, it is increasingly hard to transform the data in a way that it protects sensitive information: we live in the era of big data characterized by unprecedented opportunities to sense, store and analyze social data describing human activities in great detail and resolution. As a result, privacy preservation simply cannot be accomplished by de-identification alone. In this paper, we propose the privacy-by-design paradigm to develop technological frameworks for countering the threats of undesirable, unlawful effects of privacy violation, without obstructing the knowledge discovery opportunities of social mining and big data analytical technologies. Our main idea is to inscribe privacy protection into the knowledge discovery technology by design, so that the analysis incorporates the relevant privacy requirements from the start.
Higher-order structures of local collaboration networks are associated with individual scientific productivitySpringer Science and Business Media LLC - - 2024
Wenlong Yang, Yang Wang
The prevalence of teamwork in contemporary science has raised new questions about collaboration networks and the potential impact on research outcomes. Previous studies primarily focused on pairwise interactions between scientists when constructing collaboration networks, potentially overlooking group interactions among scientists. In this study, we introduce a higher-order network representation using algebraic topology to capture multi-agent interactions, i.e., simplicial complexes. Our main objective is to investigate the influence of higher-order structures in local collaboration networks on the productivity of the focal scientist. Leveraging a dataset comprising more than 3.7 million scientists from the Microsoft Academic Graph, we uncover several intriguing findings. Firstly, we observe an inverted U-shaped relationship between the number of disconnected components in the local collaboration network and scientific productivity. Secondly, there is a positive association between the presence of higher-order loops and individual scientific productivity, indicating the intriguing role of higher-order structures in advancing science. Thirdly, these effects hold across various scientific domains and scientists with different impacts, suggesting strong generalizability of our findings. The findings highlight the role of higher-order loops in shaping the development of individual scientists, thus may have implications for nurturing scientific talent and promoting innovative breakthroughs.
Linking Twitter and survey data: asymmetry in quantity and its impactSpringer Science and Business Media LLC - Tập 10 Số 1 - 2021
Tarek Al Baghal, Alexander Wenz, Luke Sloan, Curtis Jessop
AbstractLinked social media and survey data have the potential to be a unique source of information for social research. While the potential usefulness of this methodology is widely acknowledged, very few studies have explored methodological aspects of such linkage. Respondents produce planned amounts of survey data, but highly variant amounts of social media data. This study explores this asymmetry by examining the amount of social media data available to link to surveys. The extent of variation in the amount of data collected from social media could affect the ability to derive meaningful linked indicators and could introduce possible biases. Linked Twitter data from respondents to two longitudinal surveys representative of Great Britain, the Innovation Panel and the NatCen Panel, show that there is indeed substantial variation in the number of tweets posted and the number of followers and friends respondents have. Multivariate analyses of both data sources show that only a few respondent characteristics have a statistically significant effect on the number of tweets posted, with the number of followers being the strongest predictor of posting in both panels, women posting less than men, and some evidence that people with higher education post less, but only in the Innovation Panel. We use sentiment analyses of tweets to provide an example of how the amount of Twitter data collected can impact outcomes using these linked data sources. Results show that more negatively coded tweets are related to general happiness, but not the number of positive tweets. Taken together, the findings suggest that the amount of data collected from social media which can be linked to surveys is an important factor to consider and indicate the potential for such linked data sources in social research.
Design and analysis of tweet-based election models for the 2021 Mexican legislative electionSpringer Science and Business Media LLC - Tập 12 - Trang 1-17 - 2023
Alejandro Vigna-Gómez, Javier Murillo, Manelik Ramirez, Alberto Borbolla, Ian Márquez, Prasun K. Ray
Modelling and forecasting real-life human behaviour using online social media is an active endeavour of interest in politics, government, academia, and industry. Since its creation in 2006, Twitter has been proposed as a potential laboratory that could be used to gauge and predict social behaviour. During the last decade, the user base of Twitter has been growing and becoming more representative of the general population. Here we analyse this user base in the context of the 2021 Mexican Legislative Election. To do so, we use a dataset of 15 million election-related tweets in the six months preceding election day. We explore different election models that assign political preference to either the ruling parties or the opposition. We find that models using data with geographical attributes determine the results of the election with better precision and accuracy than conventional polling methods. These results demonstrate that analysis of public online data can outperform conventional polling methods, and that political analysis and general forecasting would likely benefit from incorporating such data in the immediate future. Moreover, the same Twitter dataset with geographical attributes is positively correlated with results from official census data on population and internet usage in Mexico. These findings suggest that we have reached a period in time when online activity, appropriately curated, can provide an accurate representation of offline behaviour.
Unveiling public perception of AI ethics: an exploration on Wikipedia dataSpringer Science and Business Media LLC - - 2024
Mengyi Wei, Yu Feng, Chuan Chen, Peng Luo, Chenyu Zuo, Liqiu Meng
Artificial Intelligence (AI) technologies have exposed more and more ethical issues while providing services to people. It is challenging for people to realize the occurrence of AI ethical issues in most cases. The lower the public awareness, the more difficult it is to address AI ethical issues. Many previous studies have explored public reactions and opinions on AI ethical issues through questionnaires and social media platforms like Twitter. However, these approaches primarily focus on categorizing popular topics and sentiments, overlooking the public’s potential lack of knowledge underlying these issues. Few studies revealed the holistic knowledge structure of AI ethical topics and the relations among the subtopics. As the world’s largest online encyclopedia, Wikipedia encourages people to jointly contribute and share their knowledge by adding new topics and following a well-accepted hierarchical structure. Through public viewing and editing, Wikipedia serves as a proxy for knowledge transmission. This study aims to analyze how the public comprehend the body of knowledge of AI ethics. We adopted the community detection approach to identify the hierarchical community of the AI ethical topics, and further extracted the AI ethics-related entities, which are proper nouns, organizations, and persons. The findings reveal that the primary topics at the top-level community, most pertinent to AI ethics, predominantly revolve around knowledge-based and ethical issues. Examples include transitions from Information Theory to Internet Copyright Infringement. In summary, this study contributes to three points, (1) to present the holistic knowledge structure of AI ethics, (2) to evaluate and improve the existing body of knowledge of AI ethics, (3) to enhance public perception of AI ethics to mitigate the risks associated with AI technologies.
Instagram photos reveal predictive markers of depressionSpringer Science and Business Media LLC - Tập 6 - Trang 1-12 - 2017
Andrew G Reece, Christopher M Danforth
Using Instagram data from 166 individuals, we applied machine learning tools to successfully identify markers of depression. Statistical features were computationally extracted from 43,950 participant Instagram photos, using color analysis, metadata components, and algorithmic face detection. Resulting models outperformed general practitioners’ average unassisted diagnostic success rate for depression. These results held even when the analysis was restricted to posts made before depressed individuals were first diagnosed. Human ratings of photo attributes (happy, sad, etc.) were weaker predictors of depression, and were uncorrelated with computationally-generated features. These results suggest new avenues for early screening and detection of mental illness.
Exposure to parks through the lens of urban mobilitySpringer Science and Business Media LLC - Tập 11 - Trang 1-21 - 2022
Ariel Salgado, Ziyun Yuan, Inés Caridi, Marta C. González
This work presents a portable framework to estimate potential park demand and park exposure through bipartite weighted networks. We use mobility information and open spatial information. Mobility information comes in the form of daily activities sampled from a model based on Call Detail Records (CDR). Spatial information comprise parks represented through OpenStreetMaps polygons and census tracts from the 2010 decennial US Census. The framework summarizes each city’s information into one bipartite weighted network with the link weights representing the number of potential visits to a park from each census tract on an average weekday. We compare park exposure and park demand in Greater Los Angeles and Greater Boston in a pre-pandemic scenario. The park exposure of a census tract is calculated as the number of parks surrounding the daily activities of its inhabitants. The demand of a park is calculated as the number of daily activities surrounding it. We find that both cities’ distribution of park exposure have similar shape with Boston having a higher average. On the other hand, the distribution of park demand is very similar in both cities, although their park spatial distributions are different. We include racial/ethnic information from the Census to explore how the park exposure connects tracts of different racial/ethnic groups. We associate parks to racial/ethnic groups based on the number of visitors from each group. Parks within minorities’ tracts are mostly used by majority groups. Finally, through detecting communities in the network, we find that park exposure connects the cities locally, linking parks to their tracts nearby. Furthermore, we find a significant spatial correlation between network communities and different racial/ethnic composition in Los Angeles. This way, patterns of park exposure reproduce the separation among demographic groups of the city.