HT '18- Proceedings of the 29th on Hypertext and Social Media

Full Citation in the ACM Digital Library

SESSION: Keynote I

Lessons in Search Data

I will discuss ways to use search data to better understand important topics in politics, health, abortion, child abuse, and sexuality.

SESSION: Session 1: Computational Social Science

Detecting the Correlation between Sentiment and User-level as well as Text-Level Meta-data from Benchmark Corpora

Do tweets from users with similar Twitter characteristics have similar sentiments? What meta-data features of tweets and users correlate with tweet sentiment? In this paper, we address these two questions by analyzing six popular benchmark datasets where tweets are annotated with sentiment labels. We consider user-level as well as tweet-level meta-data features, and identify patterns and correlations of these feature with the log-odds for sentiment classes. We further strengthen our analysis by replicating this set of experiments on recent tweets from users present in our datasets; finding that most of the patterns are consistent across our analysis. Finally, we use our identified meta-data features as features for a sentiment classification algorithm, which results in around 2% increase in F1 score for sentiment classification, compared to text-only classifiers, along with a significant drop in KL-divergence. These results have potential to improve sentiment analysis applications on social media data.

Mining and Forecasting Career Trajectories of Music Artists

Many musicians, from up-and-comers to established artists, rely heavily on performing live to promote and disseminate their music. To advertise live shows, artists often use concert discovery platforms that make it easier for their fans to track tour dates. In this paper, we ask whether digital traces of live performances generated on those platforms can be used to understand career trajectories of artists. First, we present a new dataset we constructed by cross-referencing data from such platforms. We then demonstrate how this dataset can be used to mine and predict important career milestones for the musicians, such as signing by a major music label, or performing at a certain venue. Finally, we perform a temporal analysis of the bipartite artist-venue graph, and demonstrate that high centrality on this graph is correlated with success.

Predicting Twitter User Socioeconomic Attributes with Network and Language Information

Inferring socioeconomic attributes of social media users such as occupation and income is an important problem in computational social science. Automated inference of such characteristics has applications in personalised recommender systems, targeted computational advertising and online political campaigning. While previous work has shown that language features can reliably predict socioeconomic attributes on Twitter, employing information coming from users' social networks has not yet been explored for such complex user characteristics. In this paper, we describe a method for predicting the occupational class and the income of Twitter users given information extracted from their extended networks by learning a low-dimensional vector representation of users, i.e. graph embeddings. We use this representation to train predictive models for occupational class and income. Results on two publicly available datasets show that our method consistently outperforms the state-of-the-art methods in both tasks. We also obtain further significant improvements when we combine graph embeddings with textual features, demonstrating that social network and language information are complementary.

SESSION: Session 2: ML and RecSys

Joint Distributed Representation of Text and Structure of Semi-Structured Documents

Majority of textual data over web is in the form of semi-structured documents. Thus, structural skeleton of such documents plays important role in determining the semantics of the data content. Presence of structure sometimes allows us to write simple rules to extract such information, but it may not be always possible due to flexibility in the structure and the frequency with which such structures are altered. In this paper, we propose a joint modeling of text and the associated structure to effectively capture the semantics of the semi-structure documents. The model simultaneously learns the dense continuous representation for word tokens and the structure associated with them. We utilize the context of structures for projection such that similar structures containing semantically similar topics are close to each other in vector space. We explore two semantic text mining tasks over web data to test the effectiveness of our representation viz., document similarity, and table semantic component identification. In context of traditional rule-based approaches, both these tasks demand rich, domain-specific knowledge sources, homogeneous schema for the documents, and rules that capture the semantics. On the other hand, our approach is unsupervised and resource conscious in nature. Despite of working without knowledge resources and large training data, it performs at par with state-of-the-art rule based and other unsupervised approaches.

As Stable As You Are: Re-ranking Search Results using Query-Drift Analysis

This work studies the merits of using query-drift analysis for search re-ranking. A relationship between the ability to predict the quality of a result list retrieved by an arbitrary method, as manifested by its estimated query-drift, and the ability to improve that method's initial retrieval by re-ranking documents in the list based on such prediction is established. A novel document property, termed "aspect-stability", is identified as the main enabler for transforming the output of an aspect-level query-drift analysis into concrete document scores for search re-ranking. Using an evaluation with various TREC corpora with common baseline retrieval methods, the potential of the proposed re-ranking approach is demonstrated.

Embedding Networks with Edge Attributes

Predicting links in information networks requires deep understanding and careful modeling of network structure. Network embedding, which aims to learn low-dimensional representations of nodes, has been used successfully for the task of link prediction in the past few decades. Existing methods utilize the observed edges in the network to model the interactions between nodes and learn representations which explain the behavior. In addition to the presence of edges, networks often have information which can be used to improve the embedding. For example, in author collaboration networks, the bag of words representing the abstract of co-authored paper can be used as edge attributes. In this paper, we propose a novel approach, which uses the edges and their associated labels to learn node embeddings. Our model jointly optimizes higher order node neighborhood, social roles and edge attributes reconstruction error using deep architecture which can model highly non-linear interactions. We demonstrate the efficacy of our model over existing state-of-the-art methods on two real world data sets. We observe that such attributes can improve the quality of embedding and yield better performance in link prediction.

Collaborative Filtering Method for Handling Diverse and Repetitive User-Item Interactions

Most collaborative filtering models assume that the interaction of users with items take a single form, e.g., only ratings or clicks or views. In fact, in most real-life recommendation scenarios, users interact with items in diverse ways. This in turn, generates complex usage data that contains multiple and diverse types of user feedback. In addition, within such a complex data setting, each user-item pair may occur more than once, implying on repetitive preferential user behaviors. In this work we tackle the problem of building a Collaborative Filtering model that takes into account such complex datasets. We propose a novel factor model, CDMF, that is capable of incorporating arbitrary and diverse feedback types without any prior domain knowledge. Moreover, CDMF is inherently capable of considering user-item repetitions. We evaluate CDMF against stateof- the-art methods with highly favorable results.

Privacy-Aware Tag Recommendation for Image Sharing

Image tags are very important for indexing, sharing, searching, and surfacing images with private content that needs protection. As the tags are at the sole discretion of users, they tend to be noisy and incomplete. In this paper, we present a privacy-aware approach to automatic image tagging, which aims at improving the quality of user annotations, while also preserving the images' original privacy sharing patterns. Precisely, we recommend potential tags for each target image by mining privacy-aware tags from the most similar images of the target image obtained from a large collection. Experimental results show that privacy-aware approach is able to predict accurate tags that can improve the performance of a downstream application on image privacy prediction. Crowd-sourcing predicted tags exhibit the quality of the recommended tags.

Recommending Teammates with Deep Neural Networks

The effects of team collaboration on performance have been explored in a variety of settings. Online games enable people with significantly different skills to cooperate and compete within a shared context. Players can affect teammates' performance either via direct communication or by influencing teammates' actions. Understanding such effects can help us provide insights into human behavior as well as make team recommendations. In this work, we aim at recommending teammates to each individual player for maximal skill growth. We study the effect of collaboration in online games using a large dataset from Dota 2, a popular Multiplayer Online Battle Arena game. To this aim, we construct an online co-play teammate network of players, whose links are weighted based on the gain in skill achieved due to team collaboration. We then use the performance network to devise a recommendation system based on a modified deep neural network autoencoder method.

SESSION: Keynote II

Data and Design in International Development

Foreign aid is a $150 billion dollar industry[1], but the skills and tools for using data in international development are in their infancy. Herculean efforts to overhaul systems of agriculture, education, sanitation, and health go unexamined because collecting that information is hard. Governments and NGOs are tracking information that doesn't begin in electronic format, like school refurbishments and job trainings, then storing it in unstructured documents on inaccessible hard drives around the globe. DevResults is a private software company with the objective of providing best-in-class tools for managing data in international development. The aim is to enable data-driven decision-making by international development practitioners and the communities they serve. If properly captured, shared, and interpreted, international development data offer insights on how to reduce disease, improve education, and lift people out of poverty. DevResults is approaching this problem with a Software as a Service (SaaS) model, paired with consulting and training on designing metrics, structuring data, and using web-based tools. Over the last decade, DevResults has developed an iterative approach based on user feedback from thousands of users of over 100 product instances. This has produced an increasingly sophisticated application and complex data model as demand grows for linked, interoperable data. Among other lessons learned, DevResults has identified a key precept of only revealing complexity where needed, as organizations and users express a wide range of needs and capacities. The result is a commercially viable product that's modular in design. DevResults' software is in use at all organizational levels and has dramatically improved data management and analysis for user organizations.

SESSION: Keynote III

Insecure Machine Learning Systems and Their Impact on the Web

Increasingly powerful machine learning models are often seen as a panacea to a wide range of computational problems today. There is an unsustainable level of excitement over recent results in solving systems problems using deep learning techniques, leading to a rush to deploy ML-based systems in countries around the world. In this talk, I will consider some of the negative implications of these powerful but opaque models from two angles. I will discuss vulnerabilities inherent in many of today's deep learning models, as well as the dangers of advanced ML tools used by malicious attackers. I believe that these critical issues must be addressed adequately before the wide-spread adoption of deep learning tools in today's security critical applications.

SESSION: Session 3: Temporal and Semantic

Bootstrapping Web Archive Collections from Social Media

Human-generated collections of archived web pages are expensive to create, but provide a critical source of information for researchers studying historical events. Hand-selected collections of web pages about events shared by users on social media offer the opportunity for bootstrapping archived collections. We investigated if collections generated automatically and semi-automatically from social media sources such as Storify, Reddit, Twitter, and Wikipedia are similar to Archive-It human-generated collections. This is a challenging task because it requires comparing collections that may cater to different needs. It is also challenging to compare collections since there are many possible measures to use as a baseline for collection comparison: how does one narrow down this list to metrics that reflect if two collections are similar or dissimilar? We identified social media sources that may provide similar collections to Archive-It human-generated collections in two main steps. First, we explored the state of the art in collection comparison and defined a suite of seven measures (Collection Characterizing Suite - CCS) to describe the individual collections. Second, we calculated the distances between the CCS vectors of Archive-It collections and the CCS vectors of collections generated automatically and semi-automatically from social media sources, to identify social media collections most similar to Archive-It collections. The CCS distance comparison was done for three topics: "Ebola Virus," "Hurricane Harvey," and "2016 Pulse Nightclub Shooting." Our results showed that social media sources such as Reddit, Storify, Twitter, and Wikipedia produce collections that are similar to Archive-It collections. Consequently, curators may consider extracting URIs from these sources in order to begin or augment collections about various news topics.

Studying the Spatio-Temporal Dynamics of Small-Scale Events in Twitter

Small-scale events are emerging as attractive objects of research. On Twitter, small-scale events represent weak sensors that report things happening in specific times and places. While previous work addressed the issue of detecting such events, very little is known so far about their inherent properties. In this paper, our main objective was to analyse the spatio-temporal peculiarities of small-scale events w.r.t different levels of location granularity, and to understand the general trend of their propagation along their lifetimes. Our findings suggest that (1) users involved in small-scale events mostly gravitate not significantly far from the geographical focus; (2) events do not exhibit major peaks; and (3) there exists distinct events that we can identify from users' posts that significantly differ from topic distribution, focus concentration and propagation distance perspectives across time.

The Utility Problem of Web Content Popularity Prediction

The ability to generate and share content on social media platforms has changed the Internet. With the growing rate of content generation, efforts have been directed at making sense of such data. One of the most researched problem concerns predicting web content popularity. We argue that the evolution of state-of-the-art approaches has been optimized towards improving the predictability of average behaviour of data: items with low levels of popularity. We demonstrate this effect using a utility-based framework for evaluating numerical web content popularity prediction tasks, focusing on highly popular items. Additionally, it is demonstrated that gains in predictive and ranking ability of such type of cases can be obtained via naïve approaches, based on strategies to tackle imbalanced domains learning tasks.

Know Thy Neighbors, and More!: Studying the Role of Context in Entity Recommendation

Knowledge Graphs capture the semantic relations between real-world entities and can thus, allow end-users to explore different aspects of an entity of interest by traversing through the edges in the graph. Most of the state-of-the-art methods in entity recommendation are limited in the sense that they allow users to search only in the immediate neighborhood of the entity of interest. This is majorly due to efficiency reasons as the search space increases exponentially as we move further away from the entity of interest in the graph. Often, users perform the search task in the context of an information need and we investigate the role this context can play in overcoming the scalability issue and improving knowledge graph exploration. Intuitively, only a small subset of entities in the graph are relevant to a users' interest. We show how can we efficiently select this sub-set by utilizing contextual clues and using graph-theoretic measures to further re-rank this set to offer highly relevant graph exploration capabilities to end-users.

Content Driven Enrichment of Formal Text using Concept Definitions and Applications

Formal text is objective, unambiguous and tends to have complex sentence construction intended to be understood by the target demographic. However, in the absence of domain knowledge it is imperative to define key concepts and their relationship in the text for correct interpretation for general readers. To address this, we propose a text enrichment framework that identifies the key concepts from input text, highlights definitions and fetches the definition from external data sources in case the concept is undefined. Beyond concept definitions, the system enriches the input text with concept applications and a pre-requisite concept graph that showcases the inter-dependency within the extracted concepts. While the problem of learning definition statements is attempted in literature, the task of learning application statements is novel. We manually annotated a dataset for training a deep learning network for identifying application statements in text. We quantitatively compared the results of both application and definition identification models with standard baselines. To validate the utility of the proposed framework for general readers, we report enrichment accuracy and show promising results.

Modeling Semantics between Programming Codes and Annotations

It is a common practice for programmers to leave annotations during program development. Most of the annotated documentations are predominantly being used as the archive of the coding events for limited developers. We hypothesize that these annotations captured mass amount of valuable information which can be utilized to identify similar codes or to examine code quality. However, due to the annotating behaviors vary and the language composition can be complex, this work sets out to investigate a systematic method to examine the annotation semantics and their relations with codes. We designed a semantic parser to extract concepts from codes and the corresponding annotations. Additionally, text mining techniques are applied to summarize linguistic features from the annotations. We then build models to predict concepts in programming code annotations. Results show that the proposed semantic modeling method achieved a higher performance compared to a random guessed baseline.

SESSION: Session 4: User Behaviour

IntelliEye: Enhancing MOOC Learners' Video Watching Experience through Real-Time Attention Tracking

Massive Open Online Courses (MOOCs) have become an attractive opportunity for people around the world to gain knowledge and skills. Despite the initial enthusiasm of the first wave of MOOCs and the subsequent research efforts, MOOCs today suffer from retention issues: many MOOC learners start but do not finish. A main culprit is the lack of oversight and directions: learners need to be skilled in self-regulated learning to monitor themselves and their progress, keep their focus and plan their learning. Many learners lack such skills and as a consequence do not succeed in their chosen MOOC. Many of today's MOOCs are centered around video lectures, which provide ample opportunities for learners to become distracted and lose their attention without realizing it. If we were able to detect learners' loss of attention in real-time, we would be able to intervene and ideally return learners' attention to the video. This is the scenario we investigate: we designed a privacy-aware system (IntelliEye) that makes use of learners' Webcam feeds to determine---in real-time---when they no longer pay attention to the lecture videos. IntelliEye makes learners aware of their attention loss via visual and auditory cues. We deployed IntelliEye in a MOOC across a period of 74 days and explore to what extent MOOC learners accept it as part of their learning and to what extent it influences learners' behaviour. IntelliEye is open-sourced at https://github.com/Yue-ZHAO/IntelliEye.

SimilarHITs: Revealing the Role of Task Similarity in Microtask Crowdsourcing

Workers in microtask crowdsourcing systems typically consume different types of tasks. Task consumption is driven by the self-selection of workers in the most popular platforms such as Amazon Mechanical Turk and CrowdFlower. Workers typically complete tasks one after another in a chain. Prior works have revealed the impact of ordering tasks while considering aspects such as task complexity. However, little is understood about the benefits of considering task similarity in microtask chains. In this paper, we investigate the role of task similarity in microtask crowdsourcing and how it affects market dynamics. We identified different dimensions that affect the perception of task similarity among workers, and propose a supervised machine learning model to predict the overall task similarity of a task pair. Leveraging task similarity, we studied the effects of similarity on worker retention, satisfaction, boredom and fatigue. We reveal the impact of chaining tasks according to their similarity on worker accuracy and their task completion time. Our findings enrich the current understanding of crowd work and bear important implications on structuring workflow.

Penny Auctions are Predictable: Predicting and Profiling User Behavior on DealDash

We study user behavior and the predictability of penny auctions, auction sites often criticized for misrepresenting themselves as low-price auction marketplaces. Using a 166-day trace of 134,568 auctions involving 174 million bids on DealDash, the largest penny auction site in service, we show that a) both the timing and source of bids are highly predictable, and b) users are easily classified into clear behavioral groups by their bidding behavior, and such behaviors correlate highly with the eventual profitability of their bidding strategies. This suggests that penny auction sites are vulnerable to modeling and adversarial attacks.

SESSION: Session 5: Hypertext

The StoryPlaces Platform: Building a Web-Based Locative Hypertext System

Locative narrative systems have been a popular area of research for nearly two decades, but they are often bespoke systems, developed for particular deployments, or to demonstrate novel technologies. This has meant that they are short-lived, the narratives have been constructed by the creators of the system, and that the barrier to creating locative experiences has remained high due to a lack of common tools. We set out to create a platform based on the commonalities of these historic systems, with a focus on hypertext structure, and designed to enable locative based narratives to be created, deployed, and experienced in-the-wild. The result is StoryPlaces, an open source locative hypertext platform and authoring tool designed around a sculptural hypertext engine and built with existing Web technologies. As well as providing an open platform for future development, StoryPlaces also offers novelty in its management of location, including the separation of location and nodes, of descriptions from locations, and of content from pages, as well as being designed to have run-time caching and disconnection resilience. It also advances the state of the art in sculptural hypertext systems delivery through conditional functions, and nested, geographic and temporal conditions. The StoryPlaces platform has been used for the public deployment of over twenty locative narratives, and demonstrates the effectiveness of a general platform for delivering complex locative narrative experiences. In this paper we describe the process of creating the platform and our insights on the design of locative hypertext platforms.

Narrative Plot Comparison Based on a Bag-of-actors Document Model

Comparing documents based on their semantic plot structure or narrative is an important problem in several application areas. Approaches based on information retrieval methods, latent semantic indexing, sentence embedding, and topic modeling are inadequate to capture the structural elements of the narrative. In this work, we present an abstract "bag-of-actors" document model, meant for comparing, indexing and retrieving documents based on their narrative structures. This model is based on resolving the main entities or actors in the plot, and the corresponding actions associated with them. We use this to compare movie plot summaries from IMDB (Internet Movie Database) to identify movie plots that are remakes or were inspired by one another. Evaluation over a wide range of movie plots from different genres, show encouraging results.

Mother: An Integrated Approach to Hypertext Domains

The idea to associate information with so-called links was developed by hypertext pioneers in the 1960s. In the 1990s the Dexter Hypertext Reference Model was developed with the goal to provide a general model for node-link hypertext systems. In the 1990s and 2000s there were important steps made for hypertext infrastructures, which led to component-based open hypermedia systems (CB-OHS). In this paper we provide a detailed description of node-link structures. We argue that Dexter does not match the need of CB-OHS, as it supports a mix of multiple structure domains. Based on the implementation of link support in our system Mother we demonstrate how Dexter needs to be tailored accordingly. We further describe Mother's ability of node-link structures to interoperate with other available structure services and vice versa.

VAnnotatoR: A Framework for Generating Multimodal Hypertexts

We present VAnnotatoR, a framework for generating so-called multimodal hypertexts. Based on Virtual Reality (VR) and Augmented Reality (AR), VAnnotatoR enables the annotation and linkage of semiotic aggregates (texts, images and their segments) with walk-on-able animations of places and buildings. In this way, spatial locations can be linked, for example, to temporal locations and Discourse Referents (ranging over temporal locations, agents, objects, or instruments etc. of actions) or to texts and images describing or depicting them, respectively. VAnnotatoR represents segments of texts or images, discourse referents and animations as interactive, manipulable 3D objects which can be networked to generate multimodal hypertexts. The paper introduces the underlying model of hyperlinks and exemplifies VAnnotatoR by means of a project in the area of public history, the so-called Stolperwege project.

SESSION: Keynote IV

The US National Library of Medicine: A Platform for Biomedical Discovery & Data-Powered Health

This talk will address the role of the National Library of Medicine in fostering data-powered health and serving as a platform for biomedical discovery. It will examine emerging trends in the biomedical research landscape, such as the rapid growth of biomedical data sources, shifting paradigms for data sharing and open science, and the changing role of libraries in providing access to digital information. The talk will cover the newly developed long-range vision of the National Library of Medicine, with its triple aim of: 1) accelerating discovery and advancing health through data-driven research; 2) reaching more people in more ways through enhanced dissemination and engagement; and 3) its contributions to building a workforce and populace that is empowered to conduct data-driven research and enabled to optimize health and healthcare delivery through access to new types of health information. This talk will also examine the importance of co-creation, though partnerships and crowdsourcing, to achieve solutions that optimize the roles of government and stakeholders in improving health and healthcare through information resources.

SESSION: Session 5: Privacy, Bots and Automatic Methods

Understanding Privacy Dichotomy in Twitter

Balancing personalization and privacy is one of the challenges marketers commonly face. The privacy dilemmas associated with personalized services are particularly concerning in the context of social networking websites, wherein the privacy dichotomy problem is widely observed. To prevent potential privacy violations, businesses need to employ multiple safeguards beyond the current privacy settings of users. As a possible solution, companies can utilize user social footprints to detect user privacy preferences. To take a step towards this goal, we first ran a series of experiments to examine if the privacy preference attribute is homophilous in social media. As a result, we found a set of clues that users' privacy preferences are similar to the privacy behaviour of their social contacts, signaling that privacy homophily exists in social networks. We further studied users located in different neighbourhoods with varying degrees of privacy and found a set of characteristics that are specific to public users located in private neighbourhoods. These identified features can be used in a predictive model to identify public user accounts that are intended to be private, supporting companies to make an informed decision whether or not to exploit one's publicly available data for personalization purposes.

Securing Social Media User Data: An Adversarial Approach

Social media users generate tremendous amounts of data. To better serve users, it is required to share the user-related data among researchers, advertisers and application developers. Publishing such data would raise more concerns on user privacy. To encourage data sharing and mitigate user privacy concerns, a number of anonymization and de-anonymization algorithms have been developed to help protect privacy of social media users. In this work, we propose a new adversarial attack specialized for social media data.We further provide a principled way to assess effectiveness of anonymizing different aspects of social media data. Our work sheds light on new privacy risks in social media data due to innate heterogeneity of user-generated data which require striking balance between sharing user data and protecting user privacy.

Search Rank Fraud De-Anonymization in Online Systems

We introduce the fraud de-anonymization problem, that goes beyond fraud detection, to unmask the human masterminds responsible for posting search rank fraud in online systems. We collect and study search rank fraud data from Upwork, and survey the capabilities and behaviors of 58 search rank fraudsters recruited from 6 crowdsourcing sites. We propose Dolos, a fraud de-anonymization system that leverages traits and behaviors extracted from these studies, to attribute detected fraud to crowdsourcing site fraudsters, thus to real identities and bank accounts. We introduce MCDense, a min-cut dense component detection algorithm to uncover groups of user accounts controlled by different fraudsters, and leverage stylometry and deep learning to attribute them to crowdsourcing site profiles. Dolos correctly identified the owners of 95% of fraudster-controlled communities, and uncovered fraudsters who promoted as many as 97.5% of fraud apps we collected from Google Play. When evaluated on 13,087 apps (820,760 reviews), which we monitored over more than 6 months, Dolos identified 1,056 apps with suspicious reviewer groups. We report orthogonal evidence of their fraud, including fraud duplicates and fraud re-posts.

Learning to Rank Social Bots

Software robots, or simply bots, have often been regarded as harmless programs confined within the cyberspace. However, recent events in our society proved that they can have important effects on real life as well. Bots have in fact become one of the key tools for disseminating information through online social networks (OSNs), influencing their members and eventually changing their opinions. With a focus on classification, social bot detection has lately emerged as a major topic in OSN analysis; nevertheless more research is needed to enhance our understanding of such automated behaviors, particularly to unveil the characteristics that better differentiate legitimate accounts from bots. We argue that this demands for learning behavioral models that should be trained using a large and heterogeneous set of behavioral features, so to detect and characterize OSN accounts according to their status as bots. Within this view, in this work we push forward research on bot analysis by proposing a machine-learning framework for identifying and ranking OSN accounts based on their degree of bot relevance. Our framework exploits the most known existing methods on bot detection for enhanced feature extraction, and state-of-the-art learning-to-rank methods, using different optimization and evaluation criteria. Results obtained on Twitter data show the significance and effectiveness of our approach in detecting and ranking bot accounts.

An Approximately Optimal Bot for Non-Submodular Social Reconnaissance

The explosive growth of Online Social Networks in recent years has led to many individuals relying on them to keep up with friends & family. This, in turn, makes them prime targets for malicious actors seeking to collect sensitive, personal data. Prior work has studied the ability of socialbots, i.e. bots which pretend to be humans on OSNs, to collect personal data by befriending real users. However, this prior work has been hampered by the assumption that the likelihood of users accepting friend requests from a bot is non-increasing -- a useful constraint for theoretical purposes but one contradicted by observational data. We address this limitation with a novel curvature based technique, showing that an adaptive greedy bot is approximately optimal within a factor of 1 - 1/e1/δ ~0.165. This theoretical contribution is supported by simulating the infiltration of the bot on OSN topologies. Counter-intuitively, we observe that when the bot is incentivized to befriend friends-of-friends of target users it out-performs a bot that focuses on befriending targets.

SESSION: Session 6: News and Community Detection

A Deep Joint Network for Session-based News Recommendations with Contextual Augmentation

Session-based recommendations have drawn more and more attention in many recommendation settings of modern online services. Unlike many other domains such as books and music, news recommendations suffer from new challenges of fast updating rate and recency issues of news articles and lack of user profiles. In this paper, we proposed a method that combines user click events within session and news contextual features to predict next click behavior of a user. The model consists of two different kinds of hierarchical neutral networks to learn article contextual properties and temporal sequential patterns in streams of clicks. Character-level embedding over input features is adopted to allow integrating different types of data and reduce engineering computation. Besides, we also introduced a time-decay method to compute the freshness of news articles within a time slide. Experimental results on two real-world datasets show significant improvements over several baselines and state-of-the-art methods on session-based neural networks.

Dynamics and Prediction of Clicks on News from Twitter

Social networks are a major gateway to access news content. It is estimated that a third of all web visits originate on social media, and about half of users rely on those to keep up-to-date with world events. Strangely, no model has been proposed and validated to study how to reproduce and interpolate clicks created by social media. Here we study news posted on Twitter, leveraging public information as well as private data from a popular online publisher. We propose and validate a simple two-step model of information diffusion that can be easily interpreted and applied using only public information to determine current and future clicks.

To Post or Not to Post: Using Online Trends to Predict Popularity of Offline Content

Predicting the popularity of online content has attracted much attention in the past few years. In news rooms, for instance, journalists and editors are keen to know, as soon as possible, the articles that will bring the most traffic into their website. In this paper, we propose a new approach for predicting the popularity of news articles before they go online. Our approach complements existing content-based methods, and is based on a number of observations regarding article similarity and topicality. First, the popularity of a new article is correlated with the popularity of similar articles of recent publication. Second, the popularity of the new article is related to the recent historical popularity of its main topic. Based on these observations, we use time series forecasting to predict the number of visits an article will receive. Our experiments, conducted on a real data collection of articles in an international news website, demonstrate the effectiveness and efficiency of the proposed method.

Stance Classification through Proximity-based Community Detection

Numerous domains have interests in studying the viewpoints expressed online, be it for marketing, cybersecurity, or research purposes with the rise of computational social sciences. Current stance detection models are usually grounded on the specificities of some social platforms. This rigidity is unfortunate since it does not allow the integration of the multitude of signals informing effective stance detection. We propose the SCSD model, or Sequential Community-based Stance Detection model, a semi-supervised ensemble algorithm which considers these signals by modeling them as a multi-layer graph representing proximities between profiles. We use a handful of seed profiles, for whom we know the stance, to classify the rest of the profiles by exploiting like-minded communities. These communities represent profiles close enough to assume they share a similar stance on a given subject. Using datasets from two different social platforms, containing two to five stances, we show that by combining several types of proximity we can achieve excellent results. Moreover, we compare the proximities to find those which convey useful information in term of stance detection.

Sentiment-driven Community Profiling and Detection on Social Media

Web 2.0 helps to expand the range and depth of conversation on many issues and facilitates the formation of online communities. Online communities draw various individuals together based on their common opinions on a core set of issues. Most existing community detection methods merely focus on discovering communities without providing any insight regarding the collective opinions of community members and the motives behind the formation of communities. Several efforts have been made to tackle this problem by presenting a set of keywords as a community profile. However, they neglect the positions of community members towards keywords, which play an important role for understanding communities in the highly polarized atmosphere of social media. To this end, we present a sentiment-driven community profiling and detection framework which aims to provide community profiles presenting positive and negative collective opinions of community members separately. With this regard, our framework initially extracts key expressions in users' messages as representative of issues and then identifies users' positive/negative attitudes towards these key expressions. Next, it uncovers a low-dimensional latent space in order to cluster users according to their opinions and social interactions (i.e., retweets). We demonstrate the effectiveness of our framework through quantitative and qualitative evaluations.

SESSION: Blue Sky Ideas

Intelligent Generative Locative Hyperstructure

Locative Hypertext Narrative has seen a resurgence in the Hypertext and Interactive Narrative research communities over the last five years. However, while locative hypertext provides significant opportunities for rich locative applications for both education and entertainment, many applications in this space are tied to very specific locations, restricting their utility to local users. While this is necessary for some locative applications (such as tour guides), others make use of location as a thematic or contextual backdrop and as such could be effectively read in similar locations elsewhere. However, many locative systems are restricted to use specific prescribed locations, and systems that do generate locations do so in a simplistic manner, and often with mixed results. In this paper we propose a more intelligent generative approach to locative hypertext that will generate a locative structure for the user's local area that both respects the thematic location demands of the piece and the effective patterns and structures of locative narrative.

As We May Hear: Our Slaves of Steel II

Our slaves of steel [4] explored some moral questions that arise from narrative with persistent digital agents. If we propose to her on the holodeck, can Ophelia conceivably consent to marry us? Here, we propose simple audio agents that are well within the capacity of current technology, and we explore the reader's responsibility, if any, to care for persistent agents.

A Villain's Guide To Social Media And Web Science

If we have not yet achieved planetary super-villainy on the desktop, it may be feasible to fit it into a suburban office suite. Social media and Web science permit the modern villain to deploy traditional cruelties to great and surprising effect. Because the impact of villainous techniques is radically asymmetric, our fetid plots are difficult and costly to foil.

SESSION: Tutorials

Efficient Auto-Generation of Taxonomies for Structured Knowledge Discovery and Organization

This tutorial introduces the audience to the latest breakthroughs in the area of interpreting unstructured content through an analysis of the key enabling scientific results along with their real-world applications. With technical presentations of problems like named-entity disambiguation and dynamically updating the knowledge hierarchy with domain-specific vocabulary, it would provide the fundamentals to the building-blocks of various applications in Artificial Intelligence, Natural Language Processing, Machine Learning, and Data Mining.