Connection Discovery

Reference: Discovering User Connections using Their Shared Images on Social Media [V1]

Abstract

Human beings are born social. In the social media era, we share and interact with others digitally, forming online social graphs and sharing billions of images. Many social media applications, such as recommendation, virality prediction, and marketing, make use of social graphs as similar users (e.g., users with similar interests) tend to be friends. However, the social graph may not be explicitly specified by users or may be kept private due to privacy concerns. Meanwhile, billions of user-shared images are shared by individuals, and the images are widely accessible to others due to their sharing nature. These user shared images are proved to be a more effective alternative to discover user connections. This thesis introduces a novel way to detect social signals from low level visual features, and to represent them with unbiased machinegenerated labels to discover user connections. Based on 11 million user-shared images from 11 real social media platforms, a phenomenon exists that related users who have online friendships or follower/followee relationships on those platforms share more similar images. This phenomenon is independent of the network origins, the content sharing mechanisms and the image processing/computer vision techniques that encode the images. Hence, an analytic framework is proposed to measure, formulate and utilize the phenomenon for follower/followee recommendation. The framework is optimized for social signal detections using deep learning. Different applications are also discussed. To the best of our knowledge, this framework is the first attempt to discover connections by detecting social signal from user-shared images.

Dataset

The images are collected from a set of randomly selected users from each social network by either the official API (Twitter, Skyrock, 163 Weibo and Tencent Weibo), or by a ruby-based web scraper (Flickr, Pinterest, Duitang and Digu). The details of those images are shown in the table. As the number of maximum requests (the number of images to be downloads) is limited by the operators of the social networks, the user generated images are collected in parallel with virtual machines installed on Cloud platforms such as Amazon EC2. They are downloadable at the end of the page.

origins	social networks	user shared images	min/mean/median/max size of images (KB)
US	Flickr	201,006	2.1/193.8/178/1960
US	Pinterest	314,369	0.671/77.1/51.4/1020
US	Twitter	150,696	0.338/58.5/41.3/1340
France	Skyrock	176,547	0.0684/75.4/37.1/9020
China	Duitang	396,534	0.947/116/68.2/6910
China	Digu	148,337	0.181/179/86.0/14400
China	163 Weibo	187,491	0.0791/102/56.4/7500
China	Tencent Weibo	490,624	0.616/37.2/26.3/963
	Total	2075604

Data Processing and Storing

The scraped user-generated images are processed with ResNet as vectors.

There are 2 millions user shared images from 8 social networks. They are encoded by ResNet, by taking the last layer before the 'softmax' layer.

The data of each social network contains 3 files:

1) x.mat: encoded images
2) dataFd.mat: follower/followeree relationships
3) dataImageBy: uploader of all images.

Note that the id of an images in 'dataImageBy' starts with '2', while the id if a user start with '3'. For example:

"200150685,373741227"

it means that the 150685-th image in 'x.mat' is shared by user '373741227', and the relationship of this user can be found in dataFd.mat.

Download

163: 1. Friends 2. Data 3. User-data Relationship

Daitung: 1. Friends 2. Data 3. User-data Relationship

Digu 1. Friends 2. Data 3.User-data Relationship

Flickr 1. Friends 2. Data 3.User-data Relationship

Pinterest 1. Friends 2. Data 3. User-data Relationship

Skyrock 1. Friends 2. Data 3. User-data Relationship

Tencent 1. Friends 2. Data 3. User-data Relationship

Twitter 1. Friends 2. Data 3. User-data Relationship

Please contact us for more data and collaborations (email: cpming@ust.hk)

Research Chanlleges

Based on machine-generated labels, an analytic framework that acts as a social signal detector to discover user connections from shared images is proposed. A set of images is first encoded with some image processing and computer vision techniques, such as CNN, into vectors with $R$ dimensions. Then the vectors are clustered by clustering techniques, such as $k$-means. Each cluster corresponds to a machine-generated label. Those labels represent the social signal, such that images shared by related users are more likely to be in the same cluster. The user profile, or the label distribution, is built from the occurrences of machine-generated labels. When the user profile is obtained, the next step is to discover user connections from the similarity of their label distributions. The user profile reflects the interests, connections and behaviors of a user. For example, a pair of users is more likely to be follower/followee, of the same gender, and/or have similar interests when their user profiles are more similar. The same profile can be used to predict multiple information and interests. Hence, applications can be based on those discovered connections, even without access to SGs.

Based on the proposed framework, the first research objective is to evaluate whether the framework is applicable to any social media, regardless of its origins, content sharing mechanisms and image encoding techniques. The second objectives is to propose an analytics to formulate and utilize the proposed framework. The third objective is to optimise the proposed framework for social signals detection for image encoding from user-shared images. The forth objective is to extend the framework to applications of the proposed framework. These objectives are introduced in the coming chapters. The next section discusses the collected datasets from different social media applications used in the experiments.

Data Analytics

The figure below shows the proposed analytical framework. Each image, $x_m$, will be encoded into an output feature vector, $f(x_m)$ using some image processing and computer vision techniques, as shown in step 1. The feature vectors, $f(x_m)$, of the user-shared images from every user are clustered into $K$ clusters, and annotated by some unique machine-generated label that represents the cluster it belongs to, as shown in step 2 in the figure. A user $i$ is then profiled by a $K$-dimension vector, $L_i$, that describes the distribution of $K$ unique labels tagged on the images shared by this user.

overall

Given the profiles, $L_i$ and $L_j$, of users $i$ and $j$ respectively, the similarity of users $i$ and $j$, $S_{i,j}$, can be evaluated through their shared images by:

$S_{i,j} = \frac{L_i \cdot L_j}{||L_i|| ||L_j||}$

where $\cdot$ is the dot product of 2 vectors and $||.||$ is the L2 norm of a vector.

Data Measurement

The following is the CDF of the similarity for related pairs (those who are follower/followee,$C_{i,j}=1$ and unrelated pair (those who are NOT follower/followee, $C_{i,j}=0$. It is observed that they have different CDFs.

Connection discovery for follower/followee recommendation [J1, J7, J8]

User connections, or social graph, can be in any form of online social relationships, such as online friendships, follower/followee relationships and community memberships. The aim of the discoveries is to enhance service relevance from discovered user connections and the community. Based on $S_{i,j}$, the optimized list of follower/followee recommednation, $ U_{i,J^*}$, can be obtained by:

$ U_{i,J^*} = \arg\max_{U_{i,J}}\prod\limits_{j \subset U_{i,J}} P(C_{i,j}=1|S_{i,j}=s) $

Social signal detection [S3]

In a CNN network, the weight, $textbf{W}$, is the key for encoding. It can be changed or optimised for different applications. It is motivated to optimise the CNN to detect low level visual features about user connections. An example can be found below. The connection between user 1 ,2 and 3 can be easily observed as they share images with similar objects or theme. Detecting social signals from those images helps to discover the connections among them. THe CNN can be optimised by:

$ \textbf{W}^*=\arg\max_{\textbf{W}} A(\textbf{W}) $

where $A(\textbf{W}) $ is an analytic function to evaluate a $\textbf{W}$.

System for connection discovery [J6]

Supporting multimedia big data for connection discovery is always different, especially for real time applications. Systems are implemetned with analysis to support multimedia big data applications, and proved that 90% computation time can be reduced by proposing a new computation framework.

Applications [J2]

Many social media applications, such as recommendation, virality prediction and marketing, make use of user gender, which may not be explicitly specified or kept privately. A multimedia big data system that utilize this observed phenomenon is proposed for user gender identification, which can achieve 74% accuracy.

Connection Discovery through User-shared Images: from Multimedia Big Data Characterization, Analytics to Applications

Data Analytics

Publications