Keywords

1 Introduction

As social networks with location-based information are increasingly popular, the users’ location in social network attracts more attention than before. Location information can help to shorten the gap between the virtual and the real world, such as monitoring residents’ public health problems through online network [1], recommending local activities or attractions to tourists [2, 3], determining the emergency situation and even the location of the disaster and so on [4,5,6]. In addition, users’ offline activity area and trajectory can also be analyzed through their locations in social networks. Due to the increasing awareness of privacy protection, people will cautiously submit their personal location information or set the visibility of the location of the message in social networks, which make it difficult to acquire their real location information. Therefore, how to accurately predict the actual location information of social network users is an important and meaningful research question.

This paper proposes location prediction algorithm for social network users based on label propagation, which solves the following two key problems:

  1. (1)

    The accuracy of the traditional label propagation algorithm is not high in the user location prediction, and “countercurrent” phenomenon will appear in the iterative process, which will lead to the increase of the time overhead.

  2. (2)

    Improve the accuracy of social network users’ location prediction by using their offline activity location.

2 Related Work

There are three scenarios for user location prediction in social networks, such as user’s frequent location prediction, prediction of the location of messages posted on the user’s social network, and forecasts of the locations mentioned in messages. The main methods of location pre-diction include location prediction based on the content of message published by users, user friend relationships, and so on.

Laere et al. chose two types of local vocabulary and extracted valid words to predict the location of users [7]. Ren [8] and Han et al. [9] were inspired by the frequency of reverse documents, using the reverse position frequency (ILF) and the reverse city frequency (ICF) to select the position of the vocabulary, they assumed that the location vocabulary should be distributed in fewer locations, but with large ILF and ICF values. Mahmud et al. [10] applied some column heuristics to select local vocabulary. Cheng [1] makes the position word distribution conform to the spatial change model proposed by the Backstorm [11], secondly they make local or non-local mark on 19,178 dictionary words, and use the Labeled Vocabulary Training classification model to discriminate all words in the tweet dataset.

Backstrom [12] established probability models through physical distances between users to express the possibility of relationships between users, which has no effect on the position prediction of friends considering different degrees of tightness. Kongl [13] on the basis of Backstorm work by adding the weight of the edge to predict the user’s position, where the weight of the edge is determined by a social tight coefficient. Li [14] considered the location of user neighbors, and captures the information of users’ neighbors that intuitively consider the location of users. The user location is allocated randomly, then the user’s location is iteratively updated from the user’s neighbors and the location name mentioned, and then the parameters in the update are improved by measuring the prediction error of the known location of the user. Davis Jr et al. [15] thought that the most frequent user’ locations that appear in the user’s social network as a basis for predicting their location. Jurgens et al. [16] extend the concept of location prediction into location label propagation, which is made by the location of the label space to explain the location of label propagation, they think that the position of the user through the iterative process that many times.

Li et al. [17] thought that the literature assume the user has only one home location is a defect, they think that users should have the relationship with a number of positions, so they have defined the location information of a user and user set as the set of locations, and these users about the system is not only a geographical location the range is not a point, is not a temporary and user related position, but a long-term position, so they set up a MLP in the paper (Multiple Location Profiling Model) to establish a model containing a plurality of position information of the position of archives to the user, and this model is to the location file according to the target user relationships and their tweets content released.

The label propagation algorithm can effectively deal with large data sets, so in this paper, we are in the position of the user prediction based on label propagation algorithm, but with the label propagation algorithm in-depth study, we found that the label propagation algorithm will position the label “countercurrent” label update and node location is random, this algorithm cannot guarantee the accuracy of prediction of the position of the user, in order to improve the accuracy of location prediction algorithm and reduce the time overhead, this paper pro-poses a label propagation based on user location prediction algorithm (Label Propagation Algorithm-Location Prediction, LPA-LP).

3 Related Concept and Problem Definition

Definition 1

Social Network. A social network can be represent by a graph G = (V, E, A), where V represents the collection of the users who are in the social network, and n = | V |. E represents the collection of the relationship between users and m = |E|, and A represents the collection of the activities and a = |A|. Beyond that, L represents the set of locations, including users’ locations and activities’ locations, and nl = |L|, U0 is the set of the users whose locations are known, on the contrary, Un is the set of users whose locations are unknown.

Definition 2

Shortest Path Length. It refers to the shortest path between the two nodes i and j in the social network graph. It means the minimum number of paths through the node i to the node j. It can be used d(i, j) to represent the shortest path length between two nodes.

Definition 3

K-Hop** Neighbors. It means that the user to its neighbor needs a k hop** to achieve, that is to say, the shortest path length of the two node is k.

Definition 4

K-Hop** Public Neighbors. G = (V, E, A) is a social network diagram, where V represents the user set in the graph, E = (vi, vj, wij) represents the set of relations between the user nodes with weights, wij represents the weight of the edges between nodes. The k-hop** public neighbors set of the nodes is defined as follows:

$$ \Gamma _{k} \left( {v_{i} ,v_{j} } \right) = \left\{ {v\left| {d\left( {v_{i} ,v} \right) = d\left( {v_{j} ,v} \right) = k} \right.} \right\}{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} =\Gamma \left( {v_{i} ,k} \right) \cap\Gamma \left( {v_{j} ,k} \right),k \ge 1 $$
(1)

In the formula (1), \( \Gamma \left( {v_{i} ,k} \right) \) represents the set of k-public neighbor of node vi, and represents the set of node vj, represents the set of k-public neighbor between vi and vj.

Definition 5.

Similarity of k-hop** public neighbors. The value of k is determined by the network itself, it can be defined on formula (2).

$$ \overline{k} = \frac{{\sum\limits_{i \ne j} {k_{\hbox{max} } \left| {\Gamma \left( i \right) \cap\Gamma \left( j \right)} \right|} }}{\left| V \right|} $$
(2)

In the formula (2), \( k_{\hbox{max} } \left| {\Gamma \left( i \right) \cap\Gamma \left( j \right)} \right| \) represents the max public neighbor hops between two nodes. The k value in the network refers to the average of any two nodes in the network. The similarity of the two node k-hop** public neighbors is defined by formula (3).

$$ S\left( {v_{i} ,v_{j} } \right) = \frac{{\left| {\Gamma \left( {v_{i} ,k} \right) \cap\Gamma \left( {v_{j} ,k} \right)} \right|}}{{\left| {\Gamma \left( {v_{i} ,k} \right) \cup\Gamma \left( {v_{j} ,k} \right)} \right|}},k \ge 1 $$
(3)

Definition 6

Similarity of Nodes. It means denominator size of the similarity between the k-hop** public neighbors between nodes subtracts the two nodes. It can be defined by formula (4).

$$ \gamma = \frac{{\left| {\Gamma \left( {v_{i} ,k} \right) \cap\Gamma \left( {v_{j} ,k} \right)} \right|}}{{\left| {\Gamma \left( {v_{i} ,k} \right) \cup\Gamma \left( {v_{j} ,k} \right)} \right| - 2}},k \ge 1 $$
(4)

Definition 7.

The max degree between nodes and users set. If the user is divided into different sets \( L_{1} ,L_{2} , \ldots ,L_{e} \) according to their locations, nodes are set up by users who are not labeled as location labels. The max degree of users divided into different sets according to their location is the degree and the maximum of some nodes in the nodes. It can be defined by formula (5).

$$ d\left( {v_{i} ,L_{i} } \right) = \hbox{max} \left\{ {d\left( {v_{i} ,L_{1} } \right),d\left( {v_{i} ,L_{2} } \right), \ldots ,d\left( {v_{i} ,L_{e} } \right)} \right\} $$
(5)

Definition 8

K-Hop** Weight. We believe that the most important impact on user location is its 1 hop neighbors. Moreover, the offline location of users also has a great impact on user location, and its weight can also be set to 1. For k > 1, when setting the weight of the edge, it will be attenuated according to the speed of 1/5, that is, the weight of the edge of the 1 hop neighbor is 1, the weight of the 2 neighbors is 0.8, and so on.

Now given the location prediction problem definition: In the social network G, the unknown location information of the user u, according to the location information and the users of their k-hop** neighbors, to predict the unknown location information of the user u in the prediction of the probability of the position of L.

4 Label Propagation Based User Location Prediction Algorithm

In this section, a correlation algorithm for location prediction for users of unknown location information in social networks is proposed. This paper proposed a location prediction algorithm based on label propagation (Label Propagation Algorithm-Location Prediction, LPA-LP), the algorithm is mainly divided into two parts, one part is to run before the label propagation algorithm of data preprocessing algorithm, the other part is the use of label propagation of location prediction algorithm.

figure a

Algorithm 1 is pretreated before running the label propagation algorithm to initialize the data set, according to the Definition 5, the node with its maximum similarity and the k hop neighbor as the set of starting processing for the user location prediction, and according to the known label to the data in the collection of the label, which is in order to be able to quickly and accurately using the label propagation algorithm for unknown location information in a social network user node location prediction. After preprocessing the data set, location prediction algorithm based on label propagation can be used to predict the location of users who have not tagged location labels in the processed data set. Algorithm 2 gives a description of the location prediction algorithm (LPA-LP) based on the label propagation.

figure b

In Algorithm 2 location prediction algorithm based on label propagation in the iterative process of user location labels are updated, and the location information of the user location information of neighbors and user participation in the offline activities are taken into account, which significantly improves the prediction accuracy of the locations of users, and in the operation of label propagation algorithm for data sets are preliminary the treatment improve the performance of the label propagation algorithm of user location prediction algorithm, the following will be proved by experiments.

5 Experiment Result and Analysis

In this section, we will analyze the experimental results, the experimental results are divided into two parts, one part is the results of algorithm time overhead and the other is the accuracy of user locations prediction algorithm.

5.1 Data Set Description

In this paper, we use the dataset is NLPIR microblogging corpus. We extracted several datasets from the dataset. In order to compare the accuracy of the improved algorithm for user location prediction and improve the execution efficiency of the algorithm, we extract different scale datasets from the data set to compare the experimental results. The detail of our data sets are described in Table 1.

Table 1. Data sets description

5.2 Experimental Results Analysis

The location prediction algorithm based on the label propagation (LPA-LP) is an improvement on the preprocessing of the data set and the selection strategy of the location label in the iterative process. It can avoid the “countercurrent” phenomenon of the position label and reduce the randomness to update the location tag, and improve the efficiency and the accuracy of the prediction. The whole experiment is divided into two parts. The first part is using label propagation algorithm to predict user location on these four datasets of different sizes. The second part is using LPA-LP algorithm to predict location on four different scale datasets.

In the process of user location prediction, probabilistic LPA algorithm and LPA-LP algorithm with random or update the node label to a certain extent, the running times of the two algorithms may produce different results, so the choice between the four data sets of different size on the running times of experimental results for the 10, 30, 50, 70, 100 and mean value. The time required for the experiment to run on different scale data sets is shown in Fig. 1, 2, 3 and 4.

Fig. 1.
figure 1

Time overhead comparison with dataset A

Fig. 2.
figure 2

Time overhead comparison with dataset B

Fig. 3.
figure 3

Time overhead comparison with dataset C

Fig. 4.
figure 4

Time overhead comparison with dataset D

From these four figures, we can know that the running time of different dataset is similar between the improved algorithm LPA-LP and the algorithm LPA when the dataset have less than 5000 nodes, when the nodes are more than 9000 in dataset, we can see that the running time of the improved algorithm LPA-LP is obviously less than the algorithm LPA. It shows that the LPA-LP algorithm can be effectively applied to large-scale data sets.

In addition to comparing the running time of the algorithm, it is necessary to compare the accuracy of the algorithm. The results of the experiment are shown in Table 2.

Table 2. Algorithm accuracy comparison

6 Conclusion

This paper proposes a location prediction algorithm for social network users based on label propagation. The algorithm first obtains k-hop public neighbors at any two points in the social network graph, and uses the node with the largest similarity and its k-hop neighbors as the initial set of label propagation, and calculates the degree of the node to these sets. In each iteration, the node adopts the strategy of asynchronous update, and selects the node with the highest degree to update the position label, so as to avoid the “countercurrent” phenomenon of the position label and reduce the possibility of randomly updating the position label. Relevant experiments show that the algorithm proposed in this paper improves the accuracy of user location prediction and reduces the time cost of the algorithm.