|
1.INTRODUCTIONPredicting unobserved connections or future connections based on existing connections in the network topology as a link prediction problem, it is an important topic in physics and computer science, the basic idea is that the existing connection structure of a network can indicate that new connections are more likely to occur in an evolving network, or that unobserved connections are missing in a partially known network1-4. Link prediction has many applications. For example, friends recommendation in social network, product recommendation in shopping website5-7, pre-experimental analysis in protein interaction networks and metabolic networks8. In addition, link prediction can also reveal the structural growth or formation mechanism of complex networks9. Link prediction has been studied by scientists from different fields because of its wide application value and theoretical significance. Researchers have achieved great success in many similarity algorithms, especially topological similarity algorithms10. There are three kinds of topology similarity algorithms based on path length: local similarity algorithm, global similarity algorithm and semi-local similarity algorithm. The local similarity index considers the influence of the common neighbors of two disconnected nodes on the connection. For example, CN Index11 considers the amount of common neighbors. Adamic-Adar (AA) Index12 study the contribution of the degree of the common neighbor node to the connection; Resource Allocation Index13 considers suppressing the common neighbors of large nodes. The global similarity algorithm considers the topology of the whole network, Katz Index14 computes all paths among two disconnected points under the condition that short paths are given priority. Local Random Walk (LRW) Index15 limits the random number in the semi-local range; Superposed Random Walk Index15 superposed LRW contributions to paths of different lengths. Some researchers can reflect the influence of nodes better by using mixed indicators and improve the prediction results when applying them to SRW. For example, Zhu et al.16 proposed SHI model and HHI model, using node degree and H-index to jointly reflect node influence. In the DCHI model and HCHI model proposed by Tian Yang et al.17, degrees mixed with coreness and H-index mixed with coreness are respectively used to jointly reflect the influence of node. In the HIN model proposed by Gao et al.18, the value of degree and H-index is used to reflect node influence. So far, there is a lot of research on mixed indicators to express node influence, these studies have not fully explored the value of mixed indicators. Through in-depth study, based on the HIN model, we introduce a weight factor β to balance the degree and H-index, so as to propose the DHIN model, which can reflect the node influence more accurately. Figure 1 describes some of the attributes of the node, such as H-index and degree. The degree of node a is 3, the degrees of the three neighboring nodes of node a are 3, 4, and 5 respectively, so the H-index is 3. When only study the degree of endpoints, the influence of endpoint a is 3. When the influence of node a is multiplied by the degree and H-index, the influence of node a is 9. By contrast, the hybrid influence promotes the connection of the two nodes. Figure 1.Note: Endpoint a represents a source node with H-index and degree are both 3. Node b represents a destination node. All solid and dotted lines represent existing and potential links, respectively. In this article, we propose a DHIN model. Through the depth investigation of twelve reference data sets and extensive experiments show that DHIN can improve predictive performance compared to these mainstream standards and HIN. In Section 2, we establish a DHIN model. Sections 3 introduce twelve benchmark experimental data sets. Sections 4 show the methods. Section 5 shows the discussion and results. Section 6 summarizes this research. 2.MODELThe model in this study is oriented to an undirected network G(V, E), where V and E represents the group of points and connections, respectively. For each pair of nodes x, y ⊂ V, there is a fraction sxy that represents the probability of connection between them. In this paper, the value of similarity is directly used as the score of sxy. After the score of all non-existent connection is arranged in descending order, the connection with the highest score is most likely to exist in the future or already exist but not detected. 2.1HIN modelAccording to Lu et al.15, Zhu et al.18 introduced the mixed influence index as a metric into the SRW proposed by Lu et al.15, then constructing the HIN model. Lu et al.15 established a similarity index using a random walk. The probability of one-step transition of two endpoints X and Y through a random walk using a Markov chain is pxy = axy/kx, where kx represents the degree of x, axy = 1 when x is connected to y, and axy = 0 if not. When the step size is t, the nodes x and y are represented as {x = x0 = yt, x1 = yt–1, …, xt–1 = y1, xt = y0 = y}. Therefore, the t-step transition probability from node X to node Y is denoted as and . Then, Zhu et al. replaced the degree effect in SRW by combining the H-index and degree, so as to realize the mixed effect HIN of neighbor index. 3.EXPERIMENTAL DATAOur dataset of 12 real networks in this experiment: (1) US Air97(USAir)19 uses the data from the United States airline network. (2) Yeast PPI(Yeast)20 uses data from the yeast networks on protein-protein relationships. (3) Food Web (Food)21 uses data from carbon exchange relationships. (4) Power Grid (Power)22 uses data from the Electric transmission Network in the Western US. (5) NetScience (NS)23 uses data based on the collaboration of scientists by paper in scientific networks. (6) Jazz24 uses the data from the musicians network. (7) e-mail network (e-mail)25 uses the data from the e-mail communication network of URV. (8) Slavko26 uses the data from Slavko Zitnik’s friendship network on social network. (9) UC social network (UCsocial)27 uses the data from an online social network formed by UC Irvine students. (10) Infectious (Infec)28 uses the data from a network of offline contacts made by visitors through an exhibition called “Infectious: Stay Away”. (11) EuroSiS web (EuroSiS)29 uses data from an interactive network of “social science” participants. (12) C. elegans (CE)22 uses the data from the C. elegans worm neuron network. Table 1 gives the basic topology characteristics of the twelve networks. Table 1.Seven fundamental topological characteristics of the 12 benchmark networks.
Notes: |V| describes the number of nodes, |E| represents links, 〈k〉 represents the average degree, 〈d〉 denotes the average distance, C denotes the clustering coefficient, r describes the assortativity coefficient, H defined as H = 〈k2〉/〈k〉2, denotes the degree heterogeneity. In order to realize the preprocessing, the arc is changed into a directionless link. To ensure that networks have no authority and no direction, we need to eliminate loops and multilaterals. Then, on the premise of ensuring link connectivity, the maximum link is extracted to simplify the network subgraph. First, the network connection set is randomly divided into two parts, 90% of the training set ET and 10% of the test set EP while the connectivity of ET is ensured1. In addition, there are 30 identical and independent branches on the network. Next, to achieve average accuracy in a statistical manner, we performed the experimental procedure on 30 separate training and test sets to, and recalled over 30 implementation measures. 4.EXPERIMENTAL METHODS4.1MetricWe use AUC30 to measure the accuracy of algorithms. It can be interpreted as the probability that a potential link (a link in EP) score higher than a non-existent link (a link in U-E,). In algorithm implementation, we usually calculate a score for each unobserved link. Then, each time we randomly select a missing link and a non-existent link to compare their scores, if in n independent comparisons, there are n′ missing links with higher scores and n″ missing links with the same scores, then the AUC value is For purely probabilistic algorithms, the AUC tends to 0.5 when each result is independent. Therefore, as long as the accuracy of the algorithm is greater than 0.5, the performance of the algorithm is better than the pure probability algorithm. 4.2BaselinesWe describe the following six basic models: CN Index11 focusing the number of common neighbors, which is defined as where Г(X), X ∈ {x, y}, describes the set of neighbors of X and |Г(x) ∩ Г(y)| represent the common neighbors between nodes x and y. AA Index12, which is improved from CN, using the inverse logarithm to suppress the contribution of a large degree of common neighbors, it defined as where kz describe the endpoint degree of z. Resource-Allocation (RA)13, originated from AA. The main idea is that if the degree of neighbor is larger, the possibility of nodes being connected is smaller, it defined as Local Path (LP) Index15 takes into account the similarity between specific step-size paths(two and three step), which is defined as where A describes the adjacency matrix, ε represents a penalty parameter. SRW Index15 limits the random number within the quasi-local range then superposed contributions to paths of different lengths, which is defined as The probability of one-step transition of two endpoints X and Y through a random walk using a Markov chain is pxy = axy/kx, where kx describe the degree of node X, axy = 1 when x is connected to Y, and axy = 0 if not. When the step size is t, the nodes x and y are represented as {x = x0 = yt, x1 = yt–1, …, xt–1 = y1, xt = y0 = y}. Therefore, the t-step transition probability from node X to node Y is denoted as and . 5.RESULTSTo examine the performance of our model, we ran simulations on twelve data sets and compared the resulting data with the primary baseline, and the results are discussed below. According to the description in Sections 1 and 2, we believe that integrating the endpoints H-index and degree can better describe node influence and improve link prediction performance. To verify our thoughts, we propose a new model DHIN. First, we obtain the value of random walk t corresponding to the maximum of the predicted results when there is no weight (excluding the influence of inhibiting factor β). The corresponding values of t are shown in parentheses after DHIN in Figure 2; DHIN gets the optimal AUC value with the least steps t, which is 3 in Figure 2a USAir, 3 in Figure 2b Yeast, 3 in Figure 2c Food, 14 in Figure 2d Power, 8 in Figure 2e NS, 2 in Figure 2f Jazz, 7 in Figure 2g Email, 4 in Figure 2h Slavko, 9 in Figure 2i UCsocial, 4 in Figure 2j Infec, 5 in Figure 2k Eurosis and 4 in Figure 2l CE. Then we set the weighting factor β from 0.1 to 0.9 at 0.1 intervals. In Figure 2, DHIN shows its optimal values of AUC at certain inhibitory factor of β ∈ [0,1) on random walk steps t in the different datasets, i.e., β = 0.3 in Figure 2a USAir, β = 0.7 in Figure 2b Yeast, β = 0.9 in Figure 2c Food, β = 0.8 in Figure 2d Power, β = 0.4 in Figure 2e NS, β = 0.6 in Figure 2f Jazz, β = 0.6 in Figure 2g Email, β = 0.8 in Figure 2h Slavko, β = 0.1 in figure 2i UCsocial, β = 0.4 in figure 2j Infec, β = 0.7 in Figure 2k Eurosis and β = 0.1 in Figure 2l CE. DHIN introduced the idea of weight into hybrid influence index based on HIN. We study the relationship between weights β and AUC of two endpoints with typical predictive length L=100, under the condition that the maximum random walk step t (t is from 1 to 15) of each data set obtained by HIN remains unchanged. The optimal values are calculated by the parameter β from 0.1 to 0.9. The detailed data in Figure 2 are shown in Table 2 and compares the AUC on WHIN with six models. DHIN obtained the optimal value on seven data sets, four for SRW (Power, Slavko, Ucsocial, EuroSiS) and one for LP (Food). By comparison, DHIN improved predictive performance. Table 2.AUC of seven models on twelve benchmark network under the condition of L = 100.
In Table 2, the value of L represents the number of candidate links. Each data point is an average of over 30 separate implementation processes, and each point represents a random partition of 90-10% of the training and test sets. The maximum values are bold. On SRW, HIN and DHIN the first value in the parentheses represent the corresponding optimal random walk step t. On DHIN the second value is optimal weight β. It can be concluded from Table 2 that DHIN’s prediction results are better than other models in more than half of the data sets. All results represent the optimal situation by adjusting the coefficient. In addition, computational complexity needs to be considered on describe link prediction performance. The time complexity of CN, AA, RA have O(N3) and LP, SRW, HIN, DHIN have W × O(N3) with coefficient W. To sum up, DHIN showed a significant improvement with remain the time complexity. 6.CONCLUSIONSThe value of the mixing coefficient has not been fully explored in the existing link prediction studies using node hybrid influence. Through analysis, we propose a new DHIN index focusing on DHIN. By comparing the test results on twelve datasets with six indices, we investigate the utility of the mixed effects of tenure weight factors in link prediction. Therefore, the accuracy of DHIN proposed in this paper is obviously superior to other indexes. REFERENCESLü, L. and Zhou, T., Physica A: Statistical Mechanics and Its Applications, 390
(6), 1150
–70
(2011). https://doi.org/10.1016/j.physa.2010.11.027 Google Scholar
Lü, L., Medo, M., Yeung, C. H., Zhang, Y. C., Zhang, Z. K. and Zhou, T., Physics Reports, 519
(1), 1
–49
(2012). https://doi.org/10.1016/j.physrep.2012.02.006 Google Scholar
Chi, K., Yin, G., Dong, Y. and Dong, H., Knowledge-Based Systems, 181 104792
(2019). https://doi.org/10.1016/j.knosys.2019.05.035 Google Scholar
Wang, Y., Wang, Y., Lin, X. and Wang, W., Discrete Dynamics in Nature and Society, 6148273
(2020). Google Scholar
Wang, W., Liu, Q. H., Liang, J., Hu, Y. and Zhou, T., Physics Reports, 820 1
–51
(2019). https://doi.org/10.1016/j.physrep.2019.07.001 Google Scholar
Gurini, D. F., Gasparetti, F., Micarelli, A. and Sansonetti, G., Future Generation Computer Systems, 78 430
–39
(2018). https://doi.org/10.1016/j.future.2017.03.020 Google Scholar
Xiong, F., Wang X, Pan S, Yang H, Wang, H. and Zhang, C., IEEE Transactions on Systems, Man, and Cybernetics: Systems, 50
(10), 3804
–16
(2018). Google Scholar
Battiston, F., et al, Physics Reports, 874 1
–92
(2020). https://doi.org/10.1016/j.physrep.2020.05.004 Google Scholar
Zhu, B. and Xia, Y., PloS one, 11
(2), e0148265
(2016). https://doi.org/10.1371/journal.pone.0148265 Google Scholar
Lü, L., Pan, L., Zhou, T., Zhang, Y. C. and Stanley, H. E.,
in Proceedings of the National Academy of Sciences,
2325
–30
(2015). Google Scholar
Newman, M. E., Physical Review E, 64
(2), 025102
(2001). https://doi.org/10.1103/PhysRevE.64.025102 Google Scholar
Adamic, L. A. and Adar, E., Social Networks, 25
(3), 211
–30
(2003). https://doi.org/10.1016/S0378-8733(03)00009-1 Google Scholar
Zhou, T., Lü, L. and Zhang, Y. C., The European Physical Journal B, 71
(4), 623
–30
(2009). https://doi.org/10.1140/epjb/e2009-00335-8 Google Scholar
Katz, L., Psychometrika, 18
(1), 39
–43
(1953). https://doi.org/10.1007/BF02289026 Google Scholar
Liu, W. and Lü, L., EPL (Europhysics Letters), 89
(5), 58007
(2010). https://doi.org/10.1209/0295-5075/89/58007 Google Scholar
Zhu, X., Li, W., Tian, H. and Cai, S., EPL (Europhysics Letters), 122
(6), 68003
(2018). https://doi.org/10.1209/0295-5075/122/68003 Google Scholar
Tian, Y., Wang, Y., Tian, H. and Cui, Q., Complexity, 1544912
(2021). Google Scholar
Gao, T. and Zhu. X., International Journal of Modern Physics B, 34
(05), 2050018
(2020). https://doi.org/10.1142/S0217979220500186 Google Scholar
Batagelj, V. and Mrvar, A., Connections, 21
(2), 47
–57
(1998). Google Scholar
Bu D, et al, Nucleic Acids Research, 31
(9), 2443
–50
(2003). https://doi.org/10.1093/nar/gkg340 Google Scholar
Melián, C. J. and Bascompte, J., Ecology, 85
(2), 352
–358
(2004). https://doi.org/10.1890/02-0638 Google Scholar
Yan, G., Zhou, T., Hu, B., Fu, Z. Q. and Wang, B. H., Physical Review E, 73
(4), 046108
(2006). https://doi.org/10.1103/PhysRevE.73.046108 Google Scholar
Holme, P. and Newman, M. E., Physical Review E, 74
(5), 056108
(2006). https://doi.org/10.1103/PhysRevE.74.056108 Google Scholar
Gleiser, P. M. and Danon, L., Advances in Complex Systems, 6
(04), 565
–73
(2003). https://doi.org/10.1142/S0219525903001067 Google Scholar
Guimera, R., Danon, L., Diaz-Guilera, A., Giralt, F. and Arenas, A., Physical Review E, 68
(6), 065103
(2003). https://doi.org/10.1103/PhysRevE.68.065103 Google Scholar
Opsahl, T. and Panzarasa, P., Social Networks, 31
(2), 155
–63
(2009). https://doi.org/10.1016/j.socnet.2009.02.002 Google Scholar
Blagus, N., Šubelj, L. and Bajec, M., Physica A: Statistical Mechanics and Its Applications, 391
(8), 2794
–802
(2012). https://doi.org/10.1016/j.physa.2011.12.055 Google Scholar
Isella, L., Stehlé, J., Barrat, A., Cattuto, C., Pinton, J. F. and Van den Broeck, W., Journal of Theoretical Biology, 271
(1), 166
–80
(2011). https://doi.org/10.1016/j.jtbi.2010.11.033 Google Scholar
Ermiş, B., Acar, E. and Cemgil, A. T., Data Mining and Knowledge Discovery, 29
(1), 203
–36
(2015). https://doi.org/10.1007/s10618-013-0341-y Google Scholar
Kitsak, M., Gallos, L. K., Havlin, S., Liljeros, F., Muchnik, L., Stanley, H. E. and Makse, H. A., Nature Physics, 6
(11), 888
–93
(2010). https://doi.org/10.1038/nphys1746 Google Scholar
|