Link prediction setup for Cora dataset
What is link prediction
From here, we can summarize the link prediction as:
Given a snapshot graph of the social network at a moment G=<V, E>and the node v_i and the node v_j , link prediction is to predict the probability of the link between the node v_i and the node v_j . It can be seen through the definition of link prediction that the link prediction task is divided into two categories. The first category is to predict that the new link will appear in future time. The second category is to forecast hidden unknown link in the space.
Traditional link prediction algorithms
The easiest framework of link prediction algorithm is based on the similarity between nodes. Any pair of node x and node y, we have assigned to this node is a function Similarity, this function is defined as the similarity function between nodes x and y.
Two algorithms are introduced here:
- Common Neighbors: how many common neighbors for node x and node y
- Random Walk with Restart (RWR): 1)P_uv represents the probability of the random walker reaching node v in the steady state condition from node u. 2) Similarity (u, v) = P_uv + P_vu
More detail can be found here
Link prediction based on graph neural networks
See [1]
How to split the data?
I read several papers and it is not clear about how to set up the link prediction for Cora dataset.
This paper gave a pretty clear description.
Setup. Link prediction is a commonly used task to demonstrate the meaningfulness of the embeddings. To evaluate the performance we hide a set of edges/non-edges from the original graph and train on the resulting graph. Similarly to Kipf & Welling (2016b) and Wang et al. (2016) we create a validation/test set that contains 5%/10% randomly selected edges respectively and equal number of randomly selected non-edges.We used the validation set for hyper-parameter tuning and early stopping and the test set only to report the performance. As by convention we report the area under the ROC curve (AUC) and the average precision (AP) scores for each method.
VGAE code
Code can be found here and it contains the testing split code in preprocessing.py code.
Reference
Links provided in the article.
[1]Variational Graph Auto-Encoders NIPS 2016
[2]Revisiting Semi-Supervised Learning with Graph Embeddings ICML 2016
[3]learning deep representations for graph clustering AAAI 2014