Link prediction setup for Cora dataset

2 min readMay 8, 2020

What is link prediction

From here, we can summarize the link prediction as:

Given a snapshot graph of the social network at a moment G=<V, E>and the node v_i and the node v_j , link prediction is to predict the probability of the link between the node v_i and the node v_j . It can be seen through the definition of link prediction that the link prediction task is divided into two categories. The first category is to predict that the new link will appear in future time. The second category is to forecast hidden unknown link in the space.

Traditional link prediction algorithms

The easiest framework of link prediction algorithm is based on the similarity between nodes. Any pair of node x and node y, we have assigned to this node is a function Similarity, this function is defined as the similarity function between nodes x and y.

Two algorithms are introduced here:

Common Neighbors: how many common neighbors for node x and node y
Random Walk with Restart (RWR): 1)P_uv represents the probability of the random walker reaching node v in the steady state condition from node u. 2) Similarity (u, v) = P_uv + P_vu

More detail can be found here

Link prediction based on graph neural networks

See [1]

How to split the data?

I read several papers and it is not clear about how to set up the link prediction for Cora dataset.

This paper gave a pretty clear description.

Setup. Link prediction is a commonly used task to demonstrate the meaningfulness of the embeddings. To evaluate the performance we hide a set of edges/non-edges from the original graph and train on the resulting graph. Similarly to Kipf & Welling (2016b) and Wang et al. (2016) we create a validation/test set that contains 5%/10% randomly selected edges respectively and equal number of randomly selected non-edges.We used the validation set for hyper-parameter tuning and early stopping and the test set only to report the performance. As by convention we report the area under the ROC curve (AUC) and the average precision (AP) scores for each method.