The Secure Link Prediction Problem

Link Prediction is an important and well-studied problem for social networks. Given a snapshot of a graph, the link prediction problem predicts which new interactions between members are most likely to occur in the near future. As networks grow in size, data owners are forced to store the data in remote cloud servers which reveals sensitive information about the network. The graphs are therefore stored in encrypted form. We study the link prediction problem on encrypted graphs. To the best of our knowledge, this secure link prediction problem has not been studied before. We use the number of common neighbors for prediction. We present three algorithms for the secure link prediction problem. We design prototypes of the schemes and formally prove their security. We execute our algorithms in real-life datasets.


Introduction
Social networks have become an integral part of our lives. These networks can be represented as graphs with nodes being entities (members) of the network and edges representing the association between entities (members). As the size of these graphs increases, it becomes quite difficult for small enterprises and business units to store the graphs in-house. So, there is a desire to store such information in cloud servers.
In order to protect the privacy of individuals (as is now mandatory in EU and other places), data is often anonymized before storing in remote cloud servers. However, as pointed out by Backstrom et al. [3], anonymization does not imply privacy. By carefully studying the associations between members, a lot of information can be gleaned.
The data owner, therefore, has to store the data in encrypted form. Trivially, the data owner can upload all data in encrypted form to the cloud. Whenever some query is made, data owner has to download all data, do necessary computations and reupload the re-encrypted data. This is very inefficient and does not serve the purpose of cloud service. Thus, we need to keep the data stored in the cloud in encrypted form in such a way that we can compute efficiently on the encrypted data. Some basic queries for a graph are neighbor query (given a vertex return the set of vertices adjacent to it), vertex degree query (given a vertex, return the number of adjacent vertices), adjacency query (given two vertices return if there is an edge between them) etc. It is important that when an encrypted graph supports some other queries, like shortest distance queries, it should not stop supporting these basic queries.
Nowell and Kleinberg [10] first defined the link prediction problem for social networks. The link prediction problem states that given a snapshot of a graph whether we can predict which new interactions between members are most likely to occur in the near future. For example, given a node A at an instant, the link prediction problem tries to find the most likely node B with which A would like to connect at a later instant. Different types of distance metrics are used to measure the likelihood of the formation of new links. The distances are called score ( [10]). Nowell and Kleinberg, in [10], considered several metrics including common neighbors, Jaccard's coefficient, Adamic/Adar, preferential attachment, Katz β etc. For example, if A and B (with no edge between them) have a large number of common neighbors they are more likely to be connected in future. In this paper, for simplicity, we have considered common neighbors metric to predict the emergence of a link.
Though there has been a large body of literature on link prediction, to the best of our knowledge the secure version of the problem has not been studied to date. Secure Link Prediction (SLP) problem computes link prediction algorithms over secure i.e., encrypted data.
Our Contribution We introduce the notion of secure link prediction and present three constructions. In particular, we ask and answer the question, "Given a snapshot of a graph G ≡ (V, E) (V is the set of vertices and E ⊆ V × V ) at a given instant and a vertex v ∈ V , which is the most likely vertex u, such that, u is a neighbor of v at a later instant and vu / ∈ E". The score-metric we consider is the number of common neighbors of the two vertices v and u. This can be used to answer the question, "Given a snapshot of a graph G = (V, E) at a given instant and a vertex v ∈ V , which are the k-most likely neighbors of v at a later instant such that none of these k vertices were neighbors of v in G." Note that the data owner outsources an encrypted copy of the graph G to the cloud and sends an encrypted vertex v as a query. The cloud runs the secure link prediction algorithm and returns an encrypted result, from which the client can obtain the most likely neighbor of v. The cloud knows neither the graph G nor the queried vertex v.
It is to be noted that the client has much less computational and storage capacity. We propose three schemes, (SLP-I, SLP-II and SLP-III), in all of which, the client takes the help of a proxy server which makes it efficient to obtain query results. At a high level: 1. SLP-I: is the most efficient with almost no computation at client-side and leaks only the scores to the proxy server.
2. SLP-II: has a little more communication at client-side compared to SLP-I but leaks the scores of a subset of vertices to the proxy server.
3. SLP-III: is a very efficient scheme with almost no computation and communication at the client-side and leaks almost nothing to the proxy. This is achieved with an extra computational and communication cost between the cloud and the proxy.
In all three schemes, the client does not leak anything, to the cloud, except the number of vertices in the graph. We have designed the scheme in such a way that it supports link prediction query as well as basic queries. Each of the previous schemes on encrypted graph are designed to support a specific query (for example, shortest distance query, focused subgraph query etc.). However, we have designed more general schemes that support not only link prediction query but also basic queries including neighbor query, vertex degree query, adjacency query etc. All our schemes have been shown to be adaptively secure in real-ideal paradigm. Further, we have analyzed the performance of the schemes in terms of storage requirement, computation cost and communication cost, and counted the execution time of the schemes assuming benchmark implementations of some underlying cryptographic primitives. we have implemented prototypes for the schemes SLP-I and SLP-II, and measured the performance with different real-life datasets to study the feasibility.
From the experiment, we see that they take 12.15s and 13.75s to encrypt whereas 8.87s and 8.59s process query for a graph with 10 2 vertices.
Organization The rest of the paper is organized as follows. Related work is discussed in Section 2. Preliminaries and cryptographic tools are discussed in Section 3. Link prediction problem and its security are described in Section 4. Section 5 describes our proposed scheme for SLP-I. Two improvements of SLP-I, SLP-II and SLP-III, are discussed in Section 6 and Section 7 respectively. In Section 8, a comparative study of the complexities of our proposed schemes is given. In Section 9, details of our implementation and experimental results are shown. A variant of link prediction problem SLP k is introduced in Section 10. Finally, a summary of the paper and future research direction are given in Section 11.

Related Work
Graph algorithms are well studied when the graph is not encrypted. Since, necessity of outsourcing graph data in encrypted form is increasing very fast and encryption makes it difficult to work those algorithms, study is required to enable them. There are only few works that deals with the 'query' on 'outsourced encrypted graph'.
Chase and Kamara [6] introduced the notion of graph encryption while they were presenting structured encryption as a generalization of searchable symmetric encryption (SSE) proposed by Song et al. [20]. They presented schemes for neighbor queries, adjacency queries and focused subgraph queries on labeled graph-structured data. In all of their proposed schemes, the graph was considered as an adjacency matrix and each entry was encrypted separately using symmetric key encryption. The main idea of their scheme, given a vertex and the corresponding key, the scheme could return adjacent vertices. However, complex query requires complex operation (like addition, subtraction, division etc.) on adjacent matrix which make the scheme unsuitable.
A parallel secure computation framework GraphSC has been designed and implemented by Nayak et al. [16]. This framework computes functions like histogram, PageRank, matrix factorization etc. To run this algorithms, GraphSC introduced parallel programming paradigms to secure computation. The parallel and secure execution enables the algorithms to perform even for large datasets. However, they adopt Path-ORAM [21] based techniques which is inefficient if the client has little computation power or the client doesn't uses very large size RAM.
Sketch-based approximate shortest distance queries over encrypted graph have been studied by Meng et al. [14]. In the pre-processing stage, the client computes the sketches for every vertex that is useful for efficient shortest distance query. Instead of encrypting the graph directly, they encrypted the pre-processed data. Thus, in their scheme, there is no chance of getting information about the original graph.
Shen et al. [19] introduced and studied cloud-based approximate constrained shortest distance queries in encrypted graphs which finds the shortest distance with a constraint such that the total cost does not exceed a given threshold.
Exact distance has been computed on dynamic encrypted graphs in [22]. Similar to our paper, this paper uses a proxy to reduce client-side computation and information leakage to the cloud. In the scheme, adjacency lists are stored in an inverted index. However, in a single query, the scheme leaks all the nodes reachable from the queried vertex which is a lot of information about the graph. For example, if the graph is complete, it reveals the whole graph.
A graph encryption scheme, that supports top-k nearest keyword search queries, has been proposed by Liu et al. [12]. They have made an encrypted index using order preserving encryption for searching. Together with lightweight symmetric key encryption schemes, homomorphic encryption is used to compute on encrypted data.
Besides, Zheng et al. [24] proposed link prediction in decentralized social network preserving the privacy. Their construction split the link score into private and public parts and applied sparse logistic regression to find links based on the content of the users. However, the graph data was not considered to be encrypted in the privacy preserving link prediction schemes.
In this paper, we outsource the graph in encrypted form. In most of the previous works, the schemes are designed to perform single specific query like neighbor query ( [6]), shortest distance query ( [14,19,22]), focused subgraph queries ( [6]) etc. So, either it is hard to get the information about the source graph ( [14], [19]), as they do not support basic queries, or leaks a lot of information for a single query ( [22]). One trivial approach is that taking different schemes and use all of them to support all types of required queries. In this paper, our target is to get as much information about the graph as possible whenever required with supporting the link prediction query and leak as little information as possible. To the best of our knowledge, the secure link prediction problem has not been studied before. We study issues on link prediction problem in encrypted outsourced data and give three possible solutions overcoming them.

Preliminaries
Let G = (V, E) be a graph and A = (aij)N×N be its adjacency matrix where N is the number of vertices. Let λ be the security parameter. Set of positive integers {1, 2, · · · , n} is denoted by [n]. By x $ ← − X, we mean to choose a random element from the set X. D log denotes the discrete logarithm. id : {0, 1} * → {0, 1} log N gives the identifiers corresponding to the vertices. A function negl : N ← R is said to be negligible over n if ∀c ∈ N, ∃Nc ∈ N such that ∀n > Nc, negl(n) < n −c .
A probabilistic polynomial-time (PPT) permutation {0, 1} * × {0, 1} n → {0, 1} n is said to be a Pseudo Random Permutation (PRP) if it is indistinguishable from random permutation by any PPT adversary. We consider two PRPs, F kperm and πs, where kperm and s are their keys (or seeds) respectively.

Bilinear Maps
Let G and G1 be two (multiplicative) cyclic groups of order n and g be a generator of G. A map e : G × G → G1 is said to be an admissible non-degenerate bilinear map if-1. ∀u, v ∈ G and ∀a, b ∈ Z, we have e(u a , v b ) = e(u, v) ab , 2. e(g, g) = 1, and 3. e can be computed efficiently. Our algorithms use bilinear map based BGN encryption scheme [4]. So, we first discuss this.

BGN Encryption Scheme
Boneh et al. [4] proposed a homomorphic encryption scheme (henceforth referred to as BGN encryption scheme) that allows an arbitrary number of additions and one mul-tiplication. The scheme consists of three algorithms-Gen(), Encrypt() and Decrypt() .
Algorithm 1: Gen(1 λ ) Key generation: This takes a security parameter λ as input and outputs a publicprivate key pair (pk, sk) (see Algo. 1). Here, pk = (n, G, G1, e, g, h) and sk = q1. In pk, e is a bilinear map from G × G to G1 where both G and G1 are groups of order q1. Note that, given λ, G returns (q1, q2, G, G1, e) (see [4]) where q1 and q2 are two large primes, and G and G1 are groups of order n = q1q2.
Encryption: An integer a is encrypted in G using Algo. 2. Let a1 and a2 be two integers that are encrypted in G as c1 and c2. Then, the bilinear map e(c1, c2), belongs to G1, gives the encryption of (a1a2). Note that arbitrary addition of plaintext is also possible in the group G1. If g is a generator of the group G, e(g, g) acts as a generator of the group G1. Thus, the encryption of an integer a is possible in G1 in similar manner (see Algo. 4).
Let BGN be an encryption scheme as described above. Then, it is a tuple of five algorithms (Gen, Encrypt G , Decrypt G , Encrypt G 1 , Decrypt G 1 ) as described in Algo. 1, 2, 3, 4 and 5 respectively.

Garbled Circuit (GC)
Let us consider two parties, with input x and y respectively, who want to compute a function f (x, y). Then, a garbled circuit [23,11] allows them to compute f (x, y) in such a way that none of the parties get any 'meaningful information' about the input of the other party and none, other than the two parties, is able to compute f (x, y).
Kolesnikov et al. [8] introduced an optimization of garbled circuit that allows XOR gates to be computed without communication or cryptographic operations [22]. Kolesnikov et al. [7] presented efficient GC constructions for several basic functions using the garbled circuit construction of [8]. In this paper, we use garbled circuit blocks for subtraction (SUB), comparison (COMP) and multiplexer (MUX) functions from [8].

The Secure Link Prediction (SLP) Problem
Given G = (V, E), let Nv denotes the set of vertices incident on v ∈ V . Let score(v, u) be a measure of how likely the vertex v is connected to another vertex u in the near future, where vu / ∈ E. A variant of the Link Prediction problem states that given Thus, given a vertex v, we find most likely vertex to connect with. There are various metrics to measure score like the number of common neighbors, Jaccard's coefficient, Adamic/Adar metric etc. In this paper, we consider score(v, u) as the number of common nodes between v and u i.e., score(v, u) = |Nv ∩ Nu|. Let A be the adjacency matrix of the graph G. If iv and iu are the rows corresponding to the vertices v and u respectively then, the score is the inner product of the rows i.e., score . In this paper we have used BGN encryption scheme to securely compute this inner product.

System Overview
Here, we describe the system model considered for the link prediction problem and goals which we want to achieve.
System Model: In our model (see Fig. 1), there is a client, a cloud server, and a proxy server. Each of them communicates with others to execute the protocol.
The client is the data owner and is considered to be trusted. It outsources the graph in encrypted form to the cloud server and generates link prediction queries. Given a vertex v, it queries for the vertex u which is most likely to be connected in the future.

Figure 1: System model
The cloud server (CS) holds the encrypted graph and computes over the encrypted data when the client requests a query. We assume that the cloud server is honest-butcurious . It is curious to learn and analyze the encrypted data and queries. Nevertheless, it is honest and follows the protocol.
The proxy server (PS) helps the cloud server and the client to find the most likely vertex securely. It reduces computational overhead of the client by performing decryptions. However, the proxy server is assumed to be honest-but-curious.
All channels connecting the client, the cloud and the proxy servers are assumed to be secure. An adversary can eavesdrop on channels but can not tamper messages sent along it. However, we assume, the cloud and the proxy servers do not collude.
This system model is to outsource as much computation as possible without leaking the information about the data, assuming the client has very low computation power (like mobile devices). This kind of model to outsource computation previously used by Wang et al. [22] for secure comparison. Assumption of the proxy and cloud server do not collude is a standard assumption.
Design Goals: In this paper, under the assumption of the above system model, we aim at providing a solution for the secure link prediction problem. In our design, we want to achieve the following objectives.

Confidentiality:
The cloud and proxy servers should not get any information about the graph structure i.e., the servers should not be able to construct a graph which is isomorphic to the source graph.

2.
Efficiency: In our model, the client is weak with respect to storage and computations. Since the cloud server has a large amount of storage and computation power, the client outsources the data to it.
Moreover, the client should efficiently perform neighbor query, vertex degree query or adjacency query. These are the basic query that every graph should support. The client should leak as little information as possible.

Secure Link Prediction Scheme
In this section, we present definition of link prediction scheme for a graph G and its security against adaptive chosen-query attack.
Definition 1. A secure link prediction (SLP) scheme for a graph G is a tuple (KeyGen, EncMatrix, TrapdoorGen, LPQuery, FindMaxVertex) of algorithms as follows.
• (PK, SK) ← KeyGen(1 λ ) : is a client-side PPT algorithm that takes λ as a security parameter and outputs a public key PK and a secret key SK.
• T ← EncMatrix(G, SK, PK) : is a client-side PPT algorithm that takes a public key PK, a secret key SK and a graph G as inputs and outputs a structure T that stores the encrypted adjacency matrix of G.
• τv ← TrapdoorGen(v, SK) : is a client-side PPT algorithm that takes a secret key SK and a vertex v as inputs and outputs a query trapdoor τv.
•ĉ ← LPQuery(τv, T ) : is a PPT algorithm run by a cloud server that takes a query trapdoor τv and the structure T as inputs and outputs list of encrypted scoresĉ with all vertices.
• ires ← FindMaxVertex(pk, sk,ĉ) : is a PPT algorithm run by a proxy server that takes pk, sk andĉ as inputs and outputs the most probable vertex identifier ires to connect with the queried vertex.
Correctness: An SLP scheme is said to be correct if, ∀λ ∈ N, ∀(PK, SK) generated using KeyGen(1 λ ) and all sequences of queries on T , each query outputs a correct vertex identifier except with negligible probability.
Adaptive security: An SLP scheme should have two properties: 1. Given T , the cloud servers should not learn any information about G and 2. From a sequence of query trapdoors, the servers should learn nothing about corresponding queried vertices.
The security of an SLP is defined in real-ideal paradigm. In real scenario, the the challenger C generates keys. The adversary A generates a graph G which it sends to C. C encrypts the graph with its secret key and sends it to A. Later, q times it finds a query vertex based on previous results (i.e., adaptive) and receives trapdoor for the current. Finally A outputs a guess bit b. In ideal scenario, on receiving the graph G, the simulator S generates a simulated encrypted matrix. For each adaptive query of A, S returns a simulated token. Finally A outputs a guess bit b . The security definition (Definition 2) ensures A can not distinguish C from S.
We have assumed that the communication channel between the client and the servers are secure. Since the CS and the PS do not collude, they do not share their collected information. So, the simulator can treat CS and PS separately.
In our scheme, the proxy server does not have the encrypted data or the trapdoors. During query operation, it gets a set of scrambled scores of the queried vertex with other vertices. So, we can consider only the cloud server as the adversary (see [5]). Let us define security as follows.

Overview of our proposed schemes
A graph can be encrypted in several ways like adjacency matrix, adjacency list, edge list etc. Each of them has advantages and disadvantages depending on the application. In our scheme, we have defined score as the number of common neighbors that can be calculated just by computing inner product of the rows corresponding to the calculating vertices. The basic idea is that, given a vertex, to predict the most probable vertex to connect with, we compute scores with all other vertices and sort them according to their score. However, calculating the inner product and sorting them in cloud server are expensive operations and there is no scheme that provides all of the functionality to be computed over encrypted data. So, we have used BGN homomorphic encryption scheme that enables us to compute inner product on encrypted data. Choosing BGN, gives power to the client for querying not only link prediction query but also neighbor query, degree of a vertex query, adjacency query etc. Besides, the score computation, the score decryption and sorting the score in encrypted form is non-trivial keeping in mind that the client has low computation power. So, we have proposed three schemes that perform score computations as well as sorting on encrypted data with the help of a honest-but-querious proxy server which does not collude with the cloud server. The three schemes show tread-off between the computation cost, communication cost and leakage in order to compute the vertex most probable to connect with.

Our Proposed Protocol for SLP
In this section, we propose an efficient scheme SLP-I and analyze its security. The scheme is divided into three phases-key generation, data encryption, and query phase. The client first generates required secret and public keys. Then it encrypts the adjacency matrix of the graph in a structure and uploads it to the CS. To query for a vertex, the client generates a query trapdoor and sends it to the CS. The CS computes encrypted score (i.e., inner products of the row corresponding to the queried vertex with the other vertices on the encrypted graph). The PS decrypts the scores, finds the vertex with highest score and sends the result to the client.
Key Generation: In this phase, given a security parameter λ, the client chooses a bilinear map e : G × G → G1. Then, the permutation key kperm is chosen at random for the PRP F : {0, 1} * × {0, 1} log N → {0, 1} log N . It executes BGN.Gen() to get sk and pk. After generating private key SK and public key PK, a part sk of SK is shared with the PS. This part of the key helps the PS to compute secure comparisons. Key generation is described in Algo. 8.
Data Encryption: In this phase, the client encrypts the adjacency matrix with its private key and uploads the encrypted matrix to the CS (see Algo. 9). Each entry aij in the adjacency matrix A of G is encrypted using Algo. 2. Let M = (mij)N×N be the encrypted matrix. Then, each row of M is stored in the structure T . The PRP F gives the position in T corresponding to vertices. Finally, the structure T is sent to the CS.
Encrypt G (PK.pk, a ij ) 5 end 6 Construct a structure T of size N . On receiving τv, the CS computes the encrypted scores (c1, c2, . . . , cN ) (see Algo. 11) and computes (m1, m2, . . . , mN ) corresponding to the queried vertex. Using πs, the CS shuffles the order of the encrypted scores and mi's. Finally, the CS sends the shuffled encrypted scores and the scrambled queried-row entries (m πs(1) , m πs(2) , . . . , m πs(N ) ) to the PS. , if cij is the encryption of the score sij then, cij = e(g, h) r N k=1 e(m ik , m jk ). Again, since e(g, g) q 1 q 2 = 1, we get (cij) q 1 = (e(g, g) q 1 ) N k=1 a ik a jk =ĝ s ij , whereĝ = e(g, g) q 1 . Thus, D log of (cij) q 1 to the baseĝ gives sij. If powers ofĝ are pre-computed, the score sij can be found in constant time. However, Pollard's lambda method [13] can be used to find discrete logarithm of c ij baseĝ.

Security Analysis
In the security definition, a small amount of leakage has been allowed. The adversary knows the algorithms and possesses the encrypted data and queried trapdoors. Only SK is unknown to it. The leakage function L is a pair (L bld , Lqry) (associated with EncMatrix and LPQuery respectively) where L bld (G) = {|T |} and Lqry(v) = {τv}.
Theorem 1. If BGN is semantically secure and F is a PRP, then SLP-I is L-secure against adaptive chosen-query attacks.
Proof. The proof of security is based on the simulation-based CQA-II security (see Definition 2). Given the leakage L bld , the simulator S generates a randomized structure T which simulates the structure T of the challenger C. Given a query trapdoor τv, S returns simulated trapdoors τv maintaining system consistency of the future queries by the adversary. To prove the theorem, it is enough to show that the trapdoors generated by C and S are indistinguishable to A. Semantic security of BGN guarantees that mij and mij are indistinguishable. Since F is a PRP, τv and τv are indistinguishable. This completes the proof.

SLP-II with less leakage
Though the SLP-I scheme is efficient, it has few disadvantages. Firstly, in SLP-I, the number of common nodes between the queried vertex and all other vertices are leaked to the PS which provides partial knowledge of the graph to it. Since, the server PS is semi honest, we want to leak as little information as possible. In this section, we propose another scheme SLP-II that hides most of the scores from the PS which results in leakage reduction. Secondly, the client has high communication cost with PS while processing a link prediction query. Our proposed SLP-II scheme has an advantages over this with reduced communication cost from CS to PS is. We achieve these by using extra storage of size of the matrix M and extra bandwidth from the PS to the CS of O(N ).

Proposed Protocol
In SLP-II, after computing the scores, the CS increases that of the incident vertices randomly from maximum possible score i.e., degree of the queried vertex. For example, let s be a score in the form g s 1 , then a random number r, greater than or equal to the degree, is added with it. Then the scores is increased as g s 1 .g r 1 = g (s+r) 1 . Since lower bound of r is known to the client, it can eliminate the scores with adjacent vertices. The PS only derypts the scores and sends the sorted list to the client. Since the degree is hidden from PS and known to the client, it can eliminate the vertices with score larger than degree. The algorithms are as follows.
Key Generation: Same as Algo. 8.  ( Algorithm 15: FindMaxVertexII(sk,c,m) Correctness: For all i, the decrypted entry s i (line 3, Algo. 15) is equals to si + b i i where si is the actual score. Since si ≤ deg v and b i i is zero, when v i and vi are connected, we can see that, s i becomes greater than deg v when v i and vi are connected. So, the client can eliminate these entries from the list.

Security Analysis
SLP-II does not leak any extra information to the CS than SLP-I. The leakage L = (L bld , Lqry) is same as it is in SLP-I.

Theorem 2.
If BGN is semantically secure and F is a PRP, then SLP-II is L-secure against adaptive chosen-query attacks.
Proof. As we have seen the proof of Theorem 1, The simulator requires to simulate the T , T and τv. To simulate the structure T , given L bld (A), S takes an empty structure T of size |T |. Finally, it takes m ij ← BGN.Encrypt G1 (PK.pk, 0 λ ), (i, j) ∈ [N ] × [N ]. Rest of the proof is similar as that of Theorem 1.

SLP scheme using garbled circuit (SLP-III)
In SLP-II, the PS is still able to get scores with many vertices and there is a good amount of communication cost from PS to the client. In this section, we propose SLP-III in which PS does not get any scores. Besides, the proxy needs to send only result to the client which reduces communication overhead for the client.

Protocol Description
In SLP-III, after generating the keys, the client encrypts the adjacency matrix of the graph and uploads it to the CS. At the same time, it shares a part of its secret key with the PS. In the query phase, the CS computes the encrypted scores on receiving query trapdoor from the client. However, it masks each score with random number selected by itself before sending them to the PS. The PS decrypts the masked scores and evaluates a garbled circuit, constructed by the CS (as described in Section 7.2), to find the vertex with maximum score. Finally, the PS returns the index corresponding to the evaluated identifier of the vertex with maximum score.
Query: To query for a vertex v, the client generates a query trapdoor tv = (i , s) (see Algo. 10) and sends it to the CS. On receiving τv, the CS computes the encrypted scores (c1, c2, . . . , cN ). It then considers the row T [i ] = (m i 1 , m i 2 , . . . , m i N ) corresponding to the queried vertex. Then, with random ri and r i , it computes, ci ← c πs(i) .BGN.Encrypt G 1 (PK.pk, ri) andmi ← m i πs(i) .BGN.Encrypt G (PK.pk, r i ), for all i. If the encrypted scores are sent directly, the PS can decrypt the scores directly as it has the partial secret key sk. That is why the CS chooses random ris and r i s to mask them.

Maximum Garbled Circuit (MGC)
We want minimum information to be leaked to both the servers. Without the knowledge of values, it is hard to find the maximum value because it is an iterative comparison process and requires several round of communication if we use only secure comparison. However, building a maximum garbled circuit allows cloud and proxy servers to find the maximum without knowing the value by anyone. Kolesnikov and Schneider [7] first presented a garbled circuit that computes minimum from a set of distance. In their scheme, one party holds a set of points and the second party holds a single point. They used homomorphic encryption to compute the the distances from the single points to the set of points and sort them using the garble circuit. However, the original value of the points belongs to them were known to them. In this paper, we have introduced a novel maximum garbled circuit (M GC) by which one party computes the maximum from a set of numbers, without the knowledge their values, with the help of another party without leaking them to it. Given a set of scores M GC outputs only the identity of the vertex with maximum score. Computing vertex with max score: In SLP-III, the CS computes a garbled circuit M GC (an example is shown in Fig. 2) for each query to find the maximum scored vertex identifier. Before computing M GC, in SLP-III, the PS gets (s1,s2, . . .  The circuit is constructed layer by layer. The idea is to compare pair of scores every time in a layer and pass the result for the next until the resulted vertex is found. If |V | = N , M GC has (log N + 1) layers starting from 0 to N . In the 0th layer, there are N number of NSS blocks and the rest of the blocks are Max block. The NSS blocks is for the 1st layers and computes the scores securely without knowing them. Thus, each NSS block corresponds to some vertex. Max computes the maximum score and corresponding index without knowing them. Example of a M GC, to compute maximum, assuming N = 7 and using Max blocks and NSS blocks, is shown in Fig. 2. M GC for any N is constructed similarly.  Max blocks There are 4-types of Max blocks to compute the maximum-Max1, Max2, Max3 and Max4 (see Fig. 3). The blocks are made different to handle extreme cases. These blocks use COMP and MUX blocks (see Section 3.3). NSS blocks: Each NSS block has four inputssi, ri, ai and r i . The inputs ri and r i comes from the CS whilesi and ai comes from the PS. It first subtracts ri fromsi using SUB block to get the score si. Then, using SUB block, it finds the flag bit that tells whether the vertex is adjacent to the queried vertex. MUL block (see Fig 4b) is used in NSS block as shown in Fig. 4a to make the score si zero if the vertex is adjacent else keeps the score si same.
(a) NSS block (b) MUL block (c) SUB block

Figure 4: Few circuit blocks
Elimination of scores for adjacent vertices: It can be seen from encryption that si = si + ri, where si is the actual score corresponding to ith row and ri randomizes the score. Each bit r i is taken to indicate whether r i is odd or even. On the other hand, each bit ai indicates whether the decryptedāi is odd or even. Inequality of r i and ai indicates that the vertex corresponding to ith row is connected with the queried vertex. In that case, we consider the score si = 0. The block SUB , in Fig. 4c, finds outputs 1 if they are equal, else outputs 0. Since, (si − ri) gives the score, SUB block (see Section. 3.3) is used in M GC to compute the scores where the PS givessi and CS gives ri. It can be seen that SUB subtract only one bit which is very efficient.

Security Analysis
In SLP-III, though the PS has almost no leakage, the CS has a little more leakage than SLP-I. This extra leakage occurs when it interacts with the PS through OT protocol to provide encoding corresponding to the input of PS. Since OT is secure and does not leak any meaningful information, we can ignore this leakage. In SLP-III, the leakage L = (L bld , Lqry) is same as it is in SLP-I. Theorem 3. If BGN is semantically secure and F is a PRP, then SLP-III is L-secure against adaptive chosen-query attacks.
Proof. The proof is the same as that of Theorem 1.

Basic Queries
All the three schemes support basic queries which includes neighbor query, vertex degree query and adjacency query.
Neighbor query: Given a vertex, neighbor query is to return the set of vertices adjacent to it. It is to be noted that, since we have encrypted adjacency matrix of the graph, it is enough for the client if it gets the decrypted row corresponding to the queried vertex, To query neighbor for a vertex v, the client generates τv = (i , s) as in Algo. 10 and sends it to the CS. The CS permutes rows corresponding to row i and send the permuted rowm ← (m πs(1) , m πs(2) , . . . , m πs(N ) ) to the PS. The PS decrypts them and send the decrypted vector (a1, a2, . . . , aN ) to the client. The client can compute inverse permutation for the entries for which the the entries are 1. Here, the CS gets only the queried vertex and the PS gets the degree of the vertex.
Vertex degree query: To query degree of a vertex v, similarly, the client sends τv = i to the CS. The CS computes encrypted degree as m ← i=N i=1 m i i and sends m to the proxy. The proxy decrypts m and sends the result to the client. s is not needed as permuting the elements of some row is not required.
Here, the degree is leaked to the PS which can be prevented by randomizing the result. The CS can randomize the encrypted degree and send the randomization secret to the client. The client can get the degree just by subtracting the randomization from the result by the PS.
However, this leakage can be avoided easily, without randomizing the encrypted degree, if the client performs the decryption.
Adjacency Query: Given two vertices, adjacency query (edge query) tells wither there is an edge between them. If the client wants to perform adjacency query for the pair of vertices v1 and v2, the client sends (i 1 , i 2 ) (as generated in Algo. 10) to the CS. The CS returns m i 1 i 2 . The client can get either the randomized result from the PS or it can decrypt m i 1 i 2 by itself.

Performance Analysis
In this section, we discuss the efficiency of our proposed schemes. The efficiency is measured in terms of computations and communication complexities together with storage requirement and allowed leakages. A summary is given in Table 1. Since there is no work on the secure link prediction before, we have not compared complexities of our schemes with any other similar encrypted computations.

Complexity analysis
Let the graph be G = (V, E) and N = |V |. Let BGN encryption outputs ρ-bit string for every encryption. We describe the complexities as bellow.
Leakage Comparison: As we see the Table 1, each scheme leaks, to the CS, same amount of information which is the number of vertices of the graph and the query trapdoors. However, none of the schemes leaks information about the edges in the graph to the CS. In SLP-I, since the PS has the power to decrypt the scores, it gets to know Sv = {score(v, u) : u ∈ V }. However, SLP-II reveals only a subset S v of Sv and SLP-III manages to hide all scores from the PS. SLP-I can not hide scores from the PS which results in maximum leakage to the PS.
Storage Requirement: One of the major goals of secure link prediction scheme is that the client should require very little storage. All our designed schemes have very low storage requirement for the client. The client has to only store a key which is of λ bits. For all schemes, the PS stores only a part of the secret key which is of λ bits.
In SLP-I, the CS is required to store |V | 2 ρ bits for the structure T where the PS is required to store only the secret key. While reducing the leakage in SLP-II, the CS storage becomes doubled. However, SLP-III requires the same amount of storage as SLP-I.
Computation Complexity: In all schemes, the client computes |V | 2 number of BGN encryption to encrypt A while SLP-II additionally computes |V | 2 number of the same to encrypt B. To compute each of |V | encrypted scores, the CS requires |V | bilinear map (e) computation and |V | multiplications.
Additionally, SLP-I randomizes the encrypted entries corresponding to the row that has been queried. This requires |V | exponentiations and |V | multiplications. SLP-II randomizes the encrypted scores. This requires |V | multiplications and computes the encrypted degree of the queried vertex which requires |V | multiplications. Apart  Table   Param Entity In all, the PS decrypts |V | scores. Each decryption requires log |V | multiplications on average. To find the vertex with maximum score, in SLP-I and SLP-II, the PS compares |V | numbers. The |V | encrypted entries are decrypted by the PS in SLP-I and SLP-III. In addition, the PS evaluates the garbled circuit M GC in SLP-III.
Communication Complexity: To upload the encrypted matrices, SLP-I and SLP-III requires |V | 2 ρ bits and SLP-II requires 2|V | 2 ρ bits of communications. To query, it sends only the trapdoor of size 2ρ bits (aprx.).
The CS sends 2|V | entries to the PS, in case of SLP-I and SLP-III. For SLP-II, the CS sends only |V | entries. Each of these entries is of ρ bits. In addition, SLP-III sends the garbled circuit M GC. PS to CS communication happens only when the PS evaluates M GC. For SLP-I and SLP-III, the PS sends only ires which is of log |V | bits to the client. However, the PS sends 2|V | log |V | bits to the client.

Experimental Evaluation
In this section, the experimental evaluations of our designed schemes, SLP-I and SLP-II, are presented. In our experiment, we have used a single machine for both the client and the server. All data has been assumed to be residing in main memory. The machine is with an Intel Core i7-4770 CPU and with 8-core operating at 3.40GHz. It is equipped with 8GB RAM and runs an Ubuntu 16.04 LTS 64-bit operating system. The open source PBC [17] library has been used in our implementation to support BGN. The code is in the repository [18].

Datasets
For our experiment, we have used real-world datasets. We have taken the datasets from the SNAP datasets [9]. The collection consists of various kinds of real-world network data which includes social networks, citation networks, collaboration networks, web graphs etc. For our experiment, we have considered the undirected graph datasets-bitcoinalpha, ego-Facebook, Email-Enron, email-Eu-core and Wiki-Vote. The number of nodes and the edges of the graphs are shown in Table 2.
Instead of the above graphs, their subgraphs have been considered. First fixed number of vertices from the graph datasets and edges joining them have been chosen for the subgraphs. For example, for 1000, vertices with identifier < 1000 have been taken for the subgraph.

Experiment Results
In our experiment, five datasets have been taken. The experiment has been done for each dataset taking extracted subgraphs with vertices 50 to 1000 incremented by 50. The number of edges in the subgraphs is shown in Fig. 5. For the pairing, 128, 256 and 512 bits prime-pairs are taken. In our proposed schemes, the most expensive operation for the client is encrypting the matrix (EncMatrix). For the cloud and the proxy, score computing (LPQuery) and finding maximum vertex (FindMaxVertex) are the most expensive operations respectively. Hence, throughout this section, we have discussed mainly these three operations.
As we have seen, in the proposed protocols, encrypting each entry of the adjacency matrix is the main operation of the encryption, the number of edges does not affect the encryption time for both SLP-I and SLP-II. This is because, irrespective of SLP schemes, the number of operations are independent of number of edges.   However, the time taken by the proxy to decrypt the scores is depends on the number of vertices. In SLP-I, the proxy has to decrypt |V | entries in G as well as |V | scores in G1 where in SLP-II, it decrypts only in |V | scores in G1. So proxy takes more time in SLP-I than in SLP-II. This can be observed in Fig. 6c.  For a query, in SLP-II, the proxy decrypts scores only for corresponding vertices that are not incident to the vertex queried for. So, only in this case, the computational time depends on the number of edges in the graph. As density of edges in a graph increases the chance of decreasing computational time for the graph increases. In Fig. 7 we have compared computational time taken by the proxy in SLP-II for different datasets.
In the above figures, we have considered only 128-bit primes. It can be observed from the experiment, the computational time depends on the security parameter. As we increase the size of the primes, the computational time grows exponentially. We have compared the change of computational time for all of the client, cloud and proxy for both SLP-I and SLP-II (see Fig. 8 and Fig. 9 respectively). However, in practical, as we keep the security bit fixed, keeping the security bits as low as possible improves the performance.

Estimation of computational cost in SLP-III
In the previous section, we have shown the experimental results for SLP-I and SLP-II. In this section, we have estimated the computational cost for SLP-III. Encryption algorithm of SLP-III is same as SLP-I. So both required same amount of time for encryption for the same dataset. To estimate query time, we have considered a random graph with 10 3 vertices.
Query Time: In SLP-III the cloud computes encrypted scores and the proxy decrypts the scores as well as random numbers. The number of decryption in each group is same as SLP-I. However, in SLP-III, it requires an extra garbled circuit computation. For this, 1000 OT for 128-bit security of ECC is required which takes 138 * 1000ms = 138s aprx. ( [2,15]). In addition to that, the PS evaluates the GC with 1000 * (11 * 257+4) = 2831000 XOR-gates and 1000 * (5 * 257 + 1) = 1286000 AND-gates. Assuming that the encryption used in each GC circuit is AES (128-bit), GC evaluation requires 2 AES decryption and the CS requires 8 encryption. As we see in [1], it requires 0.57 cycles per byte for AES. Thus, for evaluation in a single core processor, the PS requires (2*(1286000*256/8)*0.57) cycles = 46913280 cycles that takes (46913280/(2.5 * 10 9 )) = 0.019s. Similarly, The CS requires 0.078s to construct the GC.
The estimated costs are measured with respect to a single core 2.5 GHz processor. However, in practice, the CS provides a large number of multi-core processors. As we see all the computations can be computed in parallel, the query cost can be reduced dramatically. Each of the above-mentioned costs can be improved to cost p s with p processors and cost is cost.
10 Introduction to SLP k Let us define another variant of secure link prediction problem SLP k . Instead of returning the vertex with highest score, an SLP k returns indices of k number of top-scored vertices.
Let, a graph G = (V, E) is given. Then, the top-k Link Prediction Problem states that given a vertex v ∈ V , it returns a set of vertices {u1, u2, . . . , u k } such that score(v, ui) is among top-k elements in Sv. The top-k link prediction scheme is said to be secure i.e., a secure top-k link prediction problem scheme (SLP k ) if, the servers do not get any meaningful information about G from its encryption or sequence of queries.
Our proposed schemes, SLP-I and SLP-II, can be extended to support SLP k queries. In SLP-I, the only change is that instead of returning only the index of the vertex with highest score, the proxy has to return the indices of the top-k highest scores to the client.

Conclusion
In this paper, we have introduced the secure link prediction problem and discussed its security. We have presented three constructions of SLP. The first proposed scheme SLP-I has the least computational time with maximum leakage to the proxy. The second one SLP-II reduces the leakage by randomizing scores. However, it suffers high communication cost from proxy to the client. The third scheme SLP-III has minimum leakage to the proxy. Though the garbled circuit helps to reduce leakage, it increases the communication and computational cost of the cloud and the proxy servers.
Performance analysis shows that they are practical. We have implemented prototypes of first two schemes and measured the performance by doing experiment with different real-life datasets. We also estimated the cost for SLP-III. In the future, we want to make a library that support multiple queries including neighbor query, edge query, degree query, link prediction query etc.
It is to be noted that the cost of computation without privacy and security is far better. The performance has been degraded since we have added security. The performance comes at the cost of security.
Throughout the paper, we have considered unweighted graph. As a future work the schemes can be extended to weighted graphs. Moreover, we have initiated the secure link prediction problem and considered only common neighbors as score metric. As a future work, we will consider the other distance metrics like Jaccard's coefficient, Adamic/Adar, preferential attachment, Katz β etc. and compare the efficiency of each.