PRIVATE SET INTERSECTION: NEW GENERIC CONSTRUCTIONS AND FEASIBILITY RESULTS

. In this paper we focus on protocols for private set intersection (PSI), through which two parties, each holding a set of inputs drawn from a ground set, jointly compute the intersection of their sets. Ideally, no further information than which elements are actually shared is compromised to the other party, yet the input set sizes are often considered as admissible leakage. In the unconditional setting we evidence that PSI is impossible to realize and that unconditionally secure size-hiding PSI is possible assuming a set-up authority is present in an set up phase. In the computational setting we give a generic construction using smooth projective hash functions for languages derived from perfectly-binding commitments. Further, we give two size-hiding constructions: the ﬁrst one is theoretical and evidences the equivalence between PSI, oblivious transfer and the secure computation of the AND function. The second one is a twist on the oblivious polynomial evaluation construction of Freedman et al. from EUROCRYPT 2004. We further sketch a generalization of the latter using algebraic-geometric techniques. Finally, assuming again there is a set-up authority (yet not necessarily trusted) we present very simple and eﬃcient constructions that only hide the size of the client’s set.


(Communicated by Alfred Menezes)
Abstract. In this paper we focus on protocols for private set intersection (PSI), through which two parties, each holding a set of inputs drawn from a ground set, jointly compute the intersection of their sets. Ideally, no further information than which elements are actually shared is compromised to the other party, yet the input set sizes are often considered as admissible leakage.
In the unconditional setting we evidence that PSI is impossible to realize and that unconditionally secure size-hiding PSI is possible assuming a set-up authority is present in an set up phase. In the computational setting we give a generic construction using smooth projective hash functions for languages derived from perfectly-binding commitments. Further, we give two size-hiding constructions: the first one is theoretical and evidences the equivalence between PSI, oblivious transfer and the secure computation of the AND function. The second one is a twist on the oblivious polynomial evaluation construction of Freedman et al. from EUROCRYPT 2004. We further sketch a generalization of the latter using algebraic-geometric techniques. Finally, assuming again there is a set-up authority (yet not necessarily trusted) we present very simple and efficient constructions that only hide the size of the client's set.

Introduction
The Private Set Intersection (PSI) problem is, in a nutshell, concerned with two parties, each holding a set of inputs drawn from a ground set, that wish to jointly compute the intersection of their sets, without leaking any additional information [20]. Cryptographic solutions to the PSI problem allow interaction between a client C and a server S, with respective private input sets C = {c 1 , . . . , c v } and S = {s 1 , . . . , s w }, both drawn from the ground set U. At the end of the interaction, C learns S ∩ C and |S|, while S learns nothing beyond |C|.
Plenty of potential applications for PSI have been suggested. The Department of Homeland Security wishing to check a list of suspicious persons against the passenger list of a flight operated by a foreign air carrier, the federal tax authority checking if any suspect tax evader has a foreign bank account, and other folklore case scenarios have been the leading examples in many papers (see [13] for a detailed list). Remarkably, in many of these scenarios it is in the interest of the client (and maybe also of the server), that the size of its input set is not leaked through the interaction. Consider for instance the case in which two competing companies wish to find out the set of common clients in a certain geographical area, while they might rather not disclose to what extent they are already established there. If size-hiding is a requirement, many of the techniques proposed for private set intersection fail to do the job.
History of PSI: Results and Techniques. Freedman et al. [20] introduced the first PSI protocol based on oblivious polynomial evaluation (OPE). The key intuition is that elements in the client's private set can be represented as roots of a polynomial, i.e., P ( a i x i . Leveraging any additively homomorphic encryption scheme (e.g., [37]), the encrypted polynomial is obliviously evaluated by S on each element of its data set: S computes {u j } j=1,...,w = {E(r j P (s j ) + s j )} j=1,...,w where E() denotes additively homomorphic encryption and r j is chosen at random. If s j ∈ C ∩ S, then C learns s j upon decryption of the corresponding ciphertext u j ; otherwise C learns a random value. OPE-based PSI protocols have been extended in [31,21,14,15] to support multiple parties and other set operations such as union, element reduction, etc. Hazay et al. [25] proposed an Oblivious Pseudo-Random Function (OPRF) [19] as alternative primitive to achieve PSI. In [25], given a secret index k to a pseudo-random function family, S evaluates {u j } j=1,...,w = {f k (s j )} j=1,...,w and sends it to C. Later, C and S engage in v executions of the OPRF protocol where C is the receiver with private input C and S is the sender with private input k. As a result, C learns {f k (c i )} i=1,...,v such that c i ∈ C ∩ S if and only if f k (c i ) ∈ {u j } j=1,...,w . Improvements using the same approach were provided in [27,28].
Given U as the ground set where elements of C and S are drawn (i.e., C, S ⊆ U), none of the above techniques prevents a client to run a PSI protocol on private input C = U in order to learn the elements in S. To this end, Camenisch et al. extended PSI to Certified Sets [5], where a Trusted Third Party (TTP) ensures that private inputs are valid and binds them to each participant. The certification issue was also addressed in [11], where a related problem to PSI was considered. Moreover, the same extension, under a different name, Authorised PSI, was considered in [13], where protocols that use modular exponentiation, multiplication and hash evaluation were described.
Efficiency of the protocols on large data sets is an important practical issue, and it has been addressed in several papers in the last years. One of the most relevant applications in this respect comes from genomic computing: the user wants to protect the privacy of sensitive information coded in her genomic sequence but at the same time wishes to engage in private computations with other parties, in order to get some advantage, e.g., understand whether she has a predisposition to certain diseases or whether some medication could benefit her state of health. Baldi et al. [4] and De Cristofaro et al. [10] offer an in-depth introduction to this area. In [12] linear-complexity private set intersection protocols were proposed for malicious adversaries. Along the line of [36], to gain efficiency, Bloom filters have been applied in [30,17]. The protocols proposed in [30] are elegant and one of them is designed for an outsourced scenario. The solutions described in [17], [39] and [38] are currently, with respect to semi-honest adversaries, the most efficient ones available.
Related work. All of the above techniques reveal the size of the participants' sets. That is, C (resp. S) learns |S| (resp. |C|), even if C ∩ S = ∅. To protect the size of private input sets, Ateniese et al. [3] proposed a size-hiding PSI protocol where C can privately learn C ∩ S and |S| without leaking the size of C. Their scheme is based on RSA accumulators and unpredicatability of the RSA function. The authors proved security against honest-but-curious adversaries in the random oracle model. Very recently, general results on hiding the input-size in two party computation were given [32]; Lindell et al. prove that hiding the input size of both participants involved in a set intersection protocol is impossible in the cryptographic model 1 , considering honest-but-curious adversaries, if after the interaction: • both C and S get C ∩ S and nothing else, • client only gets C ∩ S, server gets nothing • client gets C ∩ S, server gets its size, |C ∩ S|. Further, they are able to evidence that hiding one party's input size is possible using fully homomorphic encryption; this construction is mainly theoretical, but is the first to encourage searching for new designs without random oracles.
Another (loosely related) work in which the input-size issue has been addressed is [6], where the authors present a simple database commitment functionality where besides the standard security properties the input size of the committed database set is not revealed. They consider malicious users and use as a main tool universal arguments of quasi-knowledge.
Contributions. This paper focuses on PSI protocols between honest-but-curious parties under different security models. It is an improved and revised version of [16]: as the most remarkable changes, we have refined definitions in Section 2, extended the results of Section 3, and added the generic constructions from Section 4. Section 5 has been rewritten, excluding constructions that are outperformed by the ones depicted now in the paper.
We start by considering the size-hiding scenario in Section 3, where both parties hide the size of their input sets. In the unconditional setting, we prove that PSI protocols are not achievable, while size-hiding PSI is possible involving a trusted third party which disappears after the setup phase (see Figure 1). Then, in Section 4 we deal with computational security. First we present a new generic construction for PSI from smooth projective hashing, standing on the seminal ideas of Cramer-Shoup [9] and Gennaro-Lindell [22] which have already found applications in several other contexts, such as password-based authenticated key exchange and oblivious transfer. From this generic construction one can derive PSI protocols that can be proved secure in the common reference string model under different number theoretic assumptions, including decisional Diffie-Hellman, quadratic residuosity and Paillier's decision composite residuosity. The communication complexity of the resulting schemes is quadratic in the input size, thus unfortunately higher than that of other known public-key constructions (like the OPE from [20]).
Further, we establish the existence of a PSI protocol where both parties hide the size of their sets ( Figure 3); this does not contradict [32], for our protocol cannot be used unless the universe size U is polynomial (as computations have to actually run over the full universe). Then, we depict a generalization of the oblivious polynomial evaluation construction from Freedman et al. [20], which hides the input sizes if a bound M on both the client and server's set is known, and sketch a generic algebraic-geometric construction that can be seen as an (at this point, only theoretical) generalization of it. The latter poses an algebraic-geometric question that we believe to be of interest in its own right.
Finally, in Section 5 we provide two explicit constructions of unbalanced protocols where only the client hides the size of its input set. These assume the aid of an authority (not necessarily trusted in one of them) which is only present in a setup phase and then disappears.

Preliminaries
In this section we provide definitions and tools used in the rest of the paper, essentially following the model from [3]. However, we refine their definitions slightly in order to deal with both computationally and unconditionally secure protocols, and to introduce the size-hiding constraint on both the client and the server side. Moreover, we generalize the underlying scenario by considering a finite set of clients C = {C 1 , . . . , C n } and a finite set of servers S = {S 1 , . . . , S k } (possibly not disjoint) holding corresponding data sets C i for i = 1, . . . n and S j , j = 1, . . . , k whose elements are drawn from a common universe U. Each client C i may interact with each server S j in order to compute the intersection C i ∩ S j . Note that as the set of clients and the set of servers may not be disjoint, we capture the situation in which the same participants may be enrolled in various executions playing a different role each time.
In some cases we introduce a setup authority SA which is fully trusted and interacts with the parties in a setup phase before the actual protocol execution takes place (so, it is in a sense a Trusted Initializer as introduced by Rivest in [41]). This authority might provide secret information to the parties, and its presence is actually unavoidable if participants cannot be assumed honest-but-curious, to certify the sets of secrets held by the parties in order to prevent any party to execute the protocol with a different set than the one it owns (for instance, the full universe).

Definition 1.
A party is referred to as honest-but-curious, HBC for short, if it correctly follows the steps of the protocol but eventually tries to get extra-knowledge from the transcript of the execution.
A size-hiding private set intersection protocol (SH-PSI), enabling two participants C i ∈ C = {C 1 , . . . , C n } and S j ∈ S = {S 1 , . . . , S k } to compute the intersection of their set of inputs, while revealing no further information (in particular, keeping secret all sizes of the involved sets) can be defined as follows: Definition 2. A SH-PSI protocol is a scheme involving C i ∈ C = {C 1 , . . . , C n }, S j ∈ S = {S 1 , . . . , S k } and (possibly) a fully trusted setup authority SA, together with two components, Setup and Interaction, where • Setup is an algorithm that selects all global parameters, run by the server S j and the client C i , and possibly involving the setup authority SA; • Interaction is a protocol involving only S j and C i on respective input sets which are subsets of a ground set U = {u 1 , . . . , u |U | }, satisfying correctness, client privacy, and server privacy. Correctness is formalized by: Definition 3. A scheme specified in Definition 2 is correct if, when run by HBC parties, C ∈ C and S ∈ S, at the end of Interaction, run on corresponding inputs S and C, with overwhelming probability S outputs ⊥ and C outputs S ∩ C or ⊥ if the intersection is empty.
Notice that, compared to the definition of correctness provided in [3], we do not require that |S| is part of the client's output. In Section 5 in which we will apply the size-hiding restriction only on the client's side, we will stick to the definition of correctness from [3] and require the client C to output S ∩ C and |S| or just |S| if the intersection is empty. In the sequel, we will refer to such protocols as unbalanced size-hiding, thus coining the term USH-PSI to address schemes fulfilling the definition of correctness from [3]. Unless otherwise specified, the definition of correctness used in the sequel is the one above.
Concerning client privacy, since the server does not get any output from the protocol, it is enough to require that the server, from the interaction, does not distinguish between cases in which the client has different input sets. Definition 4. Let View S (C, S) be a random variable representing the view of S during the execution of Interaction with inputs C (from a client in C) and S. A scheme specified as in Definition 2 guarantees client privacy if, for every S * that plays the role of S, for every set S, and for any two possible client input sets C, C it holds that: Notice that in the above definition, when considering the unconditional setting, the parties S, C and S * are unbounded and indistinguishable means that the two views are perfectly indistinguishable, that is, they are identically distributed. On the other hand, in the computational setting, S, C and S * are probabilistic polynomial time Turing machines. Hence, indistinguishable means that the two views are computationally indistinguishable.
Server privacy needs a bit more: the client gets the output of the protocol, and by using its input and the output, by analyzing the transcript of the execution, could obtain extra-knowledge about the server's secrets. Nevertheless, if the transcript can be simulated by using only input and output, then server privacy is achieved.
Definition 5. Let C ∈ C and View C (C, S) be a random variable representing C's view during the execution of Interaction with inputs C and S. Then, a scheme defined as in Definition 2 guarantees server privacy if, for any C ∈ C there exists an algorithm C * such that As before, in the unconditional setting the parties are unbounded and the transcript produced by C * and the real view need to be identically distributed. In the computational setting, the parties are probabilistic polynomial time Turing machines and the transcript produced by C * and the real view are required to be computationally indistinguishable. We will always consider HBC parties, even though Definitions 4 and 5 are written without this restriction.

SH-PSI: the unconditionally secure case
In this section we deal with the unconditionally secure setting and show two results: first, we evidence that unconditionally secure 2-party PSI protocols do not exist. Further, we prove that unconditionally secure size-hiding PSI can indeed be achieved if we allow the involvement of SA in the Setup algorithm.
3.1. Impossibility of unconditionally secure 2-party PSI. For a universe U such that |U| > 2, there is no unconditionally secure 2-party PSI protocol, no matter if one or both players receive the intersection as an output, or if the size of the input sets is revealed. To prove this claim, we first show that from a secure PSI protocol Π that outputs the intersection and the set sizes to both participants, one can construct a protocol that securely computes the AND function. The latter does not exist in the unconditional setting [8,Page 22], so Π cannot exist. The remaining output-variants of PSI protocols can be obtained from a protocol like Π, thus completing our argument. Theorem 1. Let F = (F 1 , F 2 ) be the functionality that, on input a pair of subsets (C, S) of a universe U with |U| > 2, outputs F 1 (C, S) = (C ∩ S, |S|) to the first participant and F 2 (C, S) = (C ∩ S, |C|) to the second one. Then an unconditionally secure 2-party protocol which computes F does not exist.
Proof. Let us reason by contradiction and assume that a secure 2-party protocol Π which computes F does exist. From Π we will build a secure private AND protocol Γ. A private AND protocol is a two-party protocol, run by users A and B, at the end of which the players get the logical AND of their input bits and nothing else (i.e., a secure protocol for computing a · b from private inputs a, b ∈ {0, 1}).
Indeed, assume that, for i ∈ {1, 2}, P i is holding as input for Γ a bit b i ∈ {0, 1}. For the PSI protocol Π, choose three different elements a 1 , a 2 , a 3 ∈ U. Each player P i constructs its input set X i as follows: Then P 1 and P 2 run the protocol Π for PSI with inputs X 1 and X 2 respectively. Each player, depending on the output of Π, sets the output of Γ as follows: • If the intersection X 1 ∩ X 2 = ∅, then the output of Γ is set to 0.
• If the intersection X 1 ∩ X 2 = {a 3 }, then the output of Γ is set to 1.
It is easy to check that Γ is a secure protocol for the computation of AND(b 1 , b 2 ): Security. First note that all the possibly involved sets have precisely one element, therefore the size of the other player's set does not provide any information. Let us analyze the information leaked from X 1 ∩ X 2 to player P i . If the input of player P i is b i = 0 then X 1 ∩ X 2 = ∅, regardless of the input of the other player. Thus P i does not gain any information. On the other hand, if b i = 1, the intersection X 1 ∩ X 2 allows P i to learn the input of the other player, but this is also what happens in an ideal execution of an AND protocol. As Π is supposed to be an unconditionally secure protocol, players do not learn anything beyond the output, so Γ is also an unconditionally secure protocol, which concludes the proof.
Then an unconditionally secure 2-party protocol which computes G does not exist.
Proof. From a protocol Λ which securely computes G we can derive a protocol Π which securely computes F . Namely, the player holding the intersection C ∩ S sends it to the other player, if needed, and each player sends the size of its input to the other player. Theorem 1 states no such protocol Π can exist; as a result, neither can Λ.

3.2.
Unconditionally secure SH-PSI with a trusted initializer. Allowing the involvement of a setup authority, we can achieve size-hiding in the unconditional setting. Figure 1 shows a construction. In the Setup, the SA chooses two random bijections f, g : P(U) −→ {0, 1} |U | , which are kept secret from client and server. After the SA has received the client's set of secrets, she sends the client an identifier, computed by means of f , and a list of sub-identifiers, one for each possible subset of the client's set of secrets, computed through the second random function g. When the SA receives a server's set of secrets, she constructs a two-column table: in the first column there is, for each possible subset E of the ground set, an identifier of E, computed with the first random function; the second column has an identifier, computed with the second random function, of the intersection E ∩ S of the subset E and the server's set of secrets S. The table is given to the server. The Interaction between a client and a server is a simple two-round protocol: the client sends its set identifier; the server looks up in the table the row with the received identifier, and returns the identifier in the second column. Finally, the client looks up the server's reply in the list of sub-identifiers and determines the intersection. We emphasize that the goal of this construction is only to show the existence of SH-PSI schemes in the unconditional setting. The scheme we provide is not practical for a universe of large size.
Theorem 2. Let f, g : P(U) −→ {0, 1} |U | be two random bijections. The protocol described in Figure 1 is a SHI-PSI protocol achieving correctness, client privacy, and server privacy in the unconditional model. Proof. Correctness. The table T contains the pair (R, g(C ∩ S)), so the client will indeed retrieve C ∩ S by looking up in L. Server Privacy. The client only gets a (random) identifier from the interaction, which can be simulated without S. Namely, the algorithm C * mentioned in Definition 2.5. may simply generated values uniformly at random in {0, 1} |U | to (perfectly) simulate the clients view. Client Privacy. The server does not get any information about the correspondence value-subset, since the construction of the table is completely blind to it. Moreover, independently of the client set of secrets, the table the server gets has in the second column exactly 2 |S| different random values, that is, the number of all possible subsets of S. Each of these values appears exactly the same number of times, namely 2 |U |−|S| , as for every F ⊆ S, Hence, a request from a client only allows the server to learn the two values (f (C), g(C ∩ S)) which do not leak any information about the client's set of secrets nor its size.

Cryptographic setting
In this section we move to the cryptographic setting; we start by introducing a new generic construction based on smooth projective hashing. 4.1. PSI from smooth projective hashing. The main tool behind our construction are smooth projective hash functions (SPH) from commitment schemes, standing on the seminal ideas of Cramer-Shoup [9] and Gennaro-Lindell [22], which have already found applications in several other contexts, such as password-based authenticated key exchange and oblivious transfer. We summarize informally the main definitions and results, and refer to [22,1] for precise definitions and concrete constructions.
Let Com be a non-interactive commitment scheme that is computationally hiding and perfectly binding in the common reference string model. This may be derived from any IND-CPA encryption scheme, like ElGamal encryption. For other applications of smooth projective hashing, non-malleability of the commitments is required, and they are typically derived from IND-CCA encryption schemes (cf. [1] for results on non-interactive commitment schemes for smooth projective hashing constructions). By Com ρ (u, r) we denote a commitment to an element u using the common reference string ρ and the random coins r; further, denote by Com ρ the space of all strings that may be output by Com when the common reference string is ρ. Now, define the sets: We consider a subset membership problem defined as follows. For each ∈ N a common reference string ρ (of polynomial size in ) is selected. Further, for each u ∈ U define D(U ρ \ L ρ ) respectively D(L ρ ) as the distribution induced by choosing random r and computing (C ρ (0 |u| , r), u) respectively (C ρ (u, r), u)). Now consider the above subset membership problem I, where for each ρ the set U ρ is partitioned by the sets {{u} × Com ρ } u∈U . Then similarly as in [22], from the hiding property of the commitment scheme it is easy to see that I is a hard partitioned subset membership problem.
In such a situation, consider at hand a smooth projective hash family F for the induced hard partitioned subset membership problem. This essentially means we have a hash family F indexed by a key space K, that is F = {f k } k∈K , so that for every k ∈ K, we have f k : U ρ −→ G for a set G of superpolynomial size in the security parameter. Let P be a fixed set and α an (efficiently computable) projection function defined over K × Com ρ with range in P. Now given all these ingredients we get the following: 1. Efficient hashing from k: there exists an efficient algorithm KeyHash which on input a pair (u, com) from U ρ , and a key k ∈ K outputs f k (u, com). 2. Efficient hashing from projection and witness: there exists an efficient algorithm ProjHash which on input: • a pair (u, com) ∈ U ρ , • a random value r so that com = Com ρ (u, r), • and a projection α(k, com), outputs f k (u, com).

3.
Smoothness: for every u ∈ U, com ∈ Com ρ , with (u, com) / ∈ L ρ , the distributions {com, u, α(k, com), f k (u, com)} and {com = Com ρ (u, r), u, α(k, com), g} are computationally indistinguishable, where g is drawn uniformly at random from G. Let us now introduce our protocol, building on a non-interactive commitment scheme Com and a smooth projective hash family F = {f k } k∈K as described above.
The key ideas in our design are, in a nutshell: Figure 2. A generic PSI protocol from smooth projective hashing.
• During setup, SA fixes and publishes the necessary information for implementing a non-interactive commitment scheme (which can actually be seen as the common reference string ρ). Also the associated projective hash family is made explicit. 3 • In the first round, the client sends to the server a commitment com i to each element c i in his set. • The server selects uniformly at random a key k ∈ K and evaluates the corresponding hash function f k on each pair (s j , com i ), for 1 ≤ i ≤ v, 1 ≤ j ≤ w. Further, she sends them to the client together with the projection keys α(k, com i ) for i = 1, . . . , v. • Finally, the client can identify matching sets C \ {s} for each s ∈ C ∩ S, and, as a result, retrieve the intersection. Note that this is all he learns, as due to the smoothness property he gains no information from the evaluations of f k in any pair (u, com) for which com is not a valid commitment on u. A detailed description can be seen in Figure 2; note that evaluations of f k computed by the server are actually executions of KeyHash, while evaluations by the client are executions of ProjHash.
Theorem 3. Let Com be a non-interactive perfectly binding and computationally hiding commitment scheme, and F a family of smooth projective hash functions. Then, the protocol given in Figure 2 is a PSI protocol achieving correctness, client privacy, and server privacy.
Proof. Correctness. To see that the protocol is correct, let us first argue that if there are indices i ∈ {1, . . . , v} and j ∈ {1, . . . , w} such that c i = s j the client will set c i ∈ C ∩ S. Indeed, in this case (s j , com i ) ∈ L ρ , and as a result the KeyHash and ProjHash executions carried over by the server and client (respectively) will output the same value φ ij .
Furthermore, if c i is not in S, for each j = 1, . . . , w, due to the smoothness of F we have that with overwhelming probability test i = φ ij ; and as a result there is only negligible probability that the client includes c i in the intersection. Client Privacy. As the commitment scheme is perfectly hiding, only |C| is leaked from the interaction. Server Privacy. To simulate the client's view, C * uniformly at random selects a key k * for the projective hash family F, chooses nonces r * i , sets com i = Com ρ (c i , r * i ), and computes the projections α(k * , com i ) for i = 1, . . . , w. Assume w.l.o.g. that C ∩S = {c 1 , . . . , c I }, for 0 ≤ I ≤ v. Now, for i = 1, . . . , I, set φ * i,j = f k * (c i , com i ) and choose all other φ * ij uniformly at random from G. Due to the smoothness property, this view is computationally indistinguishable from the real client's view.

Remark 1.
As efficient smooth projective hashing as above can be derived from different number theoretical assumptions (decisional Diffie-Hellman, quadratic residuosity, and Paillier's decision composite residuosity), so can the above PSI protocol. As a result, in scenarios where PSI is to be combined with other (public key) cryptographic tools, it could be of interest to have at hand a SPH construction that allows for reusing global parameters and underlying assumptions.

Protocol
Comm. Overhead Server Exp. Client Exp. [20] O On the other hand, the computation and communication complexity of this scheme is (globaly) higher than that of other known constructions. In Table 1 we give a brief summary of how an implementation of our scheme compares to previous proposals which are also public key, standard model (i.e., without random oracles) and only consider honest-but-curious adversaries ( [20,31]). The cost for Client and Server is measured in terms of exponentiations. At this, we see that our scheme is outperformed due to its high communication overhead. Recent proposals outside the public key setting are indeed much more efficient, see [39,38]. The cost of our scheme is evaluated considering a DDH-based 4 construction for SPH depicted in [22], and assuming the corresponding commitments are implemented via CCA2-encryptions under the same assumption. Thus, the scenario considered is the same as in 7.2 of [22].

4.2.
The size-hiding case. In this section we show that in the computational case size-hiding private set intersection is possible. We give two constructions. The first one is valid when the size of the ground set is polynomial in the security parameter and allows to state the equivalence between PSI, oblivious transfer (OT), and the secure computation of the AND function. This protocol was independently proposed in [26]. The second one is tailored for relatively small sets of secrets with an upper bound on their size being known a priori.

AND-based SH-PSI protocol.
An oblivious transfer protocol is a two party protocol involving a sender and a receiver; the sender has two secrets, s 0 and s 1 , while the receiver is interested in one of them. Its choice is represented by a bit σ.
After running the protocol the receiver gets s σ and nothing else, while the sender does not learn which secret the receiver has recovered. Introduced by Rabin [40], and later on redefined in different equivalent ways, it is a key-tool in secure twoparty and multi-party computation. We will denote this primitive by OT(s 0 , s 1 , σ). It is well known folklore that a private AND protocol can be realized by using an OT(b 0 , b 1 , s) protocol: the bit b s that the receiver gets can be expressed as b s = (1 ⊕ s)b 0 ⊕ sb 1 . As a result, invoking the instance OT(0, a, b), the receiver will get a · b. Now, the key idea underlying our construction is that, if the set of secrets of C and S are represented by means of two characteristic vectors I C and I S of elements of U then, by running an AND(I ci , I si ) protocol for each bit of the vectors, C and S get the intersection and nothing else. Each AND(I ci , I si ) = 1 means that they share the i-th element of the ground set U. Details are given in Figure 3.
Let be a security parameter and let the universe U = {u 1 , . . . , u |U | } be a ground set of size poly( ) (both known and fixed at the setup phase). Assume that C (resp. S) can be encoded in a characteristic vector I C (resp. I S ), such that I C [j] = 1 (resp. I S [j] = 1) iff the j-th element of U is in C (resp. S). It is easy to check that the protocol is correct. Moreover, it is secure as long as the AND protocol is. Thus, if we use the OT construction proposed in [18]    Proof. Correctness. The SH-PSI protocol from Figure 3 is correct as long as the OT protocol is. Just note that Client Privacy. Intuitively, client privacy is guaranteed because in the OT protocol the Sender, independently of the bit i held by the Receiver, gets uniformly distributed values of D α . Indeed, y 1−i = e 1−i is chosen uniformly at random in D α and y i = f α (e i ) is still uniformly distributed in D α since e i is chosen uniformly at random and f α is a permutation on D α . More precisely, a simulator Sim S for the sender in the OT protocol will choose y 0 , y 1 ∈ D α uniformly at random. From this one can construct a simulator for the server just running independently Sim S for each instance of the AND computation. Note that the view produced by this simulator is identically distributed to the view of the sender in a real execution of the protocol and it is independent of the set of secrets held by the client. Then, it easily follows that the server does not distinguish between two executions in which the client has two different sets of secrets. Server Privacy. On the other hand, in the OT protocol the Receiver gains no knowledge on the bit b 1−i because b(·) is a hardcore predicate for f α ; as a result, from the triplet (α, From this, we may construct a simulator Sim R which acts as the receiver in the real protocol OT except that it chooses c i−1 ∈ {0, 1} uniformly at random. This in turn yields a simulator for the client of the SH-PSI protocol, by running Sim R for each execution of the AND protocol. Remark 2. In this section we have described a reduction from PSI to AND. Given the known reductions from AND to OT and from OT to PSI (the latter constructed in [20]), it is stated that OT, PSI and AND are equivalent.

4.2.2.
Threshold-based protocol. Assuming some a priori information on the sizes of the input sets is known, more efficient protocols may be achieved. Here we assume that a known value M upper-bounds the sizes of all clients' and servers' sets. The smaller M is with respect to |U|, the greater the interest of this construction (we need M to be polynomial size, but |U| can be larger). As our scheme is a simple twist of [20], the main tool used is also additively homomorphic encryption. Informally, a public-key encryption scheme is (additively) homomorphic if, for any two encryptions Enc(m 1 ) and Enc(m 2 ) of any two messages m 1 and m 2 , it holds that Enc(m 1 ) · Enc(m 2 ) = Enc(m 1 + m 2 ), where · is a group operation on ciphertexts. By repeated application of the property, for any integer c, it follows that Enc(m 1 ) c = Enc(cm 1 ); as a result, given encryptions Enc(a 0 ), . . . , Enc(a k ) of the coefficients a 0 , . . . , a k of a polynomial P of degree k, and a plaintext value y, it is possible to compute Enc(P (y)), i.e., an encryption of P (y). Paillier's cryptosystem [37] exhibits such properties and achieves semantic security under chosen plaintext attacks under the decisional composite residuosity assumption (see, for instance, [29,Section 11.3]).
The proposed scheme is depicted in Figure 5. At this, Enc and Dec denote the Paillier encryption and decryption algorithms, respectively. As the client is assumed to be honest, he will execute in the setup phase the key generation algorithm for Paillier encryption, at which, in particular two -bit primes p, q are fixed. At this point, he specifies an encoding of U into Z n \{0} with |U|/n being negligible. For the sake of readability, in Figure 5 elements of U are assumed to belong to Z n \ {0}-cf. Remark 3 below for exclusion of 0.
Theorem 5. The protocol given in Figure 5 is an SH-PSI protocol achieving correctness, client privacy, and server privacy, assuming that the Paillier encryption scheme is secure in the sense of IND-CPA.
Proof. In the sequel, we set the notation I := |C ∩ S| and L := w − I. Correctness. The client's output is constructed by comparing his set C with the one consisting of S ∩ C plus the decryption of M − I encryptions of random values. Because of |U|/n being negligible, the probability that the latter actually encode an element in U (disrupting thus the computation of the intersection) is negligible. Server Privacy. In order to argue the existence of a probabilistic polynomial algorithm C * that is able to simulate the client's view on input C and C ∩ S, we modify C's view replacing the true input values from the server, constructed as encryptions involving values s ∈ S \C with encryptions of elements chosen uniformly and independently at random from Z n \ {0}. Consider thus the true distribution where for i = 0 . . . M each ρ i denotes the random value involved in the Paillier encryption yielding Enc(a i ), namely, they are values chosen uniformly and independently at random from Z * n , and Enc(r sj P (s j ) + s j ) for j = 1, . . . , w are computed as in the protocol description of Figure 5. Further w.l.o.g., we assume that C ∩ S = {s 1 , . . . , s I } and so the elements ξ 1 , . . . , ξ M −w are chosen as encryptions of uniformly at random chosen plaintexts in Z n . Now, consider the distribution where α j = Enc(r sj P (s j ) + s j ) for j = 1, . . . , I, and the elements ν 1 , . . . , ν L are encryptions of elements chosen uniformly and independently at random from Z n \ {0}. As above, ξ 1 , . . . , ξ M −w are encryptions of elements chosen uniformly and independently at random from Z n \ {0} as well. From the semantic security of Paillier encryption, it follows that these two distributions are computationally indistinguishable.
Remark 3. Having Paillier encryption in mind, we have designed the polynomial P in the interaction to have a small support, i. e., to have many 0-coefficients (as encryptions of 0 with Paillier are cheap). That is the reason for excluding 0 from the domain when defining the encoding of U into Z n . Different refinements of this step may suit better if another encryption scheme is used, always ensuring that the resulting polynomial has no roots that may correspond to an encoding of an element outside C and yet in U. In [20], abstracting from the concrete homomorphic encryption scheme used, several modifications of their basic protocol are proposed in order to boost efficiency, all of them geared toward reducing the number of multiplications and exponentiations performed by the server. These ideas would also yield a significant speed up versus a naïve implementation of the above protocol.

Algebraic PSI.
In this section we give a theoretical generalization of the polynomial-based construction depicted in the previous section. Let us assume we are in the same scenario as above, namely, there exists a known value M such that |C|, |S| ≤ M . We summarize some facts about algebraic curves and their function fields used in the sequel. This type of tools was used by Chen and Cramer for secret sharing schemes and multi-party computation constructions in [7], which can be used as a more extensive reference. Consider Υ a smooth, projective, absolutely irreducible plane curve defined over a prime field K = F q and let K be the algebraic closure of K. Such a curve can be represented by a polynomial F ∈ K[X, Y ] that is irreducible over K[X, Y ], where the affine part of the curve is the set of points P ∈ K 2 such that F (P ) = 0. We will only consider the rational points of the curve, that is, the points with coordinates in K. A divisor over Υ is a formal sum D = P ∈Υ m P · P , with only finitely many non-zero integer coefficients m P . The sum of this finite number of integers is called the degree of the divisor and denoted by deg D. Let K(Υ) be the function field of the curve. Given a function f ∈ K(Υ), the divisor of f is defined as div(f ) = P ∈Υ mul f (P ) · P , where mul f (P ) is the multiplicity of f at P (positive if P is a zero of f , negative if it is a pole) and zero otherwise. The divisor of a function always has degree 0, that is, the sum of the multiplicities of zeroes of f equals the sum of the multiplicities of the poles.
The Riemann-Roch space associated to a divisor D is the set where the partial order relation between divisors is determined by a divisor being ≥ 0 if all their coefficients are non-negative integers. L(D) is a finite dimensional vector space over K and the Riemann-Roch theorem provides its dimension (D).
In particular, if deg D ≥ 2g − 1 then (D) = deg(D) − g + 1, where g is the genus of Υ. A divisor is rational if it is invariant under the Galois group Gal K/K . This includes divisors whose support consists only of rational points. Moreover, if D is rational then L(D) has a basis over K. Therefore we will restrict to the subspace of L(D) which is fixed under Gal K/K and consists of q (D) different functions. Additionally, f (P ) ∈ K if P is rational. Let us fix an algebraic curve Υ of genus g as described above, denote by R Υ the set of rational points of the curve, and fix Q ∈ R Υ . Choose N = M + g − 1, where M ≥ |C|, |S|, g, and fix the divisor D = N · Q. For our protocol to work we need the following Given a concrete curve Υ, it is not obvious how realistic this assumption is. We believe this question to be of interest in its own right, beyond the cryptographic application depicted here, but with this assumption, it is relatively straightforward to generalize the protocol from Section 4.2.2. By definition, the space L(D) consists of functions with no poles other than Q and such that the multiplicity of Q as a pole is less than or equal to N. KeyGen(1 ) = (pk, sk) pk output cj Figure 6. Algebraic PSI construction for |C|, |S| ≤ M .
public parameters for the scheme are params = (Υ, Q, M, B, H). In the setup phase, the client executes the key generation algorithm for a homomorphic IND-CPA secure encryption scheme defined over K (see, e.g., [2]) and sends the public key to the server. The interaction phase is analogous to the protocol in Section 4.2.2. A description is provided in Figure 6. Note that given a point s ∈ S, from {Enc(a i )} M i=1 , sent by the client, and {Enc(f i (s))} M i=1 , which can be evaluated by the server, she can compute Enc(f (s)).

Remark 4.
If there is no need to hide the size of the server's set, no "fake" encryptions need to be sent in the last round of the protocol (saving communication complexity) and also the condition |S| ≤ M can be dropped. With these two modifications we would obtain a protocol which is unbalanced size-hiding in the sense defined in the next section.
Remark 5. For the above construction we would need to be able to compute rational functions from certain divisors. Already, the problem of deciding if a divisor D of degree 0 is actually the divisor of a rational function seems to be highly nontrivial. If the curve Υ is defined over the field C, that is, Υ is a Riemann surface, Abel's theorem (see [33], chapter VIII, section 2) provides a characterization: D is the divisor of a meromorphic function on Υ if and only if A 0 (D) = 0 in the Jacobian of X, where A 0 is the Abel-Jacobi map.
For elliptic curves over C, this translates to the following result: D is the divisor of a meromorphic function if and only if D = 0 (when the formal sum which defines the divisor is seen as an actual sum in the elliptic curve). As far as we know, this problem has not been studied for algebraic curves over finite fields (we refer here to [34] as a complete introduction to this field).

Unbalanced size-hiding aided by a setup authority
In this section, we follow the spirit of [3] and try to provide unbalanced private set intersection protocols, i.e., protocols in which the client learns |S| from the interaction, while keeping |C| secret. We follow the definitions of correctness, client privacy, and server privacy from [3]. 6 We propose two different versions of essentially the same protocol, which is based on the ideas in [25], where a function chosen from a PRF family is obliviously evaluated by the server on each on the client's elements (as roughly described in Section 1). In our constructions, the function evaluation is delegated to the setup authority SA in the setup phase. Depending on the degree of trust the participants put in SA, the function evaluation is carried out in a different way, leading to the two versions of our protocol. Namely, if SA is fully trusted, each participant sends its input set to SA for direct function evaluation (in the sequel we will refer as the protocol using this setup as the PRF-PSI-protocol, see Figure 7). While, if participants do not wish to reveal their sets, a two-party protocol between SA and each participant P is run, resulting in SA obliviously evaluating the chosen function on P 's inputs (resulting in our so-called OPRF-PSI-protocol, see Figure 8). Later on, in the interaction phase (where SA is no longer available), the two participants (playing the roles of client C and server S respectively) engage in a one-round protocol where S sends the values she obtained from SA to C, which, by comparison with his own evaluated values, obtains the intersection. Note that USH is trivially achieved as the client sends no message to the server.
It is worth mentioning that our PRF protocol is conceptually very close to the protocol by Hayay et al. [24], where SA is implemented by a trusted smart card which is first used by the server to compute the evaluations of his itemes and then sent to the client to do the same. When we originally proposed this protocol in [16] we were not aware of the existence of this other protocol.
Remark 6. Note that for correctness of our protocols, the range of the functions from the family {f r } r∈K should be chosen such that the probability of collision for every f r is negligible in the security parameter. Theorem 6. The protocol given in Figure 7 is a USH-PSI achieving correctness, client privacy, and server privacy under the assumption that {f r } r∈K is a PRF family.
Proof. Correctness. Every c j ∈ C ∩ S is output by the protocol. An element c j / ∈ C ∩ S is output if and only if f r (c j ) = f r (s i ) with c j = s i , and this happens only with negligible probability if {f r } r∈K is chosen as in Remark 6. Client Privacy. This is trivially achieved as the client sends nothing to the server. Server Privacy. This follows from the the pseudorandomness property of the function family {f r } r∈K . At this, a probabilistic polynomial algorithm C * can simulate the client's view on input C, C ∩S, and |S|, by constructing a sequence {R * 1 , . . . , R * w } so that, for each u i ∈ C ∩ S, a corresponding R * i is defined as f r (u i ), while the rest are values chosen independently and uniformly at random in the range of f r .  Now let us briefly describe our OPRF protocol. The setup phase is depicted in Figure 8 while the interaction is the same as in Figure 7. The only difference between the two protocols is that in the OPRF one SA does not learn anything about the participant set (beyond its size) as long the OPRF two-party protocol used in the setup is secure. Therefore we obtain a security result for this protocol similar to Theorem 6. Theorem 7. The protocol with setup given in Figure 8 and interaction from Figure 7 is a USH-PSI achieving correctness, client privacy, and server privacy under the assumption that {f r } r∈K is a PRF family.
Proof. Same as Theorem 6.
In order to implement the PRF-PSI-protocol in a very efficient way, the PRF family could be instantiated by using a block cipher like AES. For the OPRF-PSIprotocol, the efficient proposal from [35] can be used. Let us briefly describe this pseudorandom function family: let G be a cyclic group of prime order q and g a generator of G for which the DDH assumption holds. We will denote by K = (Z q ) n , a set of keys for a function family. Elements in K have the form r = (r 1 , . . . , r n ). The family of functions {f r } r∈K , defined as follows, is proven to be pseudorandom under the DDH assumption (see [35] for details).
Therefore, we will additionally need an encoding of the ground set U into the set {0, 1} n for big enough n. The protocol proposed in [19] can be used to evaluate f r in an oblivious way, suitable for the protocol in Figure 8.
Finally let us compare our two protocols with [25]. Our main motivation to introduce our protocols is to obtain the size-hiding property, which [25] does not provide (at the expense of introducing a third party, the SA). For the OPRF protocol, the efficiency and communication complexity remains equal than those in [25], as the same messages and computations are made, the only difference is that they are split between the server and the SA. When we consider the PRF protocol, again the communication is the same and also the number of function evaluations; but, in this case these can be direct evaluations (do not need to be oblivious). The performance of this protocol is the same as the one from [24].

Conclusion
We have explored the private set intersection problem when hiding the sizes of the input sets is relevant; for SH-PSI, we provide a (conceptual) protocol in the unconditional setting (with the help of a trusted party in a setup phase). Furthermore, we proved that PSI is impossible in the unconditional setting, making explicit its relation to AND. In addition, in the computational scenario, we have both given a theoretical construction (only applicable for polynomial universes) and a practical one which is a simple twist of the well-known polynomial scheme of Freedman et al. [20].
For USH-PSI we have proposed two practical and efficient one-round protocols based on the ideas from [25]. The additional size-hiding property is achieved thanks to the introduction of a trusted (or semi-trusted) setup authority which interacts independently once with each participant in a setup phase and whose intervention is not needed any more for later PSI computations between the participants.