Modelling dynamic network evolution as a Pitman-Yor process

Dynamic interaction networks frequently arise in biology, communications technology and the social sciences, representing, for example, neuronal connectivity in the brain, internet connections between computers and human interactions within social networks. The evolution and strengthening of the links in such networks can be observed through sequences of connection events occurring between network nodes over time. In some of these applications, the identity and size of the network may be unknown a priori and may change over time. In this article, a model for the evolution of dynamic networks based on the Pitman-Yor process is proposed. This model explicitly admits power-laws in the number of connections on each edge, often present in real world networks, and, for careful choices of the parameters, power-laws for the degree distribution of the nodes. A novel empirical method for the estimation of the hyperparameters of the Pitman-Yor process is proposed, and some necessary corrections for uniform discrete base distributions are carefully addressed. The methodology is tested on synthetic data and in an anomaly detection study on the enterprise computer network of the Los Alamos National Laboratory, and successfully detects connections from a red-team penetration test.


Introduction.
A network can be represented as a directed graph G = (V, E), where V is a set of nodes and E ⊆ V × V is a set of links, or edges, indicating the pairs of nodes which have interacted. Statistical models for networks are well studied and understood in the literature [8]. Less attention is devoted to dynamic networks. In dynamic networks, a stochastic process is observed on V × V . Graphs having this structure are commonly observed in the social network literature; for example, emails sent between employees within an enterprise network, or messages between users on Facebook. The motivating application for this article is in cybersecurity, where the nodes are computers or Internet Protocol (IP) addresses and the objective is to detect anomalous, potentially compromised nodes. Networkwide modelling of large computer networks typically requires a trade-off between mathematical complexity and computational feasibility, which is addressed in this article by proposing a simple model for network evolution. The main advantage of the proposed model over other approaches to dynamic network modelling suggested in the literature [21,17] is that the model is simpler, and allows to directly obtain a predictive probability for a link.
Computer network graphs have directed edges, since every connection is initiated by a source node making a request to another destination node, so (x, y) ∈ E =⇒ (y, x) ∈ E. Let V S ⊆ V be the subset of nodes which are ever observed to initiate a connection, and V D ⊆ V the set of nodes arising as a destination of one or more connections. This article examines a joint network-wide generative model for the sequence of links (x 1 , y 1 ), (x 2 , y 2 ), . . . ∈ V S × V D , based on the Bayesian nonparametric Pitman-Yor process [23,12]. The Pitman-Yor process, also known as the two-parameter Poisson-Dirichlet process, provides a natural extension of the Dirichlet process [6] where the probability of observing infrequently observed events can be further discounted. The Pitman-Yor process has been successfully used in natural language processing and computational linguistics applications [9,29] to appropriately model the frequency of words in languages, which often exhibit powerlaw behaviour [2]. The power-law is explicitly modelled by the interplay between a strength parameter α and a discount parameter d ∈ [0, 1).
Statistical modelling of categorical sequences is a well established branch of statistics and natural language processing, with relevant applications in speech recognition. Common methods mainly include n-gram models, based on n-order Markov assumptions (see the survey of [13]), neural models, in particular recurrent neural networks [3,18], maximum entropy or exponential models [25] and positional models [16]. The model presented in this article is much simpler, but well-suited for an efficient network-wide implementation on networks with large numbers of nodes and high frequency activity.
In the present article, the Pitman-Yor is shown to have interesting properties for suitably modelling events within real-world networks. The article also proposes an empirical procedure for estimation of the parameters within a "big data" framework. Furthermore, approximations and corrections for parameter estimation in Pitman-Yor processes with uniform discrete base distributions are also carefully addressed.

2.
Modelling sequences using the Pitman-Yor process. Consider an infinitely exchangeable sequence of nodes x 1 , x 2 , . . . ∈ V . After observing n events x n = (x 1 , . . . , x n ), let (x 1 , . . . , x Kn ) be the K n unique observations in x n . Let F 0 be a base probability distribution on V . Then the distribution of observed nodes is a Pitman-Yor process PY(α, d, F 0 ), if the predictive distribution for the next observed node is is the number of occurrences of the value x j in x n . Note that d = 0 corresponds to the more familiar Dirichlet process, PY(α, 0, F 0 ) ≡ DP(α, F 0 ).
The dynamics of the process can be most easily understood using the Chinese Restaurant metaphor [1], which will be extensively used in this article. Starting from an empty restaurant with a potentially infinite number of tables, the first customer sits at the first table and then for n = 1, 2, . . . the (n + 1)-th customer either sits at a new table or chooses an already occupied table with the following  probabilities:   P(new table) = α + dK n α + n , P(j-th table) = N jn − d α + n , Figure 1. Cartoon example of the Chinese Restaurant metaphor for the Pitman-Yor process, where C j represents the jth customer and x j is the unique dish served at table T j . The vector x n = (x 1 , . . . , x n ) denotes the observed sequence of dishes eaten by each customer. Customer C i being seated at table T j is denoted C i → T j . Then, for example, the tenth customer will sit at T 1 with probability (3 − d)/(α + 9).
for j = 1, . . . , K n . In this metaphor, N jn is the number of customers sitting at the j-th table when n customers are in the restaurant. If each table is associated with a random dish drawn from a non-atomic base distribution F 0 , the sequence x 1 , x 2 , . . . corresponding to the dishes eaten by successive customers is an exchangeable stochastic process with De Finetti measure PY(α, d, F 0 ). Figure 1 presents a cartoon representation of this metaphor for the Pitman-Yor process. In this article, for an observed sequence of links (x 1 , y 1 ), . . . , (x N , y N ) ∈ V S × V D , it is assumed that the joint distribution has the following hierarchical structure: where {F x|y } and G are unknown probability mass functions on the node set V . The probability of obtaining a link (x, y) is decomposed in two parts: p(x, y) = p(x|y)p(y). The sequence of destination nodes y 1 , . . . , y N ∈ V D is assumed to be exchangeable with a Pitman-Yor hierarchical distribution. Similarly, conditional on the destination node y ∈ V D , it is assumed that the sequence of source nodes x 1 , x 2 , . . . ∈ V S connecting to y is again exchangeable with a Pitman-Yor hierarchical distribution, with parameters depending on the specific value of y. For the application to computer networks, the decomposition p(x, y) = p(x|y)p(y) is preferred over p(x, y) = p(y|x)p(x): for identifying red-team behaviour, anomalous client computers must be selected, hence each event is first given a score measuring how surprising the server found the connection from the corresponding client, and then all such measures are combined for a given client. Note that care is needed when the base distribution of the process is atomic. In this case, using the Chinese Restaurant metaphor, it is possible that two or more tables serve the same dish, which means that the same draw from the base distribution is associated with multiple tables. Hence, the predictive probability (1) cannot be identified, since the underlying K n , and consequently N jn , are not determinable from x n . In contrast, with continuous base distributions, the draws are distinct with probability 1. Attention in the literature is mostly devoted to non-atomic base distributions. For the case of atomic F 0 , [4] suggest introducing latent multiplicities and table indicators, corresponding to the number of tables contributing to the total number of times a value x j is observed. An efficient sampler for this representation is derived in [5]. In this article, the node set will be assumed to be countable, and simple adaptation techniques will be used in the estimation of the parameters to take this problem into account.
3. Empirical estimation of the hyperparameters. Consider a Pitman-Yor process with non-atomic base distribution and 0 ≤ d < 1, and let K n be the number of occupied tables after n customers have entered the restaurant. [22] shows where the subscripts represent the Pochhammer symbol. For large n, using Stirling's approximation, n d ≈ Γ(n + α + d)/Γ(n + α), this expectation can be approximated as Similarly, the expectation of the number of tables of size m after observing n customers enter, H mn , can be approximated for large n [22,31] as: Therefore, after N observations, simple method of moments or empirical Bayes estimates (α,d) of the hyperparameters α and d are obtained by solving the following non-linear system of equations: A further estimate could be based on an alternative approximation of E(K n ), often used in the literature [22,31], obtained by noting that lim n→∞ n d = ∞ when d > 0. Therefore, when N is large, the value of Γ(1 + α)n d /dΓ(d + α) will dominate α/d in (3), and henced The estimate (6) is particularly illuminating: asymptotically, d is the long term ratio between the tables with just one customer and the total number of tables. Under this approximation, ifd = 0 then the process can be approximated by a Dirichlet process, and from (3),α = K N log(N ) ; otherwise, from (5)α can be obtained numerically as a solution to the equation 4. A correction for discrete uniform base distributions. Under the Pitman-Yor generative process, only the sequence of ordered dishes x n = (x 1 , . . . , x n ) is observed. For discrete base distriubtions, the number of unique dishes,K n , observed in x n provides only a lower bound for the number of occupied tables, K n , since the same dish may be drawn from the base distribution multiple times. Similarly, the number of of dishes eaten by only one customer,H 1n , provides only a lower bound for the number of single-customer tables, H 1n . Assuming a uniform discrete base distribution on a sufficiently large set of nodes V , the approximations in (3) and (4) might be acceptable, but in this particular scenario it is also possible to use a simple correction. Conditioning on the true but unobserved number of draws from the base distribution (equivalently, the number of tables) K n , a uniform base distribution implies that the expectation of the number of unique drawsK n can be obtained as a generalisation of the birthday problem with |V | days in a year: Substituting E(K n |K n ) withK n and solving for K n yields the approximation Similarly, and hence the approximation The performance of (8) and (9) as rival estimates toK n andH 1n will be assessed via simulations.

5.
Power-laws and Pitman-Yor dynamic graphs. In this section, some properties of the graph generated from distinct Pitman-Yor processes on each destination node are analysed. After an observation period [0, T ), it is possible to construct an adjacency matrix A ∈ {0, 1} |V S |×|V D | , where A ij = 1 if the source node x i connected to the destination node y j at least once, and 0 otherwise. The row and column sums of A correspond to the out-degree and in-degree sequences: Real world graphs are commonly characterised by power-law degree distributions: [19]. The indegree and out-degree distributions in a graph generated from a Pitman-Yor process for each destination node can be controlled by the discount and strength parameters and the base distribution.
First, note that κ in j =K Nj , where N j is the total number of connections to the destination node y j . Conditional on the parameters α and d of the process, and for a sufficiently large number of atoms, , which holds for a suitable choice of α 0 and d 0 in (2), then The out-degree distribution is slightly more difficult to control: it is highly dependent on the choice of the base distribution F 0 . For a discrete uniform base distribution over V S : which is not a power law. In order to generate a power law distribution, one can set a non-uniform discrete base distribution and a similar formula can be derived for E(κ in j |K Nj , {θ i }). Hence, the distribution of κ out i strongly depends on the interplay between N j , which contributes to K Nj , and θ i , which means that it is controlled by α 0 , d 0 and G 0 in (2).
6. Calculating p-values for anomaly detection. For any given destination node y ∈ V D , a p-value can be computed for each observed source node x n+1 , given the history x n . From the posterior predictive (1), the p-value p n+1 associated with the (n + 1)-th connection to the destination node y is The p-values (10) are discrete and stochastically larger than standard uniform random variables. For this reason, it can be preferable to consider mid p-values [15]. Defining the stochastically smaller quantity where the indicator function instead acts on the half-open interval [0, p(x n+1 |x n )) and thus sums all possibilities strictly less probable than the event observed, the mid p-value is given by In many problems, mid-p-values have been shown to outperform standard p-values, in particular in anomaly detection procedures in computer networks [26]. Similarly, the p-value for the joint event (x n+1 , y n+1 ), conditional on (x n , y n ), is where p(x n+1 , y n+1 |x n , y n ) = p(x n+1 |y n+1 , x n )p(y n+1 |y n ) by conditional independence assumptions. The calculation of (11) is intractable, but the p-value could be suitably approximated using a p-value combiner. Given a sequence of p-values p 1 , p 2 , . . . , p , common choices for p-value combiners are [11]: • Fisher's method [7]: • Stouffer's method [28]: is the inverse cumulative distribution function of a standard normal distribution. For approximation of the p-value in (11), = 2, with p 1 and p 2 the corresponding p-values for y n+1 and x n+1 |y n+1 .

7.
Applications and results. The proposed model and estimation procedure have been validated and tested on synthetic networks. Furthermore, the goodnessof-fit of the Pitman-Yor process has been assessed on a real-world network released by the Los Alamos National Laboratory [14].
7.1. Estimation of the Pitman-Yor hyperparameters. The different techniques proposed for the estimation of the Pitman-Yor hyperparameters have been extensively compared using a simulation study, constructed as follows: S = 10,000 sequences of table allocations, each of length N = 100,000, have been simulated using the Chinese Restaurant metaphor of the Pitman-Yor process, setting α = 7 and d = 0.25. For each of the S sequences, the tables have been assigned a label drawn at random, with replacement, from a discrete uniform distribution with |V | = 1,000 and |V | = 16,230 atoms (corresponding to the number of source machines in the Los Alamos National Laboratory enterprise network). Matching the sequence of table allocations with the table labels gives a sample sequence from a Pitman-Yor process with uniform discrete base distribution F 0 with |V | atoms. From the resulting sequence, the values ofK N andH 1N are calculated, and their corrected counterparts in (8) and (9),K N andĤ 1N . The values are compared with the true K N and H 1N , available from the simulated table allocation. The parameter estimatesα andd obtained using the pair (6) and (7) are referred to as Method 1 and (5) as Method 2; both are computed using either the uncorrected estimates (H 1N ,K N ) or the corrected pair (Ĥ 1N ,K N ), producing four different estimates of the parameters. Kernel density estimates across the simulations for each parameter and each estimation method are plotted in Figures 2 and 3. The kernel bandwidth choice is based on Silverman's rule of thumb [27].
In Figure 2 and 3, the plots (a) and (b) show that parameter estimation based on the most popular approximation for E(K n ) in the literature, Method 1, gives biased results. On the other hand, when Method 2 is used to estimate the pair (α, d), and the corrections (8) and (9) are used, the results are excellent, and the proposed estimation procedure on average correctly recovers the exact values of the parameters used to simulate the data. When the correction is not applied, the performance of Method 2 is still fairly reliable for α, but performs poorly for estimating d.  observed valueK N is on average close to the true K N . This does not happen for |V | = 1,000, and the approximation becomes crucial.
Overall, the simulation confirms the reliability of the parameter estimation procedure based on (5) for a Pitman-Yor process. The performance of the proposed procedure is far superior to the method based on the estimates (6) and (7), which are obtained from the most common approximation of E(K n ) used in the literature. Furthermore, for discrete uniform base distributions, the corrections proposed in equations (8) and (9) are shown to be highly beneficial for parameter estimation, especially when the number of atoms in the base distribution |V | is not large. 7.2. Description of the LANL authentication data. The proposed model and estimation procedure have been applied to the user-authentication data [14] released by the Los Alamos National Laboratory 1 . An example entry is: 1,C567$@DOM1,C567$@DOM1,C574,C988,Kerberos,...
The entries of interest in the data-line above are the source computer C574 and destination computer C988, and the arrival time 1 of the event. In total, the data contain 1,051,430,459 events, involving 16,230 source computers and 15,895 destination computers, for a total of 17,684 unique machines and 419,744 unique observed edges. Interestingly, the data also contain 48,079 records labelled as redteam events, resulting from a simulated intrusion. The compromised source nodes are C17693, C18025, C19932 and C22409. A reliable model must be able to identify unusual activity associated with those source nodes and therefore associate a high anomaly score. 7.3. Empirical assessment of the goodness-of-fit. In this section, the empirical estimation procedure of the hyperparameters and goodness-of-fit of the Pitman-Yor process is evaluated on selected nodes in the Los Alamos National Laboratory network. In order to assess the model fit, it is possible to examine the Q-Q plot of the p-values, which are approximately uniformly distributed in (0, 1) under a correct model specification. Examples of the Q-Q plots observed for six destination nodes are plotted in Figure 4, obtained using the estimate (5) and the corrections (8) and (9). The base distribution F 0 of the Pitman-Yor process was chosen to be uniform on the set of source computers V S : F 0 (x) = 1/16,230 × 1 V S (x). The estimated values of the parameters of the Pitman-Yor process, and additional summary statistics, are reported in Table 1. The Q-Q plots show, for some of the nodes, an excellent fit on real data under the strong assumptions made in this modelling framework (see plots for C5716, U7, C2525, and C1877). On the other hand, in some other cases (see plots for C395 and C423) the distribution of the p-values is not uniform, but follows the theoretical distribution only on the left tail.
It is also possible to compare the performance of the method for different choices of α and d. A destination node C1438 is used as an example. The computer C1438 appears as destination in N = 136,743 connections, with K N = 394 unique edges 1 The data set is available online at https://csr.lanl.gov/data/cyber1/.   (8) and (9). The three plots in Figure 5 present the sequential values ofK n andH 1n and their ratioH 1n /K n , corresponding corrected estimates, and average sample paths obtained from Pitman-Yor processes with a number of different values of the parameters. It is immediately clear from the plots that the only sample path which correctly captures the behaviour ofK n ,H 1n andH 1n /K n simultaneously is the Pitman-Yor process with estimatesα andd obtained using Method 2, with or without correction for discrete F 0 . The fit is excellent, and the observed trajectories almost coincides with the average estimates obtained from multiple simulations of the process. The Dirichlet process fails to trackK n andH 1n individually, and gives a slightly better performance when modelling the ratioH 1n /K n . The Pitman-Yor process with estimates obtained from Method 1 only marginally tracks the data, reaching approximately the observed valuesK N andH 1N only at the end of the process (N = 136,743). On the other hand, the ratio is not modelled in a satisfactory way.
Overall, the Pitman-Yor fit is far superior to the Dirichlet process, even without an optimal choice of the parameters, showing that in this case adding the discount parameter d is beneficial. The simulation empirically confirms that the Pitman-Yor process seems to be a suitable choice, but the parameters should be estimated carefully.
7.4. Network-wide anomaly detection. Let us suppose that p 1 , . . . , p N are the p-values computed for the N events observed on a given edge (x, y) ∈ E, and q 1 , . . . , q N are the corresponding mid-p-values. Those p-values can be obtained from the sequence y n+1 |y n , or from x n+1 |y n+1 , x n , or from a combination of the  Given the sequence of p-values, it is possible to use combine the scores into a single distribution for each edge. [10] suggests to use the minimum p-value method, or Tippett's method. Note that in this framework the distributional result is only approximate, because of the discreteness of the p-values. Then, the lower tail area of the Beta(1, N ) distribution evaluated at min{p 1 , . . . , p N } gives the p-value p xy associated with the edge (x, y).
Following [10], we define as E x set of edges in the network graph with source node x on which connections have been observed: E x = {(x, y) : y ∈ V D ∩ (x, y) ∈ E}. The p-values p xy on each edge can be combined into a single score s x for each source node using one of the combiners previously presented. For example, using the Pearson's combiner, under a normal behaviour of the network, the theoretical distribution of s x is s x d ∼ χ 2 2|Ex| . Therefore, a p-value p x for the source computer x is given by the left tail probability of the χ 2 2|Ex| distribution, given the observed s x . From the list {p x , x ∈ V S }, where V S is the set of source nodes, it is then possible to identify the most anomalous source computers by ranking the p-values. A similar procedure can be carried out sequentially using the mid-p-values q 1 , . . . , q N , or different combiners at the event, edge or node level. The procedure is fully parallelisable and particularly suitable for implementation on standard platforms for Big Data analysis like Hadoop MapReduce, and has been applied to the Los Alamos National Laboratory data.
For each destination computer y ∈ V D , α y and d y are estimated using (5) and the corrections (8) and (9), and similarly for α 0 and d 0 . Table 2 reports the results obtained using the minimum p-value method at the edge level and a number of different combiners at the event and node level.
Two of the compromised computers are consistently ranked among the top-1% most anomalous nodes. In particular, C17693 is sometimes ranked as most anomalous machine overall. For the two remaining nodes, associated with more subtle activity within the network, impressive improvements are achieved when using mid-p-values, but the malicious activity on these nodes is still difficult to detect. Overall, the results confirm the conclusions in [26] about the improved detection performance when using mid-p-values. This is particularly evident in the case of the two least anomalous compromised machines: using mid-p-values, the ranking improves significantly. The activity associated with C17693 is anomalous even when only the p-values associated with the sequence of destination nodes y 1 , y 2 , . . . is considered. For the other three machines, the rankings significantly improve when the edge activity from the p-values for x n+1 |y n+1 is used for computing the anomaly scores. Using the combined scores at the event level does not significantly improve the results, showing that most of the malicious activity might be due to edge-related anomalies.
8. Conclusion. In this article, a model for events in dynamic networks is presented. The sequence of edges is modelled by using the Pitman-Yor process, a two-parameter extension of the Dirichlet process. Empirical methods for choosing the hyperparameters, based on large sample results, have been proposed and tested on the Los Alamos National Laboratory authentication data, showing excellent fit on the activity of real destination computers. Furthermore, corrections for discrete base distributions are carefully addressed. The Pitman-Yor process allows for more flexible and accurate modelling of the tails of the predictive distribution, which also implies more accurate modelling of the new links in the network. The model has been tested on synthetic data and applied to the Los Alamos National Laboratory authentication dataset, showing good performance in a network-wide anomaly detection study.
The Pitman-Yor model might be used as a building block for more complex models. For example, in computer networks, the arrival time of each event is also available, and suitable models for arrival times on each edge have been proposed in the literature [24]. The overall performance of the method is impressive, since only two pieces of information are used in this article: the source and destination node. The methodology can be potentially extended to consider any additional information which is usually available for each event: for example, in computer network data, authentication type and logon type.