Unsupervised robust nonparametric learning of hidden community properties

We consider learning of fundamental properties of communities in large noisy networks, in the prototypical situation where the nodes or users are split into two classes, e.g., according to their opinions or preferences on a topic. We propose a nonparametric, unsupervised, and scalable graph scan procedure that is, in addition, robust against a class of powerful adversaries. In our setup, one of the communities can fall under the influence of a strong and knowledgeable adversarial leader, who knows the full network structure, has unlimited computational resources and can completely foresee our planned actions on the network. We prove strong consistency of our results in a setup with minimal assumptions. In particular, the learning procedure estimates the baseline activity of normal users asymptotically correctly with probability 1; the only assumption being the existence of a single implicit community of asymptotically negligible logarithmic size. We provide experiments on real and synthetic data to illustrate the performance of our method, including examples with adversaries.


Introduction
We develop robust and scalable methods to uncover global properties of communities hidden in large noisy networks.Consider the fundamental situation where the nodes or users in the network are split into two classes according to their opinion or preferences on a specific topic.Examples include support of a particular candidate in elections [1], or a level of interest in a particular topic, or a degree of support of certain statement.We call these two classes the "active" and "inactive" users, respectively.
Motivated by real-world settings, we assume that the network of interest is too large to be processed manually, especially for each possible topic of interest.Therefore, activity observations of users are determined and delivered to us by a third-party algorithm called the crawler.Naturally, the crawler has its classification and learning errors that are not known to us.Therefore, we treat a general non-parametric case of the crawler error probabilities.Our goal is to learn global properties of communities of active and inactive users despite such noise and errors, in an unsupervised way, while additionally being robust to a strong adversary.

Distinctive features of our framework
We treat the setup where active users can fall under the influence of a strong adversary who is capable of directly altering their activity values, pursuing the goal of spoiling our uncovering of communities.The adversary knows the true values on active vertices and which vertices are inactive, knows the full graph structure, and has unlimited computational power and memory.Moreover, the adversary has a special deal with hackers and can completely foresee the actions we will be doing on the network, including the outcomes of all our randomized procedures, in case if we choose to use any random or pseudo-random number generator.
The only limitations of the adversary are that it influences the active vertices only once, before we run our inference, and cannot act at any later steps.This assumption is not restrictive, as we are working on a static network (even if it is a particular observation of a dynamic network or of a random network); as the adversary knows everything we would do on the network in advance, his single action can be a combined response incorporating series of individual responses to any sequence of our inferences.
For clarity, it is assumed that the adversary does not know the outcomes of the crawling algorithm.The adversary can only command its supporters, and therefore does not influence inactive vertices.This is a natural assumption that is met, for example, by political parties and their leaders, or by communities of bloggers under the influence of a particular opinion leader.
It is assumed that the adversary cannot completely tone down the active nodes as well, so that a separation between (active) supporters and (inactive) non-supporters is maintained, even though the separation rule is not known.The last assumption is met in many applications such as, for example, analysis of the blogosphere [1], as the crawler is forming its evaluations using the history of users' activities; therefore, past activities of the user are accounted for and do not let the user completely mask his activity level.
As to our noise model, notably, we do not assume binary noise (often used for community recovery), do not assume Gaussian noise (engineering literature, and most of statistical literature on community detection and recovery), and we do not even ask for a noise distribution to be known (so that community detection in this setup would be a composite nonparametric hypothesis testing problem, for example).Our noise model allows for observations where the active vertices need not form completely, nor be pairwise independent collections of random variables; the model also permits strong and long-distance correlations between observations on active vertices.
Finally, learning the actual performance of the crawler can also be addressed as a by-product of results of this paper.We note that our results are, of course, valid in their present form in the special case when there is no adversarial action.
In summary, our key contributions are: -A robust scalable graph k-NN scan estimator ( §3).The k-NN scans presented here generalize sliding windows, moving averages, and scan statistics, extending them to the case of general graphs.Our probabilistic analysis of the estimators is related to extreme order statistics for dependent random variables.
-Sufficient conditions for k-NN graph scan estimators to be consistent for learning global properties of communities in an adversarial framework.This is a remarkable property of these local scans, as it can be easily seen that most existing methods can easily be spoiled either by such a strong adversary or by sheer dimensionality of a very general nonparametric framework that we are considering.
-Discussion of aspects of our estimator's computation to allow large scale graphs, in particular via a highly decentralized implementation, allowing scalable distributed and parallel graph processing.
Our estimator is non-parametric, yet scalable method for learning properties of communities in noisy graphs.Moreover, it operates under minimal assumptions about the structure of the network and the communities.Unlike many existing methods, it does not require graph sparsity or presence of hidden highly connected communities.Moreover, we consider general graphs without limiting attention to lattices, or Erdös-Renyí random graphs, or preferential attachment structures.The only condition that we impose on the network is that it has at least one compact locally connected community of size that grows at least as a logarithm of the total number of vertices, and therefore is asymptotically negligible in the large scale setting.Therefore, we believe that these results are foundational to fast consistent algorithms for community detection via percolation on general graphs.For the special case of lattices, the automated detection theorem [13] serves as an example illustration of this approach.

Related work
Scan statistics have been long used for detection of unexpected events and for nonparametric estimation.The initial idea and the first development of the underlying theory, for one-dimensional discrete case, goes back at least to [20], who studied longest runs of successes in Bernoulli sequences.In the two-dimensional case, a surge of interest to continuous scan statistics has been sparked by [6].This and related types of Euclidean spatial scan statistics found numerous applications in geostatistics, medicine, epidemiology and ecological studies, see examples in [19], [7,8], [16].The scan statistic methodology is compatible with Bayesian paradigm as well [15].Notably, despite substantial efforts, most of the work in this area is based on heuristics and experimentation rather than on rigorous probabilistic analysis.
Recently, a new line of research emerged where discrete versions of multidimensional scan statistics were applied to discrete structures such as pixelized images or lattices [10], [9], [11], [2].The idea of discrete scans proved useful in application areas like anomaly detection and automated detection of unknown objects in extremely noisy images [13], [12].Surprisingly, discrete scan statistics were rigorously analyzed by means of random graph and percolation theories [5].
As a natural extension of this idea, several variations of scan statistics for graphs were proposed.Examples include non-parametric scan statistics for event detection and forecasting in heterogeneous social media graphs [4], changepoint detection over graphs with the spectral scan statistic [17], anomaly detection in graphs [18], graph topic scan statistic for spatial event detection [14].
In the present paper, we extend the previous research in this area by constructing a fully nonparametric unsupervised learning graph scan-based procedure that is robust against an extremely malicious adversary.Moreover, we prove consistency of our results in a framework with minimal assumptions.
The outline of this paper is as follows.In Section 2, we introduce the framework for analyzing noisy or indirectly observed graphs with users polarized into two types of communities according to their opinion on a particular matter, and introduce our concept of a strong adversarial leader influencing communities of one of these types.In Section 3, we introduce the concept of k-NN graph scan estimators and define the main notions related to their construction.These estimators are one of the main ingredients for robust consistent inference in this paper.In Section 4, we establish consistency and derive important properties of these estimators.Proofs of all the results can be found in Appendix.Experimental results for real and synthetic graphs, as well as for the cases with and without adversaries can be found in Section 6. Scalability of the method is established and the algorithm's distributed realization is discussed in Section 5.

Formulation: model and adversaries
Let G n = (V n , E) be a graph with n vertices and |E| edges.We do not impose restrictions on G n such as sparsity, so |E| n is possible.For any vertex v ∈ V we observe a real-valued random variable X v : Ω → R, defined on an appropriate probability space Ω.
We call the random variables X v the observed activities.These observations are noisy realizations of the true activity A v (of each vertex v), which we only observe with an additive (nonparametric) noise.More precisely, where we assume that the noise {ε v } satisfies We do not assume knowledge of either the distribution F or the variance σ 2 .Moreover, F need not have any particular parametric form, or be continuous, or be limited to being discrete with finite or countable support.Given this model, our aim is to robustly learn, without supervision, as much as possible about the collection of the true underlying activities {A v } v∈V from noisy observations (1); and possibly, to also learn F nonparametrically.In this model, A v denotes the (unknown) value on the vertex v. Intuitively, we can think of A v as the level of activity or support to a certain topic that v intends to reveal to an observer.This level is observed imprecisely due to noise incurred by our network crawling algorithm.Moreover, we admit presence of a powerful adversary who can corrupt the graph to hinder inference ( §2.1).
Example 1.Consider a large-scale network where manual processing is not possible and the activities of users (nodes) are determined using algorithms that may either be operating on outdated data or have large classification errors.For instance, say in a social network the observed activity X v measures the level of support of a statement by user v to a specified topic.It might be difficult for an algorithm to realize whether a statement "Topic X is crazy" is supportive of the topic or the opposite.
When estimating baseline activities within communities of a-priori unknown configurations and in the presence of nonparametric noise, it is important to assume that there exists an observability threshold.Thus, we assume that: Vertices with activity level a are normal (inactive).Those with levels above the unknown threshold b are active and one of the goals of this work is to suggest an estimator of their baseline activity.
This setup with two distinctly different types of communities is typical for applications where users have to be considered in terms of their opinion or preference on an important polarizing topic.Examples include political elections with two candidates, or important issues such as a healthcare reform, etc.
We explicitly remark that we do assume two types of communities, but not the existence of two large communities covering the whole network.In fact, our setup includes the case when there is a large number of small communities of each type.The results of this paper allow strongly consistent inference for the cases when the number of communities of each type can be O(n log −1 n).
The goal of this paper can now be formulated as unsupervised learning of the infinite-dimensional vector (a, b, F ), where a and b are baseline activities corresponding to inactive and active users correspondingly.They are fundamental global intrinsic properties of topical communities of users and are needed to be able to detect and recover active and inactive communities in subsequent analysis.An unknown probability distribution function F represents performance of the crawling algorithm for the particular topic.
We specifically note here that the problem of estimating global baseline activity levels of communities is not the same as the problem of identifying active and inactive vertices in a network.For each particular vertex, the latter problem amounts to classification (or to clustering, in the unsupervised scenario), while the former problem is an infinite-dimensional parameter learning problem.Moreover, as follows from our results in this paper, global properties of hidden communities can be learned (in a strongly consistent way) without guessing activity levels of individual vertices.On contrary, it can be easily seen that, in the general setting of nonparametric noisy graphs, it is impossible to guarantee a strong form of consistency for uncovering true activity levels of individual vertices, and it is impossible to even consistently classify any individual vertex as an active or inactive one.
The lower bound for active vertex intensity b is assumed to be unknown, and we propose an unsupervised procedure for learning b as well, even though we do not analyze its performance in this paper.The difficulty of these estimation problems is that locations, shapes and exact sizes of clusters of active users are assumed to be unknown.The number of active users or active clusters is also unknown, and we make no probabilistic assumptions about this number or about the distribution of cluster locations.There can be anywhere between O(log n) and n inactive users, and between 1 and O(n log −1 n) inactive communities.The contribution of inactive (and active) users and communities can range from negligible to dominant.
It is important to note that we consider the case of a fully nonparametric noise of unknown level and having an unknown distribution; even within the setup with independent identically distributed and bounded noise, this model is far more general than traditional models with normally distributed errors and graphs with simple regular structure or parametric types of degree distributions.

Strong adversary
Our model and inference algorithms permit presence of a strong adversary.In particular, we assume that active vertices may be under the influence of an adversarial leader who is capable of directly altering their values A v , pursuing the goal of spoiling our inference.The adversary knows the true values on active vertices and which vertices are inactive, knows the full graph structure of G n , and has unlimited computational power and memory.Moreover, the adversary has a special deal with hackers and can completely foresee the actions we will be performing on the network, including the outcomes of all our randomized procedures.
Therefore, potentially, for all active v, where A is the algorithm we use for inference, and A(G n ) is the collection of all the steps we would perform together with values of all the random variables that we will generate.Unlike the case of inactive vertices, this is not a collection of numbers, but rather a collection of random variables with a complex and unknown mutual dependence structure.Therefore, {X v | v ∈ V n } forms a collection of random variables too.We do not require this collection to be neither completely, nor pairwise independent, nor having identical marginal distributions.In fact, we allow for strong and long-distance correlations (within bounds given by ( 3)) between observations on active vertices.
The adversary has some limitations, though (otherwise, clearly, no consistent inference would be possible for us).It influences the active vertices only once, before we run our inference, and cannot act at any later steps.This is an assumption that we make to derive for strong consistency theorems; the assumption is reasonable, as we are considering an effectively static network (even if it is a particular realization of the random network), and we ourselves can only act on the network once and without any supervision or prior knowledge.However, in our experiments in Section 6 we allow the adversary that is as strong an can act in multiple steps, and we are still able to demonstrate empirical consistency of our method.
We also assume that the adversary does not influence inactive vertices, and cannot completely tone down the active ones, so that condition (3) still holds.Moreover, the adversary does not know and cannot influence the outcomes of the crawling algorithm, so that {ε v | v ∈ V n } is a completely independent collection of identically distributed random variables.These assumptions are met in a number of applications.For example, when types of communities correspond to members of political parties supporting particular candidates in the elections, the leader of one of the parties can command his supporters in a variety of ways and also can gain access to hidden information about the leader's supporters; meanwhile, all this information will be completely unaccessible for our inference.
Within this adversarial framework, we present below (Sec.3) new k-nearest neighbors based graph scan estimators and establish sufficient conditions for them to be consistent.This consistency is a strong type of robustness, and a particular strength of our approach, as it can be easily seen that many estimators can be spoiled by such a strong adversary, as the example below suggests.
Indeed, suppose we are using an ingenious method that allows us to select a "nice" set of local subgraphs on which we run some consistent estimator.The adversary, knowing our method fully, can increase the values on the few active vertices contained in the selected local subgraphs, thus either skewing our averaging on those subgraphs, or knocking off good subgraphs in case model selection is involved.
Of course, all the consistency results of the present paper are still true in the neutral case when there is no adversary spoiling our inference.

k-NN graph scan estimators
The goal of this paper can now be formulated as unsupervised learning of the infinite-dimensional vector (a, b, F ), where a and b are baseline activities corresponding to inactive and active users correspondingly.They are fundamental global intrinsic properties of topical communities of users and are needed to be able to detect and recover active and inactive communities in subsequent analysis.An unknown probability distribution function F represents performance of the crawling algorithm for the particular topic.
More specifically, the true activity level a corresponds to normal (inactive) vertices, while active vertices have the true activity level at least b.Since our estimators will be based on a scan by k-nearest-neighbors, we call these estimators k-NN scan estimators, or k-NN graph scan estimators.These estimators can be used for doing nonparametric statistics and unsupervised learning on the graph (network) G n .
Our sole assumption on the graph structure and active and inactive communities can be formulated in a local form concerning only a negligibly small sub-community of inactive users.Assumption 2. For our search of inactive users, assume that there is an inactive vertex v ∈ G with a full k(n)-neighborhood of inactive nearest neighbors, and that Right below we clarify the terminology used in the Assumption.Let v ∈ V be a vertex of G and let m ∈ N be any number.We define a nested (multilevel) neighborhoods of v in G as follows.First define N 0 (v) := v, and then define recursively More generally, for any natural number i, For natural i, descendance level sets of i-th order for v in G, are defined for i ≥ 1 as For i = 0, we have set for convenience Algorithm 1 k-NN scan estimator 1: Phase 1 (decentralized).Pick one arbitrary k(n)-neighborhood per vertex in V , and compute the average of X v over this neighborhood.2: Phase 2 (collaborative).Identify a node with smallest average, over its neighborhood K.
For any k ∈ N, a (full) neighborhood of k-th order of v in G is defined as the union Lemma 3 states a simple relation between D and Ω.
Lemma 3. Let D i and Ω k be defined as above.Then, and For any m ∈ N, a full m-neighborhood In a graph G = (V, E), for a set of vertices K ⊆ V , the total sum of values observed over K will be denoted as Definition 4 (k-NN scan estimator).Let K 0 be any collection of exact k(n)-neighborhoods of all vertices in G: Set K := arg min and define a (sublevel) k-NN scan estimator as The k-NN scan estimator for inactive vertices can be computed via the following Algorithm 1.
Example 5.An important special case of graphs is given by lattices.Suppose we have a noisy twodimensional pixelized image.We are interested in detection of objects that have an unknown color.This color has to be different from the colour of the background.It is also assumed that on each pixel we have random noise that has an unknown nonparametric distribution.This type of model is typical for cryo-electron microscopy.Typically, each cryo-EM picture contains a large number of particles.Particles have unknown, irregular, nonconvex and different shapes and sizes.The only common property for all cryo-EM pictures is that particles are darker than the background and that the noise has a completely unknown irregular distribution.However, both the background intensity and the particle intensity vary from image to image and are not known in advance.It is possible to view digital images as networks, where individual pixels correspond to vertices.Since the initial noisy image can be naturally viewed as a square lattice graph, where k 2 -nearest neighbors correspond to a k × k subsquare on the screen, we see that a popular sliding window estimator is a special case of the k-NN scan estimator.
This paradigm was used to solve a number of nonparametric unsupervised learning problems in image analysis and cryo-EM applications (see [12], [13]).Many results on discrete spatial scan estimators from [12] and [13] are special cases of results of the present paper.
It is possible to define scan estimators that use either full k(n)-neighborhoods of all vertices, or all exact k(n)-neighborhoods.These estimators are consistent under rather general assumptions as well, even though the corresponding limiting distributions differ from those of the scan estimator of this paper.However, the active adversary assumption needs to be altered to ensure consistency for each of these estimators.
In view of the symmetry of the results between sub-and super-level cases, in this paper only the sublevel case is considered in details.For completeness, the superlevel scan estimator and the estimator of the crawler's performance are described below.

Superlevel graph scans for active vertices
For inference on active clusters we would have to assume that there exists at least one active cluster that contains a full ϕ(n)-neighborhood of vertices, and that Construction of superlevel k-NN scan estimators for the active level (b in our notation) requires the following dual assumption (cf.(3)): Now the above graph scan algorithm can be inverted to get a k-NN scan estimator for active vertices instead.The only modification would be that this time In view of the symmetry of the results between the sublevel and the superlevel cases, in this paper only the sublevel case is considered in details.

Graph scans for learning the crawler's performance
Our framework involves a third-party network crawling algorithm that provides us the data about the users of the network.We propose here a modified graph scan estimator to estimate the unknown distribution F of the misclassification rate of the crawling algorithm.The difficulty here is again that we do not know which vertices are active or inactive.However, this problem is solved by a combination of the graph scan estimator with the empirical distribution function estimator.
Let K be defined by ( 12) of Definition 4. A graph scan estimator for the crawler's error distribution F would be The following variation gives an unbiased consistent estimator for the noise variance σ 2 : Both scan estimators σ 2 and F (t) can be easily calculated once a is calculated.We conjecture that these are consistent estimators of σ 2 and F , respectively.Moreover, there is evidence that

Consistency of k-NN scan estimators
In this section, we establish strong consistency of the proposed k-NN scan estimator under the assumption of bounded noise.Suppose, therefore that there is a constant The bounded noise case is not the only case when the k-NN scan estimator is strongly consistent.We mainly use condition (20) to establish tight nonasymptotic performance guarantees for the estimator.Some form of asymptotic consistency can be established for unbounded noise from large nonparametric classes as well.This is illustrated in Section 6, where all of our experiments are performed for an unbounded noise.Let K ⊆ V n be any collection of vertices.Denote by S 1 (K, n) the number of active vertices in K.We prove the following statement that provides the foundation of model selection on graph neighborhoods.Proposition 6.Let K 0 be any set of inactive vertices with |K| = k(n), and let K be any collection of exact k(n)-neighborhoods.Define, for any Then, we have the following bound: Notice that the statement of Proposition 6 is actually non-asymptotic: even though the bound depends on n, it is valid for small n as well.Additionally, the bound (21) depends on the subgraph K, and automatically tightens for those subgraphs that have more active vertices.This can be used to show that (21) is essentially tight, with or without the adversary, also in finite-sample, non-asymptotic cases.
The bound in Proposition 6 is based on the first Bernstein inequality [3], which substantially improves on the Hoeffding inequality in case if Var ε v M 2 , which is the case for many distributions of interest.Additionally, this makes the bound (21) extensible to weakly-dependent random variables as well, unlike the corresponding bound that relies on the Hoeffding inequality.
The following is the master theorem governing consistency of graph scan estimators.We understand consistency in its strong classical form: as the sample size increases, the algorithm learns the true value of the parameter correctly with probability approaching 1.
Then the scanning algorithm A leads to the graph scan estimator a = X K that is a consistent estimator of a, regardless of the adversary's strategy.
As a corollary of this general statement, we derive consistency of the estimator of the present paper.
Theorem 8. Suppose that k(n) satisfies Assumption 2.Then, the graph scan estimator a = X K is a consistent estimator of a, regardless of the adversary's strategy.
Proofs of all the results can be found in Appendix.Experimental results for real and synthetic graphs, as well as for the cases with and without adversaries can be found in Section 6. Scalability of the method is established and the algorithm's distributed realization is discussed in Section 5.

Scalable algorithms and computation
In view of the huge size of many networks of current interest, and also due to the fact that we typically do not get the chance to observe the network in its entirety but rather can only query it in small parts, we need fast and scalable algorithms to implement our estimators.Moreover, these algorithms should be provably statistically reliable and consistent.In this section we discuss some scalability aspects of our estimators; however, to avoid detracting from our primary modeling focus, we defer a thorough empirical evaluation to the future.
Distributed computation.Note that the computation of our proposed estimator shown in Algorithm 1 is fully decentralized, breaking up into computation of one neighborhood per graph node (Phase 1).However, there is a necessary centralized communication phase at the end that collects all averages (a single number) per neighborhood and only keeps the smallest one out of those (Phase 2).This second phase is easily implemented as a typical map-reduce operation over the nodes, the 'reduce' operation being the min operator.Proposition 9. Let k(n) be the neighborhood size.One can compute a k-NN scan estimator using O(k(n)) operations per node in the decentralized phase (Phase 1) of Algorithm 1, followed by a single communication round to compute the minimum over the graph.
Parallelization.The main computational cost is encountered in Phase 1 of Algorithm 1. Fortunately, the computations in this phase decouple completely, enabling our algorithm to be directly applicable to huge networks.In this setting, we assume the nodes of the graph are partitioned over a set of compute agents, each responsible only for the estimators over its own nodes.In the extreme case, in an IoT setting as noted below, every graph node has its own compute agent responsible for it.All independently execute the same BFS style algorithm to build an average over the k(n) sized neighborhood per node (naively needing the communication of k(n) numbers back to node v).
In an IoT framework, we naturally have a processor built-in at each graph node.We therefore readily use those processors (that have immediate access to local observations, and communicate in small neighborhoods) for parallel estimation using Algorithm 1.
6 Experimental results

Simulated large networks, no adversary
To illustrate performance of our method on a larger scale, we constructed an artificial network that consists of two subgraphs -one of which is made of a million nodes where each of these nodes is connected to 3 randomly chosen nodes within this subgraph.The other subgraph consists of a thousand nodes and, in a similar way, each of these nodes is connected to 3 randomly chosen nodes within this second subgraph.All the 1000 nodes in the smaller subgraph are labeled as inactive, while nodes in the bigger subgraph are labeled as active or inactive randomly with equal probability.This way, we guaranteed that there is an inactive community of size k(n) = 1000.Furthermore, we pick 20 random pairs of nodes -one from the bigger subgraph and the other from the smaller one and connect the two.Number of nodes in the graph is 1,001,000 and number of edges is up to 3,003,002.
Activity levels in the network are set to a = 2 for the inactive nodes and A v = 10 for the active ones.These weights are corrupted by adding a Gaussian noise of mean 0 and variance σ 2 = 1 before feeding them to the algorithm.
The estimates returned for the above example using our proposed approach are in accordance with the theory.The algorithm underestimates a when we pick the value of the parameter k that is too small (this value is corresponding to the size of the neighborhood).The estimate's value increases with an increased value of k.Halfway between the two extreme values we obtain very accurate estimates.Naturally, these values of k that return the optimal estimate (k = 500 or k = 1000 in this case) can also be considered suggestive of the inactive community sizes inside the network.
If we change the size of the smaller group to 10,000, we observe that the k = 1000 setting of the algorithm returns an estimate of a = 1.75 on average over 400 experiments, implying that we have to scan with bigger neighborhoods to achieve consistent inference.Indeed, in this case k(n) = 10, 000.
The histogram in Figure 1 illustrates the distribution of the k-NN graph scan estimators for different values of k.In particular, for k = 500 the empirical mean of the estimator is 1.91914586 with variance 0.0341272087, while for k = 1000 we have estimated mean 2.03303998 with variance 0.031305042.The true value a = 2 is, indeed, very close.

Real network structure, no adversary
In this example, we use the real community structure of political blogs from the 2004 U.S. Presidential Election.Paper [1] studied the linking patterns of political blogs.These blogs can be naturally classified into  two classes, the liberal ones and the conservative ones.This is suitable for our framework.The number of libertarian, independent, or moderate blogs was negligible at the time.
In [1], a description of the network of over 1000 blogs is presented, based on a single day snapshot that included blogrolls.The blogs were categorized manually.It turned out that neither directory labels relying on self-reported or automated categorizations, nor the manual labels were 100% accurate, with an error probability that has an unknown distribution.
There were 1494 blogs in total, with 759 liberal and 735 conservative.The structure of the underlying linking graph was rather complex, as 91% of the links originating within either the conservative or liberal communities stay within that community, but the number of intercommunity links was non-negligible as well.The structures of both communities were noticeably different: 82% of conservative blogs receive a link from at least one other blog, while for liberal blogs the corresponding number is 67% only.Surprisingly, the big graph was disconnected, with one big component of 1222 vertices, and a large number of individual blogs that were not part of a bigger community.Figure 2 illustrates the structure of the political blogging graph.
To create a challenge to our method, we add an additional complication to this dataset by corrupting the values attributed to blogs by Gaussian white noise.Notice that this type of noise is unbounded and so does not satisfy conditions of the consistency theorem.The graph size is also relatively small, so we are far from the asymptotic regime.However, the graph scan estimator produced surprisingly accurate results.
The histogram in Figure 3 illustrates the distribution of the k-NN graph scan estimators for different values of k.It is apparent that in this graph there is an inactive neighborhood of size close to 150, but not more.An estimator for k = 150 is surprisingly accurate.

Simulated large network, two types of adversaries
In order to study the impact of an adversary on the performance of the proposed algorithm, the following experiment is designed.Let us consider the following two scenarios.Scenario 1: the adversary decides to act locally and chooses to influence those active nodes that are present in the best neighborhood K of our choice, thus hoping to throw off our specific Algorithm.
Scenario 2: the adversary is strong and decides to use brute force approach, influencing all the active nodes in the network.
For our experiment, we assumed that when the adversary 'influences' a node, the activity level for this node is changed to a large number 10 6 .We then run our algorithm on the new set of altered activity levels and generate a new estimate â(2) corresponding to a new 'best neighborhood' K(2) .
Over 100 such experiments for the value of k = 500, there happen to be only 23 cases where K obtained after first running our algorithm contains at least one active node.Hence for the remaining 77 cases, the estimate â(2) obtained after running the algorithm over the altered values by either adversary -Scenario 1 or Scenario 2, everything remains the same for the Algorithm since the new best neighborhood K(2) would remain the same as the one obtained previously ( K).For the 23 times where we have an active node in K, the average of the estimate â when the algorithm is first run has a mean of 1.9320 and a standard deviation of 0.0339.When we run the algorithm after the action of the weak adversary, the estimate â(2) has a mean of 1.9399 and a standard deviation of 0.0333.When the adversary is a strong one, our estimate â(2) has a mean of 1.9423 and a standard deviation of 0.0322.
However for the value of k = 1000, we reported 97 cases where the neighborhood K contains one or more active nodes and for the remaining 3 we have no active nodes within K.The mean of the estimates â when the algorithm runs for the first time was 2.0293 with a standard deviation of 0.0318, while after the influence of the weak adversary the estimates â(2) had a mean of 2.0472 and a standard deviation of 0.0438 and in presence of the strong adversary, the estimates had a mean of 1.9220 and a standard deviation of 0.0354.
This shows that our method exhibits remarkable stability against both crafty and brute force adversaries, even under conditions that are more general than the ones in our consistency theorem.

Figure 1 :
Figure 1: Graph scan estimators for the artificial large graph.

Figure 2 :
Figure 2: Political blogosphere graph for the 2004 Elections.

Figure 3 :
Figure 3: Graph scan estimators for the 2004 Elections graph.