A Posterior Probability Approach for Gene Regulatory Network Inference in Genetic Perturbation Data

Inferring gene regulatory networks is an important problem in systems biology. However, these networks can be hard to infer from experimental data because of the inherent variability in biological data as well as the large number of genes involved. We propose a fast, simple method for inferring regulatory relationships between genes from knockdown experiments in the NIH LINCS dataset by calculating posterior probabilities, incorporating prior information. We show that the method is able to find previously identified edges from TRANSFAC and JASPAR and discuss the merits and limitations of this approach.


1.
Introduction. Gene regulatory networks are very important in understanding the biological functioning of cells. Identifying the interactions between genes can aid biologists in their attempts to understand how the cell functions both in steady state and in reaction to external stimuli. Unfortunately, due to the complexity of the cells and the large number of genes involved, discovering the true networks is very difficult. In most cases the number of genes measured far exceeds the number of observations, as is typical in microarray or sequencing experiments. Any method for analyzing such data must take this fact into account. Often this is done by enforcing a sparsity constraint, either via an added penalty on non-sparseness or via priors placed on the model. Even with these constraints, the ability to make valid inference is limited in these high-dimensional regimes as the number of genes grows compared to the number of observations [55,54].
Many methods have been developed for inferring relationships between genes from gene expression data. One approach is to model the network holistically using Bayesian networks [40,21,22,62,24,48]. This yields good interpretable models, but often does not scale well and is hard to apply at the whole-genome level. Regressionbased methods, where the expression level of a target gene is modeled as a function of the expression level of that gene's regulators, can be applied to much larger sets of genes but do not provide an overarching model of the entire network. Inference for these models generally becomes a statistical variable selection or model selection problem. Common methods for this include Significance Analysis of Microarrays [53], Least Absolute Shrinkage and Selection Operator (LASSO) [52,17,36] and Bayesian Model Averaging (BMA) [42,19,57,29,59]. Another class of methods looks at mutual information among the measured genes [2,34,13,37].
When looking for regulatory relationships, it has been found that the information available from knockout experiments, where a single gene is fully suppressed, can be highly informative since they give a way to identify a causal pathway, direct or indirect. In the DREAM4 in silico network challenge [32,33,10], the winning method used only the data from the knockout experiments to infer the true networks, ignoring the time-series data entirely [41]. In real biological experiments, full knockout experiments are not possible for many essential genes, but knockdown experiments, where the target gene is partially suppressed, are often available. Methods for analyzing knockdown or knockout data include correlation-based approaches [45], implicit latent variable scoring [58], and Bayesian network scoring [14,44,15]. In addition, there has been some work in combining steady-state data with knockdown data to improve results [51,5].
In this paper, we propose a simple, fast method for inferring gene regulatory relationships from knockdown data alone. Our method uses a simple linear regression model focusing on single regulator-target gene pairs based on knockdown data. This method allows the incorporation of prior knowledge about the relationships and generates posterior probabilities which can be used to form a ranked edgelist or as a part of a more expansive analysis.
2. Data. Our data come from the National Institute of Health (NIH) Library of Integrated Network-based Cellular Signatures (LINCS) program [28,11]. The aim of this program is to generate genetic and molecular signatures of cells in response to various perturbations. One thrust of this program is the large-scale generation of gene expression profiles using L1000 technology. This technology has resulted in measurements from over one million experiments to date on over fifty human cell lines. These cell lines are populations of cells descended from an original source cell and having the same genetic makeup, kept alive by growing them in a culture separate from their original source.
Each of the L1000 experiments measures the expression levels of 1000 landmark genes in the human genome. These genes were selected specifically to cover as much of the gene expression variation in cellular expression as possible, since all 20,000+ genes cannot be measured. These experiments have measured cellular responses to more than 20,000 different chemical perturbagens. In addition, knockdown and over-expression experiments, where a single gene is targeted to control its expression level, have been performed on thousands of individual genes, both within and outside of the 1000 landmark genes.
The L1000 experiments were performed using Luminex Bead technology [12], in which color-coded microspheres were coded to attach to specific RNA sequences corresponding to a landmark gene and to fluoresce according to the level of that gene's expression. Sets of beads for measuring the 1000 landmark genes were added to the solution for a single experiment along with the perturbing agent. The experiment was left for a specified period of time and then the gene expression levels were measured.
Experiments were done in sets on a single plate having individual wells for 384 experiments. This minimizes some external sources of error, such as environmental conditions, across these experiments. A small set of these experiments were used as controls with no perturbation. This gives a baseline distribution of expression level for each gene from which to measure deviations in other experiments. A common approach in this setup is to look at deviations in the perturbation experiments from the controls on the same plate, again recognizing that experiments on the same plate are likely to be more similar than those on different plates. Multiple plates, typically three or four, are prepared and analyzed together as a batch. These plates are prepared as technical replicates, with a given perturbation being prepared and then put into the same well of each plate. This gives additional power in removing systematic biases. Any given perturbation also is performed in multiple different batches, resulting in biological replicates since the sets of experiments were not prepared together.
3. Methods. We want to use the L1000 data to infer gene regulatory networks. This means that we need a method for inferring causality. One way to do this would be to use a causal time-series model, but the L1000 data include a very small number of time points (drug perturbation experiments include only one to two time points). Instead, we use knockdown experiments to identify a single gene as a putative causal agent. Although this limits the amount of information we can gain from a single experiment, it allows us to use a straightforward model with a clearly defined regulatory gene.
When looking at the knockdown experiments, it is important to understand that not all experiments are equally useful. The efficacy of the perturbation varies between target genes and even between experiments for the same target. The experimental setup of the LINCS data is helpful in identifying these differences. We use the control experiments on a plate to get an estimate of the normal variability of a gene. This eliminates some of the variability due to effects we cannot control or even measure, including differences in experimental conditions such as ambient temperature and the scientist performing the experiment, since these are captured in plate-level effects.
To take advantage of this aspect of the LINCS data, we calculate plate-level zvalues for each gene in a knockdown experiment. To do this, we first calculate the baseline mean,x hp , and standard deviation, s hp , for each gene h across all control experiments on plate p. Then the z-value for gene h in knockdown experiment i on plate p is Once we have transformed the data in this way, we use a simple linear regression model to model the change in a target gene t as dependent on the change in the knockdown gene h: Here, n h is the number of available knockdown experiments for gene h. This model specifies a linear relationship between the z-score of the knockdown gene h and that of the target gene t. This is a simplification of the true process underlying the relationship between genes h and t, but it can still be effective for discovering relationships. We estimate this model with a Bayesian approach using Zellner's g-prior [60] for the model parameters (β 0 , β 1 , σ 2 ), namely The parameter g specifies the expected size of the regression parameter β 1 relative to the standard error of the OLS estimate of β 1 . The choice g = 1 indicates that the regression parameter is expected to be nearly indistinguishable from the noise, and thus gives a practical lower bound for g. Also, n h /g is the effective number of data points in the prior, with g = n h corresponding to a unit information prior and giving similar results to BIC. We do not want a prior that has more spread than a unit information prior [43], and thus we choose g in the range 1 ≤ g ≤ n h . In this case, we used g = √ n h . We have found this to be a good compromise between the extremes. We found that when we estimated g using an Expectation-Maximization algorithm [7,59], the estimated value was often close to √ n h .
The regression model with the g-prior allows us to quickly calculate the posterior probability that gene h regulates gene t [6]. We first calculate the ratio of the posterior probability that h regulates t given the data x, Pr(h → t|x), to the posterior probability that there is no regulatory relationship, Pr 0 . Further, we can incorporate a prior probability of regulation, π ht , which reflects prior information about a regulatory relationship between genes h and t. This gives us where R 2 is the coefficient of determination for the simple linear regression model (1). From this we can get the posterior probability that h regulates t, or posterior edge probability: We use this posterior edge probability to rank potential edges and find likely edges for further investigation. Two advantages of this method are its speed and simplicity. To compute the zscores, we first get plate-level means and standard deviations, which can be done in a single read through the baseline data by keeping track of sums and sums of squares. From there, standardization of the knockdown data is quick and we need only to calculate correlations between the knockdown gene and each other gene to get the posterior edge probabilities. Additionally, including external knowledge through the prior edge probability can provide a significant boost in accuracy and precision [59]. Finally, these posterior probabilities have a straightforward interpretation, namely the probability that a given regulator-target pair is a true relationship given the data. 4. Results. We computed posterior probabilities for edges on the LINCS data for cell line A375. Cell line A375 is a human skin melanoma cell line with over 100,000 experiments in the LINCS data. That includes approximately 15,000 knockdown experiments on landmark genes. This gives a good set of data to evaluate our method. We set the prior probability of any edge being present to 0.0005, reflecting the average expected number of regulators (parents) for each node determined by Guelzim et al. [16] for yeast and the assumption that transcription factors will regulate approximately the same proportion of target genes regardless of the total number of genes available.
To assess our results, we need a reference standard. In our case, we looked at the Enrichr website [4], which has collected numerous gene-set libraries, including some that list gene regulatory relationships. We used the TRANSFAC and JASPAR lists of edges; these list transcription factors as well as putative binding sites on other genes using a position weight matrix [56,47]. This is not a comprehensive gold standard for assessment since these regulatory relationships are limited to well-studied transcription factors. However, assessment of gene networks in the mammalian systems is non-trivial due to incomplete knowledge.
The TRANSFAC and JASPAR (T&J) dataset includes 37 transcription factors that overlap with the LINCS landmark genes. Thus we limit our assessment to only those genes as potential regulators. For these 37 transcription factors, the T&J dataset has 4,303 regulation-target pairs where both the transcription factor and the target are among the landmark genes. For each transcription factor in T&J, there is a variable number of knockdown experiments in the L1000 data targeting that particular gene, generally between 10 and 20 per gene. These are used to calculate posterior probabilities for about 42,000 pairs of genes to be compared with the T&J reference dataset.
To further evaluate our method, we compared our results with results from applying Significance Analysis of Microarrays (SAM) [53] and mutual information methods to the data. SAM is an adaptable method for identifying significant changes in gene expression level while estimating the false discovery rate. It is widely used to evaluate microarray data and is implemented in the R package samr. Mutual information methods, based in information theory, have also been used extensively to identify relationships among genes. We used the minet package in R [38] to analyze our data with three different mutual information methods: Context Likelihood of Relatedness (CLR) [13], Algorithm for the Reconstruction of Accurate Cellular Networks (ARACNE) [34], and minimum redundancy -maximum relevance (MRMR) [9,37]. These three methods offer differing approaches for identifying relationships between genes.
Each method produces a list of gene pairs along with some measure of the strength of their relationship. SAM returns p-values for each relationship, the mutual information methods produce weights indicating the strength of the relationship, and our method gives posterior probabilities. We can sort these to produce a ranked list and evaluate these lists against the reference dataset.
First, we looked at two-by-two tables from each method. For the posterior probability method, we used probability cutoffs of 0.5 and 0.95 to define found edges. SAM provides a list of relationships found to be significant. The mutual information methods do not define a particular cutoff for significance, and so all relationships returned with non-zero weight were included. These two-by-two tables are produced in Table 1.
To assess whether the lists and the T&J dataset are related, we also report approximate (non-Bayesian) p-values by using the probability of getting at least the number of true positives found using a binomial(n, p) distribution, where n is the number of pairs in the inferred list and p is the probability of selecting a true edge from the total number of possible edges. Note that these p-values are not the same as the posterior probability thresholds, and inference is based on the posterior probability thresholds. The p-values turn out to be equal to 0.02 for both thresholds, indicating that the posterior edge probabilities are related to the T&J results at conventional levels of significance. The competing methods are not able to accurately identify a small number of edges as true, returning many more than the posterior probability method.   We expect errors from the assessment results in the form of both false positives and false negatives due not only to limitations of our method, but also due to the nature of the data and the TRANSFAC and JASPAR reference standard by which we evaluate our edges. This is in part because the T&J reference standard is not specific for the given cell line A375. Also, false positives might arise because the expression levels of target genes change due to indirect effects. The path from the transcription factor to the target gene may go through intermediate genes. In fact, since only about 5% of the human genes are measured by the LINCS experiments, there are likely to be many genes in relevant pathways that are not measured. If we had measurements for all 22,000 genes, using a link removal procedure could be very useful [23,41]. We also expect false negatives since the T&J dataset is not a set of confirmed regulatory relationships. Rather, it is informed from attributes of the transcription factor as well as the target gene. This means that many of the true relationships as designated by the T&J dataset may not in fact reflect true interaction at the cellular level. In general, we do not expect a transcription factor to affect 10% of the possible targets, which is what the T&J dataset reports, so it is likely that the T&J data is overestimating the number of regulatory relationships.
Another way of looking at the results is via the precision-recall curve. Precision and recall are both calculated by truncating our ranked list of edges and looking only at those proposed edges. Precision is the proportion of the proposed edges which are true edges. Recall is the proportion of true edges which are in the proposed set. The precision-recall curve takes a ranked list of edges from a procedure and shows how the precision varies as more and more edges are included from that list. High precision at low recall indicates that the procedure is good at identifying true edges at the highest probability. This is important in many cases, particularly genetic studies, because it gives researchers good information on where to focus their efforts in subsequent studies. Figure 1 shows the precision-recall curves generated by the different methods. We do see that the edges most highly ranked by posterior probability yield better results than expected from random guessing by a factor of 1.5 to 2. The precision declines as we add more edges, returning to hover near random guessing. The MRMR and ARACNE results fare worse than random guessing, and although CLR ranks a few true edges highly, it returns to random much faster than the posterior probability edges. The ranked list from SAM performs comparably to the posterior probability method, but it is unable to differentiate between the edges at the very top of its list, with 168 edges yielding the same lowest p-value. From a scientific point of view, it is important to have high precision among the edges ranked most highly, since there are limited resources for designing and executing experiments investigating particular edges more closely. Of course it would be preferable to see even better precision, but our previous discussion has shown why that may not be achievable with this dataset and standard. 5. Discussion. We have demonstrated a straightforward approach to inferring gene regulatory network edges from knockdown data. This approach is simple to apply to large datasets and includes the ability to incorporate prior information when available. This approach is able to find confirmable regulatory relationships between genes from the L1000 data. We showed that our method performs comparably to or better than popular approaches for identifying important regulatory relationships as found in the TRANSFAC & JASPAR evaluation dataset.
One key benefit of this approach is that it can be applied to extremely large datasets, requiring only one read through the data to compile all sufficient statistics for computing the posterior probabilities. There is no need to retain all the data after reading it and no iterative methods, such as Expectation-Maximization or Markov Chain Monte Carlo, are used. Methods which model the entire network [24,48,31,46] may yield more comprehensive network results but are also generally restricted to a smaller set of genes due to computational constraints. Our approach is complementary to these other approaches in potentially narrowing down the set of edges for further investigation.
We have applied the method to knockdown data in order to identify causal regulatory relationships. This method can also be applied to over-expression data or even steady-state data, although for steady-state data the resulting edges would The color version of this plot is available in the online version of the paper as well as on arXiv (http://arxiv.org/abs/1603. 04835).
lack directionality [39]. This method could also be used to infer differential expression for a perturbation such as a drug treatment. This could be done using a 2-class model where the predictor variable indicates whether the expression measurements come from a perturbation experiment or a control experiment. An implementation of our method is available as an R package, BayesKnockdown, on Bioconductor, including functions for both knockdown and perturbation data.
We considered using an edge reduction technique, such as that used by Pinna et al. [41], but ultimately decided against it. Since we used only 37 genes as regulators due to our assessment data, the resulting networks did not tend to have multiple pathways from one gene to another. In cases where the resulting networks have more multi-gene pathways, using an edge reduction method may be appropriate.
Another possible use of this method is to use the resulting edge probabilities as an informative prior for another method using a different type of data. This could allow the integration of multiple data sources and increase the usefulness of knockdown data that provide only a small amount of evidence within a larger experimental context.