Foundations of semialgebraic gene-environment networks

Gene-environment network studies rely on data originating from different disciplines such as chemistry, biology, psychology or social sciences. Sophisticated regulatory models are required for a deeper investigation of the unknown and hidden functional relationships between genetic and environmental factors. At the same time, various kinds of uncertainty can arise and interfere with the system's evolution. The aim of this study is to go beyond traditional stochastic approaches and to propose a novel framework of semialgebraic gene-environment networks. Foundation is laid for future research, methodology and application. This approach is a natural extension of interconnected systems based on stochastic, polyhedral, ellipsoidal or fuzzy (linguistic) uncertainty. It allows for a reconstruction of the underlying network from uncertain (semialgebraic) data sets and for a prediction of the uncertain futures states of the system. In addition, aspects of network pruning for large regulatory systems in genome-wide studies are discussed leading to mixed-integer programming (MIP) and continuous programming.


1.
Introduction. Gene-environment networks provide a mathematical framework for the analysis and investigation of regulatory networks in genetics, bioinformatics and medicine [21]. The identification of the underlying hidden mechanisms and pathways allows for a deeper understanding of the system under consideration. Typically, different types of environmental factors play a significant role in the internal organisation and development of complex interconnected systems. In particular, environmental variation can be caused by measurable chemical, physical and biological parameters as well as life events or behavioral patterns often expressed in linguistic terms. Data uncertainty associated with such measurements has to be addressed in a sophisticated mathematical model.
It is a major advantage of the mathematical network model that it allows for an investigation of the system by graph theoretic approaches and it supports research and analysis in various ways. Both genetic and environmental factors contribute to the etiology of many complex diseases. For this reason, an important application consists in the search for novel genes related to diseases [47]. Other studies put a particular focus on risk factors and pathways for the development of diseases such as depression [44,45] or multiple sclerosis [8]. Such studies have a potential relevance for clinical application. For example, better prognostic models in pharmacogenetics may rely on genotype information and can lead to a more precise prognosis, targeted intervention and personalized treatment [23]. In particular, research can lead to new therapeutic approaches to cognitive and affective disorders in neuropharmacology [31] and precision medicine in case of Alzheimer's disease and other kinds of dementia [10].
1.1. Gene-environment network models under uncertainty. In previous studies, we addressed various mathematical gene-environment network models for many different types of data uncertainty [53,54,55]. Metric data interval arithmetics as well as polyhedral and ellipsoidal uncertainty regions [17,18]. In case of linguistic variables such as life events or behavioral traits fuzzy gene-environment networks can be applied [15,16]. In addition, models based on stochastic differential equations [37] and spline regression approaches [25,26,27] have been discussed previously.
1.2. Semialgebraic uncertainty regions. In this study, we propose a new framework of semialgebraic gene-environment networks. Semialgebraic uncertainty regions for measurements and predictions provide a higher flexibility for the design of regulatory models. They aim at an improvement of accuracy and precision in the reconstruction of the network and the prediction of future states. In particular, this framework includes and further extends polyhedral and ellipsoidal regression approaches [17,18,55].
1.3. Genome-wide association studies. In recent years, genome-wide association studies aimed at an identification of the loci associated with complex diseases [3,22,34]. From a mathematical point of view, very large gene-environment networks arise. In order to reduce the overall complexity and to avoid the curse of dimensionality, methods for network pruning are required. We derive two distinct approaches for gene-environment networks with semialgebraic uncertainty sets. These methods eventually lead to regression approaches based on mixed-integer programming (MIP) and continuous programming.

1.4.
Outline. The paper is organized as follows: In Section 2, the generic regulatory model for gene-evironment interaction is introduced. Gene-environment networks for clusters of genes and environmental factors are presented in Section 3. Then, in Section 4, we review some basic facts about semialgebraic sets and semialgebraic functions. Section 5 turns to regulatory models based on semialgebraic uncertainty regions. The following Section 6 states some basic assumptions about the required semialgebraic measurements, Thereafter, Section 7 introduces the regression model for semialgebraic gene-environment networks.
Several forms of network pruning for very large datasets are discussed in Section 8. The corresponding regression problems can be tackled by strategies from MIP and continuous programming. Selected classes of optimization problems and associated algorithmic techniques are presented in Section 9. Finally, we conclude with an outlook on future directions of research along with areas of real-world application problems.
2. Gene-environment networks. Gene-environment regulatory networks distinguish two major groups of entities: a) the genes and b) the environmental factors. The environmental factors modulate the system of genes by external conditions acting as regulatory components, additional controls or disturbances of the genes and the regulatory system itself. The interconnection between the single genes and environmental factors is encoded in the adjacency matrix A. In addition, the functional relationship between genes and/or environmental factors has to be integrated into the modeling of the gene-environment network.
connections between genes and/or environmental factors, where  3. Gene-environment networks for clustered partitions. The regulatory network consists of n genes G = (G 1 , . . . , G n ) T ∈ R n and m environmental factors E = (E 1 , . . . , E m ) T ∈ R m . In gene-environment models, functionally related groups of genes or environmental factors often take influence on the interactions between its components. For this reason, the set of genes, G, is divided into R ∈ N genetic clusters and the set of all environmental items, E, is split into S ∈ N environmental clusters.  Here, (C , D) is a disjoint cluster partition if C r1 ∩ C r2 = ∅ for all r 1 , r 2 ∈ {1, . . . , R} with r 1 = r 2 and D s1 ∩ D s2 = ∅ for all s 1 , s 2 ∈ {1, . . . , S} with s 1 = s 2 . Otherwise, the cluster partition (C , D) is called overlapping. The (crisp) states of the elements of each cluster are given by subsets of the vectors G and E. For each r ∈ {1, . . . , R} the |C r |-subvector G r of G given by the indices of C r is a cluster vector of the genetic cluster C r . For each s ∈ {1, . . . , S} the |D s |-subvector E s of E defined by the indices of D s is a cluster vector of the environmental cluster D s . After these preparations, the notion of a gene-environment network with cluster partition (C , D) can be introduced. connections between target clusters and/or environmental clusters v) the parameter-dependent functional relationships: are set-valued functions depending on the cluster partitions of genetic clusters and/or environmental clusters. Here, (θ 1 , θ 2 ) ∈ Θ 1 ×Θ 2 ⊂ R 1 ×R 2 , 1 , 2 ∈ N, denotes the unknown parameter vector.  4. Semialgebraic uncertainty regions. In this study, semialgebraic sets represent the uncertain states of each genetic and environmental cluster. We recall that a semi-algebraic set in R p , p ∈ N, has the form where f i,j ∈ R[X 1 , . . . , X p ] and * i,j ∈ {>, <, =} for i = 1, . . . , s, and j = 1, . . . , r i . The set of semialgebraic sets in R p is denoted by S p . The family of semialgebraic sets in R p is closed with respect to finite intersections, finite unions and complements. For this reason, they allow us to introduce a semialgebraic calculus on semialgebraic sets. The sign conditions f (x) > 0, f (x) < 0 or f (x) = 0 of a polynomial f imply that a semialgebraic set S ⊂ R p is obtained by a finite boolean combination of a finite number of polynomials (i.e., by disjunction, conjunction and negation; see [5]). Let A ⊂ R p and B ⊂ R q be semialgebraic sets, In particular, each polynomial mapping f : A → B is semialgebraic. In addition, each regular rational mapping g : A → B (all its coordinates are rational fractions whose denominators do not vanish on A) is semialgebraic. Furthermore, the image and the inverse image of a semialgebraic set by a semialgebraic mapping are semialgebraic. For further details on semialgebraic structures we refer to [5,7].

5.
Gene-environment networks under semialgebraic uncertainty. The uncertain states of the gene-environment network are represented in terms of semialgebraic sets. Here, we assume that C , D denotes a static cluster partition with S genetic clusters and R environmental clusters that does not change over the time interval T ⊂ N, where T = {0, 1, . . . , T } and T = {0, 1, . . . , T, T + 1} for a given T ∈ N. In the following, measurements are indicated by a bar and predictions are given by a hat (e.g., G i and G i denote a measurement and a prediction of gene i, respectively). In addition, for each t ∈ T we set Semialgebraic prediction models of degree D = 1 are defined as evolving semialgebraic network models that only depend on the previous state in time.
Definition 5.1. A parameter-dependent gene-environment network model of degree D = 1 under semialgebraic uncertainty is defined as: are semialgebraic functions depending on the semialgebraic representations of genetic clusters and/or environmental clusters. The initial values G (0) r ∈ S |Cr| , r ∈ {1, . . . , R}, and E (0) s ∈ S |Ds| , s ∈ {1, . . . , S}, are semialgebraic uncertainty sets. Figure 3 illustrates the various interactions between genetic clusters and/or environmental clusters as well as the corresponding semialgebraic uncertainty sets.
The above definitions can be further extended to models that take T + 1 previous system's states into account.
6. Semialgebraic measurements and predictions. In the sequel, a regression approach is introduced for parameter identification of semialgebraic prediction models. The semialgebraic regression strategy compares semialgebraic measurements and parameter-dependent semialgebraic predictions obtained with model (SPM). At a given time κ ∈ N the number of previous measurements required for the prediction depends on two factors: a) the number of measurements, K ∈ N, needed to determine the parameter vector θ ∈ Θ, b) the number of previous states, D ∈ N 0 , used in the corresponding model to compute a prediction.
During the regression process, the measurements have to be compared to the semialgebraic predictions (see Figure 4). The optimal parameter vector θ ∈ Θ maximizes the intersection of the semialgebraic measurements and the semialgebraic predictions.  The intersections of measurement values and predictions (both semialgebraic sets) are denoted by The size of the semialgebraic intersections is measured by a so-called criteria function. Here, regular criteria functions are applied that are monotonous by increasing with respect to inclusion.  Network pruning has been known as an efficient solution for different real-world applications. For example, deep convolutional neural networks (CNNs) have recently become deeper to attain high efficiency in various applications. In spite of their remarkable success, it is not always practical to apply them to resource-constrained problems. An effective solution to deal with this problem is applying network pruning to decrease the computation cost of over-parameterized CNNs [9]. In the sequel, two different methods are introduced for network pruning. Since network nodes are often regulated by a few entities the number of ingoing and outgoing branches should be constrained. We propose two approaches: The first method imposes binary constraints and leads to a mixed-integer regression problem. In this way, the number of branches can be considerably reduced. Nevertheless, binary constraints are very strict and important branches may be deleted in case of inappropriate constraints. For this reason, the second method relies on continuous constraints that may be interpreted as probabilities, membership assignments or fuzzy values and can be solved by continuous programming.
indicate whether or not pairs of clusters in the gene-environment regulatory network are directly connected.
For a specific genetic cluster C j , the indegrees deg(C j ) GG and deg(C j ) EG count the number of genetic and environmental clusters that regulate cluster C j . In the same way as before, bounds on the outdegree (i.e., the number of outgoing branches) can be introduced. Firstly, binary values are defined to determine whether or not there is an outgoing connection: The outdegrees deg(C j ) GG out and deg(C j ) EG out count the number of genetic and environmental clusters regulated by cluster C j .

8.1.3.
Restrictions on ingoing/outgoing branches for each cluster. As mentioned before, upper bounds on the indegrees and the outdegrees of the nodes (clusters) are introduced with the aim of network pruning. These values have to be given by the practitioner and they can depend on any a priori information. If these additional constraints are integrated into the regression model, the following mixedinteger optimization problem for network pruning is achieved: In model (MI1), individual bounds on the indegrees and outdegrees of each genetic and environmental cluster are imposed in order to control the connectivity of the gene-environment network. This approach is an extension of the regression problems with bounds on the indegrees discussed in [17,18]. Mixed-integer problems for an analysis of gene-environment networks based on interval arithmetic were presented in [50,51,53]. 8.1.4. Restrictions on the total number of ingoing/outgoing branches. The number of connections of each cluster with other genetic and environmental clusters in model (MI1) are considered separately. In a further step, these bounds can be combined to impose restrictions on the total number of all ingoing or outgoing branches of each cluster: 8.2. Continuous programming. The binary constraints in (MI1) and (MI2) are very strict and if the constraints are not chosen in an appropriate way, important branches of the regulatory network could be deleted. For this reason, continuous optimization is applied for a relaxation of (MI1) by replacing the binary variables χ GG jr , χ EG js , χ GE ir , χ EE is ∈ {0, 1} with real variables P GG jr , P EG js , P GE ir , P EE is ∈ [0, 1]. These variables can also be interpreted as probabilities or fuzzy membership values (see [35] for optimization models with probabilistic constraints). In addition, the variables should linearly depend on the corresponding subvectors of the parameter vector θ ∈ Θ denoted by θ GG jr , θ EG js , θ GE ir , θ EE is . Similarly, the real-valued indegree of cluster D i is adapted in the following way. In addition, the outdegree of environmental clusters can be adapted to comprise real values.

8.2.3.
Restrictions on ingoing/outgoing branches for each cluster. By replacing the strict binary constraints of the mixed-integer problem (MI1) with the aforementioned 'soft constraints', the following continuous programming problem for network pruning is obtained: . . , R, i = 1, . . . , S)

8.2.4.
Restrictions on the total number of ingoing/outgoing branches. It is also possible to combine these real-valued restrictions on the total number of all ingoing or outgoing branches of each cluster and to obtain a relaxation of model (MI2) in terms of the following continuous programming problem: 9. Selected classes of optimization problems and their algorithmic techniques. This paper is devoted to the foundations of semialgebraic gene-environment networks, i.e., to bases of the future development of algorithmic methods to assess those networks and to real-work applications. Already today we may mention that the following refined classes of techniques can be naturally suggested for our new class of networks under uncertainty, for their identification, optimization and extension: (i) Tchebychev Approximation [53], (ii) Semi-Infinite Optimization [29], (iii) Generalized Semi-Infinite Optimization [48,49,56], (iv) Bi-level and Multilevel Optimization [36], (v) Disjunctive Optimization [2], (vi) Robust Optimization [24,39,42], (vii) Conic Optimization [4], (viii) Optimal Control [1], and (ix) Stochastic Optimal Control [33]. Concerning classes of future real-world applications we would like to recommend emerging challenges of, for example, (a) Collaborative Games under Ellipsoidal Uncertainty or (per inner or outer approximations) Hypercube Uncertainty [11,52], (b) Transportation ("Piano Mover's" and many more) problems [32,40], (c) Supply Chain and Inventory Management [12,20,30,41,43,57] Production Planning [38], various kinds of (d) Design problems [46], (e) Artificial Intelligence and Machine Learning (e.g., "Infinite Kernel Learning") [6,13,28], and (f ) Finance, Actuarial Sciences and Pension Fund Systems [14,19,55]. 10. Conclusion. In this study, we lay the foundations of the novel concept of semialgebraic gene-environment networks. In this approach, the uncertain states of genes and environmental factors are represented by semialgebraic sets. The functional relationships between the genes and/or environmental factors are given by semialgebraic functions. This approach further extends previous set-theoretic regression strategies for gene-environment networks under uncertainty based on intervals, polyhedral and ellipsoidal uncertainty and the corresponding continuous programming. Gene-environment studies often rely on a mixture of measurable chemical, physical and biological data as well as data on behavioral patterns and life events. The system states have to reflect different types of uncertainty such a probabilities and fuzzy values. It is a major advantage of the proposed semialgebraic gene-environment networks that the generic structure of semialgebraic uncertainty sets allows for a simultaneous consideration of such data. In addition, it provides a major computational advantage. Uncertain system states can be described by a finite set of intersections and unions of underlying regions of uncertainty. In this way, a semialgebraic calculus is introduced that builds the basis for the semialgebraic regression model. Future studies will further explore graph-theoretic characteristics of gene-environment networks under semialgebraic uncertainty. In addition, settheoretic notions of boundedness, stability and reachability can support the analysis since they are often associated with a critical deviation in bio-systems. Furthermore, the effects of exogenous disturbances have to be discussed in terms of robustness of the gene-environment regulatory system. The combination of graph theoretic network analysis and the behaviour of trajectories of the underlying regulatory system offers a promising avenue for gene-environment studies with more complex uncertainty regions in genetics, neuroscience and psychobiology.