STATISTICAL QUERY-BASED RULE DERIVATION SYSTEM BY BACKWARD ELIMINATION ALGORITHM

. Computers play a serious role in human life, especially web-based applications running twenty four hours per day. These applications are based on relational database management system and they receive many queries from the users. These queries are executed in the commercial systems one by one without any consideration of past experiences and data analysis. The execution of queries can be faster if some rules were derived from the past queries. In this paper, we propose a statistical query-based rule derivation system by the backward elimination algorithm, which analysis the data based on the past queries in order to derive new rules, and then it uses these rules for the execution of new queries. The computational results are presented and analysed that the system is very eﬃcient and promising.


1.
Introduction. The approach of Semantic Query Optimisation (SQO) was researched as semantic reasoning by Hammer and Mcleod [8]. It was then announced by King [13] and the complete system was developed in the late 1980s [7,3,24]. The main advantage of SQO is its ability to derive rules and then use them to reduce the execution time of future queries. The use of rules from past experiences makes the optimiser more intelligent. The approach has several components: Query Representation, Query Optimisation and Rule Maintenance, and Automatic Rule Derivation.
The first component of the approach is to represent a query in a query language in order to express it in a way that can be understood by the chosen database environment. Many Relational Database Management Systems use the Structural Query Language and then internally the relational algebra and query trees are mainly used for this purpose [3,20,10,6].
The second component is optimising a given query. In the optimisation process, two situations may occur which are named as Query Answering and Query Refutation. In the first situation, the part of the user query conflicts with any matching rule in the rule set. In the second situation, the answer set can be found from the matching rules, without any need to access the database. These special situations are the most profitable occurrences of the approach, with substantial time savings (%99) reported in [20]. Otherwise the query optimiser constructs alternative queries using the rule set and then selects the best alternative as an optimum query. In most situations, the execution time of queries could be reduced, especially when the consequent of a rule contains the primary key and indexes [13,7,3,20,10,6,1]. Moreover if n is the number of the matching rules, the number of alternative queries can be found from the equation of (2 n − 1). For example if 10 matching rules are found for the original query, then the total number of alternative queries is 1024. It can be expected that the number of alternative queries increases rapidly. For this reason, two time consuming problems occur: How to construct the alternative queries and How to optimise the alternative queries to select the optimum query Although these problems are not within the scope of this paper directly, if the number of the rules can be limited in the rule derivation process, this would reduce the construction time and query optimisation time indirectly. For this purpose, our system can be used efficiently.
The third component is used to keep the rules up to date for changes in the databases because these rules represent a time-dependent property of a database at a particular database state. When updates are made on the database, these rules may become invalid. Therefore derived rules need to be maintained to be consistent with the current database state [25,12,14,18].
The last component is the automatic rule derivation to learn new rules based on the conditions of the given query and the query answer from the database. There are a number of automatic rule derivation techniques [8,13,7,3,24,9,23,2,22]. Despite the advantages of the SQO approach, the automatic rule derivation takes a considerably long time and the time increases according to the database size. Another problem is that the total number of rules in the rule set may increase rapidly. In order to solve these problems, many researchers suggest using statistics to measure the dependencies on the database [8,13,7,3,24,20,9,23,2,22,4,15,11,5,16].
Traditional approach of Sematic Query Optimisation (SQO) in Figure 1 starts with the users query, and then the query is rewritten in rule representation [8,13,7,3,24]. If the represented query contains a SELECT clause in the view definition language, the query optimisation takes the parts of the query in order to execute in the relational database systems. After the query execution, the answer set of this query and the parts of the represented query are used to trigger the automatic rule derivation in order to learn new rules (if possible), and then rules are saved in the rule set. These rules are used for the query transformation and optimisation in upcoming queries. Moreover, when the next query is entered into the system, in query transformation and optimisation, the parts of this query is compared with the rules from the rules set in rule matching step and when the matching is found, then the matched rule(s) can be used to construct the alternative queries in order to select the optimum query, which can be executed faster than the original query given by user. If the clause is one of INSERT, UPDATE and DELETE of data manipulation language. The clause is taken into the process of the rule maintenance manager. It is important to ensure the accuracy of the rule set. Another issue of the rule maintenance manager in Figure 1 is to keep the rule set accurate for new coming rules in case of having the same antecedent attribute and the consequent attribute with different comparison operator and values in order to manipulate their boundaries correctly. For this purpose, our previous papers can be taken into account [21,19,17].
In this paper, a Statistical Query-based Rule Derivation System by Backward Elimination Algorithm is proposed. Our system takes the condition(s) of the original query and the answer set of a query, then it specifies the attributes of relation in two groups as dependent and independent attributes. After that it examines values of these attributes to see whether these attributes are related or not, by the use of the backward elimination algorithm. If they are not, the first elimination is done.
Otherwise the system continues with multiple regression analysis and calculates the standardised regression coefficients. Then it estimates the coefficient ratio, which is the averaged rank of the coefficients of the linear multiple regression analysis. The second elimination is done by the use of this calculated ratio. The last but least important, the system ends with the rule construction with remaining attributes.
Our system is very efficient, fast and dynamic. It can work for any table on any database at any time, without any need for reconstruction. Computational results of the system show that it limits the number of rules easily and can reduce the rule derivation time and the query optimisation time. Main concepts of the SQO approach are explained in Section 2. Our system of the automatic rule derivation is then introduced in Section 3. The computational results are given in Section 4. Finally, conclusions are given in Section 5.
2. Main concepts of semantic query optimisation. There are two main concepts of SQO named as Simple Rules (described in Subsection 2.1) and Semantically Equivalent Queries (described in Subsection 2.2).

Simple rules.
A vast majority of researchers on this approach use the simple rule, which has a condition on the left-hand side (called Antecedent) and a condition on the right-hand side (called Consequent). This type of rule cannot contain more than one condition on either side. Other rules can be regarded as complex, which can have more conditions on one side. Although somehow complex rules can be expressed, using them in the query optimisation becomes too difficult; also the rule maintenance becomes problematical. Therefore the rule representation is limited to using simple rules [8,13,7,3,24,20,10,6,1,25,12,14,18,9,23,2,22]. Assuming a table is R (x, y, z) where x, y and z, are attributes of R. A rule can be formed as x⊕a → y⊗b where a and b are constant values; ⊕ and ⊗ are one of comparison operators ¡, ¡=, ¿, ¿=, =, !=. When the antecedent is satisfied, then the consequent has to be true. A rule can be defined in the rule set in Table 1. Another important concept of the SQO approach is Semantically Equivalent Queries, which were described as the heart of the SQO approach [13] because the main idea of the SQO approach is to use rules to transform a given query into alternative queries that have the same answer set as the given query, but constructed semantically. For example, assume that the following two rules are learned for the EMPLOYEE table and kept in the rule set: If the employees name is Ayla SAYLI, the salary is higher than 40K: R2 : If the employees name is Ayla SAYLI, the social security number (SSN) is 98765: N ame = AylaSAY LI → SSN = 98765 If we are looking at the detail of the employee whose name is Ayla SAYLI, the original query can be represented in SQL as follows: SELECT * FROM EMPLOYEE WHERE Name = Ayla SAYLI; Using R 1 and R 2 alternative queries can be built as Q1, Q2 and Q3 that are semantically equivalent to the original: These alternatives are then compared with each other in the query optimisation component to determine the most cost-efficient query. Further information for cost estimation can be found in [20].
Two other situations can happen during query processing as it mentioned in Section 1 as Query Answering and Query Refutation. For Query Answering, if we are looking at the social security number of the employee whose name is Ayla SAYLI, the query can be represented as follows: SELECT SSN FROM EMLOYEE WHERE Name = Ayla SAYLI; The answer set of this query would be found as 98765 by the use of R2 without accessing the database.
For Query Refutation, if we are looking at the detail of the employee whose social security number is 98765 and name is Ayla SAYLI, then this query can be represented as follows: SELECT * FROM EMPLOYEE WHERE SSN = 98764 and Name = Ayla SAYLI; The answer set of the query has to be NULL by the use of R2. Therefore we do not need to search the database as Query Answering.
It is necessary to underline differences between the concept of Semantically Equivalent Queries and that of Logically Equivalent Queries. In some situations, queries can be semantically equivalent but logically not. The first difference is that logical equivalence is based on established logical equations such as De-Morgans Laws, composition, decomposition and transition. This makes the logically equivalent queries give the same answer set for all database states. However, the semantic equivalence is changeable according to database states. The second difference is that semantic equivalence is based on rules which were derived before the construction of alternative queries. These derived rules may be violated during SQO when database updates are made. Therefore, the logically equivalent queries are clearly semantically equivalent but the semantically equivalent queries may not be logically equivalent [8,13,7,3,24,20,10,6,1,25,12,14,18,9,23,2,22].
3. Statistical query-based rule derivation system by backward elimination algorithm. It is mentioned before that statistics is suggested on analysing the data in order to derive the simple rules by many researchers [8,13,7,3,24,20,9,23,2,22,4,15,11,5,16]. In the statistical literature, there are different ways to find out the relationship between the dependent and independent attributes which are mainly calculated with correlations. In this paper, we aimed to eliminate the weak independent attribute(s) in order to limit the number of new rule derivation. Therefore we use multi-collinearity from the statistics for elimination which is given in Subsection 3.1. There are three approaches of eliminating attributes, which are "Backward Elimination", "Forward Selection" and "Bidirectional Elimination". Backward elimination starts with all candidate variables, tests the deletion of each variable using a chosen model comparison criterion, deletes the variable (if any) that improves the model the most when deleted, and repeats this process until no further improvement is possible. Forward selection starts with no variables in the model, tests the addition of each variable using a chosen model comparison criterion, adds the variable (if any) that improves the model the most, and repeats this process until none of the remaining attributes improves can be made to the model. Bidirectional elimination, a combination of the first two, tests at each step for variables to be included or excluded. In this study, we use the backward elimination which will be shown in Subsection 3.2. Our system is pictorially shown in Figure 2. It has six levels namely User Level, Query Parsing Level, First Elimination Level, Regression Level, Second Elimination Level and Rule Construction Level. User Level takes the original query of the user and the answer set of the query. Query Parsing Level dismembers the original query as the requested attribute(s) of the query, the resource(s) of the query and the condition(s) of the query to specify the dependent and independent attributes using the answer set. First Elimination Level takes the dependent and independent attributes with their data in the answer set and executes the backward elimination algorithm which is given in Subsection 3.2. By the use of the algorithm, our first elimination is done for independent attributes with the higher VIF factor (this factor is described in Subsection 3.1 briefly). Regression Level takes the remaining attributes to calculate their standardised regression coefficients. Second Elimination Level uses these coefficients to compute Coefficient Ratio and then independent attributes are eliminated if they have lower coefficient value than the ratio. The last but least important level is to construct new rules based on the remaining independent attributes which will be the attribute in the consequent of the rule. This level is detailed in subsection 3.3. The system is built using Visual Basic programming language based on SQL-Server database. The data analysis is done dynamically by using the backward elimination to analyse records. Our software system based on the automatic elimination also works with the up-todate information in any given database which can change at any time without any need of reconstructing the complete system.
3.1. Query Parsing Level. This level dismembers the original query which is given in SQL as the requested attribute(s) of the query, the resource(s) of the query and the condition(s) of the query. Then attribute of each condition is chosen as a dependent attribute to be a candidate of the attribute of antecedent of a new rule and the attributes of the result set of the query is chosen as independent attributes in order to check whether any of these independent attributes can be the attribute of the consequent of the new rule after executing all levels.

First Elimination Level.
In statistics, the simple way to find the multicollinearity among dependent attributes is to estimate the Variance Inflation Factors (VIF) [22]. The following equation given below can be used to calculate the VIF and the higher the value, the higher the multi-collinearity. A VIF for a single dependent attribute is obtained using the R-squared value of the regression of that variable against all other dependent attributes: where VIF for attribute n is the reciprocal of the inverse of R 2 n from the regression. The VIF is calculated for each dependent attribute and those with high values are removed. The definition of high is somewhat arbitrary but values in the range of 5-10 are commonly used. We opted to use 10 as a threshold in our experiments. However our high is 5 and several others, the results were not changed. This backward elimination algorithm based on VIF factor is given in Figure 3 which can be described in five steps. are coefficient values, which reflect the dependencys influences on the dependent attribute.
(2) Multiple regression analysis represents a logical extension of two-attribute regression analysis. Instead of a single independent attribute, two or more independent attributes are used to estimate the values of a dependent attribute. However, the fundamental concepts in the analysis remain the same. Using the least squares method, the regression equation is derived. The method uses the linear multiple regression equation given in Equation 2. It gets difficult to estimate the attributes when the attributes and functions are increasing. For the general equation, the matrices are used for estimated-attributes. In general, assuming dependent and independent attribute values X and Y, and b unknown matrices are given above: Then the b matrix can be calculated as follows: The elements of the b matrix are the coefficients of the multiple regressions. After calculating these elements, the linear function can be constructed. In order to make the elimination, the coefficients have to be standardised. Therefore coefficients of multiple determinations are given in Subsection 3.3.1 and then the standardised regression coefficients are described in Subsection 3.3.2 in detail.

Coefficients of multiple determinations.
A measure of association between a dependent attribute and several independent attributes is the multiple correlation coefficient R, where 0 <= R <= 1. The multiple coefficient of determination can be calculated from: where and SS T = SS R + SS E . In Equation 4, y i shows the actual values of the dependent attribute,ȳ shows the mean of the actual values of the dependent attributes,ȳ i shows the predicted values of the dependent attribute and n shows the total number of the records.

3.3.2.
Standardised regression coefficients. Standardised regression coefficients determine the relation between the dependent and independent attributes. Regression coefficients cannot provide an idea about the effect of the independent attribute on the dependent attribute. If we transform the regression coefficients to the standardised regression coefficients, the effect of the independent attribute on the dependent attribute can become clear. When the model is generated for the system which has more than two attributes, Equation 5 is used: and r kj = S kj (S kk Sjj ) ( 1/2) , the correlation matrix is The matrix which is originated by the correlations between the dependent attribute and independent attribute (k unit) is: With the help of two matrices, Equations 6 and 7, the standardised regression coefficients are calculated for the k unit independent attribute using Equation 8: After obtaining the standardised regression coefficients, the coefficient ratio can be estimated.

Second Elimination Level.
By the use of standardised regression coefficients, the coefficient ratio of independent attributes can be calculated by averaging the standardised coefficients arithmetically as formulated below: Then if the coefficients of the independent attributes are greater than or equal to the ratio, they are used for new rule construction. Otherwise, independent attributes are eliminated.

Rule Construction Level.
After the second elimination level, the system constructs the consequents for un-eliminated independent variables using their minimum and maximum values in the result set of the query. Finally, the consequents are found, and new rules are saved in the rule set.  Table 2. The main reason for using this database was to prevent difficulty in getting any suitable database for the confidentiality of data security. A sample set with 10 records from EPISODE is given in Table 3. In Table 2. 2 Attributes of EPISODE EPISODE Attributes: id#, startyear, finishyear, epcodes, episodes, admissions, male, emergency, waiting list, mean waiting, median waiting, mean length, median length, mean age, age 0 14, age 15 59, age 60 74, age 75, day case, bed days id# is the primary key Subsection 4.1, an example is used to explain how the system works for a new query. Computational results are given for rule derivation and query optimisation in order to compare the number of derived rules and their effects on query optimisation, by the use of our proposed system to Traditional Rule Derivation by Siegels System. In Subsection 4.2, the rule derivation results are pointed out for the number of derived rules and the derivation times of these rules. In Subsection 4.3, execution times of the optimisation using the rules are graphically illustrated.

Example.
A query as SELECT * FROM EPISODE WHERE Episodes=1384 is executed on EPISODE; the rules derived by Siegels system are shown in Table 4. For the condition of the query, the dependent attribute of Episodes, the remaining attributes after running our system are ID, AGE 0 14, AGE 15 59, AGE 60 74 and AGE 75. According to these remaining attributes and the condition of the given query, the derived rules can be seen in Table 5. It can be seen from Table 4 and 5 that our system can limit the number of the rule set from 32 to 10 rules, which is 68.75% decreasing for the query of this example. Further experiments were made and results are obtained to have an overall look on the rule derivation and its outcomes.

4.2.
Rule derivation. Experiment 1. During the derivation, 55635 different queries are executed. According to the rule derivation method by Siegel s system and our system, two comparisons can be seen in Figure 4 for the number of rules and in Figure 5 for the learning time. After the executions of these queries, in total, the number of rules by Siegels system was 1233224 and our system has derived 365634 which is 70.35% decreasing on average. Similar conclusion can be made for learning time as 71.58%. Therefore deriving all possible rules takes time and the question is Are they all useful. The experiments in Subsection 4.3 are done for this purpose. During our experiments, it is also observed that the rule derivation process is depended on the number of the different values of the attribute contained in the query and the number of the records satisfied by each value. For example, bed days has the maximum different values and also the number of records satisfied     Experiment 2. This experiment is done to find out other effects of the learning process such as the content of the query which can be given in four different cases described below: Case I. Consequent attribute is the primary key and its condition contains equality comparison Case II. Consequent attribute is the primary key and its condition contains range comparison Case III. Consequent attribute is no primary key and its condition contains equality comparison Case IV. Consequent attribute is no primary key and its condition contains range comparison For these cases, following results in Table 6 are estimated. It can be seen that Case I takes the minimum learning time and Case IV takes the maximum learning times. Therefore the learning process is also affected by the content of the query.
4.3. Query optimisation. Experiment 3. 878 different queries are randomly chosen to calculate their total execution time in three different execution modes in order to find out the significances of the rules on query execution. They are No rule used during query optimisation, 1233224 rules by Siegels system are used, and 365634 rules by our system are used, which are shown in the following Table 7.
Our system is 69.28% faster than the original query execution and 57.00% faster than the optimum query execution by Siegels system. These results show that the original query execution can be supported by rules in order to have a better system. Moreover it is possible to say that the usefulness of the rules is different because the number of rules did not reduce the query execution time, contrarily increased it during the long optimisation process. Our learning system has derived 857590 less rules and gave the better query optimisation.
Experiment 4. For this experiment, 50 different queries are randomly chosen to find out the computational results of the query optimisation in order to give more detail graphically for each individual query execution, with and without the use of  Figure 6. From Figure 6, the most of the optimum queries by our system could be executed in less time than original queries. Query 20, Query 36 and Query 46 gave the similar results because the contents of these queries were remained to be the same on the query optimisation.
Experiment 5. This experiment is designed to find out other effects of the optimisation process related to the content of the query (described as four cases in Experiment 2) in order to point out the optimisation results of these cases. In the experiment, the number of queries between 200 and 300 for each case are executed and the average execution times for each case are given in Table 8 below. The results of this experiment in Table 8 are very similar to Experiment 2; Case I takes the minimum query execution time and Case IV takes the maximum. Therefore the query execution is also related to the contents of queries and their matched rules on optimisation. It is very important to derive the most useful rules instead of learning all.

5.
Conclusion. From the experiments, main conclusions can be listed as follows: • Learning all possible rules does not give better query optimisation.
• Deriving rules based on data analysis can help us to select useful rules.
• Query optimisation depends on the content of the original query and the content of the rule which is used for this query. • The last but the most importantly the use of rules can make the query execution faster than the original query execution.
Although several researchers are working on data analysis for rule derivation, there are still existing problems on the components of SQO such as new learning methods, rule evaluation, rule maintenance, query optimisation. For example, we next wish to extend our system implementation using Forward Selection and /or Bidirectional Elimination and do further experiments. Finally we hope that this paper can be the inspiring point of new future works for the rule derivation and query optimisation.