In machine learning and data mining field, feature selection is a dimensionality reduction technique (
1). In model construction the feature selection methods select a subset of relevant features. In feature selection techniques the evaluation methods are divided into five types: filter, wrapper, embedded, hybrid, and ensemble (
1). The goal of feature selection is to determine the most critical features mainly (descriptors) more than hundred descriptors (
2). In this paper the wrapper type among feature selection methods is used. Feature selection problem is an NP-Hard problem and for solving this problem different meta-heuristic algorithms have been used. In QSAR modeling different feature selection algorithms have been proposed. In QSAR modeling each feature indicates a molecular property while it is a number that denotes the properties of molecules like molecular weight, solvent accessible surface or other molecular properties. In other words any feature is considered as a single number that explains an aspect of a molecule (
2). Ant Colony Optimization (ACO) algorithm (
3) has been used for modeling of anti-HIV-1 activities of 3-(3,5-dimethylbenzyl) Uracil derivatives using MLR, PLS, and SVM regressions. Particle Swarm Optimization (PSO) and genetic algorithm (
4) have been used for modeling of imidazo[1,5-a]pyrido[3,2-e]pyrazines, inhibitors of phosphodiesterase 10A. Modified ant colony optimization algorithm (
5) had been used for variable selection in QSAR modeling on cyclooxygenase inhibitors. Monte Carlo algorithm (
6) had been used for QSAR modeling on aldose reductase inhibitors. Particle swarm optimization and genetic algorithm (
7) have been used for QSAR modeling of peptide biological activity. In this work two novel hybrid meta-heuristic wrapper algorithms i.e. Sequential GA and LA (SGALA) and Mixed GA and LA (MGALA), which are based on Genetic algorithm and learning automata for feature selection in QSAR model are proposed. SGALA algorithm uses advantages of Genetic algorithm and Learning Automata sequentially and the MGALA algorithm uses advantages of Genetic Algorithm and Learning Automata simultaneously. For evaluation of selected features for our proposed algorithms the MLR classification technique was used. Our proposed algorithms were executed on three different datasets (Laufer
et al.(
8), Guha
et al.(
9) and Calm
et al.(
10)). For evaluation and assessment of our proposed algorithms we implemented our proposed algorithms along with GA, ACO, PSO, and LA algorithms in MATLAB environment. Through implementing and running all the algorithms with different datasets, it was observed that the rate of converging to optimal result in MGALA-MLR and SGALA algorithms are better than GA, ACO, PSO, and LA algorithms and also the rate of MGALA algorithm is even better than SGALA and all other algorithms. A very important difference between LA and GA is that the GA tries to find the most appropriate chromosome from the population, but in LA the position of action is very important and therefore by combining these two algorithms (MGALA) the rate of convergence is improved. Error values in MGALA and SGALA algorithms are smaller than GA, ACO, PSO, and LA algorithms and
R2 values in SGALA and MGALA algorithms are more than GA, ACO, PSO, and LA algorithms in most runs as well.
Material and method
Genetic Algorithm (GA)
Among the bio-inspired optimization algorithms, the Genetic Algorithm (GA), an algorithm based on the principles of natural selection, is believed to be one of the best and the most efficient ones (
11). GA is a random search optimization algorithm that simulates the natural evolutionary theory. To this aim, it applies a fitness function and modeled the data into some chromosomes as initial population(
11,
12). In this algorithm, the search process starts from initial population and by applying two operators (mutation and crossover) on the chromosomes the algorithm tries to generate new population and move to the optimal point of the search space. In each step, the distance of each chromosome to the optimal solution is measured by fitness function. Consequently, Optimization is the most critical function of the Genetic Algorithm(
11,
13).
Learning Automata (LA)
Learning Automata (LA) is perceived as an abstract model that selects an operation from a set of specific operations randomly. this algorithm employs the selected operation on the environment and informs the evaluated results by using a reinforcement signal (
14). LA updates its interior states by means of selected operation and reinforcement signals. Then the algorithm selects the next operation in an iterative manner(
15). The communication of LA and the environment is shown in
Figure 1(
16) .
The environment is shown by
E={
α,β,c} where
α={
α1 ,α2 ,...,αr} is a set of inputs,
β={
β1 , β2 ,... βr} is a set of outputs, and
c={ c1 ,
c2 ,...
cr } is penalty probabilities. When
β is a set of binary, so the environment is a P type. In this kind of environment
β1=1 is considered as penalty and
β2=0 as reward (
17,
18).
Proposed Algorithms
In this paper two novel hybrid algorithms for QSAR feature selection problem are proposed. These two new hybrid algorithms take advantage of both genetic algorithm and learning automata. In below sections the application of genetic algorithm and learning automata are described for feature selection in QSAR problem and then, the MGALA and SGALA algorithms are explained.
Feature selection using GA
This algorithm tries to solve QSAR feature selection problem using Genetic Algorithm. The flowchart of this algorithm is depicted in
Figure 2. At first this algorithm produces an initial population and then tries to converge to optimal result using genetic operations.
Figure 3 shows a QSAR sample and corresponding chromosome for this algorithm.
The fitness function, crossover, and mutation operators are described in below sections:
Fitness function
To obtain the fitness value, first by using Multiple Linear Regression (MLR), the activity is predicted and after that by using Root Mean Square Error (RMSE) equation, the fitness value of each chromosome/automaton is calculated. For example, for sample Table and chromosome of
Figure 3, the fitness value is determined using below steps:
Step1: Predicting activity using MLR. By using MLR, the activity values can be predicted. R1 relation shows the application of MLR for the sample demonstrated in
Figure 3.
For this specific example, the predicted activity values could be calculated using below function:
f=-7.0514+0.3969*f2+8.4445*f3 –5.0770*f4
Step 2: calculating chromosome fitness value using RMSE equation. After predicting activity values using MLR, the fitness value must be calculated using RMSE equation. R2 relation below shows the RMSE equation. In this function n is the number of sample molecules.
R2: fitness value using RMS function
Crossover operator
Regarding genetic algorithm, crossover operator applied to modify the contents of chromosomes from one generation to the next ones. It is similar to the biological crossover process that the GA is based. The crossover procedure takes more than one parent solutions and generating the same number of child solutions from them. The crossover operator in this algorithm uses single point crossover. In this type of crossover two random chromosomes were selected and half of each chromosome was attached to the other chromosome and vice versa. This operator is depicted in
Figure 4.
Mutation operator
Similar to biological mutations, mutation operator is applied to sustain genetic variety from one generation of population to the next one. In mutation, the solution may alter completely from the previous solution. Therefore, GA can be improved to a better solution by using mutation. Mutation takes place in the course of evolution according to a user-defined mutation probability. The mutation operator type in this algorithm is order-based mutation. In this type of mutation two random genes are selected and the positions of them are swapped .This operator is illustrated in
Figure 5.
GA Termination
In Genetic Algorithm there are some different conditions for termination of algorithm. In this paper at first the generation number is declared and then the algorithm executes according to this number.
Feature selection using LA
For a QSAR feature selection with
n features, different 2
n states exist and if LA is applied to solving QSAR feature selection problem, LA must involve 2
n actions. In this article, the Object Migration Automata (OMA) method, proposed by Oommen and Ma, is utilized to reduce convergence speed. More precisely, the proposed algorithm utilizes Tsetlin automata, an OMA based algorithm, for solving QSAR selection problem (
19).
In our proposed algorithms each chromosome is equal to an automaton and each gene is equivalent to an action of an automaton. The automaton illustrated in
Figure 6 is equal to the chromosome which was brought in
Figure 3. The flowchart of Learning Automata for solving this problem is depicted in
Figure 7. In this algorithm at first the initial population consisting of some random automata is generated, and then by using LA method it tries to converge to the optimal result. By repeating the process of learning, the LA selects the suitable position of actions.
Reward and penalize Operator
One of the important subjects in learning automata is reward and penalize operator. In this method in every epoch for every automaton an action is randomly selected and it is rewarded or penalized. At first the fitness value of automaton is calculated (suppose it is
S1), after that if the selected action value is zero it changes to one and vice versa and then the fitness value of the altered automaton is calculated once again (suppose it is
S2). Reward operator occurs when the value of
S1 is equal to or smaller than the value of
S2 and penalize operator occurs when the
S1 value is bigger than
S2 value (
Figure 8).
R3 relation shows the reward and penalizes.
For
Figure 8 automaton, the selected action is penalized, because by changing the selected action value from zero to one its fitness value is minimized (the error is minimized). If the fitness value for the changed action is larger than original automaton fitness value, therefore the automaton is penalized.
Figure 9 shows the reward operation and
figures 10 and
11 show penalize operation.
Two possibilities are likely when penalizing an action:
(a) The action might occur in a position other than frontier position. In this case, penalizing makes it less important. The way the action of feature f2 is penalized, is shown in
Figure 10.
(b) The action could occur in frontier position. In this case, provided that the value of action is
zero it is turned into
one and Vice versa.
Figure 11 shows how feature f1 is penalized.
Learning termination
For termination of learning process there are different methods such as: predefined epoch number and obtained suitable result and etc. In this paper we use predefined epoch number and at first before the start of algorithm, the epoch number is defined and the algorithm repeats learning process using epoch number value.
Mixed GALA Algorithm (MGALA)
GA tries to find the best chromosome in the population. In GA the location of genes, in chromosomes are random.
The optimal solution can be found in fewer generations if the position of the genes in the chromosomes discover optimally. Consequently, our algorithm tries to obtain the optimal solution in fewer generations utilizing the advantages of both GA and LA. In this algorithm the LA operator (reward/penalize) is added to GA. Generation number in GA and epoch number in LA in this algorithm are equivalent. The flowchart of this algorithm is shown in
Figure 12.
Sequential GALA Algorithm (SGALA)
Another algorithm that we have proposed in this paper is SGALA. In this Algorithm at first the GA tries to converge to optimal result after a number of GA generations, the last population of GA is applied as the initial population of LA and next the LA tries to improve the last GA results. In this algorithm we could optimize the initial population of GA using less generation numbers and then by using LA the result could be improved. Generation number in GA and Epoch number in LA are distinct. The flowchart of this algorithm is exhibited in
Figure 13.