In this section, the materials and methods applied in the proposed approach are described. First, the breast cancer dataset used in this work is introduced. Then, the GWO and SOF classifiers are explained. Finally, the proposed approach is presented.
2.2. Feature Selection
Machine learning employs methods that allow the analysis of large amounts of data automatically (
25). Classification algorithms are a type of supervised learning technique used to identify the category of new observations based on training data. In classification, a program learns from a given dataset and then classifies new observations into predefined classes. The primary objective of classification is to generalize from the training patterns to accurately categorize new patterns. The process of ML for classification begins with prerequisites, such as datasets, data cleaning procedures, FS techniques, and classification models (
26-
28).
Feature selection is a crucial aspect of ML. Feature selection techniques for classification problems are based on identifying significant features, and these techniques can enhance various standard ML methods (
29,
30). Therefore, this process plays a vital role in the development of learning models. Selecting the appropriate and optimal feature subset can be challenging due to the complex and unpredictable interrelationships between features (
31,
32). Since FS is a near-optimal (NP)-hard problem (
33,
34), several optimization algorithms have been proposed to overcome its limitations. These algorithms include particle swarm optimization (PSO) (
35,
36), ant colony optimization (ACO) (
37,
38), whale optimization (WO) (
39,
40), and GWO (
41). Feature selection approaches based on optimization algorithms have the ability to efficiently explore large search spaces and often yield results that closely approximate the global solution. These approaches systematically eliminate unnecessary and redundant features, which can, in many cases, enhance the performance of learning models by reducing uncertainty and overfitting issues (
42-
44).
2.3. Grey Wolf Optimization Algorithm
Optimization algorithms refer to procedures for finding near-optimal solutions to multi-dimensional and complex optimization problems. One such optimization algorithm is the GWO algorithm. The GWO algorithm is a promising optimization technique based on swarm intelligence (SI) (
45). This algorithm has garnered the attention of many researchers across various optimization domains (
46-
48). What sets the GWO algorithm apart from other evolutionary and swarm intelligence techniques are its distinctive characteristics. The GWO algorithm requires minimal parameter tuning, effectively balances global and local search, and demonstrates favorable convergence. Moreover, it is known for its simplicity of implementation, adaptability, and scalability (
49).
The GWO algorithm mimics the natural hierarchy of leadership and hunting mechanisms observed in grey wolves. It employs four types of grey wolves to simulate the hierarchy of leadership: Alpha (α), beta (β), delta (δ), and omega (ω). Grey wolves tend to live in herds and group environments, with group sizes ranging from a minimum of 5 to a maximum of 12. The hierarchy among grey wolves is highly structured, with α representing the best outcome in mathematical terms, followed by β and δ as the next level of preferred outcomes. The ω outcome is another consideration. The GWO algorithm posits that α, β, and δ wolves lead the hunt (optimization); however, ω wolves monitor these three leaders (
45). The crucial phase of the hunt occurs when a wolf encircles its prey. Equations are employed to represent the encircling behavior of the GWO algorithm.
Where D_i is calculated in Equation the current iteration is defined as t; p and y are positions of prey and grey wolves, respectively.
Where A and C are coefficient vectors that are determined using Equations
Where r1 and r2 are random vectors in the range of [0, 1], and the lb components are linearly lowered from 2 to 0. The hunting process is frequently directed by α. Grey wolves, both β and δ, sporadically attend the hunt. However, the exact position of the best solution (prey) in the problem’s abstract search space is unknown. Therefore, it was believed that α would be the best potential solution for the simulation of wolf hunting behavior; nevertheless, β and δ would present a better understanding of where the prey could be found. As a result, the top three findings obtained were set reserved. Omegas and other search agents must change their positions to the best search agents’ positions. Equation is utilized to update the positions of wolves.
Where W1, W2, and W3 are formulated in Equations to 8, accordingly.
Where Wα, Wβ, and Wδ are the GWO’s first three most effective solutions at a certain iteration t, A1, A2, and A3 are defined in Equation and Diα, Diβ, and Diδ are formulated in Equations to 11, accordingly.
where C1, C2, and C3 are formulated in Equation Finally, according to Equation the parameter lb was reduced (2 - 0) to emphasize global and local search.
Where t denotes the current number of iterations, and Maxit denotes the maximum iterations permitted in the GWO.
2.4. Binary GWO
The binary GWO algorithm has been called BGWO, where each solution contains a combination of 0's and 1's. Emary et al. proposed a new BGWO algorithm applied for FS tasks (
50). In this algorithm, the wolves that updated the equation represented a three-position vector function: W
α, W
β, and W
δ, which were in charge of inviting each of the wolves to the best three outcomes. The position of a specific wolf is included using the GWO principle while maintaining the binary constraint according to Equation The scholars employed the update on the GWO method, which is detailed in Equations to 23. The core update equation is expressed in Equation
Where w1, w2, and w3 binary vectors reflect the influence of a wolf’s movement toward α, β, and δ grey wolves in order. w1, w2, and w3 vectors are computed by Equations and 20, respectively.
where is the α wolf position in dimension d, and is a binary step in dimension d that can be determined in Equation
where urand is a random number generated from a standard uniform distribution in the range (0,1), and is the dimension d’s continuous-valued step size that can be computed using the sigmoidal function as in Equation
where and are determined by Equations
where is the β wolf position vector in dimension d, and is a binary step in dimension d, which can be determined as in Equation
where urand is a random number generated from a standard uniform distribution in the range (0,1), and is the dimension d’s continuous-valued step size that can be computed using the sigmoidal function as in Equation
where and are determined by Equations
where is the δ wolf position vector in the dimension d, and is a binary step in dimension d, which can be determined as in Equation
where urand is a random number generated from a standard uniform distribution in the range (0,1), and is the dimension d’s continuous-valued step size that can be computed using the sigmoidal function as in Equation
where and are determined by Equations A simple technique of stochastic crossover is used in each dimension to crossover w1, w2, and w3 outputs by Equation
2.5. Self-Organizing Fuzzy Logic
Gu et and Angelov introduced a novel classifier model based on SOF (
51). The SOF classifier utilizes non-parametric statistical operators to objectively reveal essential data patterns, even in the absence of empirically acquired data samples. It identifies local peaks within the multi-modal data distribution to serve as prototypes. Additionally, the SOF classifier is highly objective and non-parametric. This means that it does not rely on a predefined model with parameters. Instead, it derives all associated meta-parameters directly from the data itself. Depending on the complexity of the problem and the availability of computational resources, the SOF classifier can address issues at various levels of granularity or detail. Furthermore, it supports both online and offline learning and can classify data using various dissimilarity/distance criteria. Therefore, the SOF is a versatile classifier known for its excellent performance across a range of problems. In this paper, the offline learning mode of the SOF classifier will be utilized.
The SOF classifier’s offline method involves independently detecting prototypes for each class and constructing a zero-order fuzzy rule of the AnYa type based on the identified prototypes for each class (in the structure of Equation
The AnYa-type fuzzy rule-based scheme was introduced in (
52) as an alternative approach to the commonly used fuzzy rule-based schemes, such as Takagi-Sugeno (
53) or Mamdani (
54) models. In comparison to the two previous models, the pattern component (IF) in AnYa-type fuzzy rules is streamlined into a more concise, objective, and non-parametric vector structure without requiring the definition of ad-hoc membership functions, as needed in the two aforementioned predecessors. The following is the form of a zero-order fuzzy rule of the AnYa type:
Where xin signifies vector of input and “∼” signifies similarity, which can also be considered a fuzzy degree of membership/satisfaction (
55); pro
i (i=1,2,..., Np) represents the class’s ith prototype; Np is the number of prototypes discovered from the data samples of this class. Different strategies, such as “fuzzily weighted average”, might be used to determine the label for a specific data sample.
The fuzzy rule training procedures of separate classes will have no effect on each other. We will suppose for the remainder of this section that the training procedure is conducted using data samples from the cth class (c=1,2,..., C) indicated by and the frequency of occurrence and the associated unique data sample set are denoted, respectively, by and where Kc is the number of data samples with , is the number of unique data samples of the cth class. Considering all the classes, we have .
Prototypes are found using the densities and mutual distributions of data samples in the method. To begin with, multi-modal densities
(
52,
56) at all the samples of unique data in
are computed
. Then, in a list defined by {r}, the data samples are ordered according to multi-modal density values and their reciprocal distances.
By discovering the sample of data with the largest multi-modal density, , the first element of list {r} is recognized (r1). Then, the data sample was determined as the second element (r2) that is closest to r1 in terms of distance: r2 = . The minimal distance to r2 is used to identify the third element of list {r}, indicated by r3.
The entire list{r} is built by repeating the procedure until each of the data samples has been chosen, and based on the list{r}, the multi-modal densities of
are ordered, indicated by
(
57,
58).
It is important to note that after a data sample is selected into list{r}, it cannot be selected for a second time.
Prototypes, indicated by {p}0 are then
recognized as the local maximum of the ordered multi-modal densities, , by condition one:
Condition 1:
After all of the prototypes have been recognized with Equation some fewer representative ones might be found in {p}0, thereby necessitating the use of a filtering process to eliminate them from P0.
Before beginning the filtering process, use the prototypes to attract close data samples to construct data clouds (
55), similar to Voronoi tessellation (
58):
After all the clouds of data are generated around the available prototypes {p}0, one can acquire the data cloud centers Indicated by {φ}0, and the multi-modal densities at the centers are computed by as , where ; is the ith data cloud's support.
Following that, for each data cloud, supposing the ith one , the set of the centers of its neighboring data clouds, indicated using , are identified by Condition two:
Condition 2:
Where
;
is described as the average radius of the local influential region surrounding, which corresponds to the Lth granularity level (L=1,2,3,...) and is produced from the cth class data based on an offline mode. Finally, the most representative prototypes of the cth class, indicated by {p}
c, are chosen from the centers of the available data clouds that fulfill Condition 3 (
58):
Condition 3:
In the end, the representative prototypes of the cth class {p}c are recognized, the fuzzy rule of AnYa type might be constructed as follows, where Nc denotes the number of prototypes in {p}c:
2.6. The Hybrid Intelligent Method
Taking into consideration the advantages of BGWO and SOF and the importance of breast cancer classification, this study proposes an intelligent approach for distinguishing benign from malignant breast cancers. In the proposed approach, BGWO acts as an FS technique to select the effective and optimal features; nevertheless, SOF functions as the classifier to evaluate the performance of these optimal features. The procedure of the proposed approach (BGWO-SOF) is described below.
The procedure begins with normalizing the values of the WDBC dataset and initializing the parameters for the BGWO algorithm. Then, the K-fold cross-validation technique is employed (with K = 10) to assess how effectively the classification approach can predict the tumor characteristics of an unknown instance. For each fold, the dataset is divided into 10 equally sized subsets. Consequently, in each fold, 9 subsets serve as the training data (90% of the dataset); however, 1 subset (10% of the dataset) is reserved for testing purposes.
In each fold, the algorithm executes the optimization process by generating an initial population of candidate solutions (i.e., individuals) within the search space. Each position of an individual is represented as a vector with N elements, where N is the number of features in the dataset. A 0 value indicates that a certain feature is not selected; nonetheless, a 1 value indicates that the related feature is selected. Each individual in the feature space constitutes a set of candidate features. The BGWO solution is depicted in
Figure 1 (for example, N = 10).
Solution representation of feature selection
Then, the fitness for each solution (𝑋𝑖) is calculated. Because the approach’s primary goal is to improve the performance of the classification, the quality of a solution is determined by two key criteria: The number of BGWO-selected features in the solution and the SOF classifier’s error rate. Therefore, the optimal solution is a combination of features with the fewest number of selected features and the highest performance of classification. In this paper, the fitness function in Equation by the SOF classifier is applied to evaluate the quality of the features selected by the BGWO.
Where the SOF’s error rate on the WDBC dataset applying optimal selected features. F is the number of the complete set of features in the cancer breast dataset, and f is the number of the selected features in the solution. ω ∈ (0,1) and φ =1-ω are balance factors between and the size of selected features to control the importance of feature space reduction and classification performance. The φ is equal to 0.05 in the present experiments.
At each iteration, the fitness or classification error rate of the new solution is compared to the fitness of the previous solution, and if it shows an improvement, the new solution is chosen. The process is repeated until the total number of iterations reaches a certain limit called Maxit. Finally, the sets of the selected features are used in the SOF learning model.
This procedure will be repeated until all of these subsets apply for both the training and testing phases. Finally, the evaluation metrics results from the 10 iterations are averaged to produce reliable statistical results.
Figure 2 shows a flowchart of the proposed approach.
Flowchart of the proposed approach