1.1. Diabetes Mellitus
With regard to the increased prevalence of diabetes, the high cost of treatment for diabetic patients, along with many deaths and medical errors, diabetic prevention is more essential than its treatment (
1). The number of diabetic people has increased from 108 million in 1980 and now affects over 422 million people globally (
2). Furthermore, the global prevalence of adults’ diabetes who were over 18 years old has increased from 4.7% to 8.5% by 1980 to 2014, and it has been growing more quickly in the low and middle -income countries (
2). Correspondingly, diabetes is a chronic disease and the main cause of kidney failure, blindness, heart attacks, lower limb amputation, and stroke (
2). In 2012, 1.5 million people passed away directly from diabetes, and there were 2.2 million deaths due to high levels of blood glucose (BGL) (
2). The BGL is a statistical concept, defined as the distribution of fasting plasma glucose (FPG) that uses in the diagnosis of type 2 diabetes mellitus (T2DM), when it is exceeding 130 (mg/dl) (
3). Accordingly, the world health organization (WHO) declares that in 2030 diabetes will be the 7th leading cause of death (
4). Thus, diabetic prevention with early diagnosis of diabetes is our main research priority.
Diabetes mellitus is a metabolic disease that affects the body’s ability to adjust BGL (
5). In diabetes, there will be an abnormal increase in blood glucose level (hyperglycemia) which leads to significant medical conditions, including ischemic heart disease, stroke, nephropathy, neuropathy, retinopathy, and peripheral vascular disease (
5,
6).
One of the reasons for hyperglycemia in humans is insulin deficiency, which occurs when beta cells in the pancreas are no longer able to produce insulin. This condition is known as the type one diabetes mellitus (T1DM) (
5,
7,
8). People with T1DM need a daily injection of insulin to regulate their blood glucose levels, and if they don’t access to insulin, they cannot survive. The cause of T1DM is unknown and also is not currently preventable, but its symptoms include urination and persistent hunger, excessive thirst, visual changes, weight loss, and fatigue (
2).
The other type and the most common form of diabetes mellitus, which is caused by insufficiency of insulin secretion, is that the body does not produce sufficient insulin and insulin does not affect the cells (type two diabetes mellitus) (T2DM) (
2,
9). Although, the T2DM symptoms are similar to T1DM but are often less common or totally absent. As a result, the disease may not be detected for several years until its effects exhibit in the present (
2). Approximately 50% - 80% of T2DM cases are not diagnosed (
7,
10).
The interaction of genetic and metabolic factors will determine the risk of T2DM. Family history of diabetes, overweightness, obesity, ethnicity, previous gestational diabetes compound with older age, physical inactivity, unhealthy diet, and smoking will increase the risk of diabetes (
2). Both genetics and environmental factors, such as race, obesity, age, gender, and lack of exercise, evidently play significant roles in diabetes diagnosis whereby overweightness and obesity are the influential risk factors for T2DM (
2,
5,
7,
9,
11-
13).
Correspondingly, Marinov et al. (
14) investigated 17 articles describing different data-mining methods utilized for diabetes research. They expressed that data mining can play a dominant role in diabetes research and ultimately improve the quality of health care for diabetes patients. Likewise, the significance of our study is that the early diagnosis of diabetes reduces the cost of treatment.
If diabetes is not treated in time and suitably, it will lead to very severe complications including death; as a result of which the disease is one of the main priorities in medical research that generates a wealth of clinical data (
14). Data mining is the process of extraction of useful knowledge from a large amount of clinical data to predict using techniques such as classification, clustering, and association, which make it one of the effective methods in diabetes research (
12). Data mining can significantly help diabetes research and ultimately improve the quality of health care (
14,
15). Data mining methods in disease diagnosis using many complex machine-learning algorithms to discover a hidden pattern, increase the accuracy rate of detection which such identified patterns utilized to predict upcoming events (
12,
15).
This paper aims at developing an ensemble diabetes early diagnosis system, which can forecast whether the patient has diabetes or not. Moreover, this system using classification data mining methods, namely weighted k-nearest neighbor, simple decision tree, and logistic regression, which can extract knowledge from clinical patient data.
In section 1 we will review an introduction on DM and the importance of applying data mining techniques in the medical field. Furthermore, we will have a quick review on the related research background. Section 2 consists of data acquisition, pre-processing and data normalization, data analysis, reviewing different classifiers, cross-validation technique, confusion matrix and the proposed method. Section 3 will portray the result obtained by the proposed method. In section 4 we will discuss about comparative analysis of the result of different studies with our proposed method. Finally, the section 5 is the conclusion section of this research. Section 6 is about suggestions and future works of this study and the last section 7 is about the compliance with ethical standards.
1.2. Research Background
Most diabetes researches focus on two general approaches using data mining for early diagnosis of diabetes. The first approach is to use a specific classification algorithm to estimate the risk of diabetes on the patient’s diabetes data, and the second approach is to apply hybrid algorithms. For example, Thirumal and Nagarajan (
12) discussed different data mining algorithms namely, decision tree, k-nearest neighbor, naïve Bayes, and SVM tested with Pima Indian diabetes dataset. The main goal is to get the best algorithms that with given data provides higher accuracy. Likewise, Lee and Wang (
9) presented a novel fuzzy expert system for diabetes diagnosis support application which could give a semantic description of diabetes. Conversely, Tafa et al. (
16) proposed the joint implementation of two algorithms, namely support vector machine SVM and naïve Bayes to minimize their specific weakness such that the accuracy of the joint ensemble method was improved up to 97.6%. This methodology could decrease the false negative answers, which is a crucial issue in medical diagnoses. Accordingly, Han et al. (
17) used an SVM along with an ensemble learning module which turns the “black box” of SVM decisions into logical rules to monitor diabetes. Furthermore, the study illustrates that the hybrid system is efficient and can provide a tool for diabetes diagnosis. Likewise, Barakat et al. (
7) used the same technique and represented that the intelligible SVMs provide a promising tool for the prediction of diabetes. Furthermore, De Silva et al. (
15) developed an ensemble system which used a voting process between three classification methods namely, decision tree, naïve Bayes, and SVM algorithm to get the final result.
Lee and Kim (
11) evaluated the predictive power of different phenotypes using hypertriglyceridemic waist (HW) and specific anthropometric measurements such as waist circumference (WC) and triglyceride (TG) levels as the constituents of the HW phenotype. The study showed that the relationship between WC and T2DM was higher than TG association with T2DM. However, the results of this study cannot be generalized to other populations, because the study population was only Korean women and men (
11). Likewise, Lee et al. (
3) employed an arrangement of anthropometric measures as the input of two different machine learning algorithms to predict the fasting plasma glucose (FPG) status and the results indicate that using normalized data of high and normal FPG groups can enhance the estimation and decrease the intrinsic bias of the model toward the majority class. Moreover, Simon et al. (
5) aimed to discover sets of diabetes risk factors by applying association rule mining to electronic medical records (EMR). Similarly, Purushottam et al. (
18) designed a system which efficiently discovers the rules to estimate the risk level of diabetes using C45 rules and partial tree to evaluate the system.