Proceedings Article | 21 April 2023
Zhihe Xu, Yifeng Zhu, Guojun Li, Jingpeng Yang
KEYWORDS: Random forests, Data modeling, Machine learning, Decision trees, Education and training, Evolutionary algorithms, Plasma, Glucose, Statistical modeling, Performance modeling
The World Health Organization pointed out that diabetes is a non-communicable disease that needs attention at all times. From 2006 to 2016, the number of deaths from diabetes in the world increased by 31.1%. Diabetes is slightly less harmful to the human body than cancer, which ordinary people are afraid of. It is a disease with the largest number of known complications. Once complications occur, it is difficult for the patient's body to recover with drug treatment, and it has a high direct death rate.To apply random forest algorithm and Xgboost algorithm to study the risk factors of diabetes and build a high-quality risk prediction model. The data comes from the Pima Indians diabetes dataset, which includes a total of eight features, such as the number of pregnancies, blood pressure, plasma glucose concentration, body mass index, serum insulin concentration, etc. The response variable is whether the patient has diabetes. Random forest algorithm and Xgboost algorithm were applied to compare the prediction effects of the two models, and the importance scores of features were given at the same time. Results: In the importance score of the random forest model, the scores of plasma glucose concentration, body mass index, age, and diabetes spectrum function were particularly high, and the prediction accuracy of the random forest model was higher than that of the Xgboost model. Conclusion: Several strongly correlated features obtained by random forest algorithm can be used for a screening of high-risk groups of diabetes and the guidance of risk intervention measures, so as to find potential diabetic patients in the population, and achieve early intervention and early treatment. Reduce the waste of medical resources.