Data Mining

Dataset description

The Diabetes Health Indicators Dataset is derived from the Behavioral Risk Factor Surveillance System (BRFSS), a health-related telephone survey conducted annually by the Centers for Disease Control and Prevention (CDC). The dataset contains health information collected from over 250,000 individuals across the United States in the year 2015. It focuses on various health indicators that are associated with the risk of diabetes.

The dataset includes the following key elements:

  • Number of Observations: 253,680 survey responses.
  • Number of Features: 21 feature variables related to health behaviors, chronic health conditions, and the use of preventive services.
  • Target Variable: Diabetes_binary — a binary variable indicating whether an individual has diabetes (1) or not (0).
  • Classes: The target variable is imbalanced, with a higher prevalence of non-diabetic cases.
  • Data Source: The dataset is a cleaned version of the BRFSS 2015 data, focusing specifically on indicators related to diabetes.

The feature variables include a mix of categorical and continuous data, covering aspects such as body mass index (BMI), physical activity, smoking status, and other lifestyle-related factors that are crucial in predicting diabetes. The dataset is particularly valuable for building predictive models to assess diabetes risk and exploring the relationship between various health behaviors and diabetes.

To see the detailed features of the dataset click here.

To view the dataset information click here.

Justification

We have selected the Diabetes Health Indicators Dataset for this project due to its significant relevance to public health and the opportunity it presents for building predictive models. Diabetes is a major chronic disease, particularly in the United States, where it affects millions of individuals and poses a substantial burden on the healthcare system and economy. The dataset we have chosen is derived from the CDC's Behavioral Risk Factor Surveillance System (BRFSS), which is a well-established and robust source of health-related data.

This dataset is particularly suitable for our project as it contains a large number of observations (over 250,000) and a variety of health indicators that are essential for predicting diabetes. The diversity and depth of the data allow for comprehensive analysis and model building, which are critical for understanding the risk factors associated with diabetes and developing effective predictive models.

Relevance

The relevance of the Diabetes Health Indicators Dataset is underscored by its alignment with the project’s goals to apply data mining techniques to a real-world problem with substantial societal impact. The dataset's focus on health indicators related to diabetes makes it ideal for exploring the relationships between these indicators and the risk of developing diabetes. This dataset has been successfully utilized in similar studies, including the one by Zidian Xie et al., where machine learning techniques were applied to predict type 2 diabetes using survey data from the BRFSS. Their study demonstrated the efficacy of using such data for building predictive models that can facilitate early diagnosis and intervention, which is a key objective of our project. [1]

Additionally, the dataset's applicability is further supported by its use in public health research, as highlighted in studies like the one published in the Morbidity and Mortality Weekly Report, which discussed the incidence of end-stage renal disease attributed to diabetes. This study emphasizes the importance of early detection and prevention of diabetes-related complications, reinforcing the value of predictive modeling efforts like ours. [2]

By utilizing this dataset, our project not only aims to replicate the success of previous studies but also to contribute new insights into diabetes risk prediction. The large sample size and comprehensive nature of the data provide a solid foundation for developing robust predictive models, making it an excellent choice for our project.

References

  1. Xie, Z., Nikolayeva, O., Luo, J., & Li, D. (2019). Building Risk Prediction Models for Type 2 Diabetes Using Machine Learning Techniques. Preventing Chronic Disease, 16. DOI: 10.5888/pcd16.190109 .

  2. Burrows, N. R., Hora, I., Geiss, L. S., & Gregg, E. W. (2017). Incidence of End-Stage Renal Disease Attributed to Diabetes Among Persons with Diagnosed Diabetes — United States and Puerto Rico, 2000-2014. Morbidity and Mortality Weekly Report, 66(43), 1165-1170. DOI: 10.15585/mmwr.mm6643a2.