Data Mining

Exploratory Data Analysis

Data Overview

This report provides a detailed summary of the dataset, which contains information on health indicators and the condition of diabetes in a population. This initial analysis offers an overview of the data structure, volume, and the nature of the variables.

Number of Observations and Variables

The dataset contains a total of 253,680 observations and includes 22 variables that represent different health indicators or demographic characteristics relevant to diabetes analysis.

Present Variables and data type

Variable Type Data Type Description
Diabetes_binary Target Binary 0 = no diabetes, 1 = prediabetes or diabetes
HighBP Feature Binary 0 = no high BP, 1 = high BP
HighChol Feature Binary 0 = no high cholesterol, 1 = high cholesterol
CholCheck Feature Binary 0 = no cholesterol check in 5 years, 1 = yes cholesterol check in 5 years
BMI Feature Integer Body Mass Index
Smoker Feature Binary Have you smoked at least 100 cigarettes in your life? 0 = no, 1 = yes
Stroke Feature Binary Ever told you had a stroke. 0 = no, 1 = yes
HeartDiseaseorAttack Feature Binary Coronary heart disease (CHD) or myocardial infarction (MI) 0 = no, 1 = yes
PhysActivity Feature Binary Physical activity in the past 30 days (excluding job) 0 = no, 1 = yes
Fruits Feature Binary Consume fruit 1 or more times per day. 0 = no, 1 = yes
Veggies Feature Binary Consume vegetables 1 or more times per day. 0 = no, 1 = yes
HvyAlcoholConsump Feature Binary Heavy drinkers (men > 14 drinks/week, women > 7 drinks/week). 0 = no, 1 = yes
AnyHealthcare Feature Binary Have any healthcare coverage. 0 = no, 1 = yes
NoDocbcCost Feature Binary Did not see a doctor due to cost in the past 12 months. 0 = no, 1 = yes
GenHlth Feature Integer General health self-assessment scale 1-5 (1 = excellent, 5 = poor)
MentHlth Feature Integer Days of poor mental health in the last 30 days (scale 1-30)
PhysHlth Feature Integer Days of poor physical health in the last 30 days (scale 1-30)
DiffWalk Feature Binary Serious difficulty walking or climbing stairs. 0 = no, 1 = yes
Sex Feature Binary 0 = female, 1 = male
Age Feature Integer 13-level age category (1 = 18-24, 13 = 80 or older)
Education Feature Integer Education level (scale 1-6, 1 = no school, 6 = college graduate)
Income Feature Integer Income level (scale 1-8, 1 = less than 10,000, 8 = 75,000 or more)

Data visualization

Unique values count

Variable Unique Values
Diabetes_binary 2
HighBP 2
HighChol 2
CholCheck 2
BMI 84
Smoker 2
Stroke 2
HeartDiseaseorAttack 2
PhysActivity 2
Fruits 2
Veggies 2
HvyAlcoholConsump 2
AnyHealthcare 2
NoDocbcCost 2
GenHlth 5
MentHlth 31
PhysHlth 31
DiffWalk 2
Sex 2
Age 13
Education 6
Income 8

Outliers

Outliers are data points that significantly differ from the rest of the data. They can affect the results of statistical analysis. It is important to identify and handle outliers appropriately to ensure the accuracy and reliability of the analysis.

distribution

Data distribution

As can be seen in the graph, the data are unbalanced. Therefore, it will be necessary to use techniques that help to balance the data.

distribution

Selected Variables

The graph shows the variables selected after a cleaning process.

distribution