Introduction¶

This insurance dataset contains some information on insurance customers, such as age, BMI, region and their insurance charges. I am tasked with exploring these characteristics about the customers and identify the factors that contribute to their insurance charges. I will demonstrate my explotaratory data analysis skills by asking the following questions about this insurance dataset:

Goals for the project:¶

Find the average age of the patients, average insurance charge and average BMI.
Where are majority of the individuals from?
How does smoking, age, BMI and sex affect insurance charges?
What is the average insurance cost for people with at least one child?
Who pays more in insurance charges, males or females?
What is the relationship between BMI and insurance charges.(Finding the correlation between the two)
What is the relationship between smoking status and insurance charges(find the association between the two

Exploratory Data Analysis:¶

In [2]:

#Importing relevant libariries
    import scipy.stats
    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    import statsmodels.api as sm
    import seaborn as sns
    

In [3]:

#Load csv file into a dataframe
    insurance = pd.read_csv('insurance.csv')
    
    #Let's look at the first few rows
    insurance.head()
    #So in this dataset, we have age, sex, bmi, children, smoker, region and charges as our variables/columns. 
    #These columns are pretty straightforward. 
    #So, I will not waste time into explaining what these columns mean.
    

Out[3]:

	index	age	sex	bmi	children	smoker	region	charges
0	0	19	female	27.900	0	yes	southwest	16884.92400
1	1	18	male	33.770	1	no	southeast	1725.55230
2	2	28	male	33.000	3	no	southeast	4449.46200
3	3	33	male	22.705	0	no	northwest	21984.47061
4	4	32	male	28.880	0	no	northwest	3866.85520

In [4]:

#Okay, let us look at how many null values, missing data, and data types of each column in our dataset.
    insurance.info()
    

<class 'pandas.core.frame.DataFrame'>
    RangeIndex: 1338 entries, 0 to 1337
    Data columns (total 8 columns):
     #   Column    Non-Null Count  Dtype  
    ---  ------    --------------  -----  
     0   index     1338 non-null   int64  
     1   age       1338 non-null   int64  
     2   sex       1338 non-null   object 
     3   bmi       1338 non-null   float64
     4   children  1338 non-null   int64  
     5   smoker    1338 non-null   object 
     6   region    1338 non-null   object 
     7   charges   1338 non-null   float64
    dtypes: float64(2), int64(3), object(3)
    memory usage: 68.0+ KB

In [7]:

#So, we can see that we have 8 columns and 1338 rows of data. At first glance, we have no missing values in our dataset
    #Also, the datatypes for each column is correct. 
    
    #Let's see if there are really any rows in our dataset with NaN as the entry, meaning there is missing data
    np.where(pd.isnull(insurance)) #if the result is a empty array, then there is no missing data.
    

Out[7]:

(array([], dtype=int32), array([], dtype=int32))

In [8]:

#So we have no rows with Nan as the entry. 
    #Let's count the number of missing values in each column
    insurance.isna().sum()
    

Out[8]:

index       0
    age         0
    sex         0
    bmi         0
    children    0
    smoker      0
    region      0
    charges     0
    dtype: int64

Okay, so each column the sum is zero, so there are no missing values. Great, this dataset is fairly clean.¶

In [276]:

#To answer our first question, what is the average age of the patients?
    average_age = insurance['age'].mean()
    print('The average age is' + ' ' + str(average_age))
    

The average age is 39.20702541106129

In [37]:

#So, the average age is 39 years old.
    #I want to see the distribution of the ages of the people in our dataset to get an idea of the variability.
    plt.hist(insurance.age, bins=40)
    plt.show()
    

In [280]:

#It was appropriate to use mean to find the average age in our dataset because the distribution for age is not skewed.
    #Examining the distribution, we see a spike just before age 20. 
    #So let's quickly calculate the mode.
    mode_age = insurance.age.mode()
    mode_age
    #Here we see that we have mostly 18 year olds in the dataset. However, the average age is 39 years.
    

Out[280]:

0    18
    Name: age, dtype: int64

In [281]:

#Let's find out what is the average BMI for the people in our dataset.
    avg_bmi = insurance.bmi.mean()
    print('The average BMI in our dataset is' + ' ' + str(avg_bmi))
    #The average BMI is 30.66.
    

The average BMI in our dataset is 30.663396860986538

In [186]:

#Let's find out the average insurance charge in our dataset
    avg_charge = insurance.charges.mean()
    avg_charge
    #The average insurance charge is 13,270.
    

Out[186]:

13270.422265141257

Where are majority of the individuals from?¶

In [51]:

#Now let's see what region most of our patients are from.
    insurance.region.value_counts()
    

Out[51]:

southeast    364
    southwest    325
    northwest    325
    northeast    324
    Name: region, dtype: int64

In [282]:

#It is great to point out that they are over 300 patients from each region.
    #So we wouldn't have to worry about bias from a particular region because we have a good amount of patients from each region.
    #Let's visualize the amount of people in each region
    insurance.region.value_counts().plot.bar()
    plt.ylabel('# of People')
    plt.show()
    

The Effect of Smoking on Insurance Costs¶

In [144]:

#Okay, let's now see how insurance cost differ for smokers vs non smokers.
    #So, let's find the mean and median insurance charges for smokers and non smokers.
    cost_smokers_mean = insurance.groupby('smoker').charges.mean()
    cost_smokers_mean
    #We see that there is a significant difference in insurance charges for those who smoke and those who dont.
    #People who do not smoke pay 8434 and people who do not smoke pay 32050.
    #People who smoke pay more in insurance charges.
    

Out[144]:

smoker
    no      8434.268298
    yes    32050.231832
    Name: charges, dtype: float64

The mean charges for non smokers is 8,434 dollars. The mean charge for smokers is 32,050 dollars.

In [187]:

#print(insurance.to_string())

In [103]:

#Here we can see that smokers pay way more in insurance charges. 
    #But let's visualize the mean indivdually for smokers and non smokers to see what's happening.
    smoker_no = insurance[insurance['smoker'] =='no'].charges
    smoker_yes = insurance[insurance['smoker'] =='yes'].charges
    
    #Histogram of people who do not smoke
    
    plt.subplot(1,2,1)
    plt.hist(smoker_no)
    plt.subplots_adjust(wspace=0.3)
    plt.title('Non smokers')
    #Historgram of people who do smoke
    plt.subplot(1,2,2)
    plt.hist(smoker_yes)
    plt.title('Smokers')
    
    plt.show()
    plt.clf()
    

<Figure size 640x480 with 0 Axes>

Looking at these two graphs side by side, for the non smokers, on the left, the distribution is right skewed. This suggest that they are outliers in the data. Looking at the smokers on the right, the distribution is not all that skewed. Due to the fact that the non smokers distribution is right sweked. We will find the median charge, instead of the mean charge. Because median is robust, it will handle outliers well.

In [168]:

#So let's find the median charge for smokers and non smokers.
    cost_smokers_median = insurance.groupby('smoker').charges.median()
    cost_smokers_median
    

Out[168]:

smoker
    no      7345.40530
    yes    34456.34845
    Name: charges, dtype: float64

So, the median insurance charge for people who do not smoke is 7,345. And the median insurance charge for people who do smoke is 34,456. The insurance charge is still more expensive for people who do smoke. This shows us that smoking is a huge contributing factor to insurance charges.

Let's go a step further to test the association between smoker and charges. To test the association between smoker and charges, we will use median difference. I will find the median for those who smoke and those who do not and then subtract the difference.

In [284]:

smoker_no_median = insurance[insurance['smoker'] =='no'].charges.median()
    smoker_yes_median = insurance[insurance['smoker'] =='yes'].charges.median()
    median_diff_smoker = smoker_yes_median - smoker_no_median
    print('The median difference between smokers and non smokers is' + ' ' + str(median_diff_smoker))
    

The median difference between smokers and non smokers is 27110.943150000006

Highly associated variables have a large median or mean difference. In this case, we have a large median difference. So, these variables must be highly associated since the median difference between the two variables here is 27110. But what is "large" in this case? To measure how large, we need to see the spread. A good way of seeing the spread between the two variables is to plot a side by side boxplot.

In [174]:

#Let's see the spread of the data with smoker and non smoker side by side.
    sns.boxplot(data=insurance,x='smoker',y='charges')
    plt.show()
    #Amazing plot below! Notice the huge difference between in each boxplot, this is what "large" mean/median difference means. 
    #There is no overlap AT ALL in our boxplots signifying a huge median difference and a strong association between these two variables.
    

Effect of Region on Insurance Charges¶

In [108]:

#Now, let's see if there is any difference in insurance cost based on region. We will do this by finding the mean and median
    region_charges_mean = insurance.groupby('region').charges.mean()
    region_charges_mean
    

Out[108]:

region
    northeast    10057.652025
    northwest     8965.795750
    southeast     9294.131950
    southwest     8798.593000
    Name: charges, dtype: float64

In [109]:

region_charges_median = insurance.groupby('region').charges.median()
    region_charges_median
    

Out[109]:

region
    northeast    10057.652025
    northwest     8965.795750
    southeast     9294.131950
    southwest     8798.593000
    Name: charges, dtype: float64

In this case, since the mean and median are the same. We know that they weren't any signifant outliers in the data. It is safe to conclude that the mean insurance charge for people from: Northeast region is 10057. Northwest region is 8965. Southeast region is 9294. Southwest region is 8798. Peole in the Northeast region pay the most in insurance charges. But compared to the other regions, the charge amount isn't that far apart. So, we can conclude that region is not a significant contributing factor to insurance charges.

Effect of Sex on Insurance Charges¶

In [147]:

#Let's see how insurance costs differ for male and females by finding the mean for each gender
    
    sex_mean = insurance.groupby('sex').charges.mean()
    sex_mean
    

Out[147]:

sex
    female    12569.578844
    male      13956.751178
    Name: charges, dtype: float64

In [176]:

#Let's also look at the median
    
    sex_median = insurance.groupby('sex').charges.median()
    sex_median
    

Out[176]:

sex
    female    9412.96250
    male      9369.61575
    Name: charges, dtype: float64

Woah, the mean is greater than the median. When this happens, the distribution is right skewed. So, we have outliers in the data. Due to the existence of outliers, We will use the median instead of the mean. So the median insurance charge for females is 9412 and 9369 for males. This means that there isn't much of difference in insurance charge by gender, so gender is not a contributing factor to insurance charges. If we were to plot a boxplot of male and female insurance charges we would see some overlap between the two boxes.

In [293]:

#Let's test and see if there will be overlap in the boxplost between male and female charges
    sns.boxplot(data=insurance, x = 'sex', y='charges')
    plt.title('Insurance Charges vs Sex')
    plt.show()
    

See, the line in the middle of each box which denotes the mean is almost aligned, there is plenty of overlap between the two boxes. The mean difference is likely small. Therefore, we can conlude that there is no strong association between sex and insurance charges.

Effect of Children on Insurance Charges¶

In [23]:

#Next, we want to know do people with no children pay more than people with atleast 1 child?
    #To do this, we will find the mean and median for insurance charges for the children variable.
    children_mean = insurance.groupby(insurance.children).charges.mean()
    print(children_mean)
    

children
    0    12365.975602
    1    12731.171832
    2    15073.563734
    3    15355.318367
    4    13850.656311
    5     8786.035247
    Name: charges, dtype: float64

In [24]:

#So, here we a break down of the mean insurance charge for people with 0 to 5 children. 
    #Let's have a look at the median.
    children_median = insurance.groupby(insurance.children).charges.median()
    children_median
    

Out[24]:

children
    0     9856.95190
    1     8483.87015
    2     9264.97915
    3    10600.54830
    4    11033.66170
    5     8589.56505
    Name: charges, dtype: float64

So, the we see that the mean is greater than median. When this happens, it is means the distribution is right swkewed. This is means we have some outliers.And because we do have outliers, we are gonna use the median to summarize insurance charges for people with and without children. So, the median charge for people with no children is 9856. With 1 children, the insurance charge is 8483. With 2 children, it is 9264. 3 children, we have 10600. 4 children, we have 11033. Lastly, 5 children, we have 8589.So people with 4 children, pay the most in insurance charges. And 4 children is not the most number of children in our dataset, five is. I would conclude that people with more children do not pay more in insurance charges. Because the median is fairly even across the broad. There isn't any huge difference between each of the five categories.

In [287]:

#Let's visualise charges against number of children.
    plt.scatter(insurance.children,insurance.charges)
    plt.title('Insurance Charges vs Number of Children')
    plt.xlabel('# of children')
    plt.ylabel('Insurance Charges')
    plt.show()
    

In [184]:

#Let's calculate the pearson correlation between children and charges
    corr_child_charges,p = scipy.stats.pearsonr(insurance.children, insurance.charges)
    corr_child_charges
    
    #This result of 0.06 suggets that there is no relationship between the number of children you have and your insurance charges.
    

Out[184]:

0.06799822684790481

Effect of BMI on Insurance Charges¶

In order, to see how does BMI affect insurance charges. I am going to group the BMI in 4 categories. According to the CDC website, this is the adult body mass index. If your BMI is less than 18.5, it falls within the underweight range. If your BMI is 18.5 to <25, it falls within the healthy weight range. If your BMI is 25.0 to <30, it falls within the overweight range. If your BMI is 30.0 or higher, it falls within the obesity range.

In [288]:

#So, lets group each person into a category and find the median of their charges.
    underweight = insurance.groupby(insurance.bmi[insurance.bmi < 18.5]).charges.median()
    h_weight = insurance.groupby(insurance.bmi[(insurance['bmi'] >= 18.5) & (insurance['bmi'] < 25.00)]).charges.median()
    overweight = insurance.groupby(insurance.bmi[(insurance['bmi'] >= 25) & (insurance['bmi'] < 30.00)]).charges.median()
    obese = insurance.groupby(insurance.bmi[insurance['bmi'] >= 30.00]).charges.median()
    
    
    #Now variables above gives us all the median for all the bmi between the range specificed in the dataset, so we have to go a step further and find the median of that median.
    underweight_median = underweight.median()
    h_weight_median = h_weight.median()
    overweight_median = overweight.median()
    obese_median = obese.median()
    print('The median insurance charge for underweight people is' +  ' ' + str(underweight_median))
    print('The median insurance charge for healthy weight people is' +  ' ' + str(h_weight_median))
    print('The median insurance charge for overweight people is' +  ' ' + str(overweight_median))
    print('The median insurance charge for obese people is' +  ' ' + str(obese_median))
    
    # pd.set_option('display.max_rows', None)
    #print(h_weight.mean())
    

The median insurance charge for underweight people is 5116.5004
    The median insurance charge for healthy weight people is 8605.3615
    The median insurance charge for overweight people is 9253.8685
    The median insurance charge for obese people is 10767.387579999999

Based on the above print statements, overweight and obese people pay the most in insurance charges. But obese people pay the most which is 10,767. This was expected because obese people usually have more health problems so this would require a higher insurance charge While obese people pay the most in insurance charges, the difference between the other categories is not highly significant. About 3,400 separate insurance charges for underweight and healthy people and about 1,500 separate charges for overweight people and obese people.

In [159]:

#To really determine the relationship, let's check the pearson's correlation between BMI and insurance charge.
    #Let's see if BMI and insurance charges are correlated by using pearson correlation
    corr_bmi_charges, p_value = scipy.stats.pearsonr(insurance.bmi, insurance.charges)
    corr_bmi_charges
    #The pearson correlation is 0.198. Let's say 0.2
    #I am dissappointed that the correlation is this low. It's close to zero.
    #0.2 tells us that there is a linear association but it is not a strong or high linear association between BMI and charges.
    

Out[159]:

0.1983409688336289

In [294]:

#But this visualize the relationship between BMI and insurance charges
    plt.scatter(insurance.bmi, insurance.charges)
    plt.title('Insurance Charges vs BMI')
    plt.xlabel('BMI')
    plt.ylabel('Insurance charges')
    plt.show()
    #While there is a lot of variation in the plot, there is also a small trend showing people with higher BMI have higher insurance charges.
    

Effect of Age on Insurance Charges¶

In [33]:

#Let's find the youngest person in the data set and the oldest person, find the median of insurance charge for that age and then compare them.
    #So first, I''l find the median insurance charge for 18 years old.
    youngestage_median = insurance.groupby(insurance.age[insurance['age'] == 18]).charges.median()
    print(youngestage_median)
    #Next we'll find the median charge for 64 year olds, the oldest age in our dataset
    oldestage_median = insurance.groupby(insurance.age[insurance['age'] == 64]).charges.median()
    print(oldestage_median)
    #Here we see that for the youngest people, the median insurance charge is 2,198.
    #While the oldest people, the median insurance charge is 15,528.
    #We can conclude that older people pay more in insurance which makes sense because they would be more likely prone to illnesses.
    

age
    18.0    2198.18985
    Name: charges, dtype: float64
    age
    64.0    15528.758375
    Name: charges, dtype: float64

In [290]:

#et's visualize the relationship between insurance charges and age
    plt.scatter(insurance.age,insurance.charges)
    plt.title('Insurance Charges vs Age')
    plt.xlabel('Age')
    plt.ylabel('Insurance Charges')
    plt.show()
    

We see a little upward trend here in this graph. As age increases, so does insurance charges.

In [291]:

#Let's see what is the correlation between age and insurance charges though.
    corr_age_charges, p = scipy.stats.pearsonr(insurance.age, insurance.charges)
    corr_age_charges
    #We have a correlation of 0.3 here. It's not a strong linear association but there is still one as demonstrated
    #by our analysis where we found 64 year olds paying more in insurance cost than 18 year olds on average.
    #The relationship was also demonstrated by our scatter plot where we see the upward trend.
    

Out[291]:

0.29900819333064765

In [163]:

#I will create a heatmap of the correlation between each of our variables to quickly summarize their linear association.
    sns.heatmap(insurance.corr())
    plt.title('Correlation Heatmap of Variables')
    

Out[163]:

Text(0.5, 1.0, 'Correlation Heatmap of Variables')

Examining our heatmap to see what have correlates with charges. We see that age has a correlation with charges and also bmi. We know that the correlation for age and BMI with charges is 0.2 and 0.3 respectively.

FINDINGS:¶

The average age of the population in the dataset is 39 years old. The average insurance charge is 13,270 and the average BMI is 30.6.
Majority of the population is from the southeast region.
Smoking is the biggest contributor to insurance charges. Age and BMI are slight factors but there is no significant correlation. Children and sex are the least contributing factor to insurance charges.
People with the most children(five) pay the most in insurance charges. However, there is no correlation at all with children and insurance charges.
When outliers are treated, females pay more in insurance charges. However, sex as a standalone variable is no strong contributing factor to insurance charges.
The relationship between BMI and charges is not a strong association, but there exists an association. I found that people with a BMI that falls within the obese category pay the most in insurance. While, underweight people pay the least in insurance charges.
There is a strong relationship between smoker and charges. I found that people who smoke pay way more in insurance charges than people who do not smoke. Smoking is the biggest contributor to high insurance charges.

Based on my analysis, quitting smoking and having a heathly BMI could reduce insurance charges.

For future consideration:¶

I could analyze how much of the smokers lived in the Northeast and see if this is a contributing to it being the region paying the most in insurance charges.
Assess the proportion of those who smoke in the dataset and how many that do not smoke.

In [ ]: