The Final project in this class

 Issaiah Jennings 

Final Project


Introduction


My project aimed to determine whether smoking status and region (northeast, southeast, southwest, northwest) significantly impact insurance charges. Specifically, I wanted to see if smokers incur higher charges compared to non-smokers and whether insurance costs differ across regions. Understanding these relationships could help insurance providers identify how lifestyle and geographic factors influence healthcare costs, ultimately guiding pricing strategies and market targeting.


Methods


This project closely relates to concepts from class, particularly hypothesis testing and statistical analysis. For example, I used a t-test to compare the average insurance charges between smokers and non-smokers, a method we learned to analyze differences between two groups. Similarly, I applied ANOVA (Analysis of Variance) to test for differences in average charges across multiple groups, in this case, the regions. These methods align with the class discussions on testing means across two or more categories. Additionally, I explored correlations and regression analysis, which helped evaluate how variables like age and BMI influence charges, further reinforcing concepts from class.

Results



To approach the problem, I began by preparing the dataset. It included variables such as age, sex, BMI, children, smoking status, region, and insurance charges. I converted categorical variables like smoker and region into factors to ensure compatibility with statistical methods. For the t-test, I compared the mean insurance charges for smokers and non-smokers to evaluate if smoking status significantly impacts costs. 


My hypotheses were: the null hypothesis (H₀) stating no difference in charges and the alternative hypothesis (H₁) suggesting that smokers have higher charges. I found a significant difference, with smokers incurring much higher charges. For the ANOVA, I tested if charges differed across regions, with hypotheses assuming no differences (H₀) versus significant differences (H₁). This test revealed that geographic region does significantly influence insurance charges.


To visualize these relationships, I created boxplots. The first boxplot showed the distribution of charges for smokers and non-smokers, highlighting the stark cost difference. The second boxplot compared charges across regions, demonstrating variations in pricing based on geographic location. These visualizations made it easier to see how the data supported the statistical findings.



> # Set the working directory to the Desktop

> setwd("~/Desktop")

> # Load the dataset

> data <- read.csv("insurance.csv")

> # View the first few rows of the dataset

> head(data)

  age    sex    bmi children smoker    region   charges

1  19 female 27.900        0    yes southwest 16884.924

2  18   male 33.770        1     no southeast  1725.552

3  28   male 33.000        3     no southeast  4449.462

4  33   male 22.705        0     no northwest 21984.471

5  32   male 28.880        0     no northwest  3866.855

6  31 female 25.740        0     no southeast  3756.622

> # Convert categorical variables to factors

> data$sex <- as.factor(data$sex)

> data$smoker <- as.factor(data$smoker)

> data$region <- as.factor(data$region)

> # Check for missing values

> sum(is.na(data))

[1] 0

> # Descriptive statistics

> summary(data)

      age            sex           bmi           children     smoker           region       charges     

 Min.   :18.00   female:662   Min.   :15.96   Min.   :0.000   no :1064   northeast:324   Min.   : 1122  

 1st Qu.:27.00   male  :676   1st Qu.:26.30   1st Qu.:0.000   yes: 274   northwest:325   1st Qu.: 4740  

 Median :39.00                Median :30.40   Median :1.000              southeast:364   Median : 9382  

 Mean   :39.21                Mean   :30.66   Mean   :1.095              southwest:325   Mean   :13270  

 3rd Qu.:51.00                3rd Qu.:34.69   3rd Qu.:2.000                              3rd Qu.:16640  

 Max.   :64.00                Max.   :53.13   Max.   :5.000                              Max.   :63770  

> # Mean charges by smoker status

> aggregate(charges ~ smoker, data = data, FUN = mean)

  smoker   charges

1     no  8434.268

2    yes 32050.232

> # Perform a t-test to compare charges between smokers and non-smokers

> t_test_result <- t.test(charges ~ smoker, data = data)

> # View the t-test result

> t_test_result


Welch Two Sample t-test


data:  charges by smoker

t = -32.752, df = 311.85, p-value < 2.2e-16

alternative hypothesis: true difference in means between group no and group yes is not equal to 0

95 percent confidence interval:

 -25034.71 -22197.21

sample estimates:

 mean in group no mean in group yes 

         8434.268         32050.232 


> # Boxplot visualization

> library(ggplot2)

> # Boxplot visualization for smoker vs. charges with comma formatting on y-axis

> ggplot(data, aes(x = smoker, y = charges, fill = smoker)) + 

+   geom_boxplot() + 

+   labs(title = "Medical Charges by Smoking Status", 

+        x = "Smoking Status", 

+        y = "Medical Charges") + 

+   theme_minimal() + 

+   scale_y_continuous(labels = scales::comma)







# Boxplot for region vs. charges with comma formatting on y-axis

ggplot(data, aes(x = region, y = charges)) + 

  geom_boxplot() + 

  labs(title = "Insurance Charges by Region", 

       x = "Region", 

       y = "Insurance Charges") + 

  theme_minimal() + 

  scale_y_continuous(labels = scales::comma)






In conclusion, this study showed that smoking status and region are significant factors influencing insurance charges. Smokers consistently incurred higher costs, and regional differences were also evident. These insights emphasize the importance of considering both lifestyle and geographic variables when setting insurance premiums, as they play a crucial role in determining healthcare costs.




Comments

Popular posts from this blog

Module # 8 Assignment

Module # 6 assignment