Monday, November 27, 2023

Final Project

 

The compiled file is attached to the submission for this assignment itself. Blogger does not allow word documents to be attached as far as I'm aware.  If there are any formatting issues please let me know.

Statistics-Final-Project.R

mohle

2023-11-27

#STEP 1
urlToRead
<- "https://www.fueleconomy.gov/feg/epadata/vehicles.csv"
fuel_economy
<- read.csv(url(urlToRead))
fuel_economy



## 19   15.658421          0         0         0   

## 517        
## 111                         Passat       N       false   0  97     0         0

## 1071        0       0        0
## 1072        0       0        0
## 107

summary(fuel_economy$displ)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's
##   0.000   2.200   3.000   3.276   4.200   8.400     650

dim(fuel_economy)

## [1] 47075    84

mode(fuel_economy)

## [1] "list"

#STEP 3


#I: I wanted to determine whether there is a relationship between the emissions a car produces, as demonstrated by the tail pipe residue (Grams per mile); I wanted to test this rather than another relationship because the implication of a more densely covered tail pipe and overall greenhouse gas emissions.


#II: I was mostly drawn to this resource because we used the mtcars data set in this class, and I wanted to explore something I could be familiar with and recognize terminology. It is sometimes unclear what acronyms and variable are referring to without prior knowledge or experience. Even in this case I was unable to understand some of the variables in this data set because they are undefined on the website and do not have a compelling title in the original file. Additionally, this data set is enormous. I thought that with an abundance of variables and observations, I would be able to quickly discern the most applicable and interesting relationships. Although after tweaking and creating different grouping methods and fiddling with the relationships I realize that this was unwise. Many parts of the data don't directly relate to the other. Instead of providing meaningful and distinguishing qualities either over time or within the data itself, the data is actually quite isolated from the other variables; this report functions more to describe rather than analyze.


#III: At first I was unsure which variables I wanted to test, so I threw together any and every variable combination and tried running either the ANOVA or t-test to try and glean some results. I realized this would be time consuming and difficult to identify potential relationships, so I chose the ones I was most interested in by using different R functions to better understand the data. These functions included creating data frames, establishing means, standards deviations, and experimenting with different tools within the software. In the end, I decided on the current variables I listed with the hypotheses: GPM and GHS. I ran a T-Test because the variable are numeric and integers. The results are as shown. The Welch Two-Sample T-Test was the provided result, and gave me a few impressions. First of all, I noticed that the p-value is incredibly miniscule. This is a problem because the chances of evidence being presented become equally as unlikely for the alpha value to fall within this range; at this point I was fairly certain there was no statistically significant relationship. The test also provided the means of both the GPM and GHG, which I have already determined before running the test.
##With these impressions, I still opted to graphically represent the data anyways--it is still evidence for the null hypothesis: there is no significant relationship between these variables. I chose to implement a scatter plot because it is useful for enhancing the clarity of how distinct these variables are from one another; and is also functioning as a good way of comparing a scaling variable and numerical one. The product shows the clearly unrelated trends and behaviors of either variable, with the GPM.

mean(fuel_economy$cylinders, na.rm=TRUE)

## [1] 5.705426

mean(fuel_economy$displ, na.rm =TRUE)

## [1] 3.276073

mean(fuel_economy$co2TailpipeAGpm)

## [1] 16.30315

sd(fuel_economy$co2TailpipeAGpm)

## [1] 90.07946

max(fuel_economy$co2TailpipeAGpm)

## [1] 713

testframe <- data.frame(cbind(fuel_economy$co2TailpipeGpm, fuel_economy$ghgScore))

mean(fuel_economy$ghgScore)

## [1] 0.9491875

sd(fuel_economy$ghgScore)

## [1] 3.06082

max(fuel_economy$ghgScore)

## [1] 10

t_test <-t.test(fuel_economy$co2TailpipeAGpm, fuel_economy$ghgScore)
t_test

##
##  Welch Two Sample t-test
##
## data:  fuel_economy$co2TailpipeAGpm and fuel_economy$ghgScore
## t = 36.961, df = 47183, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  14.53975 16.16818
## sample estimates:
##  mean of x  mean of y
## 16.3031544  0.9491875

attach(testframe)
plot(fuel_economy$co2TailpipeAGpm, main = "Scatterplot of GPM VS Score", xlab = "GPM", ylab = "GHG Score")
abline(lm(fuel_economy$co2TailpipeGpm ~ fuel_economy$ghgScore), col = "blue", lwd = 2)



#STEP 4
#The aim of this was to understand the relationship between two variables within the data set provided by U.S. government on fuel economy. Through a series of trial-and-error, I decided to observe and determine whether or not a relationship existed between 2 variables: car CO2 GPM (Grams Per Mile) and the GHG Rating Scale, which is a rating system that calculates the emission impact of a vehicle on greenhouse gas production based on a 1-10 scale, 10 meaning there is little to no significant production of greenhouse emissions.
##Throughout my procedure I found it difficult to implement the statistical skills I have acquired in R over the course of the semester; not because I am unfamiliar with the concepts or ideas presented in this course but due to the unexpected challenge of choosing and determining the best research question and focus of a large data set.
###This data set included 47075 observations and 84 variables, so choosing which one I was most interest in addition to creating a question that drew testable hypotheses about the relationships was a difficult task to overcome. Many of teh variables were obviously unrelated or would be unfit for testing, as they would be too narrow and create untrustworthy results and weak analyses.
####I chose the variables I did because they shared data modes and had the potential to be linked because of their proximity to CO2 as a part of their calculations.
#####The conclusion I came to by the end of this was that there was no statistical evidence to suggest a relationship between the GPM and GHG scaling--and as such I must fail to reject the null hypothesis, which was that there is no evidence implying a relationship.

No comments:

Post a Comment

Final Project

  The compiled file is attached to the submission for this assignment itself. Blogger does not allow word documents to be attached as far as...