The compiled file is attached to the submission for this assignment itself. Blogger does not allow word documents to be attached as far as I'm aware. If there are any formatting issues please let me know.
Statistics-Final-Project.R
mohle
2023-11-27
#STEP 1
urlToRead <- "https://www.fueleconomy.gov/feg/epadata/vehicles.csv"
fuel_economy <- read.csv(url(urlToRead))
fuel_economy
## 19 15.658421 0 0 0
##
517
## 111 Passat N
false 0 97
0 0
## 1071
0 0 0
## 1072
0 0 0
## 107
summary(fuel_economy$displ)
## Min. 1st
Qu. Median Mean 3rd Qu. Max.
NA's
## 0.000 2.200
3.000 3.276 4.200
8.400 650
dim(fuel_economy)
## [1] 47075
84
mode(fuel_economy)
## [1] "list"
#STEP 3
#I: I wanted to determine whether there is a
relationship between the emissions a car produces, as demonstrated by the tail
pipe residue (Grams per mile); I wanted to test this rather than another
relationship because the implication of a more densely covered tail pipe and
overall greenhouse gas emissions.
#II: I was mostly drawn to this resource because
we used the mtcars data set in this class, and I wanted to explore something I
could be familiar with and recognize terminology. It is sometimes unclear what
acronyms and variable are referring to without prior knowledge or experience.
Even in this case I was unable to understand some of the variables in this data
set because they are undefined on the website and do not have a compelling
title in the original file. Additionally, this data set is enormous. I thought
that with an abundance of variables and observations, I would be able to quickly
discern the most applicable and interesting relationships. Although after
tweaking and creating different grouping methods and fiddling with the
relationships I realize that this was unwise. Many parts of the data don't
directly relate to the other. Instead of providing meaningful and
distinguishing qualities either over time or within the data itself, the data
is actually quite isolated from the other variables; this report functions more
to describe rather than analyze.
#III: At first I was unsure which variables I
wanted to test, so I threw together any and every variable combination and
tried running either the ANOVA or t-test to try and glean some results. I
realized this would be time consuming and difficult to identify potential
relationships, so I chose the ones I was most interested in by using different
R functions to better understand the data. These functions included creating
data frames, establishing means, standards deviations, and experimenting with
different tools within the software. In the end, I decided on the current
variables I listed with the hypotheses: GPM and GHS. I ran a T-Test because the
variable are numeric and integers. The results are as shown. The Welch
Two-Sample T-Test was the provided result, and gave me a few impressions. First
of all, I noticed that the p-value is incredibly miniscule. This is a problem
because the chances of evidence being presented become equally as unlikely for
the alpha value to fall within this range; at this point I was fairly certain
there was no statistically significant relationship. The test also provided the
means of both the GPM and GHG, which I have already determined before running
the test.
##With these impressions, I still opted to
graphically represent the data anyways--it is still evidence for the null
hypothesis: there is no significant relationship between these variables. I
chose to implement a scatter plot because it is useful for enhancing the
clarity of how distinct these variables are from one another; and is also
functioning as a good way of comparing a scaling variable and numerical one. The
product shows the clearly unrelated trends and behaviors of either variable,
with the GPM.
mean(fuel_economy$cylinders, na.rm=TRUE)
## [1] 5.705426
mean(fuel_economy$displ, na.rm =TRUE)
## [1] 3.276073
mean(fuel_economy$co2TailpipeAGpm)
## [1] 16.30315
sd(fuel_economy$co2TailpipeAGpm)
## [1] 90.07946
max(fuel_economy$co2TailpipeAGpm)
## [1] 713
testframe <- data.frame(cbind(fuel_economy$co2TailpipeGpm, fuel_economy$ghgScore))
mean(fuel_economy$ghgScore)
## [1] 0.9491875
sd(fuel_economy$ghgScore)
## [1] 3.06082
max(fuel_economy$ghgScore)
## [1] 10
t_test <-t.test(fuel_economy$co2TailpipeAGpm, fuel_economy$ghgScore)
t_test
##
## Welch Two
Sample t-test
##
## data:
fuel_economy$co2TailpipeAGpm and fuel_economy$ghgScore
## t = 36.961, df = 47183, p-value < 2.2e-16
## alternative hypothesis: true difference in means is
not equal to 0
## 95 percent confidence interval:
## 14.53975
16.16818
## sample estimates:
## mean of
x mean of y
## 16.3031544
0.9491875
attach(testframe)
plot(fuel_economy$co2TailpipeAGpm, main = "Scatterplot
of GPM VS Score", xlab = "GPM", ylab = "GHG
Score")
abline(lm(fuel_economy$co2TailpipeGpm ~ fuel_economy$ghgScore), col
=
"blue", lwd = 2)
#STEP 4
#The aim of this was to understand the
relationship between two variables within the data set provided by U.S.
government on fuel economy. Through a series of trial-and-error, I decided to
observe and determine whether or not a relationship existed between 2
variables: car CO2 GPM (Grams Per Mile) and the GHG Rating Scale, which is a
rating system that calculates the emission impact of a vehicle on greenhouse
gas production based on a 1-10 scale, 10 meaning there is little to no
significant production of greenhouse emissions.
##Throughout my procedure I found it
difficult to implement the statistical skills I have acquired in R over the
course of the semester; not because I am unfamiliar with the concepts or ideas
presented in this course but due to the unexpected challenge of choosing and
determining the best research question and focus of a large data set.
###This data set included 47075
observations and 84 variables, so choosing which one I was most interest in
addition to creating a question that drew testable hypotheses about the
relationships was a difficult task to overcome. Many of teh variables were
obviously unrelated or would be unfit for testing, as they would be too narrow
and create untrustworthy results and weak analyses.
####I chose the variables I did because
they shared data modes and had the potential to be linked because of their
proximity to CO2 as a part of their calculations.
#####The conclusion I came to by the end of
this was that there was no statistical evidence to suggest a relationship
between the GPM and GHG scaling--and as such I must fail to reject the null
hypothesis, which was that there is no evidence implying a relationship.