LIS4273: November 2023

Monday, November 27, 2023

Final Project

The compiled file is attached to the submission for this assignment itself. Blogger does not allow word documents to be attached as far as I'm aware. If there are any formatting issues please let me know.

Statistics-Final-Project.R

mohle

2023-11-27

#STEP 1
urlToRead <- "https://www.fueleconomy.gov/feg/epadata/vehicles.csv"
fuel_economy <- read.csv(url(urlToRead))
fuel_economy

## 19 15.658421 0 0 0

## 517
## 111                         Passat       N       false   0 97     0         0

## 1071        0       0        0
## 1072        0       0        0
## 107

summary(fuel_economy$displ)

## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 2.200 3.000 3.276 4.200 8.400 650

dim(fuel_economy)

## [1] 47075 84

mode(fuel_economy)

## [1] "list"

#STEP 3

#I: I wanted to determine whether there is a relationship between the emissions a car produces, as demonstrated by the tail pipe residue (Grams per mile); I wanted to test this rather than another relationship because the implication of a more densely covered tail pipe and overall greenhouse gas emissions.

#II: I was mostly drawn to this resource because we used the mtcars data set in this class, and I wanted to explore something I could be familiar with and recognize terminology. It is sometimes unclear what acronyms and variable are referring to without prior knowledge or experience. Even in this case I was unable to understand some of the variables in this data set because they are undefined on the website and do not have a compelling title in the original file. Additionally, this data set is enormous. I thought that with an abundance of variables and observations, I would be able to quickly discern the most applicable and interesting relationships. Although after tweaking and creating different grouping methods and fiddling with the relationships I realize that this was unwise. Many parts of the data don't directly relate to the other. Instead of providing meaningful and distinguishing qualities either over time or within the data itself, the data is actually quite isolated from the other variables; this report functions more to describe rather than analyze.

#III: At first I was unsure which variables I wanted to test, so I threw together any and every variable combination and tried running either the ANOVA or t-test to try and glean some results. I realized this would be time consuming and difficult to identify potential relationships, so I chose the ones I was most interested in by using different R functions to better understand the data. These functions included creating data frames, establishing means, standards deviations, and experimenting with different tools within the software. In the end, I decided on the current variables I listed with the hypotheses: GPM and GHS. I ran a T-Test because the variable are numeric and integers. The results are as shown. The Welch Two-Sample T-Test was the provided result, and gave me a few impressions. First of all, I noticed that the p-value is incredibly miniscule. This is a problem because the chances of evidence being presented become equally as unlikely for the alpha value to fall within this range; at this point I was fairly certain there was no statistically significant relationship. The test also provided the means of both the GPM and GHG, which I have already determined before running the test.
##With these impressions, I still opted to graphically represent the data anyways--it is still evidence for the null hypothesis: there is no significant relationship between these variables. I chose to implement a scatter plot because it is useful for enhancing the clarity of how distinct these variables are from one another; and is also functioning as a good way of comparing a scaling variable and numerical one. The product shows the clearly unrelated trends and behaviors of either variable, with the GPM.

mean(fuel_economy$cylinders, na.rm=TRUE)

## [1] 5.705426

mean(fuel_economy$displ, na.rm =TRUE)

## [1] 3.276073

mean(fuel_economy$co2TailpipeAGpm)

## [1] 16.30315

sd(fuel_economy$co2TailpipeAGpm)

## [1] 90.07946

max(fuel_economy$co2TailpipeAGpm)

## [1] 713

testframe <- data.frame(cbind(fuel_economy$co2TailpipeGpm, fuel_economy$ghgScore))

mean(fuel_economy$ghgScore)

## [1] 0.9491875

sd(fuel_economy$ghgScore)

## [1] 3.06082

max(fuel_economy$ghgScore)

## [1] 10

t_test <-t.test(fuel_economy$co2TailpipeAGpm, fuel_economy$ghgScore)
t_test

##
## Welch Two Sample t-test
##
## data: fuel_economy$co2TailpipeAGpm and fuel_economy$ghgScore
## t = 36.961, df = 47183, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 14.53975 16.16818
## sample estimates:
## mean of x mean of y
## 16.3031544 0.9491875

attach(testframe)
plot(fuel_economy$co2TailpipeAGpm, main = "Scatterplot of GPM VS Score", xlab = "GPM", ylab = "GHG Score")
abline(lm(fuel_economy$co2TailpipeGpm ~ fuel_economy$ghgScore), col = "blue", lwd = 2)

#STEP 4
#The aim of this was to understand the relationship between two variables within the data set provided by U.S. government on fuel economy. Through a series of trial-and-error, I decided to observe and determine whether or not a relationship existed between 2 variables: car CO2 GPM (Grams Per Mile) and the GHG Rating Scale, which is a rating system that calculates the emission impact of a vehicle on greenhouse gas production based on a 1-10 scale, 10 meaning there is little to no significant production of greenhouse emissions.
##Throughout my procedure I found it difficult to implement the statistical skills I have acquired in R over the course of the semester; not because I am unfamiliar with the concepts or ideas presented in this course but due to the unexpected challenge of choosing and determining the best research question and focus of a large data set.
###This data set included 47075 observations and 84 variables, so choosing which one I was most interest in addition to creating a question that drew testable hypotheses about the relationships was a difficult task to overcome. Many of teh variables were obviously unrelated or would be unfit for testing, as they would be too narrow and create untrustworthy results and weak analyses.
####I chose the variables I did because they shared data modes and had the potential to be linked because of their proximity to CO2 as a part of their calculations.
#####The conclusion I came to by the end of this was that there was no statistical evidence to suggest a relationship between the GPM and GHG scaling--and as such I must fail to reject the null hypothesis, which was that there is no evidence implying a relationship.

Sunday, November 12, 2023

Module #12 Assignment

This assignment required knowledge on time series and the application of them within R to forecast. The first step I decided to take was to code a data frame containing the information provided in the assignment details:

This was identical to the table provided, however gave me a hard time when applying the forecasting model since the separate data series were not sequential according to the plots; the "plot" function provided two separate plots to illustrate the data rather than a single, chronological one:

To solve this problem, I created a new data frame that bound the vectors into a single series.

This allowed me to use the plot function to produce a more articulate and useful plot for me to glean information from:

This provides a better understanding of the model of the charge behaviors over the course of 2 years, or 24 months as demonstrated by the plot. I decided to apply the same methods as the link, the HoltWinters() function to understand the plot and its characteristics.

In these circumstances, the alpha value is closer to 1, implying that the data is weighing the more recent points with more significance in an effort to forecast. Additionally, the coefficient is calculated. When applying the new forecasting data, the line that appears is shown as thus:

From this, we can observe that the forecast expected in comparison to the actual data is quite similar. Unlike the example provided, the smoothing line is not very different from the experimental line. This could be due to qualities of the data sample, such a n, variance, etc., however it still fulfills its job of presenting a more compact iteration of the data.

Sunday, November 5, 2023

Module 11 Assignment

1. The first demand of this assignment is to initialize the "Introduction to Statistics with R" data package "ashina". After inputting the package, I ran the file and received this result on my console:

This data set is representative of a trial being conducted on a headache medication containing no synthase inhibitor. The study included 16 patients; a baseline of pain was used on a scale of 5, the difference between the baseline and pain recorded after a certain period of time was used to determine a score. Six patients were treated with the medication in the first session and given the placebo during the second; ten patients were given the placebo first and medication second. The order in which this method was applied was randomized.

I wanted to see the structure and a summary of the data so I used the str() and summary() function after creating a vector called "ashina" to make the ISwR data more accessible. The output of these commands are:

The structure() function gives us the number of observations (number of participants) and the number of variables used within the experiment (real medication, placebo, and the group number respectively). The summary gives us an idea of the quantitative characteristics of the data set for each variable. This information is helpful, but not the goal of this assignment; using the hint provided, applying the code allows us to distinguish the data into two groups: the treated and untreated, regardless of session or group. This will allow us to be more tactile with the data and apply the logistic regression and t tests on treatment. Placing the treated data into the "act" data frame and untreated into "plac", we can make presumptions based on the data.

Using these new data frames, we can run our t tests and regression analyses.

2. For this question, we are given a series of vectors containing numbers and lists.

a <- c(2, 2, 8)

b <- c(2, 4, 8)

x <-- c(1:8)

y <- c(1:4, 8:5)

z <- rnorm (8)

Using the rnorm() function, we are given a random series of normally distributed numbers. The implications of this suggest that any and all inputs will be normally distributed and follow the behavioral patterns of the typical curve and qualities of of the data structure. After initializing the vectors, I messed around with the rnorm function to see what the results were:

I don't really understand what the model.matrix function is meant to accomplish in regards to the question or the vectors. When I try to apply the same equations or relationships used as examples, I get error messages.