LIS4273

Monday, November 27, 2023

Final Project

The compiled file is attached to the submission for this assignment itself. Blogger does not allow word documents to be attached as far as I'm aware. If there are any formatting issues please let me know.

Statistics-Final-Project.R

mohle

2023-11-27

#STEP 1
urlToRead <- "https://www.fueleconomy.gov/feg/epadata/vehicles.csv"
fuel_economy <- read.csv(url(urlToRead))
fuel_economy

## 19 15.658421 0 0 0

## 517
## 111                         Passat       N       false   0 97     0         0

## 1071        0       0        0
## 1072        0       0        0
## 107

summary(fuel_economy$displ)

## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 2.200 3.000 3.276 4.200 8.400 650

dim(fuel_economy)

## [1] 47075 84

mode(fuel_economy)

## [1] "list"

#STEP 3

#I: I wanted to determine whether there is a relationship between the emissions a car produces, as demonstrated by the tail pipe residue (Grams per mile); I wanted to test this rather than another relationship because the implication of a more densely covered tail pipe and overall greenhouse gas emissions.

#II: I was mostly drawn to this resource because we used the mtcars data set in this class, and I wanted to explore something I could be familiar with and recognize terminology. It is sometimes unclear what acronyms and variable are referring to without prior knowledge or experience. Even in this case I was unable to understand some of the variables in this data set because they are undefined on the website and do not have a compelling title in the original file. Additionally, this data set is enormous. I thought that with an abundance of variables and observations, I would be able to quickly discern the most applicable and interesting relationships. Although after tweaking and creating different grouping methods and fiddling with the relationships I realize that this was unwise. Many parts of the data don't directly relate to the other. Instead of providing meaningful and distinguishing qualities either over time or within the data itself, the data is actually quite isolated from the other variables; this report functions more to describe rather than analyze.

#III: At first I was unsure which variables I wanted to test, so I threw together any and every variable combination and tried running either the ANOVA or t-test to try and glean some results. I realized this would be time consuming and difficult to identify potential relationships, so I chose the ones I was most interested in by using different R functions to better understand the data. These functions included creating data frames, establishing means, standards deviations, and experimenting with different tools within the software. In the end, I decided on the current variables I listed with the hypotheses: GPM and GHS. I ran a T-Test because the variable are numeric and integers. The results are as shown. The Welch Two-Sample T-Test was the provided result, and gave me a few impressions. First of all, I noticed that the p-value is incredibly miniscule. This is a problem because the chances of evidence being presented become equally as unlikely for the alpha value to fall within this range; at this point I was fairly certain there was no statistically significant relationship. The test also provided the means of both the GPM and GHG, which I have already determined before running the test.
##With these impressions, I still opted to graphically represent the data anyways--it is still evidence for the null hypothesis: there is no significant relationship between these variables. I chose to implement a scatter plot because it is useful for enhancing the clarity of how distinct these variables are from one another; and is also functioning as a good way of comparing a scaling variable and numerical one. The product shows the clearly unrelated trends and behaviors of either variable, with the GPM.

mean(fuel_economy$cylinders, na.rm=TRUE)

## [1] 5.705426

mean(fuel_economy$displ, na.rm =TRUE)

## [1] 3.276073

mean(fuel_economy$co2TailpipeAGpm)

## [1] 16.30315

sd(fuel_economy$co2TailpipeAGpm)

## [1] 90.07946

max(fuel_economy$co2TailpipeAGpm)

## [1] 713

testframe <- data.frame(cbind(fuel_economy$co2TailpipeGpm, fuel_economy$ghgScore))

mean(fuel_economy$ghgScore)

## [1] 0.9491875

sd(fuel_economy$ghgScore)

## [1] 3.06082

max(fuel_economy$ghgScore)

## [1] 10

t_test <-t.test(fuel_economy$co2TailpipeAGpm, fuel_economy$ghgScore)
t_test

##
## Welch Two Sample t-test
##
## data: fuel_economy$co2TailpipeAGpm and fuel_economy$ghgScore
## t = 36.961, df = 47183, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 14.53975 16.16818
## sample estimates:
## mean of x mean of y
## 16.3031544 0.9491875

attach(testframe)
plot(fuel_economy$co2TailpipeAGpm, main = "Scatterplot of GPM VS Score", xlab = "GPM", ylab = "GHG Score")
abline(lm(fuel_economy$co2TailpipeGpm ~ fuel_economy$ghgScore), col = "blue", lwd = 2)

#STEP 4
#The aim of this was to understand the relationship between two variables within the data set provided by U.S. government on fuel economy. Through a series of trial-and-error, I decided to observe and determine whether or not a relationship existed between 2 variables: car CO2 GPM (Grams Per Mile) and the GHG Rating Scale, which is a rating system that calculates the emission impact of a vehicle on greenhouse gas production based on a 1-10 scale, 10 meaning there is little to no significant production of greenhouse emissions.
##Throughout my procedure I found it difficult to implement the statistical skills I have acquired in R over the course of the semester; not because I am unfamiliar with the concepts or ideas presented in this course but due to the unexpected challenge of choosing and determining the best research question and focus of a large data set.
###This data set included 47075 observations and 84 variables, so choosing which one I was most interest in addition to creating a question that drew testable hypotheses about the relationships was a difficult task to overcome. Many of teh variables were obviously unrelated or would be unfit for testing, as they would be too narrow and create untrustworthy results and weak analyses.
####I chose the variables I did because they shared data modes and had the potential to be linked because of their proximity to CO2 as a part of their calculations.
#####The conclusion I came to by the end of this was that there was no statistical evidence to suggest a relationship between the GPM and GHG scaling--and as such I must fail to reject the null hypothesis, which was that there is no evidence implying a relationship.

Sunday, November 12, 2023

Module #12 Assignment

This assignment required knowledge on time series and the application of them within R to forecast. The first step I decided to take was to code a data frame containing the information provided in the assignment details:

This was identical to the table provided, however gave me a hard time when applying the forecasting model since the separate data series were not sequential according to the plots; the "plot" function provided two separate plots to illustrate the data rather than a single, chronological one:

To solve this problem, I created a new data frame that bound the vectors into a single series.

This allowed me to use the plot function to produce a more articulate and useful plot for me to glean information from:

This provides a better understanding of the model of the charge behaviors over the course of 2 years, or 24 months as demonstrated by the plot. I decided to apply the same methods as the link, the HoltWinters() function to understand the plot and its characteristics.

In these circumstances, the alpha value is closer to 1, implying that the data is weighing the more recent points with more significance in an effort to forecast. Additionally, the coefficient is calculated. When applying the new forecasting data, the line that appears is shown as thus:

From this, we can observe that the forecast expected in comparison to the actual data is quite similar. Unlike the example provided, the smoothing line is not very different from the experimental line. This could be due to qualities of the data sample, such a n, variance, etc., however it still fulfills its job of presenting a more compact iteration of the data.

Sunday, November 5, 2023

Module 11 Assignment

1. The first demand of this assignment is to initialize the "Introduction to Statistics with R" data package "ashina". After inputting the package, I ran the file and received this result on my console:

This data set is representative of a trial being conducted on a headache medication containing no synthase inhibitor. The study included 16 patients; a baseline of pain was used on a scale of 5, the difference between the baseline and pain recorded after a certain period of time was used to determine a score. Six patients were treated with the medication in the first session and given the placebo during the second; ten patients were given the placebo first and medication second. The order in which this method was applied was randomized.

I wanted to see the structure and a summary of the data so I used the str() and summary() function after creating a vector called "ashina" to make the ISwR data more accessible. The output of these commands are:

The structure() function gives us the number of observations (number of participants) and the number of variables used within the experiment (real medication, placebo, and the group number respectively). The summary gives us an idea of the quantitative characteristics of the data set for each variable. This information is helpful, but not the goal of this assignment; using the hint provided, applying the code allows us to distinguish the data into two groups: the treated and untreated, regardless of session or group. This will allow us to be more tactile with the data and apply the logistic regression and t tests on treatment. Placing the treated data into the "act" data frame and untreated into "plac", we can make presumptions based on the data.

Using these new data frames, we can run our t tests and regression analyses.

2. For this question, we are given a series of vectors containing numbers and lists.

a <- c(2, 2, 8)

b <- c(2, 4, 8)

x <-- c(1:8)

y <- c(1:4, 8:5)

z <- rnorm (8)

Using the rnorm() function, we are given a random series of normally distributed numbers. The implications of this suggest that any and all inputs will be normally distributed and follow the behavioral patterns of the typical curve and qualities of of the data structure. After initializing the vectors, I messed around with the rnorm function to see what the results were:

I don't really understand what the model.matrix function is meant to accomplish in regards to the question or the vectors. When I try to apply the same equations or relationships used as examples, I get error messages.

Sunday, October 29, 2023

Module 10 Assignment

9.1

For this assignment, we are expected to use the data set from the Introductory Statistics with R package titled "cystfibr", which contains patient information about people who have cystic fibrosis (ages 7-23) and their lung capacity. The goalof this assignment is to understand the relationship between the variables (age, weight, sex, height, bmp, fev1, rv, frc, tlc, and pemax).

The first thing I wanted to do was explore the data. To do this, I used the "str" and "summary" functions in R to better understand what the data looked like and how to approach it. The resulting products are as shown:

From this we can see basic trends and qualities of the data.

The first test is to determine the coefficients of the data. Using the relationship between pemax, age, and height, the coefficients are 2.7178(age) and 0.3397(height); the intercept for these variable are at 17.8600.

The primary test I am interested in running is the relationship between the pemax, height and the age in the data set. As shown, the Sum of the Squares are listed as 231.1695, with the residuals being 84.19767. The degrees of freedom are 1.

9.2

Same as the first, I chose to use the str and summary functions to better understand the data and its structure. The ISwR::Secher data set is describing ultrasonographic measurements of babies prior to and following their births. This has 107 rows and 4 columns of data included and the summary/structure looks as such:

From here I am a bit confused, as the the model is not something I am familiar with in R. When inputting the model into R as a vector using the same variables, the next step becomes unclear. The regression lines looks something like this, however the result of using the provided formula simply initializes it as a vector. Additionally, it throws an error, however this issue is likely from improper syntax. Without using logarithmic attributes, this is the resulting output:

It's apparent that this is an improper correlation, as the negative intercept is an impossibility. More practice and information required to assess the linear regression in statistical and graphical cirumstances.

Monday, October 16, 2023

Module #8 Assignment

For this assignment, we are expected to run an ANOVA hypothesis test.

1. Firstly, it is necessary to combine the individual response data into three separate vectors to identify the ratings of the high, moderate, and low stress groups.

Once this is done, binding these vectors into a data frame using the as.data.frame and cbind function is the next step to structure these data into a single command. The stack function allows the data to be illustrated in a more readable and accessible manner before running the ANOVA function.

Using the Oneway.test function provides information on the F value, the numerator df, denominator df, and the p-value of the data under the assumption that the variances are equal.

2. The second question asks us to use the ISwR :: zelazo package. The data matrix is as follows:

I wasn't sure how to approach this question, so I opted to use the t-test to glean some information so I could determine some useful characteristics about the set. Since I am writing this after the due-date, and for my own benefit, I employ the answer key to help guide me to the correct next-step. Using the same process as in Question 1, I created a data frame from the zelazo package.

Next, I stacked the data, as before.

Then, I conducted a one-way T-test.

Although my numbers are different from those posted, I chalk this up to a misinput of the data or a mistake on my end. Again, this is being done for the purpose of practicing this process and becoming familiar with conducting these tests.

When the ANOVA test is run, the results show that the null hypothesis, that there is not evidence of significant differences between babies that are trained and those that are not, cannot be rejected, as the p-value is greater than the significance level (0.05 < 0.2239).

Sunday, October 8, 2023

Module #7 Assignment

The data set for this question is as follows:

x <- c(16, 17, 13, 18, 12, 14, 19, 11, 11, 10)

y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)

1.1 The input (x) is assumed to be fixed, while the output (y) is a response to the x, thereby being inherently random in contrast to the input. Under these circumstances the relationship is linear.

1.2

Intercepts: 19.26 and 3.269

2.1. In this scenario, the relationship between these variables is the time between eruptions of a geyser.

2.2. eruption.lm = lm(eruptions ~ waiting, data=faithful)

coefficients: 1.874016 0.075628

2.3 4.172

3. The coefficients for the 'mtcars' data frame produces this in R; using the 'head' function and limited the variables to just the first 5.

Using the lm function, as well as the plot and abline function, this is the data represented graphically.

The inputs for this in R looks like:

Sunday, October 1, 2023

Module #6 Assignment

a. The mean is 11.8

b. Randomly selected 14 and 10

c. The mean for the sample is 7. The standard deviation of the sample is 2.8284271247462 (or 2.83 to 3sf).

d. In comparison, the mean is significantly different from the populations, in contrast to the standard deviation. The standard deviation for the population is 2.8565713714171 (or 2.86 to 3sf). The difference in the Sd is significantly less apparent between the sample and population than that of the means.

n=100

p=0.95

1.Yes, the population has a normal distribution due to its proximity to 1, which suggests a higher confidence in the probability the result is true.

2. As the value becomes further from 1, the chances of the correlation being statistically significant drastically decline. Anything below 0.80 should likely be retested or judged as untrustworthy in being verifiable evidence. Around 0.76 is where the cut-off for statisticians and mathematicians should attempt to use as proof in my opinion.

B.ii.

A. 5

B. 100

C. pop= (xBar-µ)/(σ/(sqrt(n))

6.165939194 or 6.16 to 3sf.This suggests that the sample is very far off, and does not represent the entire population properly. This is likely due to the limited sample size and variance as a result. Despite the standard deviation being nearly unaffected, the other statistical qualities are negatively impacted by the small sample.

I do not know which exercise is for the last question, I checked all three textbooks I bought/have access to and am confused on what this is referring to.