Theoretical Overview Suppose we have a set of data consisting of ordered pairs a
ID: 3770390 • Letter: T
Question
Theoretical Overview
Suppose we have a set of data consisting of ordered pairs and we suspect the x and y coordinates are related. It is natural to try to find the best line that fits the data points. If we can find this line, then we can use it to make all sorts of other predictions. In this project, we're going to use several functions to find this line using a technique called least squares regression. The result will be what we call the least squares regression line (or LSRL for short).
In order to do this, you'll need to program a statistical computation called the correlation coefficient, denoted by r in statistical symbols:
NOTE: Equation is written assuming you start at the value 1. Arrays start at index 0.
Once you have the correlation coefficient, you use it along with the sample means and sample standard deviations of the x and y-coordinates to compute the slope and y-intercept of your regression line via these formulas:
Project Specifications
In this project, you must read the x- and y-coordinate pairs in from a data file of unknown length. Each line in the file must contain both coordinates, separated by whitespace, as shown here. In addition, you must use C++ functions in this project, splitting the work up into smaller components(like parts of Project 3) and reinforcing your skills with parameter passing and arrays.
You are required to create the following C++ functions, and you must list them in this order above the main program (no prototypes, please!):
(Creation and order of C++ functions is worth 5 Points)
(Each C++ function is worth 12.5 Points. Each C++ function should satisfy this table)
# (for reference)
Role
C++ Function’s Objective
Input Parameters
Output Parameters
Return Values
1
Input
To read the input file, line by line, and store the x- and y-coordinates in parallel arrays
N/A
-an array of x-coordinates from the file.
-a parallel array of y-coordinates from the file.
-logical size of the arrays
N/A
2
Process
To compute the mean of the data set
-an array of data
-logical size of the array
N/A
-the mean of the data in the array
3
Process
To compute the standard deviation of the data set.
-an array of data
-logical size of the array
-the mean of the data
N/A
-the standard deviation of the data in the array
4
Process
To compute the correlation coefficient.Call the mean and Standard Deviation function, where needed.
-an array of x-coordinates from the file
-a parallel array of y-coordinates from the file
-logical size of the array
N/A
-the correlation coefficient of the input arrays
5
Process
To compute the least-squares regression line. Call the mean, Standard Deviation, and Correlation Coefficient functions, where needed.
-an array of x-coordinates from the file
-a parallel array of y-coordinates from the file
-logical size of the array
-the y-intercept of the line
-the slope of the line
N/A
6
Output
To display a mathematical representation of a line to the screen.
-the y-intercept of the line
-the slope of the line
N/A
N/A
You will also need a main program to drive this program (Test Driver). All computation should be done in the six C++ Functions; the main program should be extremely short.
Sample Screen Output
Regression line: y = 1166.93 + -0.586788x
Testing and Report (15 Points)
When you are finished, test your program with four different input files (Test Driver):
Data File 1
Data File 2 (density in pounds per cubic foot vs. stiffness in pounds per square inch of particleboards; taken from p. 391 of Probability and Statistics for Scientists and Engineers, 6th ed., Walpole/Myers/Myers)
Data File 3 (daily rainfall in 0.01 cm vs. air pollution particulate removed in mcg/cum; taken from p. 365 of Walpole/Myers/Myers)
A data file you've created yourself. Ideally this will be something in the context of your major. Provide information on where the data came from.
Project 4 Submission (5 Points)
Submit a Project Report to the Project 4 Dropbox.
The code should be included in the Code section.
Submit the results from the sample runs in the Sample Runs section.
Submit the input file you create for the 4th sample run.
# (for reference)
Role
C++ Function’s Objective
Input Parameters
Output Parameters
Return Values
1
Input
To read the input file, line by line, and store the x- and y-coordinates in parallel arrays
N/A
-an array of x-coordinates from the file.
-a parallel array of y-coordinates from the file.
-logical size of the arrays
N/A
2
Process
To compute the mean of the data set
-an array of data
-logical size of the array
N/A
-the mean of the data in the array
3
Process
To compute the standard deviation of the data set.
-an array of data
-logical size of the array
-the mean of the data
N/A
-the standard deviation of the data in the array
4
Process
To compute the correlation coefficient.Call the mean and Standard Deviation function, where needed.
-an array of x-coordinates from the file
-a parallel array of y-coordinates from the file
-logical size of the array
N/A
-the correlation coefficient of the input arrays
5
Process
To compute the least-squares regression line. Call the mean, Standard Deviation, and Correlation Coefficient functions, where needed.
-an array of x-coordinates from the file
-a parallel array of y-coordinates from the file
-logical size of the array
-the y-intercept of the line
-the slope of the line
N/A
6
Output
To display a mathematical representation of a line to the screen.
-the y-intercept of the line
-the slope of the line
N/A
N/A
Explanation / Answer
#include <iostream>
#include<cmath>
#include<fstream>
using namespace std;
int read(double x[], double y[]);
double mean(double a[], int n);
double stddev(double a[], double m, int n);
double corr_coef(double x[], double y[], int n);
void calculate_line(double x[], double y[], int n);
int main()
{
double x[100],y[100];
int i,n;
n= read(x,y);
double mx = mean(x,n);
double my = mean(y,n);
double sx = stddev(x,mx,n);
double sy = stddev(y,my,n);
double r = corr_coef(x,y,n);
calculate_line(x,y,n);
}
int read(double x[], double y[])
{
int i=0;
ifstream in("test.txt");
while(in>>x[i]>>y[i])
i++;
return i;
}
double mean(double a[], int n)
{
double sum=0;
for(int i=0;i<n;i++)
sum+=a[i];
return sum/n;
}
double stddev(double a[], double m, int n)
{
double sum = 0;
for(int i=0;i<n;i++)
{
sum+=(m-a[i])*(m-a[i]);
}
sum/=(n-1);
return sqrt(sum);
}
double corr_coef(double x[], double y[], int n)
{
double res=0;
double mx,my,sx,sy;
mx=mean(x,n);
my=mean(y,n);
sx=stddev(x,mx,n);
sy=stddev(y,my,n);
for(int i=0;i<n;i++)
{
res+=((x[i]-mx)/sx)*((y[i]-my)/sy);
}
return res/(n-1);
}
void calculate_line(double x[], double y[], int n)
{
double res=0;
double mx,my,sx,sy;
mx=mean(x,n);
my=mean(y,n);
sx=stddev(x,mx,n);
sy=stddev(y,my,n);
for(int i=0;i<n;i++)
{
res+=((x[i]-mx)/sx)*((y[i]-my)/sy);
}
double r= res/(n-1);
double b = r*sy/sx;
double a = my - b*mx;
cout<<"Line is Y = "<<a;
if(b>0)
cout<<" + ";
cout<<b<<"x"<<endl;
}
input file: test.txt
output:
Related Questions
Navigate
Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.