Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

Python When analyzing data, we often have some idea that two quantities are rela

ID: 3664034 • Letter: P

Question

Python

When analyzing data, we often have some idea that two quantities are related and would like to quantify this relationship. We will eventually learn to use linear least-squares approximation to do so. For now, we want to see an example of correlated data and how a least-squares fit of that data can give us a good quantitative picture of the relationship. This problem will have some similarities with the power law problem, but in this case we are using actual data available online. The file [baseball-data.csv] contains the batting average and RBIs from 2014 for each of the MLB teams. Each row represents data for a single team, with the first column containing the RBI count and the second column containing the batting average. Write Python code to read in the baseball data using np.loadtxt(). Your code will see a variable called baseball_csv which you can treat as the filename when calling np.loadtxt. You will not need quotes around baseball_csv in this case. Plot the data points (with AVG on x-axis and RBI on y-axis) using matplotlib. Additionally, plot the line y=mx+b where m=4074.611 and b=398.302 on the same graph as the data points. This line is the line of best fit through the data. Which data point is closest to the line (vertically)? Which data point is farthest from the line (vertically)? Your code should produce a variable called closest_pt which is a 2-tuple of the form (x,y) where x is the AVG and y is the RBI for the closest data point. Your code should also produce a variable called closest_dist which is the vertical distance between the closest point and the line through the data. Your code should likewise produce a variable called farthest_pt which is also a tuple of length 2 of the same form and a variable called farthest_dist which is the vertical distance between the farthest point and the line through the data.

729 0.259 731 0.277 721 0.276 686 0.244 690 0.259 686 0.265 675 0.254 681 0.256 635 0.253 659 0.259 644 0.253 636 0.255 625 0.253 604 0.263 617 0.25 614 0.253 597 0.256 601 0.244 600 0.244 591 0.245 596 0.242 602 0.239 584 0.242 585 0.253 573 0.248 590 0.239 586 0.247 562 0.238 545 0.241 500 0.226

Explanation / Answer

//import files
import numpy as np
from matplotlib import pyplot as plt

a = np.loadtxt("baseball-data.c", delimiter=",")
// define x and y matrix
x = a[:,1]
y = a[:,0]

fig = plt.figure()
coord = fig.add_subplot(1,1,1)
// plot will display blue color with dotted line
fig1 = coord.plot(x, y, marker=".", linestyle="", color="blue")

m = 4074.611
b = -398.302

x_line = np.linspace(0.22,0.28,2)
y_line = m*x_line + b
fig2 = coord.plot(x_line, y_line, color="blue")

def distToLine(x, y):
    dist = abs(y - m*x - b)
    return dist

first_point = tuple(a[0,:])
closest_pt = (first_point[1], first_point[0])
closest_dist = distToLine(first_point[1], first_point[0])
farthest_pt = (0, 0)
farthest_dist = 0
for i in range (1, len(x)):
    point = tuple(a[i,:])
    currDist = distToLine(point[1], point[0])
    if currDist < closest_dist:
        closest_dist = currDist
        closest_pt = (point[1], point[0])
    if currDist > farthest_dist:
        farthest_dist = currDist
        farthest_pt = (point[1], point[0])


plt.show()