Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

0.1 Getting rid of meaningless rows. Using python! Before we start, we see that

ID: 3824797 • Letter: 0

Question

0.1 Getting rid of meaningless rows. Using python!

Before we start, we see that there are a few rows that contain column titles instead of actual data, e.g. ['RK', 'Player', 'Pos', 'Age' .....]because the data was concatenated using multiple excel pages and each page has one line of titles, we want to get rid of those rows. This column title row appears once every 100 rows of actual data.

The first step of cleaning your data is to get rid of the values that do not exist, or nan values. Since some players did not play enough games or had enough valid data points stored during this season, we observe that one or more of their categories such as 'FT' and 'FTA' are both 00, resulting in 'nan' under 'FT%', such non-numerical values will result in errors when we try to compute any further calculations. Therefore, we want to remove those players that have one or more 'nan' values from our dataset.

(1) Create a table that only keeps the rows from basketball_data that have numerical values (or getting rid of column title rows) called version_1.

(2) Create a new table called version_2 that only keeps players that have no 'non' values in any of the categories from version_1.

In [6]:

Rk Player Pos Age Tm G GS MP FG FGA FG% 3P 3PA 3P% 2P 2PA 2P% eFG% FT FTA FT% ORB DRB TRB AST STL BLK TOV PF PTS 119 Goran Dragic SG 29 MIA 70 70 2304 400 844 .474 64 202 .317 336 642 .523 .512 120 165 .727 50 220 270 412 69 15 183 179 984 120 Andre Drummond C 22 DET 80 80 2633 548 1049 .522 2 6 .333 546 1043 .523 .523 207 582 .356 386 794 1180 67 117 111 153 241 1305 Rk Player Pos Age Tm G GS MP FG FGA FG% 3P 3PA 3P% 2P 2PA 2P% eFG% FT FTA FT% ORB DRB TRB AST STL BLK TOV PF PTS 121 Jared Dudley PF 30 WAS 79 39 2042 224 473 .474 99 233 .425 125 240 .521 .578 69 94 .734 50 224 274 162 72 18 79 185 616 122 Tim Duncan C 39 SAS 60 59 1502 210 427 .492 0 2 .000 210 425 .494 .492 90 127 .709 111 327 438 160 47 77 89 123 510

Explanation / Answer

from xlrd import open_workbook

class Arm(object):
def __init__(self, id, name, age, tm, g, gs, mp, fg,FG, p3, pa, p31, p2, pa, p21, efg, ft, fta, ft1, orb, drb, ast, stl, blk, tov, pts):
self.id = id
self.name = name
self.age = age
self.tm = tm
self.g = g
self.gs=gs
self.mp=mp
self.fg=fg
self.FG=FG
self.p3=p3
self.pa=pa
self.p31=p31
self.p2=p2
self.p21=p21
self.efg=efg
self.ft=ft
self.fta=fta
self.ft1=ft1
self.orb=orb
self.drb=drb
self.ast=ast

def __str__(self):
return format(self.id,
self.name, self.age,self.tm, self.g, self.gs,self.mp,self.fg,self.FG,self.p3,
self.pa,
self.p31,
self.p2, self.p21,
self.efg,
self.ft,
self.fta,
self.ft1,
self.orb,
self.drb,
self.ast)

wb = open_workbook('player.xls')
for sheet in wb.sheets():
number_of_rows = sheet.nrows
number_of_columns = sheet.ncols

items = []

rows = []
for row in range(1, number_of_rows):
values = []
for col in range(number_of_columns):
value = (sheet.cell(row,col).value)
try:
value = str(int(value))
except ValueError:
pass