Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

Hello, I am looking for a solution to the Classifier: Income Predictor programmi

ID: 3583004 • Letter: H

Question

Hello, I am looking for a solution to the Classifier: Income Predictor programming project (PP1 of Chapter 10) on pg 473 of the Practice of Computing Using Python 2nd ed. I would be grateful for assistance with this. Full question is as follows: Income predictor Using a dataset ( the "Adult Data Set") from the UCI Machine-Learning Repository we can predict based on a number of factors whether someone's income will be greater than $50,000. The technique The approach is to create a 'classifier' - a program that takes a new example record and, based on previous examples, determines which 'class' it belongs to. In this problem we consider attributes of records and separate these into two broad classes, <50K and >=50K. We begin with a training data set - examples with known solutions. The classifier looks for patterns that indicate classification. These patterns can be applied against new data to predict outcomes. If we already know the outcomes of the test data, we can test the reliability of our model. if it proves reliable we could then use it to classify data with unknown outcomes. We must train the classifier to establish an internal model of the patterns that distinguish our two classes. Once trained we can apply this against the test data - which has known outcomes. We take our data and split it into two groups - training and test - with most of the data in the training set. We need to write a program to find the patterns in the training set. Building the classifier Look at the attributes and, for each of the two outrcomes, make an average value for each one, Then aveage these two results for each attribute to compute a midpoint or 'class separation value'. For each record, test whether each attribute is above or below its midpoint value and flag it accouringly. For each record the overall result is the greater count of the individual results (<50K, >=50K) You'll know your model works iff you achieve the same results as thee known result for the records. You should track the accuracy of your model, i.e how many correct classifications you made as a percentage of the total number of records. Process overview Create training set from data Create classifier using training dataset to determine separator values for each attribute Create test dataset Use classifier to classify data in test set while maintaining accuracy score The data The data is presented in the form of a comma-delimited text file (CSV) which has the following structure: Listing of attributes: 1. Age: Number. 2. Workclass: Can be one of -- Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked. 3. fnlwgt: number. This is NOT NEEDED for our study. 4. Education: Can be one of -- Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool. This is NOT NEEDED for our study. 5. Education-number: Number -- indicates level of education. 6. Marital-status: Can be one of -- Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse. 7. Occupation: Can be one of -- Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces. 8. Relationship: Can be one of -- Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried. 9. Race: Can be one of -- White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black. 10. Sex: Either Female or Male. 11. Capital-gain: Number. 12. Capital-loss: Number. 13. Hours-per-week: Number. 14. Native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands. This is NOT NEEDED for our study. 15. Outcome for this record: Can be >50K or <=50K. Data is available from http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data. You should be able to read this direcctly from the Internet. Fields that have 'discrete' attributes such as 'Relationship' can be given a numeric weight by counting the number of occurrances as a fraction of the total number of positive records (outcome >= 50K) and negative records (outcome < 50K). So, if we have 10 positive records and they have values Wife:2, Own-child: 3, Husband:2, Not-in-family:1, Other-realtive:1 and Unmarried:1 then this would yield factors of 0.2, 0.3, 0.2, 0.1, 0.1 and 0.1 respectively. I would be grateful if you could answer to give the output below. " The number of tested records is ...? The number of correct predictions is ...? The number of incorrect predictions is ...? The percentage of correct predictions is: ... % "

Explanation / Answer

from sklearn.cross_validation import train_test_split

from sklearn.linear_model import LogisticRegression

from sklearn.ensemble import RandomForestClassifier

from sklearn.preprocessing import LabelEncoder,OneHotEncoder, StandardScaler

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, f1_score

from sklearn.grid_search import GridSearchCV, RandomizedSearchCV

import matplotlib.pyplot as plt

import seaborn as sns

import pandas as pd

from sklearn.externals import joblib

class Data_Modeling(object):

    def __init__(self):

        self.data = pd.read_csv('cleanData.csv')

    def setData(self, newFileName):

        self.data = pd.read_csv(newFileName)

    def getData(self):

        return self.data

    def transformData(self, data):

        #Select the relevant features

      #print prep.columns

        relevantFeatures = ['Martial_Status', 'Occupation','Relationship', 'Race', 'Sex',

            'Age', 'Education_Num','Capital_Gain', 'Capital_Loss', 'Hours_Per_Week']

        #Construct big matrix X and array y

        X = data[relevantFeatures].values

        y = data['Income'].values

        #split dataset into training and testing sets

        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 101)

        #Transform categorical variables into integer labels

        martial_le = LabelEncoder()

        occupation_le = LabelEncoder()

        relationship_le = LabelEncoder()

        race_le = LabelEncoder()

        sex_le = LabelEncoder()

        transformers = [martial_le, occupation_le, relationship_le, race_le, sex_le]

        for i in range(len(transformers)):

            X_train[:, i] = transformers[i].fit_transform(X_train[:,i])

            X_test[:,i] = transformers[i].transform(X_test[:,i])

        #print X_train.shape

        #print X_train[0,:]

        #Dummy code categorical variables

        dummy_code = OneHotEncoder(categorical_features = range(5))

        X_train = dummy_code.fit_transform(X_train).toarray()

        X_test = dummy_code.transform(X_test).toarray()

        #print X_train.shape

        #print X_train[0,:]

        #Normalize all features

        scaler = StandardScaler()

        X_train = scaler.fit_transform(X_train)

        X_test = scaler.transform(X_test)

      #print X_train[0,:]

        #Encode y

        class_le = LabelEncoder()

        y_train = class_le.fit_transform(y_train)

        y_test = class_le.transform(y_test)

        #print class_le.transform(["<=50K", ">50K"])

        return X_train, X_test, y_train, y_test

    #Logistic Regression

    def buildLogisticRegression(self, X_train, X_test, y_train, cv = 5, save = False):

        lr = LogisticRegression()

        #Tune the model

        param_grid = {

            'C':[10**-5, 10**-4, 0.001,0.01,0.1,1,10,100]

        }

        lr_optimized = GridSearchCV(

            estimator = lr,

            param_grid = param_grid,

            scoring= "f1",

            cv=cv

        )

        lr_optimized.fit(X_train, y_train)

        if save == True:

            joblib.dump(value =lr_optimized, filename='lr_optimized.pkl', compress=1)

        print "Best parameter: %s" %lr_optimized.best_params_

        print "Best average cross validated F1 score: %0.4f" %lr_optimized.best_score_

        print "--------------------------------------------"

        print lr_optimized.best_estimator_.coef_

        #predictions

        predicted_y_train = lr_optimized.predict(X_train)

        predicted_y_test = lr_optimized.predict(X_test)

        return predicted_y_train, predicted_y_test

    #Random Forest

    def buildRandomForest(self, X_train, X_test, y_train, cv = 3, n_iter = 5, save = False):

        rf = RandomForestClassifier(random_state = 9)

        #Tune the model

        param_distributions = {

            'n_estimators': range(1,50,1),

            'max_depth': range(1,70,1),

            'max_features': range(6,15,1),

            'min_samples_split':[2,3,4],

            'min_samples_leaf':[1,2,3,4],

            'n_jobs':[-1]

      }

        rf_optimized = RandomizedSearchCV(

            estimator = rf,

            param_distributions = param_distributions,

            n_iter= n_iter,

            scoring = 'f1',

            cv = cv,

            random_state = 1

        )

        rf_optimized.fit(X_train, y_train)

        if save == True:

            joblib.dump(value = rf_optimized, filename = "rf_optimized.pkl", compress=1)

        print "Best parameter: %s" %rf_optimized.best_params_

        print "Best average cross validated F1 score: %0.4f" %rf_optimized.best_score_

        print "--------------------------------------------"

        #predictions

        predicted_y_train = rf_optimized.predict(X_train)

        predicted_y_test = rf_optimized.predict(X_test)

        return predicted_y_train, predicted_y_test

    #Evaluate model performance

    def evaluatePerformance(self, actual, prediction, title):

        print title

        print "Accuracy is %.4f" % accuracy_score(actual, prediction)

        print "F1 Score is %0.4f" %f1_score(actual, prediction)

        print classification_report(actual, prediction, target_names = ["<=50K", ">50K"])

        matrix = confusion_matrix(actual, prediction)

        print "Confusion Matrix:"

        print matrix

        print "----------------------------------------------------"

        plt.figure(1)

        sns.heatmap(

            data= matrix,

            annot=True,

            fmt="d",

            xticklabels = ["<=50K", ">50K"],

            yticklabels = ["<=50K", ">50K"],

            square= True

        )

        plt.title(title)

        plt.xlabel("Prediction")

        plt.ylabel("Actual")

        plt.show()

Hire Me For All Your Tutoring Needs
Integrity-first tutoring: clear explanations, guidance, and feedback.
Drop an Email at
drjack9650@gmail.com
Chat Now And Get Quote