This is a question in relation to creating a classifier in Python: Classifier fo

ID: 3583233 • Letter: T

Question

This is a question in relation to creating a classifier in Python:

Classifier for sample breast cancer dataset

Process overview
1. Create training set from data
    Here we read our data file directly from the web and split it out into a list of tuples, one tuple per record. We do some conversion of factors from string to int and a malignant/benign code, '2'/'4' to 'm'/'b' respectively. We also test for value errors and silently drop any malformed rows.

    Break out our dataset into a training and test sets where the training set has a number of records determined by the PERCENT value. The test set has the remaining records.

2. Create classifier using training dataset to determine separator values for each attribute
    For each record we average the values for each attribute in a list of known benign results and, separately, a list of known malignant results. The benign and malignant averages are then averaged against each other to compute midpoint values. These will be used to compare each attribute in a record and assign it a status - benign or malignant. The overall result is the greater of the number of the benign / malignant status values.

3. Create test dataset
    We apply the classifier list against each record in the test set. We compare each attribute against its equivalent value in the classifier list. Based on this, the attribute gets a status - 'b' or 'm'. The count of the status values for a record determines the result.

4. Use classifier to classify data in test set while maintaining accuracy score
    Given that we know the outcome for each test record we can verify the classifier.

Each data row consists of a patient id followed by nine indicators followed by an overall result. Sample data row: '1000025','5','1','1','1','2','1','3','1','1','2'. In this case '1000025 is the patient id and the overall result is indicated as '2' - malignant or '4' - benign.

DATA_URL = http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data

This is the answer:

DATA_URL = "http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data"
PERCENT = 75

import httplib2

def create_data(DATA_URL):
ts_list = []

try:
h = httplib2.Http(".cache")
headers, fh = h.request(DATA_URL)
fh = fh.decode().split(" ")

row_count = 0
for row in fh:

try:
row = row.strip()
row_list = row.split(",")
for i in range(1, len(row_list) - 1):
row_list[i] = int(row_list[i])

if row_list [-1] == "2":
row_list[-1] = "m"
elif row_list[-1] == "4":
row_list[-1] = "b"
else:
row_list[-1] = ""
ts_list.append(tuple(row_list))
except ValueError as v:
print(row_list[0], v)
continue

except IOError as e:
print(e)
except ValueError as v:
print(v)

print(ts_list)
return ts_list

def create_classifier(training_list):
benign_attrs = [0] * 9
malignant_attrs = [0] * 9
benign_count = 0
malignant_count = 0
classifier_list = [0] * 9

# Compute the totals
for record in training_list:
if record[-1] == "b":
benign_count += 1
for attribute in range(len(record[1:-1])):
benign_attrs[attribute] += record[attribute + 1]

elif record[-1] == "m":
malignant_count += 1
for attribute in range(len(record[1:-1])):
malignant_attrs[attribute] += record[attribute + 1]
# Compute the averages
for attribute in range(len(benign_attrs)):
benign_attrs[attribute] = benign_attrs[attribute] / benign_count
for attribute in range(len(malignant_attrs)):
malignant_attrs[attribute] = malignant_attrs[attribute] / malignant_count
# Compute the midpoints
for attribute in range(len(classifier_list)):
classifier_list[attribute] = (benign_attrs[attribute] + malignant_attrs[attribute]) / 2
print(classifier_list)
return classifier_list

def create_test(test_list, classifier_list):
false_count = 0
true_count = 0
total_count = 0

temp_result_list = [""]*11
for record in test_list:
temp_result_list[0] = record[0]
for attribute in range(len(record[1:-1])):
if record[attribute + 1] < classifier_list[attribute]:
temp_result_list[attribute + 1] = "m"
else :
temp_result_list[attribute + 1] = "b"
if temp_result_list.count("m") >= 5:
temp_result_list[-1] = "m"
else:
temp_result_list[-1] = "b"

print(temp_result_list, end=" ")
total_count += 1
if record[-1] == temp_result_list[-1]:
true_count += 1
print("CORRECT")
else:
false_count += 1
print("FALSE")

print(" CORRECT: {}, {:.2%}, INCORRECT: {}, {:.2%}, TOTAL COUNT: {}"
.format(true_count, true_count / total_count, false_count,
false_count / total_count, total_count))

def main():
# Make a list of tuples from the raw data
data_list = create_data(DATA_URL)
# Break out our dataset into a training and test sets.
training_list = data_list[:int(len(data_list) * PERCENT / 100)]
test_list = data_list[int(len(data_list) * PERCENT / 100):]
# Create the classifier values
classifier_list = create_classifier(training_list)
# Apply classifier against test file.
create_test(test_list, classifier_list)

if __name__ == "__main__":
main()

I have a similar question below and would like an answer in the same format as the breast cancer classifier answer. The main difference is that some of the data fields in the income question have distrete attributes.

Income predictor

Using a dataset ( the "Adult Data Set") from the UCI Machine-Learning Repository we can predict based on a number of factors whether someone's income will be greater than $50,000.

The technique

The approach is to create a 'classifier' - a program that takes a new example record and, based on previous examples, determines which 'class' it belongs to. In this problem we consider attributes of records and separate these into two broad classes, <50K and >=50K.

We begin with a training data set - examples with known solutions. The classifier looks for patterns that indicate classification. These patterns can be applied against new data to predict outcomes. If we already know the outcomes of the test data, we can test the reliability of our model. if it proves reliable we could then use it to classify data with unknown outcomes.

We must train the classifier to establish an internal model of the patterns that distinguish our two classes. Once trained we can apply this against the test data - which has known outcomes.

We take our data and split it into two groups - training and test - with most of the data in the training set.

We need to write a program to find the patterns in the training set.

Building the classifier

Look at the attributes and, for each of the two outrcomes, make an average value for each one, Then aveage these two results for each attribute to compute a midpoint or 'class separation value'.

For each record, test whether each attribute is above or below its midpoint value and flag it accouringly. For each record the overall result is the greater count of the individual results (<50K, >=50K)

You'll know your model works iff you achieve the same results as thee known result for the records. You should track the accuracy of your model, i.e how many correct classifications you made as a percentage of the total number of records.

Process overview

Create training set from data

Create classifier using training dataset to determine separator values for each attribute

Create test dataset

Use classifier to classify data in test set while maintaining accuracy score

The data

The data is presented in the form of a comma-delimited text file (CSV) which has the following structure:

Listing of attributes:

1. Age: Number.
2. Workclass: Can be one of -- Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
3. fnlwgt: number. This is NOT NEEDED for our study.
4. Education: Can be one of -- Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool. This is NOT NEEDED for our study.
5. Education-number: Number -- indicates level of education.
6. Marital-status: Can be one of -- Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
7. Occupation: Can be one of -- Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
8. Relationship: Can be one of -- Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
9. Race: Can be one of -- White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
10. Sex: Either Female or Male.
11. Capital-gain: Number.
12. Capital-loss: Number.
13. Hours-per-week: Number.
14. Native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands. This is NOT NEEDED for our study.
15. Outcome for this record: Can be >50K or <=50K.

Data is available from http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data. You should be able to read this direcctly from the Internet.

Fields that have 'discrete' attributes such as 'Relationship' can be given a numeric weight by counting the number of occurrances as a fraction of the total number of positive records (outcome >= 50K) and negative records (outcome < 50K). So, if we have 10 positive records and they have values Wife:2, Own-child: 3, Husband:2, Not-in-family:1, Other-realtive:1 and Unmarried:1 then this would yield factors of 0.2, 0.3, 0.2, 0.1, 0.1 and 0.1 respectively.

Explanation / Answer

Answer :

false_count = 0
true_count = 0
total_count = 0

print(temp_result_list, end=" ")
total_count += 1
if record[-1] == temp_result_list[-1]:
true_count += 1
print("CORRECT")
else:
false_count += 1
print("FALSE")

print(" CORRECT: {}, {:.2%}, INCORRECT: {}, {:.2%}, TOTAL COUNT: {}"
.format(true_count, true_count / total_count, false_count,
false_count / total_count, total_count))

Navigate

This is a question in astrophysics class, please answer all the sub-questions. E

This is a question more from law. Steve (an Indiana resident) wins the lottery a

Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.

This is a question in relation to creating a classifier in Python: Classifier fo

Question

Explanation / Answer

Related Questions

Navigate