1. Given the data set below, apply the k-Nearest Neighbor algorithm to classify
ID: 3889193 • Letter: 1
Question
1. Given the data set below, apply the k-Nearest Neighbor algorithm to classify the test data for k=1 and k=3. Use the Euclidean distance metric.
2. Compute the confusion matrix, accuracy, precision, recall, and F1 measures given your answers to problem 1.
3. Assume you have the data set given below, which provides hypothetical examples of instances when people did or did not get hired for a job. It consists of three categorical attributes and a label that indicates "hired" or "not hired". Using this data, induce a decision tree using information gain for splitting the nodes, showing the calculations at each step.
4. Download and install the WEKA data mining toolkit. It is available through this link:
http://www.cs.waikato.ac.nz/ml/weka/downloading.html
Then use the Explorer GUI interface and open the "credit-g.arff" data set that is included in WEKA (data directory). This is a data set from Germany describing credit-worthiness (good or bad) of customers based on 20 different attributes. Go to the classify tab, select "Percentage Split" and enter 50%.
Then run each of the following classification algorithms:
trees.J48
trees.SimpleCart
trees.RandomForests
meta.AdaBoostM1
a) For each algorithm, report the accuracy, precision, and recall values with default parameters
b) Adjust the parameters of each algorithm to maximize the accuracy. Report the algorithm and parameter settings that maximized the accuracy measure and provide the maximum accuracy value.
Training Set # x1 x2 true label 1 0.453705 -0.0106 1 2 3.258589 0.169734 1 3 3.184656 -0.83691 0 4 -0.42561 1.385033 0 5 0.658765 -1.87715 0 6 -0.40507 -1.9574 0 7 -4.52775 4.123102 1 8 2.538689 -1.5386 1 9 -1.04649 -3.59664 1 10 2.967113 0.505111 0Explanation / Answer
Here we are cosidering first testing sample data for k=3
Steps :
1. k=3 , query data x1=-4.69237 x2=-4.77898 and true label=1
2.Calculate the euclidean distance between the query sample and training data
Euclidean distance between each training data and query sample
#
x1
x2
Euclidean distance
1
0.453705
-0.0106
(0.453705-(-4.69237))^2+(-0.0106-(-4.77898))^2=49.21953573
2
3.258589
0.169734
85.9551968441
3
3.184656
-0.83691
77.5874544896
4
-0.42561
1.385033
56.2002971618
5
0.658765
-1.87715
37.0552631371
6
-0.40507
-1.9574
26.3422549864
7
-4.52775
4.123102
79.2741636791
8
2.538689
-1.5386
62.7863326679
9
-1.04649
-3.59664
14.69036885
10
2.967113
0.505111
86.58929752357
3.Sort this euclidean distance from mimimum to maximum and give rank as shown in above table last column
4. Now for k=3 consider first 3 minimum distance here in above table first 3 minumum are row number 9,6 and 5.
5.Now , there labels in training data sets are 1,0,0 .Here we cosinder the maximum occurence of each label
0 comes 2 times which greater than 1 which comes only 1.
So predicted label for first test data is 0 when k=3
In case of k=1 we will consider only 1 data which is having label 1.
So predicted label for first test data is 1 when k=1
Here accuracy of algorithm is more when k=3 than k=1.
Similar way we can predict other testing data.
Euclidean distance between each training data and query sample
#
x1
x2
Euclidean distance
Rank minimum distance1
0.453705
-0.0106
(0.453705-(-4.69237))^2+(-0.0106-(-4.77898))^2=49.21953573
42
3.258589
0.169734
85.9551968441
93
3.184656
-0.83691
77.5874544896
74
-0.42561
1.385033
56.2002971618
55
0.658765
-1.87715
37.0552631371
36
-0.40507
-1.9574
26.3422549864
27
-4.52775
4.123102
79.2741636791
88
2.538689
-1.5386
62.7863326679
69
-1.04649
-3.59664
14.69036885
110
2.967113
0.505111
86.58929752357
10Related Questions
Navigate
Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.