Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

1. Given the data set below, apply the k-Nearest Neighbor algorithm to classify

ID: 3889193 • Letter: 1

Question

1. Given the data set below, apply the k-Nearest Neighbor algorithm to classify the test data for k=1 and k=3. Use the Euclidean distance metric.

2. Compute the confusion matrix, accuracy, precision, recall, and F1 measures given your answers to problem 1.

3. Assume you have the data set given below, which provides hypothetical examples of instances when people did or did not get hired for a job. It consists of three categorical attributes and a label that indicates "hired" or "not hired". Using this data, induce a decision tree using information gain for splitting the nodes, showing the calculations at each step.

4. Download and install the WEKA data mining toolkit. It is available through this link:
http://www.cs.waikato.ac.nz/ml/weka/downloading.html
Then use the Explorer GUI interface and open the "credit-g.arff" data set that is included in WEKA (data directory). This is a data set from Germany describing credit-worthiness (good or bad) of customers based on 20 different attributes. Go to the classify tab, select "Percentage Split" and enter 50%.
Then run each of the following classification algorithms:

trees.J48

trees.SimpleCart

trees.RandomForests

meta.AdaBoostM1

a) For each algorithm, report the accuracy, precision, and recall values with default parameters

b) Adjust the parameters of each algorithm to maximize the accuracy. Report the algorithm and parameter settings that maximized the accuracy measure and provide the maximum accuracy value.

Training Set # x1 x2 true label 1 0.453705 -0.0106 1 2 3.258589 0.169734 1 3 3.184656 -0.83691 0 4 -0.42561 1.385033 0 5 0.658765 -1.87715 0 6 -0.40507 -1.9574 0 7 -4.52775 4.123102 1 8 2.538689 -1.5386 1 9 -1.04649 -3.59664 1 10 2.967113 0.505111 0

Explanation / Answer

Here we are cosidering first testing sample data for k=3

Steps :

1. k=3 , query data x1=-4.69237 x2=-4.77898 and true label=1

2.Calculate the euclidean distance between the query sample and training data

Euclidean distance between each training data and query sample

#

x1

x2

Euclidean distance

1

0.453705

-0.0106

(0.453705-(-4.69237))^2+(-0.0106-(-4.77898))^2=49.21953573

2

3.258589

0.169734

85.9551968441

3

3.184656

-0.83691

77.5874544896

4

-0.42561

1.385033

56.2002971618

5

0.658765

-1.87715

37.0552631371

6

-0.40507

-1.9574

26.3422549864

7

-4.52775

4.123102

79.2741636791

8

2.538689

-1.5386

62.7863326679

9

-1.04649

-3.59664

14.69036885

10

2.967113

0.505111

86.58929752357

3.Sort this euclidean distance from mimimum to maximum and give rank as shown in above table last column

4. Now for k=3 consider first 3 minimum distance here in above table first 3 minumum are row number 9,6 and 5.

5.Now , there labels in training data sets are 1,0,0 .Here we cosinder the maximum occurence of each label

0 comes 2 times which greater than 1 which comes only 1.

So predicted label for first test data is 0 when k=3

In case of k=1 we will consider only 1 data which is having label 1.

So predicted label for first test data is 1 when k=1

Here accuracy of algorithm is more when k=3 than k=1.

Similar way we can predict other testing data.

Euclidean distance between each training data and query sample

#

x1

x2

Euclidean distance

Rank minimum distance

1

0.453705

-0.0106

(0.453705-(-4.69237))^2+(-0.0106-(-4.77898))^2=49.21953573

4

2

3.258589

0.169734

85.9551968441

9

3

3.184656

-0.83691

77.5874544896

7

4

-0.42561

1.385033

56.2002971618

5

5

0.658765

-1.87715

37.0552631371

3

6

-0.40507

-1.9574

26.3422549864

2

7

-4.52775

4.123102

79.2741636791

8

8

2.538689

-1.5386

62.7863326679

6

9

-1.04649

-3.59664

14.69036885

1

10

2.967113

0.505111

86.58929752357

10