Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

Currently I\'m trying to classify spam emails with kNN classification. Dataset i

ID: 652904 • Letter: C

Question

Currently I'm trying to classify spam emails with kNN classification. Dataset is represented in the bag-of-words notation and it contains approx. 10000 observations with approx. 900 features. Matlab is the tool I use to process the data.

Within the last days I played with several machine learning approaches: SVM, Bayes and kNN. In my point of view, kNN's performance beats SVM and Bayes when it comes to minimize the false positive rate. Checking with 10-fold Cross-Validation I obtain a false positive rate of 0.0025 using k=9 and Manhattan-Distance. Hamming distance performs in the same region.

To further improve my FPR I tried to preprocess my data with PCA, but that blow away my FPR as a value of 0.08 is not acceptable.

Do you have any idea how to tune the dataset to get a better FPR?

Explanation / Answer

You could consider querying the k-nearest neighbours and weight them following some schema like this one. There are quite a few in the literature. The general idea is to give more relevance to those neighbours lying closer to your sample. It has the effect of regularizing your classifiers (smooth out your decision surfaces).

You evaluated your algorithm using 10-fold CV. What is the dispersion of your measurements?. If you see wide dispersion, you could use bagging. It is very easy to implement. It is specially meaningful when overfitting is strong.

Hire Me For All Your Tutoring Needs
Integrity-first tutoring: clear explanations, guidance, and feedback.
Drop an Email at
drjack9650@gmail.com
Chat Now And Get Quote