I am working with a dataset (data: emails, size: 650) and want to classify these
ID: 3878397 • Letter: I
Question
I am working with a dataset (data: emails, size: 650) and want to classify these emails into human and automated types. I am aware of the standard classification algorithms (e.g. Naive Bayes), but the issue is that I don't have any kind of tagged data available.
The final classification has to be performed on 6000+ emails on a daily basis. Which means even if I tag the 650 emails manually it will eventually be a very small sample of the actual data. What would be the best way to approach this problem? Any unsupervised or supervised algorithm that could give an accuracy of better than 80% should work.
Explanation / Answer
Given that the training data is very less (size:650), as compared to testing data (size: 6000), this is a little challenging case. The accuracy of your prediction depends on your data quality (i.e. how representative your training patterns are and how exactly your classes are distributed).
Overall, the algorithm to be used will also depend on your data and your objective (regarding constraints on time and resources). Since the data is not available to me directly, I will suggest the following:
1. Using unsupervised algorithms: k-means clustering algorithm can be of great use to you as they don't need any label information and they are quite popular for email classification problem. However, If you have taken a shot of k-means and still not able to learn satisfactorily, so you can use other clustering algorithms such as k-nearest neighbour and spectral clustering. (which is more able to capture non-convex structure in data). Further, if training time complexity is not a concern, you can try kernel versions of these algorithms. The details of all above algorithms are widely available.
Further, Apriori algorithm for association rule learning can also be used for learning inter and intraclass relationship.
2. Supervised Learning: Labelling all 650 (or more) emails manually is not a great idea. But if you want to go ahead with this, you can try Support Vector Machine.
3. Semi-supervised Algorithm: I will highly recommend using semi-supervised learning. It has been established in the literature that the prediction performance of an unsupervised learning problem can be improved greatly by considering a few labelled training points. So, if an unsupervised algorithm is unable to achieve the expected result, you can consider semi-supervised algorithm such as - Transductive Support Vector Machines, where you can manually label 10-15% of the data points, which may boost the performance. In fact, many semi-supervised variants of unsupervised algorithms like k-nn and spectral clustering are available as well, which can be tried.
Related Questions
drjack9650@gmail.com
Navigate
Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.