Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

I am new to machine learning, so forgive me if I am doing something absolutely a

ID: 653448 • Letter: I

Question

I am new to machine learning, so forgive me if I am doing something absolutely absurd.

I have a classification task (~100 classes) and have about 2 million training data points in a 2000-dimensional space. Coordinates of data points are integers (discrete). All points have non-zero coordinates only for < 10 dimensions. That is, each point can be uniquely defined in < 10 dimensional sub-space.

If I use a Gaussian Mixture Model (GMM) for each class, I will end up with ~100 GMMs in a 2000-dimensional space. I feel that given the fact that each point is uniquely definable in less than 10 dimensional space, there can possibly be a better way of doing it.

What am I missing here?

Explanation / Answer

Since your data are extremely sparse, using GMMs or a traditional SVM will result in an over-fit model. By employing methods that exploit the sparsity of the structure, you should get much better results. Regression methods typically add some penalty function as a measure of the amount of non-zero values. This is usually referred to as "regularization". Doing this exactly (under the L0 norm) is difficult, so relaxations are used (lasso: L1, ridge: L2).

If you are comfortable with Python, SciKit Learn has implementations for several approaches that utilize either lasso or ridge penalties. Here is an example of Lasso penalty for an SVM model. I would try that and see how well it works. It may require a parameter to adjust the regularization amount. You can do cross-validation to find what parameter works best.