Hi,

I faced a classification problem like this: Given a measurement of 18K different variables of 42 samples, each sample is classified as class_0 or class_1, divided near equally (19 belongs to class_0, 23 belongs to class_1) what is the right approach to eliminate these features to a minimum level, so that the classifier is still predicting correct classses.

I do not provide any domain knowledge for now, but can hint a little bit more, if needed.

Comments

You must log in or register to comment.

HateRedditCantQuitit t1_j2sinhq wrote on January 3, 2023 at 5:31 PM

I think people often see this sort of p >> N data in genetics?

ESL II has a whole chapter on p >> N problems (ch 18) https://hastie.su.domains/ElemStatLearn/

magical_mykhaylo t1_j2sj2fc wrote on January 3, 2023 at 5:34 PM

This is a very general issue, often called "the curse of dimensionality", or the "short and wide" problem. There are a number of ways to do it, that fall generally under the umbrella term of "dimensionality reduction". It's really tricky not to over-fit these types of models, but here are some things you can try:

You can reduce the number of features using Principal Component Analysis (PCA), Independent Component Analysis (ICA), or UMAP. Using PCA or ICA, and speaking in broad terms, you are not training your model on the inividual variables themselves, but rather linear combinations of those variables as "latent variables".

You can select the most relevant features, using feature or variable selection prior to training your algorithm. This can be done in the context of Random Forests using GINI coefficients or any number of other similar metrics.

If you are training a linear model, such as Linear Discriminant Analysis (LDA) there are generally higher-dimensional variants that incorporate elastic net regularisation to better handle problems with dimensionality. Look up "spare regression" for more information. Some of these algorithms also use Partial Least Squares (PLS) as a way around it, but it has fallen out of fashion in most fields.

If you are building a neural network (generally a bad idea if you have fewer samples), you might consider using regularisation coefficients for the hidden layers.

[deleted] t1_j2ryttu wrote on January 3, 2023 at 3:22 PM

[deleted]

qazokkozaq OP t1_j2sfwoj wrote on January 3, 2023 at 5:14 PM

No, I didn't solve the problem, I'm looking for ideas.

xx14Zackxx t1_j2tnl3k wrote on January 3, 2023 at 9:41 PM

Sounds like a good fit for an SVM (support vector machine) to me.

ResponsibilityNo7189 t1_j2sgk3g wrote on January 3, 2023 at 5:18 PM

Decistion trees would help in this precise case, by selecting the right features.

Random forests to improve results.

PassionatePossum t1_j2st7br wrote on January 3, 2023 at 6:37 PM

42 examples is obviously very little to go on. But one way to do it would be to use AdaBoost. You can use the classifier weights to assign feature importance.

A very similar problem is addressed in the Viola-Jones face detector (see Viola, Jones, Rapid Object Detection using a Boosted Cascade of Simple Features, 2001) where they select a couple thousand features out of over 100k of features.

philipmlong t1_j2svfj3 wrote on January 3, 2023 at 6:50 PM

Having many features can be a blessing if they complement one another; see https://jmlr.csail.mit.edu/papers/v13/helmbold12a.html.