DrXaos

DrXaos t1_j7ig8eq wrote

I guess I don't get your point. The images reflect the phenomenon I suggest.

Look at the younger images. In the smiling & young side there are more relatively high spatial frequency light to dark transitions, interpreted as a higher probability of wrinkles, vs the non-smiling side. I conjecture those contribute to higher age estimation.

3

DrXaos t1_j7ia833 wrote

It's fairly well known that common ML systems for image processing (layers of convolutional networks followed by max-pooling or the like) are more sensitive to texture and less sensitive to larger scale shape and topology than humans.

It's likely that smiling triggered more 'wrinkle' detector units and the classifier eventually effectively added up the density of this texture detection for age prediction while humans know better where wrinkles from aging vs smiling are placed on the face and compensate.

6

DrXaos t1_ix66waj wrote

The introduction of the paper is explanatory and not particularly technical.

Conventionally artificial neural networks learn only when old and new data are shuffled when presented for training. People don’t learn like that, they can concentrate and learn new skills while not forgetting old ones, but conventional neural network algorithms fail to do that. This paper presents a model of sleeping in a biologically inspired neural network model in which the sleep phase algorithms overcomes the problem.

27

DrXaos t1_iw7o3ef wrote

In my typical use, I’ve found that changing random init seeds (and also random seeds for shuffling examples during training, don’t forget that one) in many cases induces a larger variance on performance than many algorithmic or hyper parameter changes. Most prominently with imbalanced classification, which if often the reality of the valuable problem.

I guess it’s better to be lucky than smart.

Avoiding looking at the results of random init can make you think you’re smarter than you are and will tell yourselves false stories.

1

DrXaos t1_iw04agd wrote

That’s a different scenario and clearly dynamically justified.

Any recursive neural network is like a nonlinear dynamical system. Learning happens best on the boundary of dissipation vs chaos (exploding or vanishing gradients).

The additive incorporation of new info in LSTM/GRU greatly ameliorates that usual problem of RNNs with random transition matrices where perturbations evolve multiplicatively. RNN initted to zero Lyapunov exponent through identity is helpful.

1

DrXaos t1_iw03k6k wrote

I’m not entirely convinced it eliminates every random choice. There is usually a permutation symmetry on tabular inputs, and among hidden nodes.

If I’m reading it correctly, then for a single scalar output of a regressor or classifier coming from hiddens or inputs directly (logreg), it would set the coefficient of the first node to 1 and 0 to all others being a truncated identity.

But what’s so special about that first element. Nothing. Same applies to the Hadamard matrices, it’s making one choice from an arbitrary ordering.

In my opinion, there still could/should be a random permutation of columns on interior weights and I might init the final linear layer of the classifier to equal but nonzero values like 1/sqrt(Nh), and with random sign if hidden activations are nonnegative like relu or sigmoid, instead of symmetric like tanh.

Maybe also random +1/-1 signs times random permutation times identity?

By that matter, any orthogonal rotation also preserves dynamical isometry, and so a random orthogonal before truncated identity should also work as init, and we’re back to an already existing suggested init method.

Training for enhanced sparsity is interesting, though.

3

DrXaos t1_iugsjf4 wrote

Not only are Vantagescore and FICO different algorithms, but any given score algorithm may be “calibrated” differently, so the same apparent risk at 650 on one score is at a different number on another score.

The machine learning goal of the models is to rank-order customers by probability of future default by using information in the reports. That is the “predictiveness of the score” goal.

Next, the scores are transformed onto the scale that is actually observed and reported numerically, and there is arbitrary freedom here. Generally they are approximately linear in log risk odds, so that Score = A + B * log(non defaults/defaults). The A and B are arbitrary, and also change in an economic cycle (mostly A). Different calibration means different score for the same predicted risk (predicted ratio of good customers to default customers).

No reason that Vantagescore is calibrated the same as any FICO score, and different subversions may also be calibrated differently.

5

DrXaos t1_it0w81o wrote

This is a political advertisement on a construction project: the implication being “we paid for this so be grateful”.

The entire point is promoting the politicians, and the more important they are, the more names they have.

Like Mayor Fiorello LaGuardia airport, with a whole bunch of extra middle names.

132

DrXaos t1_isn4fs5 wrote

top 1 accuracy is a noisy measurement particularly if it's a binary 0/1 measurement.

A continuous performance statistic will more likely show the expected behavior of train perf better than test. Note on loss functions lower is better.

There's lots of regularization possible, but start with L2, weight decay, and/or limiting the size of your network.

1

DrXaos t1_isn3k9e wrote

Certainly could be dropout. Dropout is on during training, stochastically perturbing activations in its usual form in packages, and off during test.

Take out dropout, use other regularization and report directly on your optimized loss function, train and test, often NLL if you're using a conventional softmax + CE loss function which is the most common for multinomial outcomes.

3

DrXaos t1_ismok0x wrote

In this case the train and test probably wasn't split stratified by classes, and there's an imbalance in the relative proportions of classes and there is some bias in the predictor.

And it's probably measuring top 1 accuracy which isn't the loss function being optimized directly.

  1. do a stratified test/train split
  2. measure more statistics on train vs test
  3. check dropout or other regularization differences
2

DrXaos t1_irl3dr7 wrote

It's information theory. If prior is uniform across the 100 classes (i.e. 1/100) (worst case) it takes -log(p) = log2(100) bits hypothetically to specify one actual label. Imagine it were 64 labels, then the explicit encoding is obvious, 6 bits. Information theory still works without explicit physical encoding in the appropriate limit. If priors are non-uniform it's even lower. There are 6865 examples. That's all the independent information about the labels which exists.

If you were to write out all the labels in a file, it could be compressed to no less than 45.5k bits if their probability distribution were uniform. So with hypothetically 45.5k bits in arbitrary free params you could memorize the labels. Of course in modeling there are practical constraints and regularization so this doesn't happen at that level but it should give you some pause. I know there are non-classical statistical behaviors with big models like double descent but I'm not sure we're there in this problem.

I think you're may be trying to do too much blind modeling without thinking. If you had to classify or cluster the signals by eyeball what would you look at? Can you start with a linear model? What features would you put in for that? If you're doing something like the MFCC from 'librosa' (as the youtube) there's all sorts of complex time-domain and frequency domain signal processing parameters in there that will strongly influence the results---I would concentrate on those foremost. As a first cut instead of going directly to a high parameter classifier which requires iterative stochastic training I would use a preliminary but fast-to-compute and (almost) deterministically optimizable criterion to help suggest your input space and signal processing parameters. What about clustering? If you had to do simple clustering in a Euclidean input space (you could literally program this and measure performance----how many observations are closer to the class centroid than someone else's centroid? Or just measure distances if it's not the correct centroid) what space would you use? Can you optimize to get good performance on that? Once you do that, then a high-effort complex classifier like a deep net would have a good head start and would help push performance further.

Or even what would a Naive Bayes model look like? Can you make/select features for that?

Also, one big consideration, often in audio classification there is a time translation invariance, in that the exact moment of the start isn't a physically important parameter; akin to image subset classification with 2-d x-y spatial translational invariance. If that's true then you could do lots of augmentation and make more signals of the same class with some translation operators applied for your train set.

Also consider performance measures different from 0/1 accuracy. Is that 'top 1' accuracy? And if the background accuracy is 0.01 (1/100 chance to get it right) then 0.2 might be considered good.

The no-information background performance is making a score proportional to the prior probabilities or maybe logodds thereof. Measure lift above that.

2

DrXaos t1_irjz674 wrote

log2(100) is about 6.64 and with 6865 samples that's 45.5K bits needed to fully encode/memorize the labels. You have way more than that in the effective # of bits in the free parameters. 25 million parameters? I train models on binary classification with 5000 params and a million observations.

You need some feature engineering and simplification of the model.

Are you doing something like this? https://en.wikipedia.org/wiki/Mel-frequency_cepstrum

Your frequency grid might be far too fine and you may need some windowing/filtering processing first. What's the structure of the 1723,13 input?

Given this is there some sort of informed unsupervised transformation to lower dimensionality you could use before the supervised classifier?

What you're seeing is the limits of purely blind statistical modeling, and since your dataset size isn't so big you'll have to build in some priors about the underlying 'physics' somehow through processing or structuring your model.

2