DrXaos
DrXaos t1_j7ia833 wrote
Reply to comment by keithcody in Researchers tested a large sample of the prominent major AI technologies available today and found not only did they reproduce human biases in the recognition of facial age, but they exaggerated those biases by giuliomagnifico
It's fairly well known that common ML systems for image processing (layers of convolutional networks followed by max-pooling or the like) are more sensitive to texture and less sensitive to larger scale shape and topology than humans.
It's likely that smiling triggered more 'wrinkle' detector units and the classifier eventually effectively added up the density of this texture detection for age prediction while humans know better where wrinkles from aging vs smiling are placed on the face and compensate.
DrXaos t1_j4w9vav wrote
Reply to comment by tsgiannis in Why a pretrained model returns better accuracy than the implementation from scratch by tsgiannis
The data size of the pre trained model was likely enormously larger than yours and that overcomes the distribution shift.
DrXaos t1_j3jb1f2 wrote
Reply to comment by mixer99 in Revolutionary Violence and Counterrevolution ["revolutions involving more violence are less at risk of counterrevolution and that this relationship exists primarily because violence lowers the likelihood of counterrevolutionary success—but not counterrevolutionary emergence"] by i_have_thick_loads
“Ruthless revolutionary murderers kill their opponents and keep power.”
See: Lenin, Castro, Khomeini, Kim
DrXaos t1_ixxkkoa wrote
Reply to comment by littleMAS in Record efficiency of 26.81% for large silicon solar cells by Wagamaga
Photosynthesis in plants is typically 3%. Solar cells are fighting thermodynamics, so almost 27% is extraordinarily good.
DrXaos t1_ix66waj wrote
Reply to comment by [deleted] in Sleep prevents catastrophic forgetting in spiking neural networks by forming a joint synaptic weight representation by chromoscience
The introduction of the paper is explanatory and not particularly technical.
Conventionally artificial neural networks learn only when old and new data are shuffled when presented for training. People don’t learn like that, they can concentrate and learn new skills while not forgetting old ones, but conventional neural network algorithms fail to do that. This paper presents a model of sleeping in a biologically inspired neural network model in which the sleep phase algorithms overcomes the problem.
DrXaos t1_iw7o3ef wrote
Reply to comment by elcric_krej in [R] ZerO Initialization: Initializing Neural Networks with only Zeros and Ones by hardmaru
In my typical use, I’ve found that changing random init seeds (and also random seeds for shuffling examples during training, don’t forget that one) in many cases induces a larger variance on performance than many algorithmic or hyper parameter changes. Most prominently with imbalanced classification, which if often the reality of the valuable problem.
I guess it’s better to be lucky than smart.
Avoiding looking at the results of random init can make you think you’re smarter than you are and will tell yourselves false stories.
DrXaos t1_iw04agd wrote
Reply to comment by vjb_reddit_scrap in [R] ZerO Initialization: Initializing Neural Networks with only Zeros and Ones by hardmaru
That’s a different scenario and clearly dynamically justified.
Any recursive neural network is like a nonlinear dynamical system. Learning happens best on the boundary of dissipation vs chaos (exploding or vanishing gradients).
The additive incorporation of new info in LSTM/GRU greatly ameliorates that usual problem of RNNs with random transition matrices where perturbations evolve multiplicatively. RNN initted to zero Lyapunov exponent through identity is helpful.
DrXaos t1_iw03k6k wrote
Reply to comment by elcric_krej in [R] ZerO Initialization: Initializing Neural Networks with only Zeros and Ones by hardmaru
I’m not entirely convinced it eliminates every random choice. There is usually a permutation symmetry on tabular inputs, and among hidden nodes.
If I’m reading it correctly, then for a single scalar output of a regressor or classifier coming from hiddens or inputs directly (logreg), it would set the coefficient of the first node to 1 and 0 to all others being a truncated identity.
But what’s so special about that first element. Nothing. Same applies to the Hadamard matrices, it’s making one choice from an arbitrary ordering.
In my opinion, there still could/should be a random permutation of columns on interior weights and I might init the final linear layer of the classifier to equal but nonzero values like 1/sqrt(Nh), and with random sign if hidden activations are nonnegative like relu or sigmoid, instead of symmetric like tanh.
Maybe also random +1/-1 signs times random permutation times identity?
By that matter, any orthogonal rotation also preserves dynamical isometry, and so a random orthogonal before truncated identity should also work as init, and we’re back to an already existing suggested init method.
Training for enhanced sparsity is interesting, though.
DrXaos t1_iugsjf4 wrote
Reply to comment by Cruian in How accurate is Credit Karma by CASTePhYRusIO
Not only are Vantagescore and FICO different algorithms, but any given score algorithm may be “calibrated” differently, so the same apparent risk at 650 on one score is at a different number on another score.
The machine learning goal of the models is to rank-order customers by probability of future default by using information in the reports. That is the “predictiveness of the score” goal.
Next, the scores are transformed onto the scale that is actually observed and reported numerically, and there is arbitrary freedom here. Generally they are approximately linear in log risk odds, so that Score = A + B * log(non defaults/defaults). The A and B are arbitrary, and also change in an economic cycle (mostly A). Different calibration means different score for the same predicted risk (predicted ratio of good customers to default customers).
No reason that Vantagescore is calibrated the same as any FICO score, and different subversions may also be calibrated differently.
DrXaos t1_iu5j1mj wrote
They care about cost for certain. Speed and hardware may relate to that.
DrXaos t1_itik4ct wrote
Reply to comment by oxtoacart in Sennie day - to me the goat is the 58X by keynzeev
How is it different from HD600?
I had expected it to be very similar.
DrXaos t1_it0w81o wrote
Reply to comment by SLMZ17 in One of the longest ancient Roman inscriptions ever discovered in Britain is to go on display for the first time. by Demderdemden
This is a political advertisement on a construction project: the implication being “we paid for this so be grateful”.
The entire point is promoting the politicians, and the more important they are, the more names they have.
Like Mayor Fiorello LaGuardia airport, with a whole bunch of extra middle names.
DrXaos t1_isn4fs5 wrote
Reply to comment by redditnit21 in Testing Accuracy higher than Training Accuracy by redditnit21
top 1 accuracy is a noisy measurement particularly if it's a binary 0/1 measurement.
A continuous performance statistic will more likely show the expected behavior of train perf better than test. Note on loss functions lower is better.
There's lots of regularization possible, but start with L2, weight decay, and/or limiting the size of your network.
DrXaos t1_isn3k9e wrote
Reply to comment by redditnit21 in Testing Accuracy higher than Training Accuracy by redditnit21
Certainly could be dropout. Dropout is on during training, stochastically perturbing activations in its usual form in packages, and off during test.
Take out dropout, use other regularization and report directly on your optimized loss function, train and test, often NLL if you're using a conventional softmax + CE loss function which is the most common for multinomial outcomes.
DrXaos t1_ismok0x wrote
Reply to comment by pornthrowaway42069l in Testing Accuracy higher than Training Accuracy by redditnit21
In this case the train and test probably wasn't split stratified by classes, and there's an imbalance in the relative proportions of classes and there is some bias in the predictor.
And it's probably measuring top 1 accuracy which isn't the loss function being optimized directly.
- do a stratified test/train split
- measure more statistics on train vs test
- check dropout or other regularization differences
DrXaos t1_irl3dr7 wrote
Reply to comment by perfopt in Help regularization and dropout are hurting accuracy by perfopt
It's information theory. If prior is uniform across the 100 classes (i.e. 1/100) (worst case) it takes -log(p) = log2(100) bits hypothetically to specify one actual label. Imagine it were 64 labels, then the explicit encoding is obvious, 6 bits. Information theory still works without explicit physical encoding in the appropriate limit. If priors are non-uniform it's even lower. There are 6865 examples. That's all the independent information about the labels which exists.
If you were to write out all the labels in a file, it could be compressed to no less than 45.5k bits if their probability distribution were uniform. So with hypothetically 45.5k bits in arbitrary free params you could memorize the labels. Of course in modeling there are practical constraints and regularization so this doesn't happen at that level but it should give you some pause. I know there are non-classical statistical behaviors with big models like double descent but I'm not sure we're there in this problem.
I think you're may be trying to do too much blind modeling without thinking. If you had to classify or cluster the signals by eyeball what would you look at? Can you start with a linear model? What features would you put in for that? If you're doing something like the MFCC from 'librosa' (as the youtube) there's all sorts of complex time-domain and frequency domain signal processing parameters in there that will strongly influence the results---I would concentrate on those foremost. As a first cut instead of going directly to a high parameter classifier which requires iterative stochastic training I would use a preliminary but fast-to-compute and (almost) deterministically optimizable criterion to help suggest your input space and signal processing parameters. What about clustering? If you had to do simple clustering in a Euclidean input space (you could literally program this and measure performance----how many observations are closer to the class centroid than someone else's centroid? Or just measure distances if it's not the correct centroid) what space would you use? Can you optimize to get good performance on that? Once you do that, then a high-effort complex classifier like a deep net would have a good head start and would help push performance further.
Or even what would a Naive Bayes model look like? Can you make/select features for that?
Also, one big consideration, often in audio classification there is a time translation invariance, in that the exact moment of the start isn't a physically important parameter; akin to image subset classification with 2-d x-y spatial translational invariance. If that's true then you could do lots of augmentation and make more signals of the same class with some translation operators applied for your train set.
Also consider performance measures different from 0/1 accuracy. Is that 'top 1' accuracy? And if the background accuracy is 0.01 (1/100 chance to get it right) then 0.2 might be considered good.
The no-information background performance is making a score proportional to the prior probabilities or maybe logodds thereof. Measure lift above that.
DrXaos t1_irjz674 wrote
log2(100) is about 6.64 and with 6865 samples that's 45.5K bits needed to fully encode/memorize the labels. You have way more than that in the effective # of bits in the free parameters. 25 million parameters? I train models on binary classification with 5000 params and a million observations.
You need some feature engineering and simplification of the model.
Are you doing something like this? https://en.wikipedia.org/wiki/Mel-frequency_cepstrum
Your frequency grid might be far too fine and you may need some windowing/filtering processing first. What's the structure of the 1723,13 input?
Given this is there some sort of informed unsupervised transformation to lower dimensionality you could use before the supervised classifier?
What you're seeing is the limits of purely blind statistical modeling, and since your dataset size isn't so big you'll have to build in some priors about the underlying 'physics' somehow through processing or structuring your model.
DrXaos t1_j7ig8eq wrote
Reply to comment by keithcody in Researchers tested a large sample of the prominent major AI technologies available today and found not only did they reproduce human biases in the recognition of facial age, but they exaggerated those biases by giuliomagnifico
I guess I don't get your point. The images reflect the phenomenon I suggest.
Look at the younger images. In the smiling & young side there are more relatively high spatial frequency light to dark transitions, interpreted as a higher probability of wrinkles, vs the non-smiling side. I conjecture those contribute to higher age estimation.