trajo123
trajo123 t1_j3i3dy1 wrote
Reply to comment by Zyansheep in [R] Greg Yang's work on a rigorous mathematical theory for neural networks by IamTimNguyen
I don't mean applications _of Deep Learning_, I mean what are the applications of this specific theory to real life Deep Learning problems. Can this theory help a Deep Learning practitioner, or is it applied only to proving some abstract bounds on some theoretical, abstracted and simplified neural nets?
trajo123 t1_j3ev6wz wrote
Can anyone ELI5? More specifically, what are the practical applications to Deep Learning problems?
trajo123 t1_j3c3cvf wrote
Reply to comment by trajo123 in Why didn't my convolutional image classifier network learn anything! by AKavun
...let me know if it works any better!
trajo123 t1_j3c38rx wrote
Reply to comment by trajo123 in Why didn't my convolutional image classifier network learn anything! by AKavun
Several things I noticed in your code:
- your model doesn't use any transfer function
- the combination of final activation function and loss function is incorrect
- for CNN you should be using BatchNorm2D layers
The code should look something like this:
def __init__(self, input_size, num_classes):
super(CNNClassifier, self).__init__()
self.input_size = input_size
self.num_classes = num_classes
self.conv1 = nn.Conv2d(in_channels=3, out_channels=32, kernel_size=3, stride=1, padding=1) # increase the number of channels
self.bn1 = nn.BatchNorm2d(32)
self.pool = nn.MaxPool2d(kernel_size=2, stride=2)
self.conv2 = nn.Conv2d(in_channels=8, out_channels=128, kernel_size=3, stride=1, padding=1) # increase the number of channels
self.bn2 = nn.BatchNorm2d(128)
self.fc1 = nn.Linear(128, 256) # note the smaller numbers
self.fc2 = nn.Linear(256, num_classes)
self.bn1 = nn.BatchNorm2d(32),
self.final_pool = nn.AdaptiveAvgPool2d(1) # before flatten, you should use AdaptiveMaxPool2d, or AdaptiveAvgPool2d to get rid of the spatial dimensions, essentially treat each filter as one feature
# self.softmax = nn.Softmax(dim=1) - not needed, see below. Also Softmax is not correct for use with NLLLoss, he correct one would be LogSoftmax(dim=1)
self.f = nn.ReLU()
def forward(self, x):
x = self.conv1(x)
x = self.pool(x)
x = self.f(x) # apply the transfer function
x = self.bn1(x) # apply batch norm (this can also be placed before the transfer function)
x = self.conv2(x)
x = self.pool(x)
x = self.f(x) # apply the transfer function
x = self.bn2(x) # apply batch norm (this can also be placed before the transfer function)
# since you are now using batchnorm, you could add a few more blocks like the one above, vanishing gradients are less of a concern now
x = self.final_pool(x)
x = torch.flatten(x, 1)
x = self.fc1(x)
x = self.f(x) # apply the transfer function, here you could try tanh as well
x = self.fc2(x)
# x = self.softmax(x) # no need for a function here because it is incorporated into the loss function for numerical/computational efficiency reasons
return x
Also, the loss should be
# criterion = nn.NLLLoss()
criterion = nn.CrossEntropyLoss() # the more natural choice of loss function for classification, actually for binary classification the more natural choice would be BCEWithLogitsLoss, but then you need to set the number of number of output units to 1.
trajo123 t1_j3busy6 wrote
First of all, the dataset size is way too small to train a model from scratch to give meaningful results on this relatively complex task (more complex than MNIST for example, which has a training set of 60000 images). Second, your model is way too small/simple for this task even if you would have 100 times more data. I strongly suggest "Transfer Learning" - fine-tuning a pre-trained model by replacing the classification head, freezing the rest of the model in place and training on your dataset.
Something along these lines:
from torchvision import transforms, models
# ...
model = models.swin_b(weights=models.Swin_B_Weights.IMAGENET1K_V1)
model.heads[0] = nn.Linear(model.heads[0].in_features, 1, bias=True)
# ...
)
In the pre-trained model documentation you will see what training recippe was used and what transforms were applied to the image. Typically:
transforms.Normalize(
mean=(0.485, 0.456, 0.406),
std=(0.229, 0.224, 0.225),
)
transforms.Resize((224, 224), interpolation=transforms.InterpolationMode.BICUBIC)
See more at <https://pytorch.org/vision/stable/models.html#table-of-all-available-classification-weights>. You can also find pre-trained models HuggingFace / VisionModels.
Hope this helps, good luck!
trajo123 t1_j13gu3z wrote
Reply to comment by techni_24 in [D] Techniques to optimize a model when the loss over the training dataset has a Power Law type curve. by Dartagnjan
Reducing the batch size to 1 can allow you to train a bigger model, allowing you to reach a lower loss on the training set. Note that accumulate_grad_batches takes on the meaning of batch_size when the latter is set to 1.
trajo123 t1_j101qp8 wrote
Reply to [D] Why are we stuck with Python for something that require so much speed and parallelism (neural networks)? by vprokopev
>Why not *just* move to C++ or something new...?
Moving to a different language is never a "just" in any non-trivial organization. With Python you have the option but not the obligation to optimize: you can write slow pythonic code or faster framework-y code. You also have the option to write python extensions in whatever language you want for performance critical parts of the code. The latter seems like a much more pragmatic aproach to me than completely switching languages.
trajo123 t1_j1001dl wrote
Reply to How to train a model to distinguish images of class 'A' from images of class 'B'. The model can only be trained on images of class 'A'. by 1kay7
Look into OOD (Out of distribution) sample detection. If you go down the auto-encoder route then this paper can give you some pointers: Detecting Out-of-distribution Samples via Variational Auto-encoder with Reliable Uncertainty Estimation
Please note that OOD sample detection is an open problem and active research topic.
trajo123 t1_j0zyd7i wrote
Reply to [D] Techniques to optimize a model when the loss over the training dataset has a Power Law type curve. by Dartagnjan
You are maxing out you GPU RAM even for a batch size of 1? If not then you can set the batch size to 1 and set accumulate_grad_batches (or whatever that is in your DL framework) to whatever you want your effective batch size to be. https://pytorch-lightning.readthedocs.io/en/stable/advanced/training_tricks.html
Note that your loss will never be 0 unless you run into numerical issues. However your metric of interest such as accuracy, F1 score, etc can still be perfect for the training set even if the loss is not 0. Can you get a perfect score on the training set? If not then it seems that your model is not big/complex enough for your training data. Actually this a good sanity check for your model building and training - being able to get a perfect score on the training set.
Depending on the problem you can also look into focal loss, hard-example mining, etc. But not achieving a perfect score on the training set is not necessarily a bad thing. For instance if you have mislabelled examples in your training set then you actually want the model to assigns a high loss to those. Are you sure your high-loss training examples are labelled correctly?
trajo123 t1_j0l5hwj wrote
Reply to comment by zeyus in [D] Is softmax a good choice for confidence? by thanderrine
You will get some results, for sure. Depending on your application may even be good enough. But as a general probability that an image is something other than cat and dog, not really.
As other commenters have mentioned the general problem is known as OOD (out of distribution) sample detection. There are Deep Learning models which model probabilities explicitly and can in principle used for OOD sample detection - Variational Autoencoders. The original formulation of this model performs poorly in practice at OOD sample detection, but there is work addressing some shortcomings, for instance Detecting Out-of-distribution Samples via Variational Auto-encoder with Reliable Uncertainty Estimation. But with VAEs things get very mathematical, very fast.
Coming back to you initial question, no, softmax is not appropriate for "confidence", but this is an open problem in Deep Learning.
trajo123 t1_j0kullx wrote
Reply to comment by zeyus in [D] Is softmax a good choice for confidence? by thanderrine
Yes, that's an option but you have absolutely no guarantees about it's ability to produce anything meaningful. What images to you introduce in the "other" class? There are infinitely more images falling in the other category than there are cat-or-dog images. For any training set you come up with for the "other" class, the model can still be tested with an image totally different from your training set, and the model output will have no reason what-so-ever to favour "other" for the new image.
trajo123 t1_j031wsx wrote
Reply to comment by alkaway in [P] Are probabilities from multi-label image classification networks calibrated? by alkaway
Perhaps scikit-learn's "Probability calibration" section would be a good place to start. Good luck!
trajo123 t1_j029y2r wrote
Reply to comment by alkaway in [P] Are probabilities from multi-label image classification networks calibrated? by alkaway
Ah, yes it doesn't really make sense for more than a couple of classes. So if you can't make your problem multi-class, have you tried any probability calibration on the model outputs? This should make them "more comparable", I think this is the best you can do with a deep learning model.
But why do you want to rank the outputs per pixel? Wouldn't some per-image aggregate over the channels make more sense?
trajo123 t1_j023qfb wrote
Reply to comment by alkaway in [P] Are probabilities from multi-label image classification networks calibrated? by alkaway
You could reformulate your problem to output 4 channels: "only disease A", "only disease B", "both disease A and disease B" and "no disease". This way a softmax can be applied to to these outputs, their probabilities summing to 1.
[EDIT] corrected number of classes
trajo123 t1_izwd9xi wrote
Reply to Getting started with Deep Learning by MightyDuck35
The Coursera Deep Learning specialization is great. It starts with the basics, including a gentle introduction to the intuition behind the maths, then goes on to cover many important application areas. If you like a more structured approach (e.g. assignments, quizzes), then this is for you. It's quite a lot of work, but it will get you from completely clueless to comfortable with most of the concepts and ready to explore the field on your own.
I found the FastAI course too light on details and the Jupyter Notebook based deep learning framework they built abstracts too many details away ...and is yet another (not very popular / used in practice) framework to learn.
trajo123 t1_iymuivu wrote
Reply to Doubt regarding activation functions by Santhosh999
To answer you question concretely: in classification you want your model output to reflect a probability distribution over the classes. If you have only 2 classes this can be achieved with 1 output unit producing values ranging from 0 to 1. If you have more than 2 classes then you need 1 unit per class so that each one produces a value in the (0,1) range and also that the sum of all units adds up to 1 to pass as a probability distribution. In case of 1 output unit the sigmoid function ensures that the output is 0,1 and in case of multiple output units softmax ensures the conditions mentioned above. Now, in practice, classification models don't use an explicit activation function after the last layer, instead the loss incorporates the appropriate activation function due to efficiency and numerical stability reasons. So in case of binary classification you have two equivalent options:
- use 1 output unit with torch.nn.BCEWithLogitsLoss
>This loss combines a Sigmoid layer and the BCELoss in one single class. This version is more numerically stable than using a plain Sigmoid followed by a BCELoss as, by combining the operations into one layer, we take advantage of the log-sum-exp trick for numerical stability.
- use 2 output units with torch.nn.CrossEntropyLoss
>This criterion computes the cross entropy loss between input logits and target
Both of these approaches are mathematically equivalent and should produce the same results up to numerical considerations. If you get wildly different predictions, it means you did something wrong.
On another note, using accuracy when looking at credit card fraud detection is not a good idea because the dataset is most likely highly unbalanced. Probably more than 99% of the data samples are labelled as "not fraud". In this case, having a stupid model always produce "not fraud" regardless of input will already give you 99% accuracy. You may want to look into metrics for unbalanced datasets, e.g. F1 score, false positive rate, false negative rate, etc.
Have fun on your (deep) learning journey!
trajo123 t1_iylxuu0 wrote
Come on, I can't believe that Schmidhuber wasn't picked as one of the "combatants"!
trajo123 t1_iyciqd9 wrote
For users, it's quite expensive that Nvidia has such a monopoly on ML/DL compute acceleration. People replying with "don't bother, just use Nvidia&CUDA" only make the problem worse ...music for Nvidia's ears.
I would say, by all means try it out and share your experience, just be aware that it's likely going to be more hassle than using Nvidia&CUDA.
trajo123 t1_j3lwv4f wrote
Reply to comment by AKavun in Why didn't my convolutional image classifier network learn anything! by AKavun
> 90% accuracy in my test
Looking at accuracy can be misleading if your dataset is imbalanced. Let's say 90% of your data is labelled as False and only 10% of your data is labelled as True, so even a model that doesn't look at the input at all and just predicts False all the time will have 90% accuracy. A better metric for binary classification is the F1 score, but that also depends on where you set the decision threshold (the default is 0.5, but you can change that to adjust the confusion matrix). Perhaps the most useful metric to see how much your model learned is the Area under the ROC curve aka ROC_AUC score (where 0.5 is the same as random guessing and 1 is a perfect classifier).