trajo123

trajo123 t1_j3lwv4f wrote

> 90% accuracy in my test

Looking at accuracy can be misleading if your dataset is imbalanced. Let's say 90% of your data is labelled as False and only 10% of your data is labelled as True, so even a model that doesn't look at the input at all and just predicts False all the time will have 90% accuracy. A better metric for binary classification is the F1 score, but that also depends on where you set the decision threshold (the default is 0.5, but you can change that to adjust the confusion matrix). Perhaps the most useful metric to see how much your model learned is the Area under the ROC curve aka ROC_AUC score (where 0.5 is the same as random guessing and 1 is a perfect classifier).

1

trajo123 t1_j3i3dy1 wrote

I don't mean applications _of Deep Learning_, I mean what are the applications of this specific theory to real life Deep Learning problems. Can this theory help a Deep Learning practitioner, or is it applied only to proving some abstract bounds on some theoretical, abstracted and simplified neural nets?

0

trajo123 t1_j3c38rx wrote

Several things I noticed in your code:

  • your model doesn't use any transfer function
  • the combination of final activation function and loss function is incorrect
  • for CNN you should be using BatchNorm2D layers

The code should look something like this:

    def __init__(self, input_size, num_classes):
        super(CNNClassifier, self).__init__()
        self.input_size = input_size
        self.num_classes = num_classes
        self.conv1 = nn.Conv2d(in_channels=3, out_channels=32, kernel_size=3, stride=1, padding=1) # increase the number of channels
        self.bn1 = nn.BatchNorm2d(32)
        self.pool = nn.MaxPool2d(kernel_size=2, stride=2)
        self.conv2 = nn.Conv2d(in_channels=8, out_channels=128, kernel_size=3, stride=1, padding=1) # increase the number of channels
        self.bn2 = nn.BatchNorm2d(128)
        self.fc1 = nn.Linear(128, 256)  # note the smaller numbers
        self.fc2 = nn.Linear(256, num_classes)
        self.bn1 = nn.BatchNorm2d(32),
        self.final_pool = nn.AdaptiveAvgPool2d(1)  # before flatten, you should use AdaptiveMaxPool2d, or AdaptiveAvgPool2d to get rid of the spatial dimensions, essentially treat each filter as one feature
        # self.softmax = nn.Softmax(dim=1) - not needed, see below. Also Softmax is not correct for use with NLLLoss, he correct one would be LogSoftmax(dim=1)
        self.f = nn.ReLU()
        
    def forward(self, x):     
        x = self.conv1(x)
        x = self.pool(x)
        x = self.f(x)  # apply the transfer function
        x = self.bn1(x) # apply batch norm (this can also be placed before the transfer function)

        x = self.conv2(x)   
        x = self.pool(x)
        x = self.f(x)  # apply the transfer function        
        x = self.bn2(x) # apply batch norm (this can also be placed before the transfer function)

        # since you are now using batchnorm, you could add a few more blocks like the one above, vanishing gradients are less of a concern now

        x = self.final_pool(x)
        x = torch.flatten(x, 1)
        x = self.fc1(x)
        x = self.f(x)  # apply the transfer function, here you could try tanh as well
        x = self.fc2(x)
        # x = self.softmax(x)  # no need for a function here because it is incorporated into the loss function for numerical/computational efficiency reasons
        return x

Also, the loss should be

# criterion = nn.NLLLoss()
criterion = nn.CrossEntropyLoss()  # the more natural choice of loss function for classification, actually for binary classification the more natural choice would be BCEWithLogitsLoss, but then you need to set the number of number of output units to 1.
1

trajo123 t1_j3busy6 wrote

First of all, the dataset size is way too small to train a model from scratch to give meaningful results on this relatively complex task (more complex than MNIST for example, which has a training set of 60000 images). Second, your model is way too small/simple for this task even if you would have 100 times more data. I strongly suggest "Transfer Learning" - fine-tuning a pre-trained model by replacing the classification head, freezing the rest of the model in place and training on your dataset.

Something along these lines:

from torchvision import transforms, models

# ...

model = models.swin_b(weights=models.Swin_B_Weights.IMAGENET1K_V1)
model.heads[0] = nn.Linear(model.heads[0].in_features, 1, bias=True)
# ...
)

In the pre-trained model documentation you will see what training recippe was used and what transforms were applied to the image. Typically:

transforms.Normalize(
                mean=(0.485, 0.456, 0.406),
                std=(0.229, 0.224, 0.225),
            )
            
transforms.Resize((224, 224), interpolation=transforms.InterpolationMode.BICUBIC)

See more at <https://pytorch.org/vision/stable/models.html#table-of-all-available-classification-weights>. You can also find pre-trained models HuggingFace / VisionModels.

Hope this helps, good luck!

3

trajo123 t1_j101qp8 wrote

>Why not *just* move to C++ or something new...?

Moving to a different language is never a "just" in any non-trivial organization. With Python you have the option but not the obligation to optimize: you can write slow pythonic code or faster framework-y code. You also have the option to write python extensions in whatever language you want for performance critical parts of the code. The latter seems like a much more pragmatic aproach to me than completely switching languages.

1

trajo123 t1_j1001dl wrote

Look into OOD (Out of distribution) sample detection. If you go down the auto-encoder route then this paper can give you some pointers: Detecting Out-of-distribution Samples via Variational Auto-encoder with Reliable Uncertainty Estimation

Please note that OOD sample detection is an open problem and active research topic.

9

trajo123 t1_j0zyd7i wrote

You are maxing out you GPU RAM even for a batch size of 1? If not then you can set the batch size to 1 and set accumulate_grad_batches (or whatever that is in your DL framework) to whatever you want your effective batch size to be. https://pytorch-lightning.readthedocs.io/en/stable/advanced/training_tricks.html

Note that your loss will never be 0 unless you run into numerical issues. However your metric of interest such as accuracy, F1 score, etc can still be perfect for the training set even if the loss is not 0. Can you get a perfect score on the training set? If not then it seems that your model is not big/complex enough for your training data. Actually this a good sanity check for your model building and training - being able to get a perfect score on the training set.

Depending on the problem you can also look into focal loss, hard-example mining, etc. But not achieving a perfect score on the training set is not necessarily a bad thing. For instance if you have mislabelled examples in your training set then you actually want the model to assigns a high loss to those. Are you sure your high-loss training examples are labelled correctly?

7

trajo123 t1_j0l5hwj wrote

You will get some results, for sure. Depending on your application may even be good enough. But as a general probability that an image is something other than cat and dog, not really.
As other commenters have mentioned the general problem is known as OOD (out of distribution) sample detection. There are Deep Learning models which model probabilities explicitly and can in principle used for OOD sample detection - Variational Autoencoders. The original formulation of this model performs poorly in practice at OOD sample detection, but there is work addressing some shortcomings, for instance Detecting Out-of-distribution Samples via Variational Auto-encoder with Reliable Uncertainty Estimation. But with VAEs things get very mathematical, very fast.
Coming back to you initial question, no, softmax is not appropriate for "confidence", but this is an open problem in Deep Learning.

1

trajo123 t1_j0kullx wrote

Yes, that's an option but you have absolutely no guarantees about it's ability to produce anything meaningful. What images to you introduce in the "other" class? There are infinitely more images falling in the other category than there are cat-or-dog images. For any training set you come up with for the "other" class, the model can still be tested with an image totally different from your training set, and the model output will have no reason what-so-ever to favour "other" for the new image.

4

trajo123 t1_j029y2r wrote

Ah, yes it doesn't really make sense for more than a couple of classes. So if you can't make your problem multi-class, have you tried any probability calibration on the model outputs? This should make them "more comparable", I think this is the best you can do with a deep learning model.

But why do you want to rank the outputs per pixel? Wouldn't some per-image aggregate over the channels make more sense?

3

trajo123 t1_izwd9xi wrote

The Coursera Deep Learning specialization is great. It starts with the basics, including a gentle introduction to the intuition behind the maths, then goes on to cover many important application areas. If you like a more structured approach (e.g. assignments, quizzes), then this is for you. It's quite a lot of work, but it will get you from completely clueless to comfortable with most of the concepts and ready to explore the field on your own.

I found the FastAI course too light on details and the Jupyter Notebook based deep learning framework they built abstracts too many details away ...and is yet another (not very popular / used in practice) framework to learn.

3

trajo123 t1_iymuivu wrote

To answer you question concretely: in classification you want your model output to reflect a probability distribution over the classes. If you have only 2 classes this can be achieved with 1 output unit producing values ranging from 0 to 1. If you have more than 2 classes then you need 1 unit per class so that each one produces a value in the (0,1) range and also that the sum of all units adds up to 1 to pass as a probability distribution. In case of 1 output unit the sigmoid function ensures that the output is 0,1 and in case of multiple output units softmax ensures the conditions mentioned above. Now, in practice, classification models don't use an explicit activation function after the last layer, instead the loss incorporates the appropriate activation function due to efficiency and numerical stability reasons. So in case of binary classification you have two equivalent options:

  • use 1 output unit with torch.nn.BCEWithLogitsLoss

>This loss combines a Sigmoid layer and the BCELoss in one single class. This version is more numerically stable than using a plain Sigmoid followed by a BCELoss as, by combining the operations into one layer, we take advantage of the log-sum-exp trick for numerical stability.

  • use 2 output units with torch.nn.CrossEntropyLoss

>This criterion computes the cross entropy loss between input logits and target

Both of these approaches are mathematically equivalent and should produce the same results up to numerical considerations. If you get wildly different predictions, it means you did something wrong.

On another note, using accuracy when looking at credit card fraud detection is not a good idea because the dataset is most likely highly unbalanced. Probably more than 99% of the data samples are labelled as "not fraud". In this case, having a stupid model always produce "not fraud" regardless of input will already give you 99% accuracy. You may want to look into metrics for unbalanced datasets, e.g. F1 score, false positive rate, false negative rate, etc.

Have fun on your (deep) learning journey!

2

trajo123 t1_iyciqd9 wrote

For users, it's quite expensive that Nvidia has such a monopoly on ML/DL compute acceleration. People replying with "don't bother, just use Nvidia&CUDA" only make the problem worse ...music for Nvidia's ears.
I would say, by all means try it out and share your experience, just be aware that it's likely going to be more hassle than using Nvidia&CUDA.

33