Rediggo t1_itjua9p wrote on October 24, 2022 at 4:16 AM

It sound like you are using a somewhat high-level interface for training. If that is the case I can only help with these three points:

1 when asking for help, try to provide some details about the specific implementation (for example: are you using huggingface? Pytorch linghtning? Some other thing? Did you check that some usual suspects are not causing any trouble?)

2 it is important to read the documentation for the tool you are using. Are you sure the training didn't stop because the loss wasn't improving and that's the default behavior?

3 stackoverflow usually has most of the basic questions solved. If your question was already asked by someone like 6 years ago and it has no replies, then it's probably just a mistake solvable by reading the docs a little bit (but that last point is just my experience)

Good luck with your project :)

BarcaStranger OP t1_itk5u1y wrote on October 24, 2022 at 6:30 AM

In evaluation at epoch 18 it stuck at 99/100 and keeping running without errors, thats why i want to retrain from the checkpoint