Submitted by BarcaStranger t3_yc13da in MachineLearning
Rediggo t1_itjua9p wrote
It sound like you are using a somewhat high-level interface for training. If that is the case I can only help with these three points:
1 when asking for help, try to provide some details about the specific implementation (for example: are you using huggingface? Pytorch linghtning? Some other thing? Did you check that some usual suspects are not causing any trouble?)
2 it is important to read the documentation for the tool you are using. Are you sure the training didn't stop because the loss wasn't improving and that's the default behavior?
3 stackoverflow usually has most of the basic questions solved. If your question was already asked by someone like 6 years ago and it has no replies, then it's probably just a mistake solvable by reading the docs a little bit (but that last point is just my experience)
Good luck with your project :)
BarcaStranger OP t1_itk5u1y wrote
In evaluation at epoch 18 it stuck at 99/100 and keeping running without errors, thats why i want to retrain from the checkpoint
Viewing a single comment thread. View all comments