[D] Switching optimizers for faster convergence

Well, talking about a "global optimum" in deep learning is delicate. We'll probably never find a global optimum for large datasets and large models.

Anyway, I would suggest using RAdam in 2020, or Ranger (RAdam + LookAHead), I've experienced great results with RAdam. To answer your question, I wouldn't be concerned about that specifically. There's a thousand reasons a training can fail when you change something as important as an optimizer in the middle of the training. Maybe your new optimizer has bad parameters, maybe if it overfits less you need less data augmentation, maybe your parameters needed the original optimizer momentum to progress well etc..

It's not current practice to change the optimizer during training because it makes you loose important informations and usally a lr scheduler is easier to manipulate and have the same kind of impact, but maybe there are situations I'm not aware of where changing the optimizer during the training is better.

/r/MachineLearning Thread