Submitted by joossss t3_107c95i in MachineLearning
deephugs t1_j3n3qwj wrote
Reply to comment by rlvsdlvsml in [D] Deep Learning Training Server by joossss
I think Ray is great! But Ray will not click your GPUs into a motherboard, install linux on all the machines, setup nvidia-docker, power cycle if there are issues, periodically clear up space on hdds, etc. Its the non-software part of cluster management that ends up being the most annoying and time consuming.
rlvsdlvsml t1_j3nd87h wrote
I have always felt like the network/security and integration with internal it systems was worse than the physical maintenance. Like people should expect that they have to invest time into integrating into a on-prem data center environment or physical maintenance stuff. I think small teams are benefited by a small gpu cluster with a fixed budget over large cloud gpu training costs. Mid-large companies do better with cloud than on-prem bc they can have better separation of environments but they cost more.
Viewing a single comment thread. View all comments