Submitted by joossss t3_107c95i in MachineLearning
deephugs t1_j3mt7re wrote
Reply to comment by TrueBirch in [D] Deep Learning Training Server by joossss
Cloud is almost always better imo. At the small scale you can prototype quicker and spend less time messing with hardware by using cloud services. Once you actually need to scale your product then using a cloud solution makes it really easy. The "but its cheaper" argument gets less and less valid every year, and it often doesn't account for the time and effort spent setting up a local cluster.
rlvsdlvsml t1_j3n2it2 wrote
If u use ray u can setup a gpu cluster in less than 30 min
deephugs t1_j3n3qwj wrote
I think Ray is great! But Ray will not click your GPUs into a motherboard, install linux on all the machines, setup nvidia-docker, power cycle if there are issues, periodically clear up space on hdds, etc. Its the non-software part of cluster management that ends up being the most annoying and time consuming.
rlvsdlvsml t1_j3nd87h wrote
I have always felt like the network/security and integration with internal it systems was worse than the physical maintenance. Like people should expect that they have to invest time into integrating into a on-prem data center environment or physical maintenance stuff. I think small teams are benefited by a small gpu cluster with a fixed budget over large cloud gpu training costs. Mid-large companies do better with cloud than on-prem bc they can have better separation of environments but they cost more.
Viewing a single comment thread. View all comments