Viewing a single comment thread. View all comments

JackBlemming t1_j3997nh wrote

Couple thoughts:

  1. Setting up an environment is typically harder than cloning the repo and running pip install on the requirements.txt file. Many python packages require prior linux packages to have been installed beforehand. Your service should ideally take care of this for me. Some obvious examples are opencv, cuda/gpu drivers, mysqlclients etc.

  2. Dataset management is the most annoying part of machine learning for me, not setting up environments which is typically a dockerfile or docker-compose file, and maybe one shell script to bootstrap everything. Dataset management being allowing my models to access the dataset in a fast way, updating the dataset, etc. Ideally your service should make it easy to upload data to your dataset and then make it accessible to the training code. This is assuming you want to allow people to train models on the service.

34

jrmylee OP t1_j39ara0 wrote

  1. Great point, we have this covered. We intelligently install apt dependencies alongside pip dependencies. CUDA drivers are also all installed properly.
  2. This makes sense. If I understand you correctly, is the difficult part: uploading/managing dataset to server easily + writing data loaders to feed into the model?
5

JackBlemming t1_j39baqq wrote

Per 2. yes, exactly right. Some of my datasets are millions of images with metadata. As you can imagine, uploading and consuming this magnitude is slow and tedious, and then integrating it with the remote machine actually running the training script.

7

jrmylee OP t1_j39dpr0 wrote

Got it, appreciate it the feedback!

4

i_ikhatri t1_j3xlleh wrote

Just to add onto this feedback (because I think /u/JackBlemming is 100% correct) you would probably benefit from storing some of the most popular datasets (ImageNet, MS COCO, whatever is relevant to the fields you're targeting) somewhere in the cloud where you can provide fast read access (or fast copies) to any number of training workers that get spun up.

Research datasets tend to be fairly standardized so I think you could get a high amount of coverage by just having a few common datasets available. I only gave computer vision examples because that's what I'm most familiar with but if you get a few CV datasets, a few NLP ones etc. you should be able to provide a killer UX.

Bonus points if you're somehow able to configure the repos to read from the centralized datastore properly automatically (though this is probably difficult/impossible).

2

the__itis t1_j3dogz7 wrote

Also if not 100% native, compiling libraries for architectures is a big complication.

1