Submitted by angkhandelwal749 t3_10lxwgd in MachineLearning

Versioning and collaboration on code for software engineers is a reasonably solved problem through GitHub since the task at hand predominantly involves just maintaining different copies of just simple vanilla code in different folders. On the other hand, ML engineers face the humungous task of maintaining different versions on not just code, but hyper parameters, data, models, data lineage and labels and storing this on GitHub currently does not allow you to track the changes on each variable well. What are the software/open source tools currently used for the same? Is their a space for a new company to be built here?

18

Comments

You must log in or register to comment.

Delicious-View-8688 t1_j60s6lt wrote

git for versioning code

dvc for versioning data (and other ML things)

mlflow for managing ml pipelines (overlaps with some parts of dvc)

conda for environment management (yes, it can be slow...)

13

metric_logger t1_j61xa7a wrote

Comet.ml does everything you listed! Free for individuals!

2

Vivid-Ad6077 t1_j62n4k1 wrote

https://wandb.ai/site - Weights & Biases does everything you listed, from versioning code, datasets and models to vizualizing experiments and managing hyperparameters and even running hyperparameter search. It can be used to fully reproduce and recreate the entire state of your ML workflow. It's free for individuals and academics.

2

conv3d t1_j62o8yn wrote

I can’t believe nobody has mentioned MLFlow

2

angkhandelwal749 OP t1_j62urqo wrote

>https://adataanalyst.com/wp-content/uploads/2021/05/Infra-Tooling3.png

Understood! Thanks so much for that - also wanted to understand at core the thinking process of an ML engineer - what parameters do they prioritise while choosing a tool - like user experience or service? lot of features or just few quality features done well?

1

Dry-Tomatillo449 t1_j6mnkh3 wrote

GitLab is an open-source and free alternative to GitHub for hosting ML projects and code. It's used by many organizations for software development, data analysis, and machine learning. It offers a wide range of features, including an integrated CI/CD pipeline, version control, issue tracking, and project management. Additionally, GitLab also supports Jupyter Notebooks and data science projects.

1