chatterbox272 t1_j40jame wrote on January 12, 2023 at 9:29 AM

Not publishing the dataset is becoming less common as we start inching our way slowly to reproducible science. Public code with public data is the simplest form of reproducible research, where we can re-run your experiments with the same code and should get the same result (modulo some extremely low-level randomness or hardware differences that we may not be able to control).

That alone isn't enough to kill a paper, but it doesn't help. As another commenter said, showing your approach on public datasets and other approaches on your dataset will help, as it gives the rest of the community something that is reproducable.

It's more common in medical venues because of a few reasons:

Difficulties around safely releasing medical data. Proper anonymisation and informed consent.
It is more common in medical science to go for a higher level of reproducibility, where the same or a similar study will be done on a different population (i.e. same method, different data). This is pretty uncommon in ML, it's hard to get papers accepted in this format.

Insighteous t1_j430uuc wrote on January 12, 2023 at 8:47 PM

Publishing everything is a good thing. At the moment I am trying to reproduce some results of a paper and have to work with „we created X datasets by three methods“. And NO WHERE in the paper it is stated what these three methods are. Also no code.

It is so annoying. Cannot put it in words.

newperson77777777 OP t1_j437zvq wrote on January 12, 2023 at 9:30 PM

Thanks for your perspective.