Submitted by SuchOccasion457 t3_11bvmia in MachineLearning

Say one wanted to model how much getting access to data would cost, how should one go about that? If labeling costs for say CIFAR10 are known with SageMaker and Google Cloud, what is the cost of getting the data in the first place?

Furthermore, say we move into the space of medical images e.g. MRI scans. What is the cost of getting MRI scans with a given desease? Where do I even find such information?

4

Comments

You must log in or register to comment.

PassionatePossum t1_ja0llsg wrote

The data is always the most expensive part. I work in the medical device industry and it strongly depends on the type of data and how much effort it is for the physicians to collect it.

In the simplest case you can just run a recording device while they are doing their procedures. But of course it rarely is that simple: You need to be careful not to capture any data that can be used to personally identify the patient (and the definition of personally identifying information is - at least in Europe - extremely wide).

The next question is: Do you need any lab data as groundtruth? If the answer is "yes", it will create a lot of effort for the physician because he/she can not simply record the data. They will have to keep track of the patients, recordings and diagnosis and annotate them later accordingly.

Another thing to keep in mind is: In many cases you cannot just connect a non-certified device to a medical device. You often need special recording hardware that is medically certified. That probably mostly is the case for surgical devices. The rules for MRI images migth be more relaxed. I don't know.

As a rough guideline you can expect to pay physicians around 200€ / hour (in the U.S. likely even more than that). And as I said: How much data you get for that, strongly depends on the type of data that you collect.

5

SuchOccasion457 OP t1_ja0nas2 wrote

thank you very much for the ellaborate response! are you aware of any literature, even if informal, that goes into details about the whole process with any type of rough estimates?

2

PassionatePossum t1_ja0sovs wrote

Sorry, I am not aware of any literature. The contractual stuff is mostly handled between ours and the hospital's lawyers. I'm not really involved in all of that.

And I'm not even sure that you'll find a one-size fits all answer. The requirements for a collaboration can vary wildly. A private practice usually has much more flexibility when it comes to technical infrastructure. In a hospital, the IT department usually wants to know when you are planning to connect stuff to their systems. But there are certain diseases that you are very unlikely to see in a private practice.

I would do the following: Once you know what kind of data you need, talk to physicians to understand their workflow. Then make a proposal how to collect the data and talk it through with them or their lawyers. If there is a potential problem with the plan, they'll tell you.

Once you know the workflow, you'll probably also have an idea how long it will take for them to collect the data you are looking for and from there you can make an educated guess how much it is going to cost you.

The rest is up for negotiation. As far as I know we have contracts that have built-in safeguards for both the physicians as us. They get a fixed price for a fixed number of hours they work for us. And they guarantee a certain minimum number of recorded procedures. If they can deliver more in the alotted time, even better.

2

jobeta t1_ja08xy1 wrote

I don’t think there is a general answer to that. For labeling there are multiple services that you can use. You could just contact them and ask or look if they advertise how much they pay people to label to get a proxy. For the data itself, it completely depends on the data. I would imagine medical data would be hard to obtain and require some legal consideration around privacy (at least I would hope so).

3

SuchOccasion457 OP t1_ja0c7x9 wrote

have you seen anyone selling datasets? I found one webpage that openly lists prices, everything else seems rather closed :( Everyone does per-user pricing

1

jobeta t1_ja0m0og wrote

Yes but just pick two or three and ask? Also check on Amazon mechanical Turk if you find labeling job listed and the rates. I have only needed this one but used upwork. We paid well and it was a while ago so I don’t think the price I will give you will be a good reference.

1

bubudumbdumb t1_ja414tz wrote

My sweet summer child, MRI data is medical data, the only way you can have that is by having patients (being a clinic or an hospital) and making sure they are ok with you labeling the data and using it for training models. Medical data is very very sensitive and very protected, you probably won't be able to have third party labeling services as you might be required to keep the data on your own infrastructure. Of course all of this depends on jurisdiction and you should consult lawyers.

1

SuchOccasion457 OP t1_ja4rn7n wrote

thank you for this! am not trying to get hold of such data, but rather trying to understand how one would even approach modeling associated costs. people usually talk about labeling services, but nobody mentions the costs for actually getting the data itself. just looking for a reference to quote an order of magnitude ...

1