Submitted by lattecoffeegirl t3_101hi2v in MachineLearning

So I will directly dive into the problem setting and will later describe the background:

data: survival data of ~3000 patients with several clinical and lab(blood) parameters.

question: does 1 of the parameters has any influence on the survival time?

what I have done so far: non proportional multivariate hazard model (Cox regression) problem: highly correlated variables, strong time interaction, some hardly not normal distributed variables (even after transformation)

QUESTION: Is there a machine learning / AI solution for this problem question?

background: I am a PhD student in medicine and did intensive mathematics together with my colleagues. But we only had one „old-fashioned“ statistics professor, who answered us for our problem: „seems like your data isn‘t good enough, and you can‘t explore something there, cause its far too complex“ We first want to get an intuition if our theoretical findings can be proved in the data, before we will plan a new study. I reformulated our problem a bit, we are not dealing with the death of patients, but with time of an specific event.

I am really grateful for any ideas, any sources where to look at and everything which could help 😊😊 Thanks in advance!

14

Comments

You must log in or register to comment.

PredictorX1 t1_j2nk8sa wrote

My recollection (feel free to correct me) is that statistical survival models are (can be?) created as a series of logistic regressions, one for each of several forecast horizons. One could keep that same structure, substituting any classifier (induced decision tree; neural network, ...) which produces an estimated probability for those logistic regressions.

3

Worth-Advance-1232 t1_j2nvwv9 wrote

If I understood correctly, rather than actually predicting survival time your goal is to see whether certain parameters have influence on the survival time. Besides the other things already mentioned, one approach might be trying to predict survival time via a black box algorithm and trying to explain the decision of the algorithm. One paper that covers this topic would be this one.

3

khashitk t1_j2ojgc3 wrote

There are a lot of ways to tackle this problem. But it will really help if you could share a data sample. How many samples do you have? How many attributes are you studying? So on and so forth. The rang is from a simple decision tree to a new experimental algorithm.

1

lattecoffeegirl OP t1_j2onc3y wrote

you are right, I want to know, wether our „new“ parameter is a good predictor or not

1

sitmo t1_j2oz219 wrote

Whatever you do to make a model, I would created benchmarks datasets where you random shuffle the survival info between patients. Then, any model fitting and testing you do, also do it on these randomised datasets. This will give you good insights about the statistical significance of anything you’ll find in your data.

2

Biggzlar t1_j2p5q4p wrote

I recently finished Pearl's "Book of Why" and iirc BNNs do not capture causal relationships between variables. So finding a strong association between a lab value and years of survival does not necessarily indicate that one led to the other.

In other words, a strongly associated lab or clinical value might actually be the result of a longer survival (or other compounding factors) and not the other way around.

Entirely not an expert, but this question sounds exactly like a problem for causal inference as described in the book. Of course, there is also the issue that only observational data seems to be available, so CI may not actually be possible. Maybe it's worth it to check out this entry point to the subject.

6

DJ_laundry_list t1_j2q6fzs wrote

When you say "has any influence", I'm assuming you mean causal influence, rather than just being correlated with a particular outcome. This puts us in the domain of causal inference. I suggest you go through a causal inference tutorial or two to get some domain knowledge. See https://causalinference.gitlab.io/kdd-tutorial/ and https://economics.mit.edu/sites/default/files/inline-files/causal_tutorial_1.pdf. Econometric modeling revolves heavily around this, so you're probably going to find more sources that are econometric rather than medical.

My personal approach: Train an xgboost model (or really any ML model) using the appropriate hazard function and bayesian optimization for hyperparameter tuning, then compare the log likelihood function of the parameter at its actual values vs counterfactual values. If the counterfactual values provide a similar fit, you're looking at something that is not likely causal.

3

hopsauces t1_j2r0n39 wrote

ML methods people are talking about are mostly focused on prediction. Collinearity and confounding problems you describe will mess with those methods too, though it’ll be harder to recognize and diagnose. “feature importance” measures are generally as vague as the name implies. You’re doing it the right way, just keep working within that feedback loop of analyzing your data and improving your model.

2

khashitk t1_j2r4vrk wrote

The sample would help a lot. You have a good number of entries. So there could be quite a few ways to go.

Edit: Give "weka" a try it's free to use and has a lot of options. It's great for getting a good overall view of the direction the project should take.

2