So I will directly dive into the problem setting and will later describe the background:

data: survival data of ~3000 patients with several clinical and lab(blood) parameters.

question: does 1 of the parameters has any influence on the survival time?

what I have done so far: non proportional multivariate hazard model (Cox regression) problem: highly correlated variables, strong time interaction, some hardly not normal distributed variables (even after transformation)

QUESTION: Is there a machine learning / AI solution for this problem question?

background: I am a PhD student in medicine and did intensive mathematics together with my colleagues. But we only had one „old-fashioned“ statistics professor, who answered us for our problem: „seems like your data isn‘t good enough, and you can‘t explore something there, cause its far too complex“ We first want to get an intuition if our theoretical findings can be proved in the data, before we will plan a new study. I reformulated our problem a bit, we are not dealing with the death of patients, but with time of an specific event.

I am really grateful for any ideas, any sources where to look at and everything which could help 😊😊 Thanks in advance!

Comments

chasingourselves t1_j2ni0m4 wrote on January 2, 2023 at 5:24 PM

https://arxiv.org/pdf/0811.1645.pdf

wintermute93 t1_j2nqmiu wrote on January 2, 2023 at 6:19 PM

Random survival forests is exactly what I clicked on this thread to suggest, yep.

Academic_Bumblebee t1_j2oh3xj wrote on January 2, 2023 at 9:09 PM

Seconded.

avocado-bison t1_j2nmiff wrote on January 2, 2023 at 5:53 PM

Bayesian networks spring to mind, with wanting to identify what contributes to the death event and whatnot.

Biggzlar t1_j2p5q4p wrote on January 2, 2023 at 11:52 PM

I recently finished Pearl's "Book of Why" and iirc BNNs do not capture causal relationships between variables. So finding a strong association between a lab value and years of survival does not necessarily indicate that one led to the other.

In other words, a strongly associated lab or clinical value might actually be the result of a longer survival (or other compounding factors) and not the other way around.

Entirely not an expert, but this question sounds exactly like a problem for causal inference as described in the book. Of course, there is also the issue that only observational data seems to be available, so CI may not actually be possible. Maybe it's worth it to check out this entry point to the subject.

ajt9000 t1_j2p0goc wrote on January 2, 2023 at 11:16 PM

This is what I wanted to say too. Bayesian nets are good for identifying how strongly a parameter affects the outcome

PredictorX1 t1_j2nk8sa wrote on January 2, 2023 at 5:38 PM

My recollection (feel free to correct me) is that statistical survival models are (can be?) created as a series of logistic regressions, one for each of several forecast horizons. One could keep that same structure, substituting any classifier (induced decision tree; neural network, ...) which produces an estimated probability for those logistic regressions.

Worth-Advance-1232 t1_j2nvwv9 wrote on January 2, 2023 at 6:52 PM

If I understood correctly, rather than actually predicting survival time your goal is to see whether certain parameters have influence on the survival time. Besides the other things already mentioned, one approach might be trying to predict survival time via a black box algorithm and trying to explain the decision of the algorithm. One paper that covers this topic would be this one.

DJ_laundry_list t1_j2q6fzs wrote on January 3, 2023 at 4:07 AM

When you say "has any influence", I'm assuming you mean causal influence, rather than just being correlated with a particular outcome. This puts us in the domain of causal inference. I suggest you go through a causal inference tutorial or two to get some domain knowledge. See https://causalinference.gitlab.io/kdd-tutorial/ and https://economics.mit.edu/sites/default/files/inline-files/causal_tutorial_1.pdf. Econometric modeling revolves heavily around this, so you're probably going to find more sources that are econometric rather than medical.

My personal approach: Train an xgboost model (or really any ML model) using the appropriate hazard function and bayesian optimization for hyperparameter tuning, then compare the log likelihood function of the parameter at its actual values vs counterfactual values. If the counterfactual values provide a similar fit, you're looking at something that is not likely causal.

cantfindaname2take t1_j2rch7q wrote on January 3, 2023 at 12:08 PM

On that note there is a library that extends xgboost's parameters with survival analysis capabilities specifically. Here is a tutorial: https://loft-br.github.io/xgboost-survival-embeddings/how_xgbse_works.html

DJ_laundry_list t1_j2uzlay wrote on January 4, 2023 at 3:09 AM

Good call, didn't know about that

sitmo t1_j2oz219 wrote on January 2, 2023 at 11:07 PM

Whatever you do to make a model, I would created benchmarks datasets where you random shuffle the survival info between patients. Then, any model fitting and testing you do, also do it on these randomised datasets. This will give you good insights about the statistical significance of anything you’ll find in your data.

hopsauces t1_j2r0n39 wrote on January 3, 2023 at 9:35 AM

ML methods people are talking about are mostly focused on prediction. Collinearity and confounding problems you describe will mess with those methods too, though it’ll be harder to recognize and diagnose. “feature importance” measures are generally as vague as the name implies. You’re doing it the right way, just keep working within that feedback loop of analyzing your data and improving your model.

khashitk t1_j2ojgc3 wrote on January 2, 2023 at 9:24 PM

There are a lot of ways to tackle this problem. But it will really help if you could share a data sample. How many samples do you have? How many attributes are you studying? So on and so forth. The rang is from a simple decision tree to a new experimental algorithm.

lattecoffeegirl OP t1_j2onsfp wrote on January 2, 2023 at 9:52 PM

I can try to extract a small data sample. I have around 3000 patients, and their „time till death“ or „time till end of study“, and around 10 predictors

khashitk t1_j2r4vrk wrote on January 3, 2023 at 10:34 AM

The sample would help a lot. You have a good number of entries. So there could be quite a few ways to go.

Edit: Give "weka" a try it's free to use and has a lot of options. It's great for getting a good overall view of the direction the project should take.

lattecoffeegirl OP t1_j2onc3y wrote on January 2, 2023 at 9:49 PM

you are right, I want to know, wether our „new“ parameter is a good predictor or not

[deleted] t1_j2nok85 wrote on January 2, 2023 at 6:06 PM

[removed]