eltorrido23 t1_j6c4bwq wrote on January 29, 2023 at 7:14 AM

I’m currently starting to pick up ML with a quant focused social scientist background. I am wondering what I am allowed to do in EDA (on the whole data set) and what not, to avoid „data leakage“ or information gain which might eventually ruin my predictive model. Specifically, I am wondering about running linear regressions in the data inspection phase (as this is what I would often do in my previous work, which was more about hypothesis testing and not prediction-oriented). From what I read and understand one shouldn’t really do that, because to much information might be obtained which might lead me to change my model in a way that ruins predictive power? However, in the course I am doing (Jose Portillas DS Masterclass) they are regularly looking at the correlations before separating train/test samples. But essentially linear regressions are also just (multiple/corrected) correlations, so therefore I am a bit confused where to draw the line in EDA. Thanks!

trnka t1_j6ce4td wrote on January 29, 2023 at 9:26 AM

I try not to think of it as right and wrong, but more about risk. If you have a big data set and do EDA over the full thing before splitting testing data, and intend to build a model, then yes you're learning a little about the test data but it probably won't bias your findings.

If you have a small data set and do EDA over the full thing, there's more risk of it being affected by the not-yet-held-out data.

In real-world problems though, ideally you're getting more data over time so your testing data will change and it won't be as risky.