IHaque_Recursion t1_j7mjumw wrote on February 7, 2023 at 9:48 PM

Reply to comment by SandwichNo5059 in We’re Recursion and we’re using AI to decode biology and industrialize drug discovery! by ShakeNBakeGibson

Batch effects are probably the most annoying part about doing machine learning in biology – if you’re not careful, ML methods will preferentially learn batch signal rather than the “real” biological signal you want.

We actually put out a dataset, RxRx1, back in 2019, to address this question. You can check this here.Here is some of what we learned (ourselves, and via the crowdsourced answers we got on Kaggle).

Handling batch effects takes a combination of physical and computational processes. To answer at a high level:

We’ve carefully engineered and automated our lab to minimize experimental variability (you’d be surprised how clearly the pipetting patterns of different scientists can come out in the data – which is why we automate).
We’ve scaled our lab, so that we can afford to ($ and time!) collect multiple replicates of each data point. This can be at multiple levels of replication – exactly the same system, different batches of cells, different CRISPR guides targeting the same gene, etc. – which enables us to characterize different sources of variation. Our phenomics platform can do up to 2.2 million experiments per week!
We’ve both applied known computational methods and built custom ML methods to control / exclude batch variability. Papers currently under review!