IHaque_Recursion

IHaque_Recursion t1_j7n2ly1 wrote

I can’t comment about all of our internal technologies. But! We did recently publish work with our collaborators at Genentech on benchmarking methods to builds maps of biology, which we evaluated on both our phenomics data and (publicly-available) 10x scRNA-seq (Perturb-seq) data – check it out here. So, draw your own conclusions…

2

IHaque_Recursion t1_j7n1vgs wrote

We run our experiments in house so that we can control the quality and relevance of the data. This type of attention to detail requires doing a lot of the unsexy behind-the-scenes operational improvements to control for as many 'exogenous' factors that can influence what actually takes place in our experimental wells. To manage this, we have (to an extent) backward integrated with our supply chain so that we can (i) anticipate where possible or (ii) correct for changes in the media our vendors supply, different coatings that suppliers may put on plates, etc... Additionally, we have built an incredibly robust tracking process that allows us to measure the meta data from every step in our multi-day assay, so that we maintain precise control over things like volume transfers, compound dwell times, plate movements, etc. to further ensure this relatability. I also wrote more earlier in the AMA about how we handle batch effects!

1

IHaque_Recursion t1_j7n1l9p wrote

We build maps of biology in a range of cell types for exactly this reason – different cell types express different genes. For example, in our partnership with Roche and Genentech, we are building maps in a range of neuroscience-relevant cell types to capture their unique biology.

1

IHaque_Recursion t1_j7n0rk7 wrote

GNNs are in the suite of methods that we use and evaluate. But it’s useful to recognize that although we often draw molecules as graphs, that is not necessarily the only useful (or best) representation for molecules in machine learning. We recently published (poster and talk, paper) research using DeBERTa-style representations and self-supervision over molecular graphs, achieving SOTA results on 9/22 tasks in the Therapeutic Data Commons ADMET tasks.

4

IHaque_Recursion t1_j7n0hnm wrote

We actually don’t start our drug discovery efforts from single targets – check out my earlier reply in the AMA for more details. ChEMBL certainly is an excellent source of structural information, but our insights come not from these data, but rather from high-dimensional relationships between cells treated with compounds and genetic knockout. We advance series of compounds using this data prior to having any information about the target itself.

12

IHaque_Recursion t1_j7mzd3e wrote

Conformal prediction is indeed an interesting method (or family thereof). I can’t comment on our undisclosed internal machine learning research, but what I can say is that machine learning on biological problems tends to be much, much harder than that on common toy or benchmarking datasets. Uncertainty quantification is usually an even harder problem than pure accuracy measurement, especially when you have a mix of known and unknown systematic and random effects in your data-generating process.

1

IHaque_Recursion t1_j7my3wq wrote

1 - We aim to close the loop between high-dimensional, biological profiling of compounds and rapidly learning how to drive the compound series’ evolution to higher potency, lower risk and better kinetics. This is a huge and critical component of the overall vision of industrializing drug discovery. In practice we are dedicating major efforts into ML-guided SAR and how automated synthesis integrates into this plan is part of our roadmap.

2,3 - given the highly custom nature of the automation systems we have built, and the need for ultra-high control over experimental precision, we have relationships with several automation experts in this space. As far as partnerships in this space are concerned, we can’t comment on specific business development plans or transactions until we announce them publicly. What I can say is that we recognize the work it has taken over the last decade to map and navigate biology, and we believe there are many other teams and technologies that have been developing in parallel and we’re always exploring options to bring in additional capabilities that may accelerate our mission.

4 - The “Recursion 101” video we released in October of 2022 has some of the most current footage of our automation labs — if you haven’t seen the video, we (selfishly) think it’s worth the watch. We have also released “Recursion's Mapping & Navigating Demonstration” which shows footage of our laboratories.

4

IHaque_Recursion t1_j7mws6x wrote

The majority of drugs don’t fail because we can’t engage the target with a small or large molecule - they fail because we pick the wrong target. Hence our focus on mapping and navigating causal biology. Our platform is exceptionally well-suited to target-agnostic identification of compounds that impact biology, which absolutely means we don’t always know the target of our compounds. However, one of the major advantages of our map is that it can often uncover the real targets of our active compounds, enabling us to use advancements in structure-based. Additionally, the underlying learnings in this field are even useful in the target-agnostic space, as we try to featurize compounds and learn how to make molecules not only more potent against their primary target, but also in enhancing their overall efficacy, safety and metabolic profile.
That said, we actually do make use of structure-based methods where appropriate. What we don’t do is limit ourselves to solely identifying particular targets (and their structures) ahead of time when initiating discovery programs.

1

IHaque_Recursion t1_j7mv6ze wrote

I did my PhD in the Folding@home lab, so I like this one. There’s a distinction between what’s formally called “ground-state structure” and “structural dynamics”. “Ground state structure” is the lowest-energy, most stable structure of a protein; for me, the ground state structure is “lying in bed”. But only knowing that doesn’t tell you how the structure moves around, which it turns out is important. For example, when I sprained my shoulder, the movement of my arm was highly restricted, but you wouldn’t have known that from looking at one position in which I sleep (you creep). Folding@home is more focused on modeling the dynamics of proteins than their ground state structures. For example, the most effective recent COVID vaccines used a modification to the spike protein called “S-2P”/”prefusion-stabilized” that effectively froze the protein in one particular shape rather than allowing it to fluctuate, which enhanced its ability to generate a useful immune response.
That said, dynamics is the obvious next step for ML methods in protein structure, so I would not be surprised to see new developments here!

2

IHaque_Recursion t1_j7mu67h wrote

It’s an interesting idea, but we think our unique advantage is being able to generate scalable,, relatable, and reliable data in-house. Clinical data are extremely challenging to work with from a statistical perspective (the number of confounders is astounding, and once you stratify you may be left with very few samples). That said, real-world evidence is certainly interesting from a clinical development perspective for understanding the patient landscape, longitudinal disease progression, and for informing patient selection strategies in clinical trials; and other population-scale datasets may be of interest for advancing our discovery and development pipelines.

3

IHaque_Recursion t1_j7ms2y4 wrote

Directing evolution of bacteria to change their small molecule output is indeed a great example of the utility of AI and is definitely similar to how we view AI in the overall evolution of a compound series. Today, our core applications of AI are at a lower level in the stack – for example, taking raw images from our microscopes and projecting them into biologically meaningful embedding spaces. That said, we’re building our discovery technologies with an eye towards building closed-loop optimization cycles in small-molecule discovery. We actually just presented more about this a couple weeks ago – if you’re curious, see more here in the Recursion OS section from our recent Download Day.

4

IHaque_Recursion t1_j7mqu1g wrote

I have genetics on the brain, so yes: I definitely think that data from both germline GWAS and somatic variation studies can be valuable for drug discovery. We don’t work on antibodies at Recursion today (though we have piloted them and they worked great on the platform), but we certainly make use of genetics data to inform our directions. As far as canonical targets, our platform allows us to be agnostic and to explore without having to select a target. As we move through our drug discovery process we aim to understand as much as possible about the target and its mechanism of action.

1

IHaque_Recursion t1_j7mp91v wrote

Though there have been a lot of painful layoffs in biotech and tech lately, we and many other companies are still hiring. That said, computational chemistry is without a doubt going to be a critical component of the future of drug discovery and it’s awesome you’re kicking off your career in this space. We will certainly be continuing to grow in this space and would love to hear more about your work and journey in this field. As you can probably tell, we look to hire innovators who are passionate about their work and committed to bold, outside the box thinking in pursuit of our mission.

23

IHaque_Recursion t1_j7mn89n wrote

So, data sharing in industrial science is complicated. I’ve spent my career in biotech driving for greater openness and data release in the companies where I’ve been. The “natural” state of data is to be siloed. This isn’t just an industrial thing – I’ve read plenty of papers from academic groups with “data available on request” (lol nope, I tried) – and the driver is always the same: a fear that “we spent this money to make the data, how do we get value out of it?”

One of the reasons I joined Recursion in 2019 was that Chris and the team shared that commitment to sharing learnings back to the world. The balance we’ve struck to support open science, but also use this data to drive internal research and develop therapeutics as a public company, is to share a huge dataset that is partially blinded. In RxRx3 we are revealing ~700 genes and 1600 compounds. We’ve sometimes chosen different points on the balance; for example, our COVID datasets RxRx19a and RxRx19b were released completely openly (CC-BY) because we thought the public health crisis was more important than any commercial interest we might have in the data. Our current aim is to continue to unblind parts of the RxRx3 dataset over time, so please stay tuned for additional releases over time.

We have also contributed to open science releasing not just datasets, but tools. Associated with our COVID datasets, we released a data explorer allowing folks to explore the results from our COVID screens. Along with RxRx3, we released a tool (MolRec) where people outside of Recursion can explore some of the same insights that our scientists use to generate novel therapeutic hypotheses and advance new discovery programs, and get a look at how Recursion is turning drug discovery from a trial-and-error process into a search problem.

16

IHaque_Recursion t1_j7mm269 wrote

I’ve been super excited to see how our datasets have driven academic research out in the world. Recursion has been on the cutting edge of developing phenomics as a high-throughput biological modality, and the RxRx datasets are among the largest and best-organized public datasets out there for folks to work with. I’ve seen blog posts, conference posters, MS theses, and more written on our datasets. (We’ve also hired a number of folks to our team based on their work on these data!)

1

IHaque_Recursion t1_j7mjumw wrote

Batch effects are probably the most annoying part about doing machine learning in biology – if you’re not careful, ML methods will preferentially learn batch signal rather than the “real” biological signal you want.

We actually put out a dataset, RxRx1, back in 2019, to address this question. You can check this here.Here is some of what we learned (ourselves, and via the crowdsourced answers we got on Kaggle).

Handling batch effects takes a combination of physical and computational processes. To answer at a high level:

  1. We’ve carefully engineered and automated our lab to minimize experimental variability (you’d be surprised how clearly the pipetting patterns of different scientists can come out in the data – which is why we automate).
  2. We’ve scaled our lab, so that we can afford to ($ and time!) collect multiple replicates of each data point. This can be at multiple levels of replication – exactly the same system, different batches of cells, different CRISPR guides targeting the same gene, etc. – which enables us to characterize different sources of variation. Our phenomics platform can do up to 2.2 million experiments per week!
  3. We’ve both applied known computational methods and built custom ML methods to control / exclude batch variability. Papers currently under review!
4

IHaque_Recursion t1_j7mg5db wrote

Might be some personal bias here – I come from a sequencing background before Recursion – but I don’t necessarily think metabolomics or proteomics are more established than transcriptomics (especially in a research context; clinical testing is different!). The past 10-15 years have seen an absolute _explosion_ in the ability to generate (and analyze/interpret) sequencing data at scale. One of our core principles is being able to generate high-dimensional data at scale, and from that perspective, transcriptomics is a great complement to phenomics. Metabolomic and proteomic technologies (whether affinity or MS-based) are still more expensive and smaller scale than what you can achieve by sequencing. That being said, as technology advances and we find the right application areas, we’re interested in exploring what these other readouts can do for us.

10