Submitted by tmblweeds t3_zn0juq in MachineLearning

tl;drI built a site that uses GPT-3.5 to answer natural-language medical questions using peer-reviewed medical studies.

Live demo: https://www.glaciermd.com/search

Background

I've been working for a while on building a better version of WebMD, and I recently started playing around with LLMs, trying to figure out if there was anything useful there.

The problem with the current batch of "predict-next-token" LLMs is that they hallucinate—you can ask ChatGPT to answer medical questions, but it'll either

  1. Refuse to answer (not great)
  2. Give a completely false answer (really super bad)

So I spent some time trying to coax these LLMs to give answers based on a very specific set of inputs (peer-reviewed medical research) to see if I could get more accurate answers. And I did!

The best part is you can actually trace the final answer back to the original sources, which will hopefully instill some confidence in the result.

Here's how it works:

  1. User types in a question
  2. Pull top ~800 studies from Semantic Scholar and Pubmed
  3. Re-rank using sentence-transformers/multi-qa-MiniLM-L6-cos-v1
  4. Ask text-davinci-003 to answer the question based on the top 10 studies (if possible)
  5. Summarize those answers using text-davinci-003

Would love to hear what people think (and if there's a better/cheaper way to do it!).

---

UPDATE 1: So far the #1 piece of feedback has been that I should be way more explicit about the fact that this is a proof-of-concept and not meant to be taken seriously. To that end, I've just added a screen that explains this and requires you to acknowledge it before continuing.

​

https://preview.redd.it/jrt0yv3rfb6a1.png?width=582&format=png&auto=webp&s=38021decdfc7ed4bc3fe8caacaee2d09cd9b541e

Thoughts?

Update 2: Welp that's all the $$$ I have to spend on OpenAI credits, so the full demo isn't running anymore. But you can still follow the link above and browse existing questions/answers. Thanks for all the great feedback!

178

Comments

You must log in or register to comment.

w0lph t1_j0el9zd wrote

Curious to know how you do step 4.

Also, I think it’s a dangerous use case. I don’t think you will ever eliminate the risk of hallucinations, and that can be catastrophic when it comes to health.

55

Nowado t1_j0fdi2f wrote

As much as I would like to agree, all (even sensitive and important, like health) use cases have to be compared against realistic alternative. Currently this alternative is google/webMD and previously it was asking friends/guessing. Not doctors.

Adding disclaimer not to treat this as a medical advice for legal reasons however would of course be good call for OP personally.

20

Seon9 t1_j0foo1j wrote

I think it's worse than Google or WebMD because people are biased toward believing detailed (richer) information sources, esp if it's masked behind AI. You might receive vague advice from a friend or WebMD but the information is weighted against the source's trustworthiness, perceived soundness, information richness, etc. I think this oversteps alot of that for most people and people might be more inclined to believe it.

It also overlooks that research ≠ clinical practice and that translating research into practice requires numerous hurdles. There can be a decade of research and numerous publication for a drug that then collapses in clinical testing. Or misattribution, that deaths diagnosed from shaken babies syndrome are prob due to prior non-shaking related trauma. I think access to research is a good thing but stripped of context like this isn't.

I also wonder if the hallucinations increase as the pool of information shrinks. Like I'm not gonna use this if I have a common cold but if I have rare S3 colorectal cancer then I might find this useful. Cool project but medical applications are always tough.

16

NotMyMain007 t1_j0i3nwz wrote

I do agree with you, but at the same time, there is some really shitty places on earth, it would be a good option to have instead of just laying on the ground and dying. Even more if this keep evolving.

1

trnka t1_j0emdr8 wrote

Very cool! I worked closely our doctors on ML features at a telemedicine startup, let me check some of the things I know about:

- What's the most effective antibiotic for a UTI? -> "effective" was a poor word choice on my part, it gave 1 drug that we didn't use, 1 that was a last line of defense type of drug, and then a drug class

- What's the best first-line antibiotic for a UTI -> agreed with our clinical best practices (Nitrofurantoin)

- I tried asking for when to diagnose bacterial vs viral common colds if a lab test can't be done - no results (best practice was symptoms not improving >10 days to treat as bacterial)

- "When is tamiflu effective?" If I remember right, our guidelines were first day of infection or first two days if the patient's immunocompromised, lives with someone immunocompromised, or works in healthcare. The system was sorta right: "Oseltamivir (Tamiflu) is effective for the treatment and prevention of influenza in adults, adolescents, and children, and early initiation of treatment provides greater clinical benefits."

- How does coffee affect blood pressure? I remembered it increased after drinking in a BP test we ran. That showed up in the results, but the results had both 1) studies about immediate effects and 2) studies about long-term effects which have different conclusions.

When the query fails, I wish it wouldn't just delete it - it'd be nice to have it still there so I can share it with you, or for myself to revise the query. I also had one query get "stuck" but all the steps transitioned to checked.

If you can get access to a site like UpToDate, many of our doctors used that for clinical best practices. Very few searched the medical literature directly.

​

I'll share with some of my doctor friends and hopefully they'll give you actual medical feedback rather than the secondhand knowledge I have.

49

tmblweeds OP t1_j0hg7su wrote

Super helpful! This broadly aligns with other feedback I've gotten, that I should be focusing on guidelines in addition to/instead of clinical research.

2

trnka t1_j0hr6zn wrote

Yeah our doctors spent a lot of time building and revising clinical guidelines for our practice.

I'm not sure what your background is, but some tips from working with them on clinical guidelines:

  • There were some guidelines that were generally-accepted best practices in medicine, but it was more common to have clinic-specific guidelines
  • My team ran into some resistance to the idea of ML-created guidelines. Physicians were more receptive to technology that assisted them in creating guidelines
  • Many guidelines are aspirational, like when to order lab tests. Many patients just won't get the tests, or the test results will come back after the current condition has resolved. Likewise, if you're worried about a patient taking the antibiotic to term, it may be better to use a 1-dose second-line antibiotic rather than a multi-dose first-line antibiotic. In the long term I expect that clinical guidelines will adapt somewhat to patient adherence; they aren't a one-time thing. Plus research changes too.
  • For any evidence, there needs to be vetting of how it's gathered, like whether it's a proper randomized control trial, how the statistics are done, how the study is designed, what population was studied, etc
5

tmblweeds OP t1_j0ieog3 wrote

Ah yeah I wasn't thinking necessarily about creating guidelines with ML—more like highlighting/synthesizing relevant excerpts from existing guidelines (e.g. NICE, American College of Cardiology, etc.). But I didn't know that individual clinics had their own guidelines in addition to the "official" ones).

2

take_eacy t1_j0hh73q wrote

Yes, as an MD, I agree with the points made above

Uptodate is a widely used summary of guidelines (doctors are paid to summarize and synthesize the latest research studies). Widely used is almost an understatement honestly - whke your doctor steps outside or has to look something up, it is one of the top resources (if not like #1 though nobody likes to admit it) since it's such a great aggregator.

4

Top-Perspective2560 t1_j0ez2z9 wrote

The first thing I’d say, and this is really important: You need to put a disclaimer on the site clearly stating that it’s not medical advice of any kind.

Explainability is always the sticking point in healthcare. This is pretty cool, but unless you can explicitly state why the model is giving that advice/output, it can never be truly useful, and worse, can open you up to all sorts of issues around accountability and liability. Tracing back to the original studies is a good thing, but doesn’t necessarily answer the question of why the model thinks that study should result in that advice.

Deep Learning models in healthcare are typically relegated to the realms of decision-support at best for the moment because of these issues. Even then, they’re often ignored by clinicians on the whole for a variety of reasons.

The methodology for determining what advice to give is quite shaky too. There is usually a bit more to answering these kinds of questions. What are the effect sizes given in the studies, for example? What kind of studies are they?

Anyway, I hope that doesn’t come across as overly-critical and is constructive in some way. AI/ML for healthcare can be a bit of a minefield, but it’s my area of research so just thought I’d pass on my thoughts.

Edit just to add: It would probably be really beneficial for you to talk to a clinician or even a med student about your project. From my experience, it's pretty much impossible to build effective tools or produce good, impactful research in this domain without input from actual clinicians.

34

tmblweeds OP t1_j0hgp9v wrote

Definitely not overly critical—the whole reason I posted was to get critiques! I think you're right that I can go further with explainability, and I also think that there are ways to use NER, etc., to give more interesting answers (e.g., a table of treatments sorted by effect size or adverse events). I'll keep working in this direction.

3

Top-Perspective2560 t1_j0ho0hv wrote

Sounds good! The table of treatments sounds like a good starting point - but further down the road it's definitely an issue of making sure that it actually corresponds to the model's "answer" somehow, because the advantage of providing it is to validate the output. Quite a lot of these issues around explainability are very deep-rooted in the models themselves - I'm sure you're familiar with the general state of play on that. However, there are definitely ways to take steps in the right direction.

If you'd like any input at any point feel free to fire over a DM!

1

take_eacy t1_j0hhxrm wrote

Agreed! Clinicians are often the gatekeepers in clinical practice and have an understanding of the actual workflow

2

alekosbiofilos t1_j0f0uuq wrote

Great tech skills. But honestly, I think it is a bad idea!

If it works most of the time, that's even worst! Thing is, these models are basically fancy autocorrect apps. They don't understand anything. Research papers are fairly structured, but not as much as needed for this application. For example, this app might be very compelled to end with "but more research is needed", or start "discussing" with itself on scientific ideas that can be studied from several angles. Not to mention things like gene names, gene x environment interactions, the nuance of what "interaction" is (is it genetic, regulatory, physical?).

Maybe for researchers, this can work as a way of providing an easier to use search engine for papers. The problem is that the "curation" of the answer is abstracted away from users, and one might take more time trying to figure out what the thing meant, than doing the lit search

9

tmblweeds OP t1_j0hhgpd wrote

I hear you! I definitely want to "do no harm" here—I think while I'm still testing things out I need to plaster a lot more warnings around the site like "THIS IS A PROOF-OF-CONCEPT, NOT MEDICAL ADVICE, DO NOT TRUST."

My ultimate goal would be to make the "curation" of the answer much clearer, so that this would be more of a research tool (like Pubmed) and less of a magic oracle.

1

farmingvillein t1_j0ejk8l wrote

Check out // compare with https://crfm.stanford.edu/2022/12/15/pubmedgpt.html, if you haven't yet.

7

tmblweeds OP t1_j0eow0t wrote

Yeah just saw that! Definitely going to give it a test drive in the next few days.

3

SulszBachFramed t1_j0g22xz wrote

> (...) a new state of the art performance of 50.3% accuracy on the MedQA biomedical question answering task.

Oof, the fact that the accuracy is only 50% does not inspire confidence.

1

Taenk t1_j0eqjgc wrote

I have no domain knowledge in medicine, but when playing with ChatGPT one of the more annoying limitations was not being able to get primary sources or book references.

Could this approach be easily expanded or adapted to other domains? I am thinking of maybe having an AI like this check the sources for Wikipedia articles or conversely suggesting sources for when it hits a [citation needed].

5

JanneJM t1_j0ejc6i wrote

Out of curiosity, how well does it work if you simply ask it to base the answer on Pubmed sources only, without any ranking or anything?

4

memberjan6 t1_j0g470y wrote

Gpt3 is always at some risk of hallucinating its responses, due to its architecture. Your steps to prevent the hallucinations in the medical application are steps in the right direction, and may turn out to be helpful guidance for other developers of applications. Your steps toward traceability of the model's answers are also wise moves.

But bY contrast the Deepset.ai Haystack pipeline QA framework and perhaps others is designed to exactly execute Nonhallucination as well as answer provenance transparency. In the medical context, I think you'd need to demonstrate some empirical evaluations on both types of systems, to medical stakeholders, after getting some such evidence privately for your own decision as to the better architecture for a medical app.

I can say the slower responses of gpt3 types of LLMs is also a potential challenge. By contrast the Haystack design uses a combination of two or more model types in a pipeline to dramatically speed up the responses, and show you exactly where it sourced its answer in the document base.

4

CrossroadsDem0n t1_j0emmzy wrote

Something that comes to mind for improvements.

Not all pubmed citations are of equivalent value, medically speaking. So somebody producing an in-house compendium of research likely spends a fair bit of time curating their database. Those with domain knowledge recognize which researchers or organizations are more historically authoritative or more currently relevant in particular areas.

Also, pubmed is not likely to help you with assessing complicating factors, or the relationships between differing-sounding conditions that have pathway commonality and thus influence treatment (random example: melatonin can be a suggested support for some forms of cancer, but that serotonin pathway has relevance to lung inflammation apparently so melatonin may not be appropriate for lung cancers, which squares with findings from studies). Crafting databases to walk that kind of network is very involved.

Cool project though. It'll be interesting to see where you take it next.

3

jms74 t1_j0fgtxx wrote

I asked “how does lsd interact with serotonin? And got a bs answer with mixed phrases that didn’t make sense

3

jms74 t1_j0fgyy7 wrote

Now I got a true answer..

2

ReginaldIII t1_j0gdsis wrote

This tool is such an unbelievably bad idea.

It really upsets me when i see people using unrestrained models to do what only a safety critical system should do.

With no clinical study or oversight. No ethics review before work on the project can start. No consideration for the collateral damage that can be caused.

Really really unethical behaviour.

If someone hooked up a bare CNN trained via RL to a real car and put it on the roads everyone would be rightfully screaming OP is a unethical fool for endangering the public. But somehow people think it's okay to screw around with medical data... The mind boggles.

0

JClub t1_j0fwb4d wrote

Nice project!

How do you prevent the model from hallucinating? I did not get that. Do you just hope that the model will copy from the top 10 searches you give it?

3

take_eacy t1_j0hisvl wrote

I think framing this as a better WebMD (which as a clinician, is a shitty resource I kind of hate but IMO is probably better than having no such resource) is the best way to go. Go in the direction more as consumer wellness than real medical advice (which is a high bar and risk aversion is extremely high).

IMO, medical research is too bogged down with moving slowly and being risk averse. This is how AI in medicine efforts tend to really lag their mainstream CS counterparts. I'm glad there are efforts to be careful, but as someone in academic medicine, I think there can be way too much inertia

3

tmblweeds OP t1_j0hk6fs wrote

Yeah I feel like right now the answers are in a weird spot...it's understandable/usable by motivated consumers/patients (like WebMD), but looking at primary research is more of a clinician/doctor thing (like UpToDate). Truthfully I'm more interested in making a better WebMD, since I think most health decisions (diet, exercise, sleep, supplements, OTC meds, etc.) are made without any MD input.

3

take_eacy t1_j0hlpw6 wrote

FYI, There's also a For Patients section and Beyond the Basics section meant for well read patients in Uptodate that doctors will print out for patients (they're meant for that use case)

3

ktpr t1_j0eo896 wrote

Since doctors also have to “trace the[ir] final answer back to the original sources,” and contexts of the case how does these help doctors that must do the same due diligence either way?

2

weightloss_coach t1_j0ewuub wrote

Have you tried elicit or consensus?

2

tmblweeds OP t1_j0hhwf0 wrote

Indeed! I'm interested in trying to make a health-specific version of these tools. Elicit/Consensus are general-purpose research tools, which means it's harder for them to add health-specific views (e.g. a table of treatments sorted by effect size, a list of symptoms and their prevalence, a list of side effects and their prevalence, etc.). Obviously I haven't built any of that yet, but I'm working on it.

1

kreuzguy t1_j0eyt37 wrote

Very interesting. Have you thought about adding a bias to most recent and most cited papers?

2

tmblweeds OP t1_j0hi246 wrote

Yeah definitely working on that for "most recent"—but I've read conflicting things about using "most cited" as a proxy for trustworthiness (there are studies in some disciplines showing that citations are negatively correlated with reproducibility).

1

rafgro t1_j0fpg80 wrote

Do you embed some special clauses or verification to limit hallucination? In my experiences with splicing primary sources into input, sometimes it can even induce more hallucinations (which can be more believable with sources but still false!). To test it out here, I consciously asked a few questions with no obvious answers - such as "What genes cause brain cancer?" - and got nice response in the form of "there's no answer yet".

2

StarInABottle t1_j0fzlzw wrote

While the engineering on this project sounds super interesting, I hope you realise how dangerous this tool is. Other responses have already discussed the possible dangers better than I would have. But the risk of letting a convincing-sounding statistical text predictor trained on research data act as a substitute for curated medical advice from practitioners should not be understated. Bad medical advice can at worst cost lifes. Please be careful with this tool and add clear disclaimers that this does not constitute professional medical advice.

2

tmblweeds OP t1_j0hid0a wrote

Agreed! I think "add more/bigger/better disclaimers" is my biggest takeaway from this post.

2

ksblur t1_j0h0q1a wrote

This is awesome. I was just daydreaming about a service like this.

Sadly it might be broken at the moment: > Unexpected status 500: { "error": { "message": "The server had an error while processing your request. Sorry about that! You can retry your request, or contact us through our help center at help.openai.com if the error persists. (Please include the request ID fb2aca926dc155161eb029fd8191b425 in your message.)", "type": "server_error", "param": null, "code": null } }

1

tmblweeds OP t1_j0hj6tv wrote

Hah I yeah that's from OpenAI (they've been having API issues ever since ChatGPT launched). I've switched the reranker to use a different provider (for now) so it should be working again.

2

singularineet t1_j0hjtub wrote

Just asked it:

>Does varicela zoster cause DVTs?

and it replied:

>Varicella zoster is rarely associated with deep venous thrombosis (DVTs).
>
>Supporting excerpt
>
>“Frequently, these complications involve intracranial and extracranial arteries leading to ischemic stroke, cerebral aneurysm formation, subarachnoid hemorrhage, cerebral vasculitis, and giant cell arteritis. Symptoms involving veins, such as cerebral sinus thrombosis, deep venous thrombosis, and pulmonary embolism, are rare.”
>
>Study
>
>Varicella-Zoster Virus Vasculitis: A Case Report of Enteric Reactivation with Vasculopathy Leading to Arterial Dissection, Stroke, and Subarachnoid Hemorrhage
>
>Donohoe et al.
>
>Archives of Clinical and Medical Case Reports
>
>Invalid Date

This is all completely wrong. Varicella Zoster (aka the Chicken Pox virus) causes a period of hypercoagulability in adults which causes DVTs and pulmonary embolisms reasonably often. It's well documented, although most doctors are not familiar with it. So that part of the response is wrong. And the study it cites (a) does not support it's answer, and (b) is not relevant.

edit: this is the kind of wrong answer that can kill people.

1

tmblweeds OP t1_j0hkcn9 wrote

Noted! I'll work on fixing the underlying problem, and more importantly I'll add bigger/better disclaimers to make sure nobody is taking these answers too seriously.

1

singularineet t1_j0i47td wrote

Right.

Obviously there's an NLP issue going on, where the "rare" in the quoted snippet is scoped to the complication under discussion.

2

Own-Plantain8065 t1_j0i1n4v wrote

You should prioritize papers based on the type of paper. For example, healthcare professionals care put more trust in a systematic review vs a case study. Just something to consider in your algorithm. Cred: medical student

1

farmingvillein t1_j0ifmkt wrote

This also would probably be a good way to gather data on where the model may not be working.

If a relatively recent systematic review is giving a different result than a contemporaneous and/or older set of papers, it is probably (would need to verify this empirically) more likely that something is being processed incorrectly.

(Reviews obviously also aren't perfect--but my guess is that you'd find that they are pretty robust indicators of something being off.)

1

rjtannous t1_j0xs168 wrote

I am curious whether running queries on this is subject to the $0.12/1K tokens rate ?

1

elbiot t1_j1tjpg7 wrote

The source I found this post through also referenced Retrieval Augmented Generation (https://ai.facebook.com/blog/retrieval-augmented-generation-streamlining-the-creation-of-intelligent-natural-language-processing-models/) and it seems like they've integrated document selection into the back propagation of the model training. You couldn't do this with chat GPT but maybe smaller pretrained LLM that could be fine tuned on consumer hardware would be enough for just that part

1

Dupuytren t1_j1u8zyr wrote

What approach do you recommend for using machine learning on a local collection of thousands of relevant full-text publications rather than scraping PubMed?

1

race2tb t1_j0ho83k wrote

I think you are wasting your time, LLM are too dumb for this and risk reward is bad. Leave the medical questions to doctors. What doctors really need is tools to help them with administrative tasks. Doctors spending more time being doctors and less expense and time on admin. I think this is where current LLM would be helpful.

−1