top of page
  • Hongjian Zhou

Newsletter from The Neural Medwork: Issue 8



Welcome back to the 8th newsletter of The Neural Medwork! In this issue, we will discuss the Naive Bayes classifier with a dive into probabilities and predictions. Next, we introduce a paper discussing diagnostic reasoning prompts with large language models. Finally, we show another simple yet effective prompting technique - knowledge prompting.


Core Concept: Naïve Bayes Classifier

Welcome back to our exploration of artificial intelligence (AI) in healthcare. After delving into the intricacies of KNN algorithms in our last discussion, today we pivot to a different yet equally fascinating type of AI algorithm: the Naive Bayes classifier. This powerful tool is a testament to the versatility and depth of AI applications in the medical field, offering insights and efficiencies that were once beyond our reach.

The Naive Bayes classifier belongs to the family of supervised learning models, which means it learns from a dataset containing inputs along with their corresponding outputs. At its heart, Naive Bayes is grounded in Bayes' Theorem, a principle that calculates the probability of an event based on prior knowledge of conditions related to the event. What sets the Naive Bayes classifier apart is its assumption of independence among predictors. In simpler terms, it naively assumes that the presence (or absence) of a particular feature in a class is unrelated to the presence (or absence) of any other feature.

This simplicity is not a drawback but a strength, making Naive Bayes fast, efficient, and particularly suited for large datasets where the relationships between features are complex or unknown.

As always, the best way to understand such algorithms is by working through a healthcare example. Imagine we're faced with a dataset from a recent health survey, containing patient records with various attributes (age, weight, blood pressure, cholesterol levels, etc.) and a binary outcome: whether the patient developed a specific condition, say, Type 2 Diabetes. Our goal is to predict the likelihood of new or unseen patients developing this condition based on their health attributes. Here is how we can apply the Naïve Bayes Algorithm to this scenario.

  1. Data Preparation: We start by organizing our dataset into a training set (to teach the Naive Bayes model) and a test set (to evaluate its predictions). Each record in our training set includes the patient's attributes (i.e. age, weight, etc) and whether they were diagnosed with Type 2 Diabetes.

  2. Model Training: The Naive Bayes classifier is trained on the training set, calculating the probability of diabetes given each attribute (e.g., the probability of diabetes given a certain age range, weight category, etc.). It also calculates the overall probability of developing diabetes and the probability of not developing it, based on the training data.

  3. Making Predictions: When a new patient's data is inputted into the AI model, Naive Bayes uses the probabilities it learned during training to predict the likelihood of this patient developing Type 2 Diabetes. It does this by comparing the probabilities of diabetes given each of the patient's attributes and then classifying the patient based on which probability (diabetes or no diabetes) is higher.

  4. Interpreting Results: The output is a probabilistic assessment of risk. For instance, the model might indicate that a patient with a certain profile has a 70% likelihood of developing Type 2 Diabetes. This information can be invaluable for preventive measures or early interventions.

The Naive Bayes classifier, with its ability to handle vast amounts of data and provide probabilistic predictions, is an excellent tool for healthcare professionals. It aids in risk assessment, decision-making, and developing personalized treatment plans, ultimately leading to improved patient outcomes.


Relevant Research Paper

Title: "Diagnostic reasoning prompts reveal the potential for large language model interpretability in medicine"


This study, published in npj Digital Medicine (2024), explores the capability of Large Language Models (LLMs), specifically GPT-3.5 and GPT-4, to mimic clinical reasoning in forming diagnoses. A significant barrier to LLM adoption in healthcare is the perceived lack of interpretable methods that align with clinicians' cognitive processes. By developing diagnostic reasoning prompts, the authors investigate whether LLMs can accurately and interpretably form diagnoses, potentially offering physicians a means to evaluate LLM responses for patient care.


The primary objective is to assess if GPT models can imitate clinical reasoning using specialized instructional prompts without compromising diagnostic accuracy. This could address the "black box" limitations of LLMs, bringing them closer to practical, trustworthy application in medicine.


  • LLM Prompt Development: Iterative prompt engineering was employed, focusing on different clinical reasoning strategies: traditional chain of thought, differential diagnosis, intuitive reasoning, analytic reasoning, and Bayesian reasoning.

  • Case Sources: The study utilized a modified MedQA USMLE dataset for initial testing, covering step 2 and step 3 questions focused on diagnosis. Additionally, GPT-4's performance was evaluated against the NEJM Case Records series.

  • Evaluation: LLM responses were evaluated for diagnostic accuracy against a test set of 518 MedQA questions and 300 NEJM cases, using both traditional and diagnostic reasoning prompts.

Diagnostic Reasoning Prompts

  1. Traditional Chain of Thought (CoT): Simple step-by-step reasoning.

  2. Differential Diagnosis CoT: Forming a differential diagnosis list followed by step-by-step deduction to identify the correct diagnosis.

  3. Intuitive Reasoning CoT: Employing symptom-sign-laboratory associations to deduce the diagnosis.

  4. Analytic Reasoning CoT: Using pathophysiology to deduce the diagnosis logically.

  5. Bayesian Reasoning CoT: Applying Bayesian inference to update prior probabilities based on new information to deduce the diagnosis.


  • GPT-3.5 Showed improved accuracy using intuitive reasoning prompts over traditional CoT, but performed worse with analytic reasoning and differential diagnosis prompts. GPT 3.5 was 46% correct for CoT, 48% for intuitive reasoning, 40% for analytic reasoning, 38% for differential diagnosis, and 42% for Bayesian inference

  • GPT-4: Demonstrated overall improved accuracy across all prompts compared to GPT-3.5. Traditional CoT and diagnostic reasoning prompts showed similar performance, indicating GPT-4's ability to mimic clinical reasoning processes. GPT 4 was 76% correct for CoT, 77% for intuitive reasoning, 78% for analytic reasoning, 78% for differential diagnosis, and 72% for Bayesian inference. 


The study concludes that GPT-4 can effectively imitate clinical reasoning processes through specialized diagnostic reasoning prompts, providing interpretable rationales that align with physicians' cognitive processes. This signifies a step towards making LLM responses more trustworthy and interpretable for clinical use. However, it also highlights that despite GPT-4's advancements, the application of clinical reasoning in LLMs does not enhance accuracy as it would for a human provider, suggesting inherent differences in reasoning mechanisms between humans and LLMs.


This research underlines the potential of prompt engineering in enhancing the interpretability and utility of LLMs like GPT-4 in clinical settings. By leveraging diagnostic reasoning prompts, clinicians may better assess the trustworthiness of LLM-generated diagnoses, mitigating the "black box" nature of these models. This approach represents a promising avenue for integrating advanced AI technologies into healthcare, offering a method to harness LLM capabilities in a manner that aligns with clinical reasoning and decision-making processes.

Savage, T., Nayak, A., Gallo, R. et al. Diagnostic reasoning prompts reveal the potential for large language model interpretability in medicine. npj Digit. Med. 7, 20 (2024).


Tips and Tricks: Mastering Knowledge Prompting in Healthcare

As advancements in Large Language Models (LLMs) continue, new techniques emerge to enhance their capabilities, including knowledge prompting by Liu et al.2022. Given LLMs' ability to incorporate and leverage knowledge/context in their responses, this method empowers LLMs like ChatGPT to generate knowledge to be incorporated into the prompt, enabling them to tackle tasks that require commonsense reasoning.

What is Knowledge Prompting: Knowledge prompting involves asking the model to generate knowledge that can be used as part of the prompt. By prompting the model with the necessary background information or context, healthcare professionals can guide LLMs to provide more informed and relevant responses.

Practical Example:

Consider using ChatGPT to diagnose a patient with abdominal pain. A knowledge prompting prompt might include:

"Given the following symptoms of abdominal pain: [Symptom 1], [Symptom 2], [Symptom 3]. Generate knowledge about potential underlying causes, such as appendicitis, diverticulitis, or pancreatitis. Now, consider a patient presenting with these symptoms. What is the most likely diagnosis based on the generated knowledge?"

In this scenario, the model uses the generated knowledge to understand the symptoms, consider differential diagnoses, and make an informed diagnosis. This approach leverages the LLM's ability to learn and apply specialized medical knowledge, supporting healthcare professionals in their diagnostic decision-making and treatment planning.

Thanks for tuning in,

Sameer & Michael


bottom of page