Fudan University team releases Chinese medical and health personal assistant, and open source 470,000 high-quality data sets

Show obvious advantages in single-round question-and-answer and multi-round dialogue medical and health consultation evaluations.

With the rise of telemedicine, online consultation and consultation have increasingly become the first choice for patients seeking convenient and efficient medical support. Recently, large language models (LLM) have demonstrated powerful natural language interaction capabilities, bringing hope for health medical assistants to enter people's lives.

Medical and health consultation scenarios are usually complex, and personal assistants need to have rich medical knowledge and the ability to understand the patient's intentions through multiple rounds of dialogue, and give professional and detailed responses. In the face of medical and health consultation, the general language model often avoids talking or answers wrong questions due to lack of medical knowledge; at the same time, it tends to complete the consultation for the current round of questions, lacking satisfactory multi-round questioning ability. In addition, high-quality Chinese medical data sets are currently very rare, which poses a challenge to training powerful language models in the medical field.

Fudan University Data Intelligence and Social Computing Laboratory (FudanDISC) releases Chinese medical and health personal assistant - DISC-MedLLM. In the medical and health consultation evaluation of single-round question and answer and multi-round dialogue, the performance of the model shows obvious advantages compared with existing large medical dialogue models. The research team also released a high-quality supervised fine-tuning (SFT) data set - DISC-Med-SFT containing 470,000 people. The model parameters and technical reports are also open source.

  • Home page address:
  • Github address:
  • Technical Reports:

1. Sample display

Figure 1: Dialogue Example

When patients feel unwell, they can ask the model to describe their own symptoms. The model will give possible causes and recommended treatment options as a reference. When there is a lack of information, the model will actively ask for a detailed description of the symptoms.

Figure 2: Dialogue in the consultation scene

Users can also ask the model specific consultation questions based on their own health conditions, and the model will give detailed and helpful answers, and proactively ask questions when information is lacking to enhance the pertinence and accuracy of the responses.

Figure 3: Dialogue based on self-health consultation

Users can also ask about medical knowledge that has nothing to do with themselves. At this time, the model will answer as professionally as possible so that users can understand it comprehensively and accurately.

Figure 4: Dialogue of medical knowledge inquiry that has nothing to do with itself

2. Introduction to DISC-MedLLM

DISC-MedLLM is a large medical model trained on the general domain Chinese large model Baichuan-13B based on our high-quality dataset DISC-Med-SFT. It is worth noting that our training data and training methods can be adapted to any base large model.

DISC-MedLLM has three key features:

  • Reliable and rich professional knowledge. We use the medical knowledge graph as the information source, sample triples, and use the language capabilities of the general large model to construct dialogue samples.
  • Inquiry ability for multiple rounds of dialogue. We use real consultation dialogue records as the information source, and use a large model for dialogue reconstruction. During the construction process, the model is required to be completely aligned with the medical information in the dialogue.
  • Align responses to human preferences. Patients hope to obtain richer supporting information and background knowledge during the consultation process, but human doctors' answers are often concise; through manual screening, we construct high-quality, small-scale instruction samples to align with patients' needs.

The strengths of the model and the data construction framework are shown in Figure 5. We calculated the real distribution of patients from real consultation scenarios to guide the sample construction of the data set. Based on the medical knowledge graph and real consultation data, we used two ideas: large model-in-the-loop and people-in-the-loop to construct the data set. .

Figure 5: Structure of DISC-Med-SFT

3. Method: Construction of the data set DISC-Med-SFT

In the process of model training, we supplemented DISC-Med-SFT with general domain datasets and data samples from existing corpora to form DISC-Med-SFT-ext, the details of which are presented in Table 1.

Table 1: DISC-Med-SFT-ext data content introduction

Reconstruction AI Doctor-Patient Dialogue

data set. 400,000 and 20,000 samples are randomly selected from two public datasets, MedDialog and cMedQA2, respectively, as source samples for SFT dataset construction.

Refactor. In order to adapt real-world doctor responses to the desired high-quality responses in a unified format, we utilize GPT-3.5 to complete the reconstruction process of this dataset. The prompt word(s) requires rewriting to follow the following principles:

  • Remove verbal expressions, extract unified expressions, and correct inconsistencies in doctors' language use.
  • Stick to the key information in the original doctor's answer and provide appropriate explanations to be more comprehensive and logical.
  • Rewrite or delete responses that AI doctors should not send, such as asking patients to make an appointment.

Figure 6 shows an example of refactoring. The adjusted doctor's answer is consistent with the identity of the AI medical assistant, which not only adheres to the key information provided by the original doctor, but also provides patients with more comprehensive help.

Figure 6: Example of dialogue rewriting

Knowledge map question and answer pairs

The medical knowledge graph contains a large amount of well-organized medical expertise, based on which less noisy QA training samples can be generated. Based on CMeKG, we sample the knowledge graph according to the departmental information of disease nodes, and utilize appropriately designed GPT-3.5 models to generate a total of more than 50,000 diverse medical scene dialogue samples.

Behavioral Preference Dataset

In the final stage of training, in order to further improve the performance of the model, we use a dataset more in line with human behavior preferences for secondary supervised fine-tuning. About 2000 high-quality, diverse samples were manually selected from the two data sets of MedDialog and cMedQA2. After rewriting several examples and manually revising them to GPT-4, we used the small sample method to provide them to GPT-3.5 ,generate high-quality behavioral preference data sets.

other

general data. In order to enrich the diversity of the training set and reduce the risk of degradation of the model's basic capabilities during the SFT training stage, we randomly selected several samples from two common supervised fine-tuning data sets moss-sft-003 and alpaca gpt4 data zh.

MedMCQA. In order to enhance the question answering ability of the model, we choose MedMCQA, a multiple-choice data set in the English medical field, optimize the questions and correct answers in the multiple-choice questions using GPT-3.5, and generate about 8,000 professional Chinese medical question-answer samples .

4. Experiment

train. As shown in the figure below, the training process of DISC-MedLLM is divided into two SFT stages.

Figure 7: Two-stage training process

Review. The performance of medical LLMs is evaluated in two scenarios, namely single-round QA and multi-round dialogue.

  1. Single-round QA evaluation: In order to evaluate the accuracy of the model in terms of medical knowledge, we extracted 1500+ multiple-choice questions from the Chinese National Medical Qualification Examination (NMLEC) and the National Postgraduate Entrance Examination (NEEP) Western Medicine 306 major , to evaluate the performance of the model in a single round of QA.
  2. Multi-round dialogue evaluation: In order to systematically evaluate the dialogue ability of the model, we use three public datasets - Chinese Medical Benchmark Evaluation (CMB-Clin), Chinese Medical Dialogue Dataset (CMD) and Chinese Medical Intent Dataset ( CMID) randomly selects samples and GPT-3.5 acts as a patient-model dialogue, and proposes four evaluation indicators-initiative, accuracy, usefulness, and language quality, which are scored by GPT-4.

Evaluation result

Compare models. Our model is compared with three general LLMs and two Chinese medical conversational LLMs. Including OpenAI's GPT-3.5, GPT-4, Baichuan-13B-Chat; BianQue-2 and HuatuoGPT-13B.

Single round of QA results. The overall results of the multiple-choice assessment are shown in Table 2. GPT-3.5 shows a clear lead. DISC-MedLLM achieves second place in the few-shot setting and third behind Baichuan-13B-Chat in the zero-shot setting. Notably, we outperform HuatuoGPT (13B) trained in a reinforcement learning setting.

Table 2: Single-choice question evaluation results

Results of multiple rounds of dialogue. In the CMB-Clin evaluation, DISC-MedLLM achieved the highest composite score, followed by HuatuoGPT. Our model scored top on the positivity criterion, highlighting the effectiveness of our training approach biased towards medical behavioral patterns. The results are shown in Table 3.

Table 3: CMB-clin results

In the CMD sample, as shown in Figure 8, GPT-4 obtained the highest score, followed by GPT-3.5. The overall performance scores of the models DISC-MedLLM and HuatuoGPT in the medical field are the same, and their performance in different departments is outstanding.

Figure 8: CMD results

The situation of CMID is similar to that of CMD, as shown in Figure 9, where GPT-4 and GPT-3.5 maintain the lead. Except for the GPT series, DISC-MedLLM performed best. It outperformed HuatuoGPT in the three intents of disease, treatment regimen and drug.

Figure 9: CMID results

The inconsistent performance of each model between CMB-Clin and CMD/CMID may be due to the different data distribution between the three datasets. CMD and CMID contain more explicit question samples, and patients may have obtained a diagnosis and expressed clear needs when describing symptoms, and the patient's questions and needs may even have nothing to do with their personal health status. The general-purpose models GPT-3.5 and GPT-4, which excel in many aspects, are better at handling this situation.

5. Summary

The DISC-Med-SFT dataset utilizes the strengths and capabilities of real-world dialogue and general-purpose domain LLM, and has carried out targeted enhancements on three aspects: domain knowledge, medical dialogue skills, and human preference; high-quality datasets train excellent The large medical model DISC-MedLLM has achieved significant improvements in medical interaction, shows high usability, and shows great application potential.

Research in this field will bring more prospects and possibilities to reduce online medical costs, promote medical resources, and achieve balance. DISC-MedLLM will bring convenient and personalized medical services to more people and play a role in the cause of general health.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
0/400
No comments
Trade Crypto Anywhere Anytime
qrCode
Scan to download Gate App
Community
English
  • 简体中文
  • English
  • Tiếng Việt
  • 繁體中文
  • Español
  • Русский
  • Français (Afrique)
  • Português (Portugal)
  • Bahasa Indonesia
  • 日本語
  • بالعربية
  • Українська
  • Português (Brasil)