💥 Gate Square Event: #PTB Creative Contest# 💥
Post original content related to PTB, CandyDrop #77, or Launchpool on Gate Square for a chance to share 5,000 PTB rewards!
CandyDrop x PTB 👉 https://www.gate.com/zh/announcements/article/46922
PTB Launchpool is live 👉 https://www.gate.com/zh/announcements/article/46934
📅 Event Period: Sep 10, 2025 04:00 UTC – Sep 14, 2025 16:00 UTC
📌 How to Participate:
Post original content related to PTB, CandyDrop, or Launchpool
Minimum 80 words
Add hashtag: #PTB Creative Contest#
Include CandyDrop or Launchpool participation screenshot
🏆 Rewards:
🥇 1st
Fudan University team releases Chinese medical and health personal assistant, and open source 470,000 high-quality data sets
With the rise of telemedicine, online consultation and consultation have increasingly become the first choice for patients seeking convenient and efficient medical support. Recently, large language models (LLM) have demonstrated powerful natural language interaction capabilities, bringing hope for health medical assistants to enter people's lives.
Medical and health consultation scenarios are usually complex, and personal assistants need to have rich medical knowledge and the ability to understand the patient's intentions through multiple rounds of dialogue, and give professional and detailed responses. In the face of medical and health consultation, the general language model often avoids talking or answers wrong questions due to lack of medical knowledge; at the same time, it tends to complete the consultation for the current round of questions, lacking satisfactory multi-round questioning ability. In addition, high-quality Chinese medical data sets are currently very rare, which poses a challenge to training powerful language models in the medical field.
Fudan University Data Intelligence and Social Computing Laboratory (FudanDISC) releases Chinese medical and health personal assistant - DISC-MedLLM. In the medical and health consultation evaluation of single-round question and answer and multi-round dialogue, the performance of the model shows obvious advantages compared with existing large medical dialogue models. The research team also released a high-quality supervised fine-tuning (SFT) data set - DISC-Med-SFT containing 470,000 people. The model parameters and technical reports are also open source.
1. Sample display
When patients feel unwell, they can ask the model to describe their own symptoms. The model will give possible causes and recommended treatment options as a reference. When there is a lack of information, the model will actively ask for a detailed description of the symptoms.
Users can also ask the model specific consultation questions based on their own health conditions, and the model will give detailed and helpful answers, and proactively ask questions when information is lacking to enhance the pertinence and accuracy of the responses.
Users can also ask about medical knowledge that has nothing to do with themselves. At this time, the model will answer as professionally as possible so that users can understand it comprehensively and accurately.
2. Introduction to DISC-MedLLM
DISC-MedLLM is a large medical model trained on the general domain Chinese large model Baichuan-13B based on our high-quality dataset DISC-Med-SFT. It is worth noting that our training data and training methods can be adapted to any base large model.
DISC-MedLLM has three key features:
The strengths of the model and the data construction framework are shown in Figure 5. We calculated the real distribution of patients from real consultation scenarios to guide the sample construction of the data set. Based on the medical knowledge graph and real consultation data, we used two ideas: large model-in-the-loop and people-in-the-loop to construct the data set. .
3. Method: Construction of the data set DISC-Med-SFT
In the process of model training, we supplemented DISC-Med-SFT with general domain datasets and data samples from existing corpora to form DISC-Med-SFT-ext, the details of which are presented in Table 1.
Reconstruction AI Doctor-Patient Dialogue
data set. 400,000 and 20,000 samples are randomly selected from two public datasets, MedDialog and cMedQA2, respectively, as source samples for SFT dataset construction.
Refactor. In order to adapt real-world doctor responses to the desired high-quality responses in a unified format, we utilize GPT-3.5 to complete the reconstruction process of this dataset. The prompt word(s) requires rewriting to follow the following principles:
Figure 6 shows an example of refactoring. The adjusted doctor's answer is consistent with the identity of the AI medical assistant, which not only adheres to the key information provided by the original doctor, but also provides patients with more comprehensive help.
Knowledge map question and answer pairs
The medical knowledge graph contains a large amount of well-organized medical expertise, based on which less noisy QA training samples can be generated. Based on CMeKG, we sample the knowledge graph according to the departmental information of disease nodes, and utilize appropriately designed GPT-3.5 models to generate a total of more than 50,000 diverse medical scene dialogue samples.
Behavioral Preference Dataset
In the final stage of training, in order to further improve the performance of the model, we use a dataset more in line with human behavior preferences for secondary supervised fine-tuning. About 2000 high-quality, diverse samples were manually selected from the two data sets of MedDialog and cMedQA2. After rewriting several examples and manually revising them to GPT-4, we used the small sample method to provide them to GPT-3.5 ,generate high-quality behavioral preference data sets.
other
general data. In order to enrich the diversity of the training set and reduce the risk of degradation of the model's basic capabilities during the SFT training stage, we randomly selected several samples from two common supervised fine-tuning data sets moss-sft-003 and alpaca gpt4 data zh.
MedMCQA. In order to enhance the question answering ability of the model, we choose MedMCQA, a multiple-choice data set in the English medical field, optimize the questions and correct answers in the multiple-choice questions using GPT-3.5, and generate about 8,000 professional Chinese medical question-answer samples .
4. Experiment
train. As shown in the figure below, the training process of DISC-MedLLM is divided into two SFT stages.
Review. The performance of medical LLMs is evaluated in two scenarios, namely single-round QA and multi-round dialogue.
Evaluation result
Compare models. Our model is compared with three general LLMs and two Chinese medical conversational LLMs. Including OpenAI's GPT-3.5, GPT-4, Baichuan-13B-Chat; BianQue-2 and HuatuoGPT-13B.
Single round of QA results. The overall results of the multiple-choice assessment are shown in Table 2. GPT-3.5 shows a clear lead. DISC-MedLLM achieves second place in the few-shot setting and third behind Baichuan-13B-Chat in the zero-shot setting. Notably, we outperform HuatuoGPT (13B) trained in a reinforcement learning setting.
Results of multiple rounds of dialogue. In the CMB-Clin evaluation, DISC-MedLLM achieved the highest composite score, followed by HuatuoGPT. Our model scored top on the positivity criterion, highlighting the effectiveness of our training approach biased towards medical behavioral patterns. The results are shown in Table 3.
In the CMD sample, as shown in Figure 8, GPT-4 obtained the highest score, followed by GPT-3.5. The overall performance scores of the models DISC-MedLLM and HuatuoGPT in the medical field are the same, and their performance in different departments is outstanding.
The situation of CMID is similar to that of CMD, as shown in Figure 9, where GPT-4 and GPT-3.5 maintain the lead. Except for the GPT series, DISC-MedLLM performed best. It outperformed HuatuoGPT in the three intents of disease, treatment regimen and drug.
The inconsistent performance of each model between CMB-Clin and CMD/CMID may be due to the different data distribution between the three datasets. CMD and CMID contain more explicit question samples, and patients may have obtained a diagnosis and expressed clear needs when describing symptoms, and the patient's questions and needs may even have nothing to do with their personal health status. The general-purpose models GPT-3.5 and GPT-4, which excel in many aspects, are better at handling this situation.
5. Summary
The DISC-Med-SFT dataset utilizes the strengths and capabilities of real-world dialogue and general-purpose domain LLM, and has carried out targeted enhancements on three aspects: domain knowledge, medical dialogue skills, and human preference; high-quality datasets train excellent The large medical model DISC-MedLLM has achieved significant improvements in medical interaction, shows high usability, and shows great application potential.
Research in this field will bring more prospects and possibilities to reduce online medical costs, promote medical resources, and achieve balance. DISC-MedLLM will bring convenient and personalized medical services to more people and play a role in the cause of general health.