west china medical publishers
Keyword
  • Title
  • Author
  • Keyword
  • Abstract
Advance search
Advance search

Search

find Keyword "Large language models" 3 results
  • Evaluation of the accuracy of the large language model for risk of bias assessment in analytical studies

    Objective To systematically review the accuracy and consistency of large language models (LLM) in assessing risk of bias in analytical studies. Methods The cohort and case-control studies related to COVID-19 based on the team's published systematic review of clinical characteristics of COVID-19 were included. Two researchers independently screened the studies, extracted data, and assessed risk of bias of the included studies with the LLM-based BiasBee model (version Non-RCT) used for automated evaluation. Kappa statistics and score differences were used to analyze the agreement between LLM and human evaluations, with subgroup analysis for Chinese and English studies. Results A total of 210 studies were included. Meta-analysis showed that LLM scores were generally higher than those of human evaluators, particularly in representativeness of exposed cohorts (△=0.764) and selection of external controls (△=0.109). Kappa analysis indicated slight agreement in items such as exposure assessment (κ=0.059) and adequacy of follow-up (κ=0.093), while showing significant discrepancies in more subjective items, such as control selection (κ=−0.112) and non-response rate (κ=−0.115). Subgroup analysis revealed higher scoring consistency for LLM in English-language studies compared to that of Chinese-language studies. Conclusion LLM demonstrate potential in risk of bias assessment; however, notable differences remain in more subjective tasks. Future research should focus on optimizing prompt engineering and model fine-tuning to enhance LLM accuracy and consistency in complex tasks.

    Release date: Export PDF Favorites Scan
  • Performance comparison of ChatGPT-4.5 and DeepSeek-V3 in rehabilitation guidance for knee osteoarthritis

    Objective To compare the performance of ChatGPT-4.5 and DeepSeek-V3 across five key domains of physical therapy for knee osteoarthritis (KOA), evaluating the accuracy, completeness, reliability, and readability of their responses and exploring their clinical application potential. Methods Twenty-one core questions were extracted from 10 authoritative KOA rehabilitation guidelines published between September 2011 and January 2024, covering five task categories: rehabilitation assessment, physical agent modalities, exercise therapy, assistive device use, and patient education. Responses were generated using both the ChatGPT-4.5 and DeepSeek-V3 models and evaluated by four physical therapists with over five years of clinical experience using Likert scales (accuracy and completeness: 5 points; reliability: 7 points). The scale scores were compared between the two large language models. Additional assessment included language style clustering. Results Most of the scale scores did not follow a normal distribution, and were presented as median (lower quartile, upper quartile). ChatGPT-4.5 outperformed DeepSeek-V3 with higher scores in accuracy [4.75 (4.75, 4.75) vs. 4.75 (4.50, 5.00), P=0.018], completeness [4.75 (4.50, 5.00) vs. 4.25 (4.00, 4.50), P=0.006], and reliability [5.75 (5.50, 6.00) vs. 5.50 (5.50, 5.50), P=0.015]. Clustering analysis of language styles revealed that ChatGPT-4.5 demonstrated a more diverse linguistic style, whereas DeepSeek-V3 responses were more standardized. ChatGPT-4.5 achieved higher scores than DeepSeek-V3 in lexical richness [4.792 (4.720, 4.912) vs. 4.564 (4.409, 4.653), P<0.001], but lower than DeepSeek-V3 in syntactic richness [2.133 (2.072, 2.154) vs. 2.187 (2.154, 2.206), P=0.003]. Conclusions ChatGPT-4.5 demonstrates superior performance in accuracy, completeness, and reliability, indicating a stronger capacity for task execution. It uses more diverse words and has stronger flexibility in language generation. DeepSeek-V3 exhibited greater syntactic richness and is more normative in language. ChatGPT-4.5 is better suited for content-rich tasks that require detailed explanation, while DeepSeek-V3 is more appropriate for standardized question-answering applications.

    Release date: Export PDF Favorites Scan
  • Interpreting the TRIPOD-LLM guideline: a reporting standard for large language model research in healthcare

    The burgeoning application of large language models (LLM) in healthcare demonstrates immense potential, yet simultaneously poses new challenges to the standardization of research reporting. To enhance the transparency and reliability of medical LLM research, an international expert group published the TRIPOD-LLM reporting guideline in Nature Medicine in January 2024. As an extension of the TRIPOD+AI guideline, TRIPOD-LLM provides detailed reporting items specifically tailored to the unique characteristics of LLMs, including general foundational models (e.g., GPT-4) and domain-specific fine-tuned models (e.g., Med-PaLM 2). It addresses critical aspects such as prompt engineering, inference parameters, generative evaluation, and fairness considerations. Notably, the guideline introduces an innovative modular design and a "living guideline" mechanism. This paper provides a systematic, item-by-item interpretation and example-based analysis of the TRIPOD-LLM guideline. It is intended to serve as a clear and practical handbook for researchers in this field, as well as for journal reviewers and editors responsible for assessing the quality of such studies, thereby fostering the high-quality development of medical LLM research in China.

    Release date: Export PDF Favorites Scan
1 pages Previous 1 Next

Format

Content