REFINE Logo
No Item
i
    Standardized terminology used in REFINE
  • Training: The process of optimizing a model’s parameters using data and an objective function to minimize a defined loss metric through iterative gradient updates.
  • Pretraining: Training performed on large-scale, general, or weakly supervised datasets, often via self-supervised learning, to develop broad representations and foundational capabilities.
  • Post-training: The subsequent optimization phase that includes supervised fine-tuning, instruction tuning, task-specific or domain-specific adaptation, and alignment methods such as reinforcement learning from human feedback (RLHF), reinforcement learning from AI feedback (RLAIF), and direct preference optimization (DPO). This phase refines the pretrained model’s behavior toward human intent, safety, or specialized performance.
  • Fine-tuning: A targeted form of post-training focused on adapting model weights for a particular task, dataset, or domain to achieve an explicit objective such as classification accuracy or instruction adherence.
  • Inference-time adaptation: Adaptation without weight updates, achieved through context conditioning or external mechanisms such as prompting, retrieval-augmented generation, tool integration, or dynamic hyperparameter selection.
  • Alignment: Steering model behavior toward human intent or domain goals, achieved through weight-updating post-training methods such as RLHF, RLAIF, or DPO, or through inference-time methods such as prompting or retrieval.
  • Testing: Final, strictly unseen hold-out evaluation performed once, with no design choices informed by test feedback.
  • Validation: Any confirmation process that does not refer to a held-out data partition (i.e., not validation or testing splits). Throughout this checklist, “validation” refers to verifying the correctness or adequacy of procedures and is not used to denote a machine learning validation dataset due to its ambiguity.
Reported
i
  • Yes: fully reported
  • Partial: partly reported
  • No: not reported
  • N/A: not applicable
Location
(if Yes/Partial)
1. Model Specification
1.1 Model name, vendor/developer, version/identifier, release date, and training/knowledge cutoff date
? Report the full model identity. Specify the model name (e.g., GPT-5), vendor or developer (e.g., OpenAI or Stanford University), version or application programming interface (API) identifier (e.g., 5 or GPT-5-2025-08-07), and model release date. If an official release date is unavailable, report the date accessed and cite the model card or changelog version. Also report the training or knowledge cutoff date (if provided), which defines the latest point up to which the model was trained.
1.2 Model architecture and key characteristics
? Report the model’s foundational architecture (e.g., transformer or state-space models) and any key design features (e.g., mixture-of-experts or diffusion-based models). Include parameter count if available. For multimodal models, specify the image encoder architecture (e.g., ResNet or vision transformer) and the fusion strategy (e.g., cross-attention) if available.
1.3 Model pretraining, post-training, and inference-time adaptation strategy
? Describe how the model was developed or adapted for the study task. Indicate whether the study involved training a new foundation model (pretraining from scratch or continued pretraining) or post-training an existing one. For training, summarize the pretraining corpus and objectives. For post-training, specify whether it involved tuning or alignment and whether model weights were updated. If weights were updated, state the method (e.g., supervised fine-tuning or reinforcement learning) and the technique (e.g., full fine-tuning or parameter-efficient fine-tuning), including key details such as dataset size, steps/epochs, batch size, and optimizer. If no weight updates were performed, describe the inference-time adaptation strategy (e.g., prompting or retrieval-augmented generation), and report any external tools or APIs used at inference. Indicate whether clinical data were included, and clearly specify the stage(s) at which they were used (pretraining, post-training, or inference-time adaptation).
1.4 Modality support details (input and output) and limitations
? Report the input and output modalities supported by the model (e.g., text, image, audio, and video). Describe any technical limitations, such as maximum context length (e.g., up to 128k tokens), image resolution constraints (e.g., ≤1,024 × 1,024 pixels), patch size, or input-pipeline restrictions (e.g., DICOM files require prior conversion to image tensors; only 2D images are accepted).
1.5 Language capabilities
? Report the languages in which the model has been evaluated or explicitly tested, and specify any known limitations or domain specialization (e.g., medical English, radiology-specific Turkish, or German for lay explanations). If applicable, indicate whether performance across languages was assessed or remains unverified.
1.6 Model access
? Describe how the model was accessed during the study. Specify whether access was through a graphical user interface (e.g., chat platform) or an API. For API access, indicate whether it was locally hosted, securely hosted (e.g., an enterprise cloud), or publicly hosted.
1.7 Sharing of code, data, and model artifacts
? State the availability of study materials that support reproducibility, including code repositories (e.g., GitHub), any artifacts such as processed datasets, model weights, or checkpoints (e.g., Hugging Face), and study datasets (e.g., Kaggle, Zenodo, or Hugging Face). Provide access links where applicable. If any component is not publicly available, include a clear statement of unavailability and the reason (e.g., provider restrictions, licensing limitations, or institutional policy).
1.8 Computational requirements
? Report the computational resources required for model development and use. Specify hardware and resource needs for different stages, including training, fine-tuning, alignment, and inference (e.g., graphics processing unit/tensor processing unit type, number of compute nodes, memory requirements, runtime, or cloud compute specifications).
2. Prompt Design
2.1 Prompt engineering protocol with versioning
? Report the protocol used for prompt engineering and the development process (e.g., iterative testing of prompt variants with predefined success metrics, human-in-the-loop review with inter-rater checks, or automated prompt optimization [e.g., DSPy]). Describe contributors involved (e.g., domain experts or engineers), data partitioning used during prompt development, and any automated tools or optimization frameworks applied. Provide a version history of prompts, summarizing major changes across iterations.
2.2 Prompting strategy, format, and length
? Describe how prompts were constructed and used. Specify the prompting strategy (e.g., zero-shot with task instructions, few-shot using exemplar inputs, or chain-of-thought prompting), the prompt format (e.g., structured templates with placeholders or open-ended queries for text prompts; a single image, a multi-image set, or region annotations/bounding boxes for image prompts), and prompt length (e.g., short directive prompts vs. long multi-context prompts).
2.3 Prompt modality, language, technical input specification, and full content
? Specify the prompt modality (e.g., text, image, or audio), type of input(s) (e.g., chest X-ray images, discharge summary text, or pathology reports), and the language (if text prompts are used or images contain text). Include relevant technical characteristics such as resolution, encoding, or preprocessing applied before input conversion. Specify whether and how prior studies or longitudinal data were included in the prompt for comparison. Provide the full prompt content used in the study (e.g., verbatim text for text prompts or representative examples for non-text prompts).
2.4 Integration of relevant patient clinical context
? Describe how patient-specific clinical information was incorporated into prompts. Specify what types of context were included (e.g., age, sex, key comorbidities, prior treatments, and relevant medical history), how this information was selected (e.g., based on predefined criteria, guideline-driven relevance, or expert curation), and how it was standardized (e.g., ICD codes for diagnoses, SNOMED CT for procedures, RxNorm for medications, and BI-/LI-/PI-RADS for imaging). Also state the source of the clinical information (e.g., electronic health records, radiology reports, or patient summaries).
2.5 Interaction style and session memory policy
? Specify how users interacted with the model. Indicate whether the workflow used a single-turn interaction (independent one-shot queries) or a multi-turn conversation in which previous messages influence later responses. Also report the memory policy, whether prior context was retained across messages in the same conversation and reset after closing the session (i.e., session memory policy) or whether prior context was retained across different conversations (i.e., persistent memory policy), to clarify how much conversational history shaped the model’s outputs.
2.6 Output handling
? Report how model outputs were controlled and managed after generation. Describe the output format used (e.g., structured JSON, free text, or tabular form). Specify the level of control applied: i, no control; ii, control in prompt only; iii, control during generation (e.g., guided generation or JSON mode with schema validation); or iv, control after generation (e.g., checking for information completeness, validating against a ground truth, or validating against a schema). Describe how constraints or output schemas were enforced and any further validation schemes applied (e.g., clinical plausibility review).
3. Stochasticity Control
3.1 Generation parameters
? Report all model generation settings used during output generation for all modalities (e.g., text, image, and others). For text generation, specify parameters such as temperature, top-k/top-p sampling, maximum output tokens, repetition or frequency penalties, and any random seed used. For image generation, include the number of inference steps, scheduler type, output resolution, and guidance scale.
3.2 Prompt operator characteristics and number of prompt attempts
? Specify who interacted with the model. Report these operators’ roles (e.g., clinician, researcher, attending radiologist, resident, or technologist) and experience levels (e.g., years of practice, AI familiarity, or training status). Also specify how many prompt attempts were made.
3.3 Output selection
? Describe how the final model output(s) were selected. If multiple generations were sampled, detail the selection criteria. Specify whether final outputs were expert-reviewed (e.g., radiologist selected best response), randomly selected (e.g., first output used without filtering), consensus-based (e.g., agreement among multiple reviewers), algorithmic (e.g., ranking by confidence score), or AI-automated (e.g., pipeline-selected output). Report tie-break rules and any rejection filters.
4. Dataset Integrity
4.1 Dataset name, version, access type, source citation, license, and compliance statement
? Clearly report the dataset used to ensure transparency and traceability. Include the dataset name and version (e.g., MIMIC-CXR v2.0, LIDC-IDRI, or BraTS 2023), access type (public, restricted, or private), and citation of the data source. For private datasets, specify the institution or repository name and how the data were accessed (e.g., through departmental electronic health records, institutional PACS, or a trial repository). Also include license details (e.g., PhysioNet Credentialed Health Data License; RSNA data agreement) and a statement confirming compliance with data use agreements or institutional approvals.
4.2 Dataset origin
? Specify whether the dataset was collected from a single site or multiple centers. For multi-center datasets, indicate the number and type of institutions (e.g., academic medical centers or community hospitals) and whether the data were international or regional.
4.3 Ethics and consent statements
? Specify institutional review board or ethics committee approval status, including the approval number if applicable. If informed consent was waived, provide a brief justification (e.g., retrospective anonymized data).
4.4 Prior dataset usage and publication date
? Report any prior usage of the dataset that could affect the evaluation of independence, particularly prior work by the authors or institution. For widely used public datasets, a general acknowledgment of their established use is sufficient. Also state the dataset’s publication or public release date, as datasets made public before a model’s pretraining cutoff may carry a risk of contamination. If usage details are limited by proprietary restrictions, explicitly state the limitations and their source.
4.5 Dataset composition and data synthesis details
? Report the composition of the dataset, specifying whether it includes real clinical data, synthetic data, or a combination of both. For public datasets, citing the original source and providing a brief description is sufficient. For private or newly published datasets, disclose the data generation method (e.g., LLM-based report synthesis, diffusion model for MR image synthesis, or generative adversarial network-based augmentation of CT scans) and the proportion of synthetic data in the dataset.
4.6 Sample characteristics and representational bias analysis
? Report key characteristics of the dataset population to assess fairness and generalizability. Include demographics (e.g., age distribution and sex balance), clinical characteristics (e.g., disease types and severity levels), and data composition (e.g., imaging modality, number of classes, and case distribution). Also evaluate representational bias by analyzing subgroup coverage (e.g., underrepresentation of certain age groups, sex imbalance, or limited geographic diversity).
4.7 Reference standard and annotator qualifications
? Define how the reference standard in the dataset was established. Specify the type of reference standard used (e.g., pathology-confirmed diagnosis, radiology report, clinical follow-up, or expert panel consensus) and describe the annotator qualifications (e.g., board-certified radiologist with 10 years of experience, pathology fellow, or multidisciplinary tumor board). Include the number of annotators and any disagreement resolution process.
4.8 Preprocessing and data pairing/registration
? Describe all preprocessing steps applied to the dataset before model use. For text data, specify methods for de-identification/anonymization (e.g., removal of protected health information or natural language processing-based redaction). For imaging data, report pairing or registration procedures (e.g., linking reports to imaging studies, spatial alignment across sequences, or longitudinal time points). Report specific processing pipelines such as DICOM window/level settings, normalization techniques, and multi-sequence handling. Also mention any filtering, resizing, artifact removal, or other normalization procedures.
4.9 Missing data extent, mechanism, and handling
? Report how missing data were assessed and managed. Specify the extent of missing data (e.g., percentage of missing values per variable), the mechanism if known (e.g., missing completely at random, missing at random, or missing not at random), and the handling strategy used (e.g., exclusion of incomplete cases, mean/median imputation, model-based imputation, or no imputation). Provide a brief rationale for the chosen method to support transparency and reproducibility.
4.10 Separation of training, fine-tuning, internal testing, and external testing datasets
? Report how datasets were separated to prevent information leakage. Clearly describe the distinction between training data, fine-tuning data, internal testing data (i.e., held-out cases from the same institution), and external testing data (i.e., from a different institution). State how independence was ensured (e.g., separation by patient ID, study date, or site). Confirm that no internal or external test data were used during training, fine-tuning, or prompt optimization.
5. Output Evaluation
5.1 Output evaluation method and performance metrics
? Report the methods used to evaluate the model and list all relevant performance metrics. State their appropriateness for the task(s) of interest. Specify whether the evaluation was based on human review (e.g., Likert ratings, readability scores, or clinical usefulness ratings), task-performance metrics (e.g., AUROC, F1-score, sensitivity, specificity, or calibration measures), text or semantic similarity metrics (e.g., BLEU, ROUGE, or BERTScore), or model-based evaluation (e.g., LLM-as-a-judge scoring).
5.2 Human evaluator characteristics and reliability analysis
? Report the characteristics of human evaluators involved in the assessment of model outputs, including the number of evaluators, their role or specialty (e.g., radiologist, clinician, or domain expert), experience level, and any formal training provided for the evaluation task. Report the methods used to measure evaluator consistency, including inter-rater or intra-rater reliability statistics (e.g., Cohen’s kappa, Fleiss’ kappa, or intraclass correlation coefficient).
5.3 Statistical analysis of evaluation results
? Report the statistical methods used to analyze model evaluation outcomes. Specify any hypothesis testing or alternative approaches, the interval estimates of uncertainty used (e.g., confidence intervals, credible intervals, or bootstrap intervals), and effect size measures where applicable. Describe the criteria used to interpret statistical evidence (e.g., significance thresholds, Bayesian decision rules, or equivalence margins).
5.4 Subgroup performance and output bias assessment
? Report model performance across predefined subgroups relevant to the clinical task, such as age groups, sex, disease categories, imaging modality, institution, or geographic region. Describe any observed performance differences between subgroups to examine potential output bias.
5.5 Failure analysis and error metrics
? Report how model errors were identified, reviewed, and categorized. Describe the procedure used to detect and classify failures (e.g., hallucination, reasoning error, bias-related error, factual inaccuracy, or formatting issue). Provide representative examples of failure types. Report key error metrics (e.g., hallucination rate, factual inaccuracy rate, or omission rate) and include summaries of error distribution where available.
5.6 Output stochasticity and reproducibility constraints
? Report the results of generative variability assessment based on repeated generations for identical prompts and randomness settings. Provide examples of output differences. Based on the analysis, state any factors limiting exact reproducibility, such as inherent stochastic model behavior, lack of seed control, proprietary or closed-source model access, updates to model versions over time, or restricted access environments where full configuration cannot be disclosed.
5.7 Performance effects of different prompt strategies and revisions
? Report how different prompting strategies affected model performance, including comparisons across approaches such as zero-shot, few-shot, chain-of-thought prompting, or prompting using different languages. Report the impact of any prompt revisions (e.g., initial prompt version vs. revised version) on output quality or task performance.
5.8 Model version comparisons and temporal performance variation
? Report comparisons of model performance across different versions of the model (e.g., Llama-3.1-8b v1.0 or Llama-3.1-8b v1.1) or across different release dates of the same version, if such evaluations were performed or relevant to the study period. Describe how performance changed over time or between versions, and specify the conditions used for comparison.
5.9 Methods for explainability and interpretability of model outputs
? Report the methods used to interpret or explain model outputs and describe how they were applied, as applicable to the model type and study context. Describe the scope and known limitations of the chosen explainability approach in the study context.
5.10 Comparison with clinically relevant benchmarks
? Report comparisons between model performance and appropriate clinical benchmarks. These benchmarks may include established clinical standard-of-care (e.g., expert human performance or clinical guidelines). Comparisons to specialist task-specific artificial intelligence models should be reported as technical benchmarks, distinct from clinical reference standards. Describe the comparison framework and reference standards used.
6. Implementation
6.1 Declared intended application and scope of use
? Report the intended purpose of the model. Describe the target application, such as diagnostic decision support, medical question answering, report generation, workflow triage, patient communication, or educational support.
6.2 Clinical workflow integration
? Report whether, and if applicable, how and where the model was integrated into the clinical workflow. Specify the integration mode (e.g., embedded in PACS or RIS, decision-support dashboard, web-based interface, or standalone software) and the interaction point in the workflow (e.g., prereading triage, concurrent reading, post-report review, or quality assurance). State the intended user role (e.g., radiologist, resident, or technologist) and whether patient-facing interaction was involved.
6.3 Measured clinical utility or added value
? Report measures of clinical utility or added value obtained from model use. Describe how the model contributed to outcomes such as improvement in diagnostic accuracy, reduction in reporting or decision time, enhancement of workflow efficiency, increased clinician confidence, or improved patient understanding.
6.4 Model limitations, explicit clinical non-use cases, and potential misuse considerations
? Report known limitations of the model and clearly state clinical scenarios where it should not be used. Describe explicit non-use cases (e.g., high-risk decision-making without human oversight, unsupported imaging modalities, or unsupported patient populations) and identify foreseeable risks of misuse. Include considerations related to potential patient harm, safety concerns, or health system risks.
6.5 Safety testing and monitoring protocols
? Report procedures used to identify and manage harmful, clinically unsafe, or medically inaccurate outputs. Describe any safety testing performed before deployment (e.g., screening for harmful recommendations or toxic responses and sandboxing) and monitoring protocols used during model interaction (e.g., automated safety filters or human review of unsafe outputs).
6.6 Data security and privacy safeguards
? Report measures used to protect data security and patient privacy during model use if patient or sensitive data were processed. Describe how prompt and user data were handled, including storage policies, duration of retention, and whether data were reused for model improvement. Specify data routing and regional processing locations if applicable (e.g., EU vs. US data centers) and report safeguards for live handling of protected health information, such as de-identification, access control, and encryption.
6.7 Governance, auditability, and oversight
? Report governance measures and institutional oversight applied to the use of the model in the study, distinct from the initial ethics approval. Specify procedures for auditability (e.g., logging of model interactions or traceability of outputs) and formal oversight, especially when using external or API-based models (e.g., use of secure institutional accounts vs. public APIs).

Summary of Checklist Completion by Section

Yes
Partial
No
N/A
Empty
*Percentages are calculated independently and may not sum to exactly 100% due to rounding.

Note: Please cite the following article when using this checklist: Mese I, Akinci D'Antonoli T, Bluethgen C, Bressem K, Cuocolo R, Chaudhari A, Tejani AS, Isaac A, Ponsiglione A, Meddeb A, Khosravi B, Le Guellec B, Kahn CE Jr, Suh CH, Pinto Dos Santos D, Koh DM, Tzanis E, Kotter E, Colak E, Kitamura F, Busch F, Nensa F, Yang G, Müller H, Kather JN, Nawabi J, Kleesiek J, Zhong J, Santinha J, Haubold J, de Almeida JG, Lekadir K, Marias K, Reiner LN, Maier-Hein L, Moy L, Adams LC, Martí-Bonmatí L, Paschali M, Moassefi M, Dietzel M, Huisman M, Ingrisch M, Klontzas ME, Papanikolaou N, Diaz O, Kuriki P, Seeböck P, Rouzrokh P, Strotzer QD, Park SH, Faghani S, Tayebi Arasteh S, Kim SH, Venugopal VK, Kim W, Kocak B. Reporting checklist for foundation and large language models in medical research (REFINE): an international consensus guideline. Diagn Interv Radiol. 2026 Feb 26. doi: 10.4274/dir.2026.263812. Epub ahead of print. PMID: 41742713.