Dear Editor,

We would like to discuss “Translating musculoskeletal radiology reports into patient-friendly summaries using ChatGPT-4” [1]. The layperson summaries of 60 radiology reports generated by ChatGPT-4 were evaluated as part of the study. The primary goal was to evaluate the completeness and accuracy of these summaries, which had to be written at a reading level appropriate for the eighth and ninth grades and be succinct and well-organized. In order to do this, the summaries were given a scale of 1 to 3 for accuracy and thoroughness, to be rated by three separate readers.

The study’s conclusions showed that each of the 60 summaries complied with the word count and readability standards. However, the three readers’ evaluations of accuracy differed from one another. Readers two and three rated accuracy significantly higher, at 2.71 and 2.77, respectively, than reader one, who gave it an average rating of 2.58. The readers’ evaluations of completeness also varied; reader one gave it a score of 2.87, reader two gave it a score of 2.73, and reader three gave it a score of 2.87. The two primary readers’ low agreement on accuracy and completeness is shown by their respective kappa values of 0.29 and 0.33. It was unexpected that the addition of a third reader had no discernible effect on the inter-reader agreement.

A few of the study’s shortcomings should be noted. First off, there appears to have been irregularities in the three readers' assessment processes based on the differences in accuracy and completeness scores. Furthermore, the poor inter-reader agreement suggests that opinions of the caliber of the layperson summaries are not in agreement. Finally, no information about possible factors that might have affected the ratings and inter-reader agreement is provided by the study.

Given these constraints, a number of questions are posed for additional research. First and foremost, in order to guarantee more consistency among readers, standardization in the review process is important. Furthermore, more research is needed to determine the variables influencing the differences in accuracy and completeness evaluations across various readers. Additionally, the study would benefit from focusing on certain prompts or features of the radiological reports that might have influenced the readers’ evaluations. It is also worthwhile to think about putting extra rules or regulations into place to raise the standard of layperson summaries. Finally, broadening the scope of the research to encompass a more varied and substantial array of radiology reports would augment the applicability of the results. By establishing norms or guidelines and looking into the ethical concerns surrounding the use of AI-generated content, it is also possible to guarantee a responsible and transparent use of this technology [2].