Introduction

Known for its remarkable conversational capabilities and extensive knowledge spanning numerous disciplines, large language model (LLM) chatbots like ChatGPT have gathered significant interests in education1, research2, and clinical practice3. In the field of bioinformatics, ChatGPT serves as an instrumental aid for learning basic bioinformatics4,5. Several literatures offer recommendations on harnessing the chatbot for more efficient data analysis4,5,6,7,8,S4) as well as three promoter regions (denoted by blue circles). Each circle signifies the interaction between two genomic regions, pointed by dashed lines originating from the circle. For instance, chromatin domains “A1” and “A2” are connected to the leftmost blue circle in Supplementary Fig. S4, denoting their physical proximity. GPT-4V could not accurately discern the colors of the two types of circles and reported non-existent purple circles. While manually counting the circles is a straightforward task for human analysts, GPT-4V consistently failed to count the circles in all three replicates. When challenged to identify interacted chromatin regions indicated by the circles, none of the reports from GPT-4V was correct. This underscored GPT-4V’s limitations in not only counting simple visual elements like circles but also deducing relationships between connected visual elements.

Figure legends from the basal assessment consistently overlooked explanations for custom markings on plots, such as rectangles or dashed lines (top three chat histories in Supplementary Notes S4). Summaries lacked sufficient details for concrete conclusions. During the in-depth assessment, GPT-4V incorporated its detailed responses to previous specific questions into the figure legends and result summaries to enrich depth. This approach, however, also led to the inclusion of errors or misleading information from the previous responses, emphasizing the need for rigorous human review to ensure accuracy.

Our qualitative analyses from the four case studies revealed two primary strengths of GPT-4V in figure interpretation. First, it demonstrated competency in identifying and explaining various plot types. Second, it demonstrated competency in leveraging domain knowledge to elucidate or substantiate observations. Regarding limitations, GPT-4V struggled with color perception. Additionally, it faced challenges in discerning the positional relationships between visual elements. Notably, in the “Clonal” and “YY1” case studies, GPT-4V also showed limitations in tasks involving the counting of visual elements. To further measure the significance of these findings, we conducted a comprehensive quantitative analysis in the next section.

Quantitative evaluation

Six categories emerged from our qualitative assessment to characterize GPT-4V’s responses in scientific figure interpretation: Plot Recognition, Domain Knowledge, Color Perception, Positional Inference, Counts, and Others (see the “Methods” section for definition). The six categories covered 80.3 ± 7.9% of the statements. We devised a summative indexing table for each case study to encapsulate the true/false evaluation of the statements, their categorizations, and the origins of replicates and panels (Supplementary Tables S1S4).

As an illustration of this process, Fig. 1a shows an overview, starting from the inputs (figure and prompts) and progressing through parsing GPT-4V’s responses into discrete statements, true/false annotations, categorization, and finally, the compilation of a summative indexing table for downstream comparative analysis. Figure 1b, using extracts from GPT-4V’s explanation of the plot type of a sub-panel from the “ZNF71” case study, demonstrates the workflow from the initial GPT-4V responses through to the indexed and annotated statements and to their allocations in the corresponding summative table. In this example, statements “<2-3-3>” and “<2-3-4>” explained the plot type and were annotated as correct (see the “Methods” section for details on statement indexing). Therefore, the two statement numbers “<3>” and “<4>” were placed in the Plot Recognition category in the summative table and shown in blue. The statement “<2-3-5>” contained correct information on the number of groups such that the statement number (“<5>”) was marked as blue in the Counts category. The statement also contained incorrect information for color coding and thus was marked as red in the Color Perception category.

Fig. 1: Analytical framework for summarizing GPT-4V’s responses into a summative indexing table for quantitative analysis.
figure 1

a Workflow to illustrate the process from the figure and prompt input to GPT-4V, followed by parsing responses into statements and indices, annotating with true/false color codes, culminating in a summative indexing table with categories. b Example using GPT-4V’s explanation of plot type in a sub-panel from the “ZNF71” case study to demonstrate statement generation, index creation, true/false annotation, categorization, and allocation in a summative table. Color Coding: Black for non-informative chitchats; blue for correct statements; and red for incorrect statements, with curly-bracketed comments on inaccuracies. Oval text: GPT-4V quotes used for category determination.

These summative tables formed the basis for subsequent comparative analyses, which evaluated GPT-4V’s performance within specific categories and across replicates. Note that the basal inquiries were excluded from the analyses, as their corresponding in-depth inquiries elicited more comprehensive responses from GPT-4V.

Overall performance

We calculated the percentages of true statements for each case based on the annotations provided in Supplementary Notes S1S4. As indicated in Fig. 2, the “RNA” case achieved the highest accuracy levels (95.1 ± 1.3%). This may be attributed to the prevalent use of the RNA-Seq technique and the routine nature of the analyses covered in the figure, potentially leading to more effective training of GPT-4V to interpret such figures. Conversely, the overall accuracies from the other three case studies were less impressive, ranging from 64.4 ± 0.1% in the “ZNF71” case to 77.3 ± 5.9% in the “YY1” case (Fig. 2). A noteworthy observation in the “ZNF71” case was the consistent misspelling of “CD27” as “CD271” and “IKBKB” as “IKKBK” (Supplementary Notes S2), which accounted for 40–55% of the false statements. This type of error, only presented in the “ZNF71” case, was a key contributor to its lower accuracy rates: by excluding those statements, the adjusted accuracy increased to 79.2 ± 2.8%.

Fig. 2: Statement accuracy by replicate and case study.
figure 2

Columns represent each replicate, while rows depict the four case studies: “RNA”, “ZNF71”, “Clonal”, and “YY1”. Accuracy levels indicated by color gradient: higher accuracy in yellow, lower accuracy in deeper shades of purple.

Performance by category

The six categories derived from the qualitative analyses were found in all cases. The summative indexing tables (Supplementary Tables S1S4) effectively organized statements by category and replicate for each case study. Notably, 23.3% of the statements spanned two categories, and 1.5% intersected three categories. For simplicity, the true/false status of these overlap** statements was independently assessed in each category during their allocations in the summative indexing tables (see Fig. 1b for an example).

Supplementary Table S6 details the counts of true and total statements for each category and case study combination. This calculation consolidated replicates from each case to ensure a sufficient statement count per category. Statement accuracy for each case study, sorted by category, is visualized in Fig. 3a. These results highlighted GPT-4V’s proficiency in plot type recognition and domain knowledge recall, with accuracies surpassing 85% in all cases (Fig. 3a left two columns). Significantly, these accuracies were markedly higher (p-value < 0.05; t-test) than those for Color Perception, which consistently showed accuracies below 60% across all cases (Fig. 3a, third column).

Fig. 3: Statement accuracy stratified by category.
figure 3

a Accurate statement percentages by category, aggregated by replicates from each of the four case studies: “RNA” (circle), “ZNF71” (square), “Clonal” (triangle), “YY1” (diamond). P-values from two-sided t-tests indicated as *p < 0.05, **p < 0.01, ***p < 0.001. b–f Accurate statement percentages for each replicate across case studies, categorized into plot recognition (b), domain knowledge (c), color perception (d), positional inference (e), and counts (f). Color coding for replicates: Blue for replicate 1, Red for replicate 2, and Green for replicate 3.

Statements in the Positional Inference category describe spatial relationships of visual elements or extract numerical values inferred from coordinate axes. The “RNA” case exhibited a high accuracy of 89.3%. However, its accuracies in the “Clonal” (61.7%) and “YY1” (53.1%) cases were notably lower (Fig. 3a; fourth column). Regarding the Counts category, we observed a bimodal distribution in accuracies: high in the “RNA” (88.9%) and “ZNF71” (100%) cases, while low in the “YY1” (57.9%) and “Clonal” (37.2%) cases (Fig. 3a fifth column). A closer inspection revealed that the counting tasks in the “RNA” and “ZNF71” cases were relatively simple, while the “Clonal” and “YY1” cases represented complex scenarios with multiple groups and/or overlap** with issues in color perception.

Statements under the “Others” category predominantly were suggestions for improving figure presentation or summative sentences of previous statements. In this category, the “RNA”, “Clonal”, and “YY1” cases all demonstrated accuracies as high as 95% or above, while the “ZNF71” case displayed a low accuracy of 72.7% (Fig. 3a; sixth column). This “ZNF71” case had only eleven statements in this category. Moreover, all the inaccurate statements were attributed to the consistent misspelling of “CD27” as “CD271”.

Performance by category across replicates

To assess the robustness of performance across replicates, we defined a replicate’s performance in a specific category as unsatisfactory if its accuracy falls below 80%, mirroring a concerning “C” grade in graduate-level evaluation. For each category in every case study, we tabulated the numbers of correct and total statements for each replicate in Supplementary Table S7. To ensure a robust analysis, our focus was on replicates with at least six statements in the relevant category and case studies with at least two such replicates. Figure 3b–f illustrates the accuracies for all case studies sorted by categories, with key observations summarized as follows:

  1. 1.

    Plot Recognition: All replicates showed satisfactory performance for all cases (Fig. 3b).

  2. 2.

    Domain Knowledge: All replicates showed satisfactory performance for all cases (Fig. 3c).

  3. 3.

    Color Perception: All replicates showed unsatisfactory performance for all included cases (Fig. 3d).

  4. 4.

    Positional Inference: Unsatisfactory performance was prevalent across replicates except for those from the “RNA” case (Fig. 3e).

  5. 5.

    Counts: All replicates showed unsatisfactory performance for all included cases (Fig. 3f).

We next explored to what extent GPT-4V may repetitively fail to address a question. To this end, we summarized 52 specific sub-questions by reviewing all incorrect statements (Supplementary Table S8). A response to a sub-question in a replicate was deemed incorrect if it contained one or more inaccurate statements. Our observations revealed the following (Supplementary Table S8): 77.8% of responses to sub-questions in the Color Perception category were consistently incorrect across replicates, followed by Counts (75.0%) and Positional Inference (55%). In contrast, the percentages of consistent, inaccurate responses in the Plot Recognition, Domain Knowledge, and Other categories were lower, at 37.5%, 25.0%, and 20.0%, respectively. This indicates that GPT-4V’s responses in the Color Perception, Counts, and Positional Inference categories, when incorrect, tend to be more persistently incorrect compared to those in the Plot Recognition, Domain Knowledge, and Others categories (p-value = 0.01; two-sided t-test).

This replicate-based analysis underscored the competency of GPT-4V’s responses in plot recognition and citing domain knowledge. However, it confirmed a significant limitation of GPT-4V in performing tasks for color perception, positional inference, and counts, with consistently poor performance across replicates and cases.

Confirmation bias

During the quantitative evaluation of the “YY1” case, we noted a substantial number of instances where GPT-4V applied valid domain knowledge to rationalize flawed observations, a phenomenon known as “confirmation bias”24. This occurred in about 5–15% of the statements that cited valid domain knowledge from the in-depth inquiries. Specific instances included statements 22, 25, 29, 49, 53, 56, 86, 218, 220 from replicate one; statements 27, 29, 57, 61, 64, 93, 95, 96, 98, 100, 102, 207, 220, 222 from replicate two; and statements 29, 64, 91, 95 from replicate three (Supplementary Table S4). A notable example was in the interpretation of the “I” regions from replicate two: GPT-4V inaccurately identified them as intronic regions rather than inactive promoters and interpreted them as intronic enhancers (as in the statement “<4-2-95>”), with further explanations about their functions in transcription regulation (“<4-2-96>”), alternative splicing (“<4-2-98>”), and 3D chromatin organization (“<4-2-100>”). This finding underscored the essential role of a human-in-the-loop approach to mitigate potential misinformation from “confirmation bias” and ensure accuracy from GPT-4V’s assistance in figure interpretation.

Discussion

Data visualization is crucial in conveying results from bioinformatics analyses. LLM chatbots such as ChatGPT have demonstrated an ability to transform natural language prompts into relevant visual representations through coding25,26. The newly introduced feature of ChatGPT to take image inputs, namely GPT-4V, offers a promising avenue for identifying patterns within the image, offering interpretations, summarizing findings, and beyond30, were used to generate a principal component analysis (PCA) plot. Fold change (FC) in expression (T/M; log2) and false discovery rate (FDR; ‒log10), which measures the significance of differential expression, were utilized to craft a volcano plot. Additionally, a dot or bubble plot illustrating hits in pathway enrichment analysis for up-regulated genes (log2FC > 1 and FDR < 0.05) was produced using ShinyGO31.

For the “ZNF71” case study, gene expression data with survival information from 194 patients of NSCLC32 were used to generate Kaplan–Meier (K–M) curves. We combined gene expression data and PRISM drug screening data of NSCLC cell lines from the DepMap data portal33 to relate docetaxel-sensitivity with ZNF71 KRAB expression and to examine the correlation between drug response and CD27 expression. Gene expression of tumors and cell lines were expressed as transcripts per million (TPM). We applied the Boolean implication network algorithm34 to construct gene association networks within the context of tumors and normal tissues adjacent to the tumors (NATs) using Xu’s lung adenocarcinoma (LUAD) cohort35.

For the “Clonal” case study, somatic mutations in a pair of primary and recurrent tumors from an MM patient (ID: 1201) in the MMRF-CoMMpass study were sourced from the GDC data portal36. Clonal and subclonal mutations were identified using the MAGOS method37. The relative prevalence of (sub)clones and their evolutionary relationships were inferred using the ClonEvol method38.

For the “YY1” case study, a screenshot for gene expression, genomic distributions of histone modifications, and chromatin–chromatin interactions in the genomic region encompassing YY1 in GM12878 was sourced from the WashU Epigenome Browser39. Specifically, the Chromatin immunoprecipitation followed by sequencing (ChIP-seq) data for histone modifications H3K27me3, H3K4me3, and H3K27ac, along with strand-specific RNA sequencing (RNA-Seq) for gene expression, were loaded from the Encyclopedia of DNA Elements (ENCODE) data hub. The Proximity Ligation-Assisted ChIP-Seq (PLAC-seq) data for chromatin–chromatin interactions was loaded from the 4D Nucleome (4DN) Network. Regions denoting active and inactive chromatin domains, as well as key areas of chromatin–chromatin interactions, were annotated manually.

Input figures to GPT-4V from the above four case studies were listed in Supplementary Figs. S1S4. Legends of the figures were drafted by the authors. Additional case background and our interpretations of the figures were provided in Supplementary Methods S1 as references.

Prompts for ChatGPT

We instructed GPT-4V to function as a bioinformatics expert for each case study, introducing the research question associated with each figure at a high level to offer context. GPT-4V’s interpretive capabilities may be demonstrated under two assessment models. In the basal model, GPT-4V operated with minimal guidance to provide an overview of a figure, while in the in-depth model, it was guided through a sequence of questions addressing details. The questions were prompted to the chatbot one by one, which yielded more detailed responses compared to prompting all questions at once. At the conclusion of each evaluation, GPT-4V was optionally tasked to draft a figure legend and a summary paragraph for the figure. Prompts used for each case study were detailed in Supplementary Methods S2. We repeated each assessment three times using the web interface of ChatGPT-4 Plus (version dated Sep 25, 2023). All experiments were conducted under the default settings of ChatGPT-4 Plus.

Summative indexing tables for quantitative assessment

We compiled chat histories from each case study into a unified document (Supplementary Notes S1S4). This process began with the segmentation of GPT-4V’s responses into discrete statements using Spacy v3.6.140. Subsequently, these statements were indexed in a sequential manner to facilitate easy referencing. The indexing format adopted is <Case-Replicate-Statement>. In this format, “Case” represents the case study number, with 1 for “RNA”, 2 for “ZNF71”, 3 for “Clonal”, and 4 for “YY1”. The term “Replicate” indicates the replicate number, which can be either 1, 2, or 3. “Statement” corresponds to the statement’s sequential position within a specific replicate of a case study. For example, the third statement (3) in the second replicate (2) of the “RNA” case study (1) is indexed as <1-2-3>. Furthermore, a letter “B” is appended to the “Case” slot if the chat history originates from prompts associated with the basal assessment model.

Each statement from a chat history was then assigned a color code based on evaluation outcome: blue signifies true statements, red indicates false statements (with specific errors highlighted in yellow), and black denotes statements that were excluded from the assessment, such as chitchat (small talk or gossip). Additional explanations were provided following the statements classified as false to offer clarity on the nature of their inaccuracies.

To facilitate quantitative assessment, we summarized six categories to characterize statements from GPT-4V:

  1. 1.

    Plot recognition: Statements that identify and/or explain plot types.

  2. 2.

    Domain knowledge: Statements that utilize external knowledge to interpret results.

  3. 3.

    Color perception: Statements that perceive colors of visual elements.

  4. 4.

    Positional inference: Statements that analyze the positional relationships between visual elements.

  5. 5.

    Counts: Statements that count the number of visual elements.

  6. 6.

    Others: Statements that do not fall into the above five categories.

A summative indexing table was then generated for each case study (Supplementary Tables S1S4). This table documents the categorization of statements, their true/false status, and their origins from replicates and panels. In instances where a statement falls into multiple categories, it is assigned a true/false status for each category independently.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.