1 Introduction

Model-driven engineering (MDE) has been used in diverse engineering fields such as software engineering [1], robotics [2], and automotive [3]. The promised benefits of using this approach include increased development speed, earlier system analysis, and more manageable complexity [4]. However, managing interrelated models of different domains is challenging [5]. Qamar et al. [6] recommend the explicit modeling of the relationships between these models as an approach to manage them.

A number of technologies have been proposed to model the relationships between models explicitly [6,7,8,9]. However, automatic identification of these relationships remains an open problem. The main reason is the heterogeneity of modeling notations: we believe that it would not be feasible to develop a tool to parse all the existing notations. Moreover, even if such a tool was developed, it would have to be updated every time a modeling notation evolves or a new notation emerges as stated by Ruscio et al. [10]. Metamodels change over time [11], and the need to update the tools/models to support this change is known as the model co-evolution problem [12]. The literature reported that the time spent on this maintenance represents more than 25% of the total effort involved with creation of a Domain Specific Language (DSL) [12, 13]. Thus, we believe that to manage interrelated models of different domains one has to use a technology independent of the modeling language(s) used.

We observed that graphical models, independent of the engineering domain, typically combine textual and graphical elements such as boxes, lines, and arrows. Such models can be designed using different tools, that usually can export the model in a structured format such as XML, or as an image format such as PNG or JPEG [14]. From the data extraction perspective, it would have been ideal if all models were available in a common structured format. However, this is often not the case as models might be only available as images [14, 15] as described in the following case:

  • Unavailability of the source code of the model. Akdur et al. [16] identified that some models are discarded shortly after the engineer presents the model to a colleague. In order to increase the lifespan of the model, engineers take a picture of the model and store it as image [16, 17].

  • Models are stored as images. Some companies prefer to store the models as images due to the impossibility, in the future, of opening the models in their original modeling tools [18]. This issue can happen when the source code of the model is available but the modeling tool that created the model cannot open it due to version compatibility. Maintaining the version compatibility between the source code of the model and the modeling tool can be an expensive, and resource-intensive process [18]. Similarly, engineers store 3D CAD models by exporting them to 2D drawings and saving the drawings as images [18]. This way of working is also common practice in free/open source software (FOSS) projects. Hebig et al. [19] investigated 3.295 GitHub projects and identified that more than 50% of UML files presented in those projects are stored as images format (jpeg, png, gif, svg, bmp). Furthermore, a number of repositories reported in the literature store the models as image [20, 21]. These findings suggest that it is a common practice to store the models as images [14, 20, 22]

  • Engineers do not use formal modeling languages. Akdur et al. [23] conducted a survey with 627 software engineers from 27 different countries on modeling and model-driven engineering practices in the embedded software industry. They identified that no formal modeling language was used to design some of the models by 65% of the participants. Examples of such models include those designed using analog media (papers, whiteboard), and also those models that were designed using computer tools but not necessary modeling tools, such as Microsoft PowerPoint. In this case, the engineers can either recreate the models using formal modeling tools, or they can easily export the model as an image or take a picture of the model.

Therefore, to ensure maintenance of multi-domain systems, we need a uniform approach that would be independent from the peculiarities of the notation. This also means that such a uniform approach can only be based on something which is present in all those models, i.e., text, boxes, and lines. We believe that the initial step to identify these relationships is through a uniform approach that extracts data presented in a number of models from different domains.

In the first part of this work, we investigate the suitability of optical character recognition (OCR) as part of this uniform approach independent from the peculiarities of the notation. OCR is a collection of techniques aiming at recognizing text from handwritten or printed document and exporting the result as a machine-encoded text. We start by evaluating two of the bestFootnote 1 off-the-shelf OCR services, Google Cloud VisionFootnote 2 and Microsoft Cognitive Services,Footnote 3 for extracting text from a collection of 43 models from different domains. We use precision, recall, and f-measure as metrics to evaluate the results. In our context, precision and recall are the fractions of OCR-extracted texts that are also manually extracted compared to either all OCR-extracted texts (precision), or compared to all manually extracted texts (recall). F-measure is the harmonic mean of precision and recall. Following are the research questions used to guide the first part of this work:

Fig. 1
figure 1

Xamã is built on top of Google Cloud Vision. Xamã process the output provided by Google Cloud Vision to identify and merge the text fragments that are positioned on multiple lines. As output, Xamã produces the final text fragment list

  • RQ1 How accurate are off-the-shelf OCR services for extracting text from graphical models?

    • \(\bullet \) Motivation: While OCR techniques have been around since 1930s, they have not been applied in the context of text extraction from graphical models. Hence, it is crucial to evaluate accuracy of the state-of-the-art off-the-shelf OCR services with respect to these tasks.

    • \(\bullet \) Answer: We observe that Google Cloud Vision outperforms Microsoft Cognitive Services, being able to detect 70% of textual elements as opposed to 30% by Microsoft.

  • RQ2 What are the common errors made by OCR services on models from different domains?

    • \(\bullet \) Motivation: Taking a closer look at the common errors made by Google Cloud Vision is a prerequisite to designing techniques that can improve the precision and recall of the OCR services when applied to text extraction from graphical models.

    • \(\bullet \) Answer: We organize the common errors made by Google Cloud Vision into four categories. The first category of errors is related to non-alphanumerical characters used in the models such as [, \(\{\), < or \(\_\). The second category is mathematical notation commonly used in equations such as subscripts and Greek letters. The following category of errors is related to spacing and relative positioning of the textual elements. Finally, the last group of errors is related to single-character errors such as characters being wrongly added, removed, or recognized. We observed that the main OCR challenges are related to recognizing text that contains equations, Greek letters, multi-line text, i.e., text fragments positioned on multiple lines, and subscripts.

Based on these findings in the second part of this work, we aim to improve the precision and recall of the OCR focusing on fixing the multi-line text error. We chose to correct this error for two main reasons: the first one is because OCR failed to detect textual elements that are positioned in multiple lines. The second reason is related to the long-term goal of our research which is the support for model management focusing on managing interrelated models of different domain. We believe that correcting errors related to mathematical formulas might not be as beneficial as correcting the multi-line text error. It is because even a small difference in one equation such as the presence of “-” instead of “+” can lead to a completely unrelated equation.

In order to address the multi-line text error, we developed Xamã (Fig. 1) as an extra layer on top of Google Cloud Vision. This tool includes two approaches that identify whether the elements are positioned on a single line or multiple lines. As consequence, we merge those identified as positioned on multiples lines avoiding the multi-line text error. To achieve this goal, i.e., identifying whether the text is positioned on multiples lines or not, we first investigated the similarities of textual elements presented in the models used in the first part of this work. Based on this analysis, we defined a set of heuristics and applied them to a new collection of models to evaluate the accuracy of our approach. The second approach is a combination of a modified version of these heuristics with shape detection feature. The shape detection feature is a collection of image processing algorithms used to identify shapes such as boxes presented in the models.

Additionally, we evaluate the overall improvement of using Xamã and we compare the results to a state-of-the-art domain specific tool. We selected Img2UML [14, 22] as the domain specific tool, and we compared the results by applying Img2UML to collection of 20 UML class diagrams. In this comparison, we do not take the identification of classes and relationships into account because the focus of this paper is to evaluate the use OCR in extracting text from models. Following are the research questions used to guide the second part of this work:

  • RQ3 How accurate are the heuristics of Xamã in identifying multi-line problems?

    • \(\bullet \) Motivation: Evaluating the accuracy of the heuristics of Xamã is important to guarantee that they are not cause worse precision and recall values.

    • \(\bullet \) Answer: We applied our heuristics in a collection of 51 models (10 Matlab Simulink models, 20 UML diagrams, and 21 models from scientific papers). Our heuristics without shape detection correctly identified 905 out of 1171 elements, presenting a precision of 75%, recall of 77%, and f-measure of 76%. With shape detection, the number of correctly identified elements increases to 956, improving the precision, recall, and f-measure to 84%, 82%, and 83%, respectively. Evaluating the results by the model domain, we observed that Xamã, with the shape detection feature, presents statistically higher precision and recall on Matlab Simulink and the models in the scientific papers than without this feature. The results are inconclusive for UML diagrams.

  • RQ4 What is the overall impact, in terms of precision and recall, of using our approaches?

    • \(\bullet \) Motivation: Due to the existence of other kind of errors, correcting the multi-line text error does not guarantee the improvement of overall accuracy.

    • Answer: Xamã, without shape detection, presents better results in 22 (out of 51) models in terms of precision, and in 16 models in terms of recall. With shape detection, it presents better result in terms of precision in 29 models, and recall in 22 models. Evaluating these results by the domain, we observed that Xamã presents statistically better precision on Matlab Simulink and the models in the scientific papers. Regarding the recall, Xamã presents statistically better results on these two groups of models when using the shape detection feature. For UML diagrams, the results are inconclusive.

  • RQ5 How accurate is Xamã compared to a state-of-the-art domain specific tool?

    • \(\bullet \) Motivation: Since it is expected that a domain specific tool would outperform a generic tool we would like to know how good is Xamã compared to a domain specific tool regarding the text recognition.

    • \(\bullet \) Answer: We observed that Xamã correctly recognized 433/431 (without/with shape detection) out of 614 elements on class diagrams, while Img2UML correctly recognized 171 elements.

This paper is an extension of our previous work [24]. While the conference paper focused on the evaluation of the existing OCR techniques (RQ1 and RQ2), in this extension, we propose two approaches to improve the OCR precision and recall focusing on correcting the error caused by the misinterpretation of one textual element as two separate elements (RQ3, RQ4 and RQ5).

To encourage replication of our work, the data we have collected and the source code we have used to perform the analysis have been made available on: bit.ly/DataOCRExtension

The remainder of this paper is organized as follows. Section 2 presents our previous study aimed at investigating the suitability of OCR in extracting text from models from different domains. In Sect. 3, we present the second study in which we address the multi-line text error. Section 4 presents the threats to validity, and the actions we took in order to mitigate the threats. In Sect. 5, we present a summary of our studies, discussing the results and indicating possible future work directions. Section 6 presents the related work. Finally, the conclusion is presented in Sect. 7.

2 Suitability of optical character recognition for text extraction from graphical models

In this section, we present our previous study [24] aimed at investigating the suitability of optical character recognition (OCR) as a uniformed approach to extract data from models from different domains. Specifically, we investigate the accuracy of off-the-shelf OCR services for extracting text from graphical models (RQ1), and the common errors made by OCR (RQ2). In Sect. 3, we build upon this study and propose two approaches to correct the most common error produced by OCR.

2.1 Methodology

To answer RQ1 we apply Google Cloud Vision and Microsoft Cognitive Services to a collection of 43 models from different domains. To answer RQ2, we focus on the OCR service that has been shown to perform better on RQ1 and inspect the errors made by the service.

2.1.1 Models Selection

For reproducibility reasons, we arbitrarily select models from two UML open repositories [25, 26], three control system engineering papers [27,28,29], and the example catalog of Matlab Simulink.Footnote 4 In total, we analyzed 43 models as presented in Table 1. We only require the models to be graphical models, i.e., they must contain a mix of textual and graphical elements. Diagrams are graphical representations of parts of a model [30]. Therefore, in the context of this study, we use the term “model” to represent diagrams as well.

We select Matlab Simulink models because of its high adoption by the industry. These models are available on the official website as example catalog and they are used to describe control systems from different domains including automatic climate control, robot arm control, and fault-tolerant fuel control. We also include models from three scientific papers on control system engineering. The models from these papers are an intelligent control architecture of a small-scale unmanned helicopter, an actuator control system, and a x-ray machine.

Among the UML models, we focus on class diagram, sequence diagram, and use case diagrams. These models are stored in two repositories: Git UML [26] and Models-db [25]. The former automatically generates diagrams from source code stored in git repositories. Models-db is automatically populated by crawlers identifying models from public GitHub repositories.

Table 1 List of models used to answer RQ1

2.1.2 Text Extraction

In order not to bias the evaluation toward a specific engineering domain, we opt for general-purpose OCR techniques. Several OCR serves are available off the shelf, including Google Cloud Vision, Microsoft Cognitive Services, and Amazon AWS Rekognition.Footnote 5 For this work, we select the Google Cloud Vision and Microsoft Cognitive Services: these services have been shown to be effective in recognizing text from the photographs of the pages of the Bible [57], and to outperform Amazon AWS on images of business names or movie names [58].

2.1.3 Measures for Accuracy

The validation consists of manually identifying the text from graphical models, and comparing the text extracted by OCR to the manually identified text. When deciding whether the OCR-extracted text matches the one manually extracted we do not distinguish between the letter case, i.e., Velocity is seen as the same as veLoCitY because these words still have the same meaning. We do distinguish between differently chunked texts, i.e., given the manually identified text Velocity control an OCR service extraction of Velocity and Control as two separate texts will be seen as wrong.

As common in information retrieval tasks, we report precision, recall, and f-measure, i.e., the harmonic mean of precision and recall. In our context, precision is the fraction of OCR-extracted texts that are also manually extracted compared to all OCR-extracted texts, and recall is the fraction of OCR-extracted texts that are also manually extracted compared to all manually extracted texts.

2.2 Results

2.2.1 RQ1: How accurate are off-the-shelf OCR services for extracting text from graphical models?

In overall, Google Cloud Vision correctly detected 854 out of 1232 elements, while Microsoft Cognitive Services correctly detected 388 elements. This observation concurs with previous evaluations of these OCR services. Indeed, on the photographs of the pages of the Bible Reis et al. [57] observed that Google Cloud Vision had a relative effectiveness of 86.5% as opposed to 77.4% of Microsoft Cognitive Services. On images of business names or movie names, [58] Google Cloud Vision achieved 80% of both precision and recall as opposed to 65% of precision and 44% of recall of Microsoft Cognitive Services.

Hence, we hypothesize that also on our dataset Google Cloud Vision will outperform Microsoft Cognitive Services in terms of both precision and recall. Formally, we state the following hypotheses:

  • \(H^p_0\): The median difference between the precision for Google Cloud Vision and Microsoft Cognitive Services is zero.

  • \(H^p_a\): The median difference between the precision for Google Cloud Vision and Microsoft Cognitive Services is greater than zero.

  • \(H^r_0\): The median difference between the recall for Google Cloud Vision and Microsoft Cognitive Services is zero.

  • \(H^r_a\): The median difference between the recall for Google Cloud Vision and Microsoft Cognitive Services is greater than zero.

To test these hypotheses, we perform two paired Wilcoxon signed-rank tests, one for precision and another one for recall. The p values obtained for precision and recall are \(1.9\times 10^{-7}\) and \(2.8\times 10^{-9}\), respectively. Hence, we can reject \(H^p_0\) and \(H^r_0\) and state that Google Cloud Vision outperforms Microsoft Cognitive Services.

To illustrate this argument, consider Fig. 2. It summarizes precision (y-axis) and recall (x-axis) organized by the type of models. Indeed, we can observe that while precision and recall obtained by Google Cloud Vision mostly exceed 0.5, precision and recall obtained by Microsoft Cognitive Services are mostly below 0.5. Moreover, the data for both Google Cloud Vision and Microsoft Cognitive Services suggests a linear relation between precision and recall: indeed, while the number of textual elements extracted by OCR tools is often close to the number of manually identified textual elements, the textual elements themselves are imprecise.

Finally, while Google Cloud Vision extracted some textual information from all models, Microsoft Cognitive Services failed on two models: Matlab Simulink model [51] and Fig. 4.b from the paper by Ai et al. [27].

Fig. 2
figure 2

Precision (y-axis) and recall (x-axis) obtained by Google Cloud Vision (top) and Microsoft Cognitive Services (bottom)

Table 2 OCR service that presents statistically better results organized by the domain

2.2.2 Performance on models of different domains

While the previous discussion indicates that overall Google Cloud Vision outperforms Microsoft Cognitive Services, a priori this does not imply that this should also be the case for models of different domains. This is why we formulate the corresponding hypotheses separately for UML diagrams, Matlab Simulink models, and models from scientific papers. We test these hypotheses using paired Wilcoxon signed-rank tests, one for precision and another one for recall. However, since we perform multiple comparisons, we need to adjust the p values to control for the false discovery rate. We use the method proposed by Benjamini and Hochberg [59].

The adjusted p values are below the commonly used threshold of 0.05 for five of the six comparisons (three types of models \(\times \) precision or recall). We conclude that Google Cloud Vision outperforms Microsoft Cognitive Services: for models of all domains in terms of recall; for UML diagrams and Matlab Simulink models in terms of precision as presented in Table 2.

Fig. 3
figure 3

F-Measure for all analyzed models. The map** of paper model ID and the models from the selected scientific papers is presented in the “Appendix” Table 11

2.2.3 Performance on individual models

Table 2. Figure 3 shows that the f-measure for Google Cloud Vision is higher than for Microsoft Cognitive Services on 33 models, as opposed to five models where Microsoft Cognitive Services scores higher. For the remaining five models, the f-measures are equal.

Inspecting Fig. 3, we also notice that for six models Microsoft Cognitive Services have precision equal to zero, i.e., either no textual elements have been extracted (Matlab Simulink 4 and Paper model 3.3) or all textual elements extracted are wrong (UML class diagram 8, UML use case 4, UML sequence diagram 2 and 4). Unfortunately, we cannot precisely state the reasons why Microsoft Cognitive Services failed in process these models. Possible reasons could be related to the quality of the images, and size of font. However, these are unlikely to be the reasons for this fail, since all used images are in good quality and Google Cloud Vision managed to process the same images. In this study, we did not look further in investing the reasons for the bad performance of Microsoft Cognitive Services.

figure a

2.2.4 RQ2: what are the common errors made by OCR services on models from different domains?

Based on our answer to RQ1, one would prefer Google Cloud Vision as the OCR service to be integrated in a multi-domain model management solution. In this section, we take a closer look at the errors made by Google Cloud Vision: addressing these errors is necessary in order to make OCR suited for multi-domain model management.

Table 3 Number of models affected by the identified problems

Table 3 summarizes the results of manual analysis of the errors made by Google Cloud Vision:

  • The first category of errors is related to non-alphanumeri-cal characters used in the models such as [, \(\{\), <, or \(\_\). These characters are sometimes confused with each other or missed by the OCR, e.g., the name of the element is ‘file_version’ and OCR detects ‘file version’, without the underscore.

  • Engineering models can involve mathematical formulas such as equations, including subscripts and Greek letters.

  • The next group of errors is related to spacing and relative positioning of the textual elements. For example, due to space limitations text can be positioned on multiple lines, making OCR to misinterpret as one textual element but as two separate elements, we call this error as Multi-line Text. When this misinterpretation happens in a textual element positioned in one single line, we call this error as Split Element. The difference between Multi-line Text and Split Element is that the latter occurs on textual element written in one single line but have an empty space between the words, causing this misinterpretation. The opposite of the error Split Element is Mix of Elements. Mix of Elements occurs when OCR mixes the name of different elements due to their proximity.

  • Finally, the last group of errors is related to single-character errors such as characters being wrongly added, removed, or recognized. An example of such error is Character Confusion. This error occurs when OCR is not capable of identifying the letter due to the similarity to other letters. For instance, the name of the element is ‘DeleteNodeById()’. However, OCR interprets the capital letter “i” as the lowercase “l,” returning ‘DeleteNodeByld()’. The difference between Character Confusion and Wrong Char is that the former occurs between similar characters, e.g., the letter “o” and the number “0.” And Wrong Char occurs between any character.

Table 3 shows that errors present in the largest number of models are Multi-line Text, Empty Space between Letters, Missing Char, and Wrong Char. However, the number of models affected by the errors should be compared to the number of models that can be affected by those errors: while wrong characters might appear in any model, errors related to underscores can only be present if the models contain underscores.

Hence, Table 4 summarizes the number of models that can be affected (candidate models) and the models that are affected by errors. Similarly, it includes the number of elements that can be affected (candidate elements) and are affected by the errors.

Inspecting Table 4 we observe that the Curly Brackets, Equations, Greek letters, Multi-line String, Parentheses, Subscript, and Underscore occur in every single model that has the corresponding elements.

Even though Parentheses and Underscore problems arise in 100% of the candidate models, Google Cloud Vision correctly identified 92% of textual elements that have parentheses and 60% of the textual elements that have underscores and this in sharp contrast with Equations, Greek Letters, Multi-line String, and Subscript that could not be recognized by Google Cloud Vision.

Table 4 Candidate Elements are the elements that contain characters that can cause a problem
figure b

3 Addressing an OCR limitation: the multi-line error

The long-term goal of our research is to support the model management focusing on managing interrelated models of different domains. It is known that the explicit modeling of the relationships between models from different domains can support this goal. Indeed, explicit modeling of the relationships facilitates the identification of the affected models due to a change. While explicit modeling of the relationships can be done using a number of approaches, little is known in how to do this automatically. For this goal, we first need an approach to extract information that is presented in various kind of models. In Sect. 2, we investigated the use of OCR as a basis for such approach, and the common errors produced by OCR.

In this section, we improve the precision of OCR by correcting the errors caused by the misinterpretation of one textual element as two separate elements. We call this error as the multi-line text error. We decided to correct this kind of error first because OCR failed in recognize these elements in 100% of the cases. Although the identification of mathematical formulas also represents the main OCR challenges, we believe that fixing errors related to mathematical formulas might not be as beneficial to the long-term goal of our research as correcting the multi-line text error. The reason is because even a small difference in one equation such as the presence of “-” instead of “+” can lead to a completely unrelated equation making it difficult to identify relationships between models.

3.1 Methodology

We observed that OCR services fail in detecting textual elements that are positioned on multiple lines. In order to address this error, we developed Xamã. This tool includes two approaches: the first approach is the use of a collection of heuristics that take into consideration parameters such as the text alignment, the distance between the text fragments, and the size of the words and letters. The second approach is the combination of these heuristics with image processing. When defining these approaches, we aimed at solutions that would be independent of OCR engines and that would not introduce considerable overhead in terms of running time or memory consumption.

3.1.1 Identifying similarities

The main goal of the second part of this study is to propose a generic solution to the multi-line text error. We need a generic solution because of the heterogeneity of models. The first step to fix the multi-line text error is to identify the text that is positioned on multiple lines. For the sake of generalizability of the approach to be designed, we focus on common characteristics of the text positioned on multiple lines. To identify these characteristics, we analyzed the collection of models used to answer RQ1 and RQ2, and checked the characteristics of the multi-line text. Based on the characteristics, we defined a collection of heuristics capable of detecting text fragments positioned on multiple lines.

A possible approach to resolving multi-line errors might consist in merging all text fragments that are located on top of and close to each other. However, this approach would be overly eager and would merge unrelated text fragments, e.g., attributes and operations in a UML class diagram such as the one on Fig. 4.

Fig. 4
figure 4

Example of a UML Class Diagram

We propose two approaches to correct the multi-line text error. The first approach applies a set of heuristics used to classify whether the text should be merged taking into consideration the coordinates of the text fragments. Following are the description of these heuristics we call regular heuristics (RH):

  • Alignment. We observed that the text fragments that are positioned on multiple lines have centralized as the chosen alignment format in most of the models (25 out of 28). Therefore, we opted to take this alignment format into consideration when evaluating whether the text fragment should be merged or not.

  • Text Distance. Text fragments positioned below each other do not necessary represent the same element. To define whether the text fragments should be merged, we take into consideration the font size (height) and the distance (horizontal and vertical) between the text fragments. The threshold for the vertical distance between the text fragments is half of the font size. The horizontal distance is the distance between the centers of the both texts fragments. In this study, we consider that the threshold for the horizontal distance between the centers of the text fragments cannot be greater than 3 times the length of the lettersFootnote 6. These thresholds have been chosen in order to obtain the optimal performance on the data set described in Sect. 2. For example, we observed that increasing the threshold of the horizontal distance to 4 times the length of the letters leads to an increase of false positives. Figure 5 illustrates an example of how we calculate the text distance.

Fig. 5
figure 5

Example of how we calculate the variables used in our heuristics of two text fragments that represent the element “Application Controller.” The “CP” means central point of the word. The “HD” and “VD” mean horizontal and vertical distances, respectively. The “LL” means length of the letters. The “N” means the number of letters in a word

The second approach consists of a slightly modified version of the previous heuristics, and image processing algorithms to detect shapes presented in the models. For the image processing, we use OpenCV [60], an open source computer vision and machine learning software library. This library has more than 2500 optimized algorithms and is used by companies such as Google, Yahoo, Microsoft, Intel, IBM, Sony, Honda, and Toyota, with estimated number of downloads exceeding 18 million [60].

In order to improve the quality of the shape detection, we first convert the images to black and white, and then we blur these images using the Gaussian kernel filter [61] reducing the high-frequency noise from images. To identify the shapes presented in the images, we use algorithm findContours with simple approximation as the contour approximation method. We opt for simple approximation to reduce memory consumption.Footnote 7 Once we have a list of the identified shapes, we use the pointPolygonTest algorithm to perform a point-in-contour test to determine whether the text fragments (provided by Google Cloud Vision) are inside the identified shapes.

Next we apply a series of heuristics. Instead of applying RH as defined above, we slightly modify them. Indeed, grou** the text fragments into the shapes they belong to, allows us to apply the heuristics in a less strict manner. While in the RH the threshold for the vertical distance is half of the font size, for the modified heuristics, we call shape detection heuristics (SDH), the new threshold becomes the font size. The change of this value was derived by trying different values to obtain optimal results to improve the accuracy. This is the only difference between the heuristics used in both approaches.

The merging process is done on two text fragments at a time. For example, consider the following model in Figure 6. There are three elements in this model: Navigation System, Application Controller, and Engine Speed Selector. Because of the text is positioned on multiples lines, the OCR service recognizes Application Controller as two elements “Application” and “Controller”; and Engine Speed Selector as three elements “Engine,” “Speed,” and “Selector.” In this example, Xamã starts checking whether “Navigation System” should merge with “Application,” “Controller,” “Engine,” “Speed,” and “Selector.” In case of a merging, the merged text fragments become one, and we repeat the operation until all text fragments are checked. The merging is only considered correct if the order of the text fragments is correct. Having the element Application Controller as example, the merging is correct if the final result is “Application Controller” and wrong if it is “Controller Application.”

Fig. 6
figure 6

An example of a model in which the multi-line text error occurs on Application Controller and Engine Speed Selector elements

We selected Google Cloud Vision as the OCR technology because of its high precision and recall as shown in the previous section. It is worth mentioning that Xamã works with any OCR service, as long as the service also provides the coordinates of the text fragment presented in the image. Our approach works as follows: we submit a collection of models to Google Cloud Vision that returns the list of text fragments and their coordinates in respect to the their models. Xamã processes the list of text fragments following a set of heuristics (with or without shape detection feature to identify and merge the text that is positioned on multiple lines, and exports a new list of text fragments. Every text fragment is one element of the model. Finally, we compare the final version of the text fragments with the ones manually extracted. These steps are summarized in Fig.  7.

Fig. 7
figure 7

Google Cloud Vision extracts a list of text fragments from a set of models. Xamã receives this list as input, processes it to identify and merge the text fragments that are positioned on multiple lines. As output, Xamã produces a final text fragment list to be compared with the text fragments extracted manually

3.1.2 Limitations

When we submit an image to be processed by Google Cloud Vision, we obtain as response the recognized text and the coordinates (xy) in pixels of the text in the image. Xamã uses these coordinates to verify whether the text should be merged. However, in our preliminary experiments we observed that these coordinates are not always reliable. For example, considering the coordinates (xy) of two textual elements that are aligned to the left. We would expect that the x values would be the same for both of textual elements. However, it is not always true. Thus, we have to take into account a tolerance range.

Another limitation of Google Cloud Vision is that one cannot specify the Google Cloud Vision version they want to use. This can be problematic because the results Google Cloud Vision, including coordinates computed on the same input model, changed with time. We conjecture that it happened due to an update of the service. However, this limitation would occur when using any other external API/service where it is not possible to specify the version to be used.

3.1.3 Models selection

The methodology used to select the models to answer RQ3, RQ4, and RQ5 is similar to the one used to answer the first two research questions. The differences are the number of models. For this study, we selected 51 models instead of 43, and we used a different repository for UML sequence diagrams. In the previous models selection, we found five sequence diagrams in the Models-DB repository. However, for this study we could not find additional five sequence diagrams, we found only one in the Models-DB repository. The other four sequence diagrams we arbitrarily selected from the Lindholmen Dataset [19].

The selected models are organized as follows: 21 models from the scientific papers, 10 models from the example catalog of Matlab Simulink, and 20 UML diagrams (10 class diagrams, five user case diagrams, and five sequence diagrams). The majority (70%) of the models of this dataset presents text fragments positioned on multiple lines. It is worth mentioning that the presence of multi-line text was not a selection criterium for the models. Table 5 presents the models used in this study and the number of models per type that have text on multiple lines.

We also decided to include models that do not present text fragments positioned on multiple lines. The reasoning behind this decision is because there is a possibility of merging text fragments that should not be merged, causing false positives.

Table 5 List of models used in this study
Fig. 8
figure 8

Precision (y-axis) and recall (x-axis) obtained by Xamã without shape detection (top), with shape detection (bottom). In this analysis we focus on Multi-line text and we ignore other error types. There are 12 models (7 class diagrams, 2 sequence and use case diagrams, and 1 Matlab Simulink) with perfect precision and recall without shape detection, and 13 models (5 class diagrams, 2 sequence diagrams and use cases, 3 Matlab Simulink, and 1 model from a scientific paper) with shape detection

3.2 Results

3.2.1 RQ3: how accurate are our heuristics in identifying multi-line problems?

Xamã was designed to identify whether the text is positioned on multiple lines and merge the fragments that should be merged correcting the multi-line text error. It is important to stress that in the following analysis we ignore other errors such as character confusion and wrong char. Therefore, to evaluate the precision and recall we consider as true positive those text fragments that Xamã correctly identified as positioned on multiple lines, and also those that Xamã identified as positioned on one single line. The reason for this is that it is possible that Xamã misinterprets two different text fragments as if they were only one but positioned on multiple lines, causing false positive.

Fig. 9
figure 9

F-Measure values focusing on the identification and merging of text positioned on multiple lines. The map** of paper model ID and the models from the selected scientific papers is presented in the “Appendix” Table 11

3.2.2 Without shape detection

We first evaluate our approach without the shape detection feature. As results, Xamã correctly identified 905 out of 1171 elements, with the overall precision of 75%, recall of 77%, and f-measure of 76%. Figure 8 (top) presents the distribution of these values. We observe that the majority of models that present precision lower than 70% are models from scientific papers. The reasons for low precision are: most of the text fragments are composed by long sentences, for instance: “Store the best value for the regularization parameter”; the presence of mathematical equations or subscripts; and the text distance between the text fragments is higher than the threshold defined in the RH. Most of the models \(\approx \)66% are located on the upper right side of the plot, meaning that they present precision and recall higher than 70%.

We also used f-measure as a metric to evaluate the accuracy of Xamã. The f-measure values are presented in Fig. 9. We observe that \(\approx \)66% of the analyzed models have f-measure values greater than 70%. UML diagrams are the models that present higher values, and models from scientific papers are the models that present lower values.

3.2.3 With shape detection

When applying the shape detection feature, Xamã correctly identified 956 elements, improving the overall precision, recall, and f-measure to 84%, 82%, and 83%, respectively. Figure 8 (bottom) presents the distribution of these values. We observe that five models present precision and recall below 50%. Without the shape detection, the amount of models presenting values below 50% are: 11 models for precision and eight for recall.

Table 6 Precision, recall, and f-measure obtained by our tool with and without shape detection (SD)

3.2.4 With or without?

Performance of Xamã with and without shape detection is presented in Table 6. By using shape detection, we improved the precision in 19 models (seven Matlab Simulink models, and 12 models from the scientific papers), and the recall in 18 models (seven Matlab Simulink models, and 11 models from the scientific papers).

We hypothesize that Xamã with shape detection feature outperforms the version without the shape detection in terms of both precision and recall. Formally, we state the following hypotheses:

  • \(H^p_0\): The median difference between the precision for Xamã with and without shape detection is zero.

  • \(H^p_a\): The median difference between the precision for Xamã with and without shape detection is greater than zero.

  • \(H^r_0\): The median difference between the recall for Xamã with and without shape detection is zero.

  • \(H^r_a\): The median difference between the recall for Xamã with and without shape detection is greater than zero.

To test these hypotheses, we perform two paired Wilcoxon signed-rank tests, one for precision and another one for recall. Since we perform repeated tests we need to adjust the p values to control for the false discovery rate. We use the method proposed by Benjamini and Yekutieli [95]. The adjusted p values obtained for precision and recall are \(3.2\times 10^{-5}\) and \(7.3\times 10^{-3}\), respectively. Hence, we can reject \(H^p_0\) and \(H^r_0\) and state that Xamã with shape detection outperforms Xamã without shape detection.

Table 7 Improvement approach that presents statistically better results organized by the domain

We also investigated the corresponding hypotheses separately for UML diagrams, Matlab Simulink models, and models from scientific papers, using the same Wilcoxon signed-rank tests. The adjusted p values are below the commonly used threshold of 0.05 for four of the six comparisons (three types of models \(\times \) precision or recall). As conclusion, the use of shape detection improved the results for Matlab Simulink models and models from the scientific papers in terms of precision and recall. For UML diagrams, the results were inconclusive as presented in Table 7.

Fig. 10
figure 10

Presents the ratio between the two approaches (with and without shape detection). The ratio is calculated using the median execution time using shape detection, divided by the median execution time without the shape detection. The map** of paper model ID and the models from the selected scientific papers is presented in the “Appendix” Table 11

As expected, adding the shape detection on top of the heuristics represents additional costs. Therefore, we also calculated the time spent to execute each approach. For every model, we run our approaches 40 times, we exclude the first 10 results, and we calculate the median. The results are presented in “Appendix” Tables 12, 13, and 14. Finally, we use the medians to calculate the ratio that is presented in Figure 10. Our experiments were executed on a MacBook Pro 16” (2019) with an Intel 2.4-GHz 8-core i9 processor turbo boost 5.0 GHz, 32 GB 2666-MHz DDR4, AMD Radeon Pro 5300M 4 GB, Intel UHD Graphics 630 1536 MB, MacOS Catalina.

Even though the shape detection makes the tool 111.28 (median) times slower, they still present reasonable execution times. For example, Figure 3 from the paper by Zhou et al. [64] presents the slowest execution time (with shape detection) around 0.02 seconds. Therefore, using shape detection not only produce better results but also performs reasonably well.

figure c

3.2.5 RQ4: what is the overall impact, in terms of precision and recall, of using our approaches?

To answer this research question, we compared the results produced by Google Cloud Vision with the results produced by Xamã with and without shape detection. For the following analyzes, even if the identification and merging of the text fragment positioned on multiple lines is made correctly, the “final” version of the text might be wrong due to another kind of error such as missing char or extra char. For example, assuming that the OCR service, wrongly, recognizes the element “Application Controller” from Figure 6 as two elements “Applicati0n” and “Contr0ller.” By applying Xamã, these elements become “Applicati0n Contr0ller.” Even though the merging is correct, the text fragment is wrong because of other kind of error (character confusion). Therefore, a priori the correction of the multi-line text error might not lead to higher precision and recall in general.

3.2.6 Without shape detection

We observe the same results in 27 out of 51 models. Xamã presents better results in 22 models in terms of precision and f-measure, and in 16 models in terms of recall. Google Cloud Vision presents better results in one model in terms of precision, and in two models in terms of recall and f-measure. The overall precision and recall are presented in Table 8.

Table 8 Precision, recall, and f-measure obtained by Google Cloud Vision (GCV) and our tool (S/SD) where “H” means without shape detection, and “SD” means with shape detection

As expected, the overall improvements are better noticed on those models that present more text fragments positioned on multiple lines. The best improvements on precision, recall, and f-measure are on the following models:

  • Figure 2 from the paper by Dounis et al. [63] with values improving from 13 to 50% (precision), 9 to 27% (recall) and 11 to 35% (f-measure);

  • Figures 1b and 3 from the paper by Zhou et al. [64] with values improving from 17 to 40% (precision), 29 to 57% (recall) and 21 to 47% (f-measure) for Figure 1b, and from 2 to 30% (precision), 3 to 47% (recall) and 2 to 37% (f-measure) for figure 3.

We hypothesize that Xamã will outperform Google Cloud Vision in terms of both precision and recall. Formally, we state the following hypotheses:

  • \(H^p_0\): The median difference between the precision for Xamã and Google Cloud Vision is zero.

  • \(H^p_a\): The median difference between the precision for Xamã and Google Cloud Vision is greater than zero.

  • \(H^r_0\): The median difference between the recall for Xamã and Google Cloud Vision is zero.

  • \(H^r_a\): The median difference between the recall for Xamã and Google Cloud Vision is greater than zero.

To test these hypotheses, we perform two paired Wilcoxon signed-rank tests, one for precision and another one for recall, as described previously in RQ3.

The adjusted p values obtained for precision and recall are \(5.7\times 10^{-6}\) and \(2.3\times 10^{-4}\), respectively. Hence, we can reject \(H^p_0\) and \(H^r_0\) and state that Xamã without shape detection outperforms Google Cloud Vision.

This discussion indicates that while overall Xamã outperforms Google Cloud Vision, this does not imply that this should also be the case for models of different domains. This is why we formulate the corresponding hypotheses separately for UML diagrams, Matlab Simulink models, and models from scientific papers. They were tested using the same paired Wilcoxon signed-rank tests as before.

The adjusted p values are below the commonly used threshold of 0.05 for two of the six comparisons (three types of models \(\times \) precision or recall). As conclusion Xamã improved the results for Matlab Simulink models and models from the scientific papers in terms of precision. For UML diagrams, the results were inconclusive, as presented in Table 9.

3.2.7 With shape detection

We observe that Xamã presents the same results of Google Cloud Vision in terms of both precision and recall in 19 models. Xamã presents better result in terms of precision in 29 models, and recall in 22 models. Google Cloud Vision presents better results in three models in terms of precision, and in five models in terms of recall and f-measure. The best improvements on precision, recall, and f-measure are on the following models:

  • Figure 1 from the paper by Blasera et al. [62] with values improving from 11 to 60% (precision), 17 to 63% (recall) and 13 to 61% (f-measure);

  • Figures 1b and 3 from the paper by Zhou et al. [64] are also presented as the best improvements when using Xamã without the shape detection feature. However, using shape detection Xamã could improve the results even further. For Figure 1b, the precision is 75%, recall is 86%, and f-measure is 80%. For Figure 3, the precision is 78%, recall is 88%, and f-measure is 82% .

Since Xamã without the shape detection feature outperformed Google Cloud Vision, we expected that Xamã with the shape detection feature would also outperform Google Cloud Vision. Therefore, we tested these hypotheses using the same two paired Wilcoxon signed-rank tests. As result, the tests confirm our hypothesis presenting adjusted p values of \(1.3\times 10^{-7}\) and \(3.1\times 10^{-5}\) for precision and recall, respectively.

We also tested these hypotheses separately for UML diagrams, Matlab Simulink models, and models from scientific papers. Using the shape detection feature, Xamã improved the results for Matlab Simulink models and models from scientific papers, but for UML models the tests present inconclusive results. It is worth mentioning that the amount of text fragments positioned on multiple lines, causing the multi-line text error, is not high on UML diagrams (17 out of 604 elements). Therefore, there is not enough multi-line text error in the UML diagrams to be correct by Xamã that could lead to statistically better results. The distribution of these elements that could cause the multi-line text error is the following: nine elements in three class diagrams, three elements in two sequence diagrams, and five in two use cases.

Table 9 OCR service that presents statistically better results organized by the domain
figure d

3.2.8 RQ5: how accurate is Xamã compared to a domain specific tool?

We investigate how Xamã which is a combination of a general-purpose OCR technique and our generic approach (with and without shape detection features) would perform compared to a domain specific tool. To answer this question, we select img2UML as a domain specific tool because of its accuracy in extracting UML models from images [22].

Img2UML [14, 22] is a system written in VB.NET and using MODI library to extract the text from images. The system takes an image of a class diagram as input, and as output it generates a XMI file that represents the class diagram to be visualized by StarUML CASE tool.

Since Img2UML has been designed specifically for UML class diagrams, we compare the precision and recall only for this type of models. To answer this research question, we analyzed 20 UML class diagrams. The UML class diagrams used to answer RQ1 and RQ2, and additional 10 UML class diagrams [96,97,98,99,100,101,102,103,104,105]. It is worth mentioning that we only compare the accuracy related to the text recognition. The classes and relationships identification are out of the scope of this study.

In overall, Xamã correctly recognized 433/431 (without/with shape detection) out of 614 elements, while Img2UML correctly recognized 171 elements. Table 10 presents the precision, recall, and f-measure obtained for these class diagrams. We observe that Xamã presented better results than Img2UML in all cases. Therefore, we hypothesize that Xamã outperforms Img2UML in terms of both precision and recall. Formally, we state the following hypotheses:

  • \(H^p_0\): The median difference between the precision for Xamã and img2UML is zero.

  • \(H^p_a\): The median difference between the precision for Xamã and img2UML is greater than zero.

  • \(H^r_0\): The median difference between the recall for Xamã and img2UML is zero.

  • \(H^r_a\): The median difference between the recall for Xamã and img2UML is greater than zero.

To evaluate these hypotheses, we used the same statistic tests as before. Xamã with or without shape detection feature does not present significantly different values for precision and recall on UML class diagrams. Therefore, we expect that the p values obtained from the tests between img2UML and Xamã with shape detection, and between img2UML and Xamã without shape detection would be the same. The adjusted p values obtained for both precision and recall is \(1.4\times 10^{-6}\). Hence, we can reject \(H^p_0\) and \(H^r_0\) and state that Xamã outperforms img2UML.

We also calculate the Cliff’s delta [106]. A nonparametric effect size measure for ordinal data. We consider the effect size values: \(|d|<0.147\) “negligible,” \(|d|<0.33\) “small,” \(|d|<0.474\) “medium,” otherwise “large” as defined by Romano et al. [107]. We obtained a large Cliff’s delta value -0.8625, and a 95% confidence interval being between -0.98 and -0.25.

Table 10 Precision, recall, and f-measure obtained by Img2UML and Xamã
figure e

4 Threats to validity

As any empirical study our work is subject to threats to validity. Wohlin et al. [108] provide a list of possible threats that researchers can face during a scientific research. In this section, we describe the actions we took in order to increase the validity and decrease the threats.

Internal validity concerns the unknown influences of independent variables can have on studies. In order to mitigate this concern, we have selected OCR services that have been evaluated by previous studies on different text recognition tasks. While the manual extraction of textual elements has been performed by one author only, the task is simple for an engineer and is unlikely to be affected by the subjectivity of their judgment. Another threat is that we do not know which version of Google Cloud Vision we used, nor we can choose the version. As consequence it is possible that future Google Cloud Vision update might break our heuristics. In order to mitigate this problem, we opted to be as conservative as possible regarding the size of the tolerance range used to identify whether the text is positioned on multiples lines or not.

External validity concerns the generalizability of the results and findings of the study. In order to mitigate this concern, we have diversified the collection of models analyzed to include models from different domains and different sources.

Construct validity concerns the issues related to the design of the experiment. In order to address this issue, we used metrics that were sufficiently defined in previous studies. Example of such metrics are precision, recall, and f-measure. We used these metrics to indicate which OCR service presents better performance.

Conclusion validity concerns about the relations between the conclusions that we draw and the analyzed data. In order to mitigate this concern, we paid special attention to use appropriate statistical techniques, and we described all decisions we made. Thus, this study can be replicated by other researchers, and we expect our results to be quite robust.

5 Discussion and future work

The results described in this paper can serve as a starting point for future research on the use of OCR for multi-domain model management, as well as for design of tools supporting multi-domain model management. Our expectation is that OCR can be used to support the multi-domain model management by identifying the relationships between models from different domains. In particular in those scenarios where only the images of the models are available, or in those scenarios where the source code of the models is available but there is no communication between the modeling tools.

We started by investigating accuracy of the off-the-shelf OCR services for extracting text from graphical models. Concurrent with the previous studies [57, 58] Google Cloud Vision outperformed Microsoft Cognitive Services on both precision and recall. However, the precision and recall values of Google Cloud Vision were not as high as the ones presented in the previous studies [58]. We believe this is due to the difference between the analyzed items: graphical models vs. business names. As opposed to business names, graphical models often include mathematical elements such as Greek letters and subscripts, and non-alphanumeric characters. Moreover, extracting text from models that do not follow the same design rules, incurs additional challenges. Indeed, the precision and recall scores for models from scientific papers are much more spread out in Figure 2, than for models from other data sources.

Next, we investigated the common errors produced by Google Cloud Vision. We identified 17 different types of errors organized by four categories: non-alphanumeric characters, mathematical formulas, spacing, and character confusion. Most common errors are related to Spacing and Character confusion; however, the main challenges seem to be related to the multi-line text error and mathematical formulas—not a single Greek letter, subscript or equation appearing in the models could be correctly identified. Therefore, we proposed two approaches to correct the multi-line text error. To achieve this goal, we investigated a set of models that have this problem and we defined a collection of heuristics to be used to correct them. The main challenge in our approach was to identify right heuristics that would present a good accuracy independently of the domains of the model in study. Hence, we opted to be as conservative as possible. One approach consists using the position (coordinates) of the text fragments, and the other approach adds shape detection feature to group the text fragments improving the results.

We evaluated the accuracy of Xamã in identifying whether the text fragment is positioned on multiple lines and if the merging of the text fragments is performed correctly. In this evaluation, we ignored other error types that Xamã could not fix, such as character confusion. We observed that Xamã, without the shape detection feature, correctly identified 905 out of 1171 elements positioned on single or multiple lines. We consider these numbers as good results, specially because of the fact it is a lightweight approach that uses the coordinates to identify whether the text fragments should be merged or not. We improved these numbers when we combined a slightly modified version of these heuristics with the identification of shapes such as boxes. Using the shape detection feature, the tool correctly identified 956 elements and improved the precision from 75% to 84%, the recall from 77% to 82%, and f-measure from 76% to 83%. We evaluate this as good improvements and they were statistically better on Matlab Simulink and models retrieved from scientific papers.

As expected, the combination of these techniques demands more computational processing making the tool slower. Thus, we calculated the costs (time) of using the shape detection, and we observed that this feature makes the tool 111.28 (median) times slower. Although this number might represent a high ratio, in practice it is not noticeable for human beings because the measurement unit of time is nanosecond. Having the slowest execution time as example. It took 0.02 seconds to be executed representing negligible costs. Finally, we conclude that using the shape detection feature improve the text recognition and presents a reasonable execution time. It is worth mentioning that this feature was used only to identify whether the text fragments are positioned on multiple lines or not. The reason is because this paper is mainly about the text extraction and how to improve the OCR results. For future work, we intend to use shape detection features to extract semantics from the models.

Next, we compared the overall accuracy between Xamã and Google Cloud Vision. We performed two paired Wilcoxon signed-rank tests to conclude that our approaches outperformed Google Cloud Vision. As expected the accuracy presented by Xamã was higher on those models that have more text fragments positioned on multiple lines. This analysis shows the importance of correcting this error. By having the models from scientific papers as example: we observed that in Figure 2 from the paper by Dounis et al. [63] the precision increased almost 40%. We obtained even better improvements when using the shape detection, for instance, in Figure 3 from the paper by Zhou et al. [64] the improvement on precision was 76%, and 84% on recall.

The imprecision of the coordinates provided by Google Cloud Vision partially explains the reason why some models present low precision and recall. Due to this imprecision, we had to add a tolerance range when evaluating whether the text fragment is positioned on multiple lines or not. When increasing the tolerance range, we also increase the number of identified text positioned on multiple lines. However, it causes a raise of false positives due to the misinterpretation of text that are not positioned on multiple lines. We decided not to increase the tolerance range because we opted for high precision to the detriment of recall.

Finally, we compared the results produced by Xamã with the ones presented by img2UML, a domain specific tool. We also performed two paired Wilcoxon signed-rank tests and we concluded that Xamã outperformed img2UML in both precision and recall. Text recognition is only one part of img2UML; this tool is also capable of detecting shapes in order to identify the classes and the relationships between the classes. In this evaluation, we only compared the text recognition. We conjecture two possible reasons for the weak accuracy by img2UML. The first reason might be the fact img2UML uses an outdated OCR engine for text recognition. Hence, we believe that img2UML can improve significantly once it updates the OCR engine to a newer one such as Google Cloud Vision. The second reason might be that img2UML is optimized to class diagrams exported by one specific CASE tool.

As future work, we intend to focus on the main challenges we identified in Sect. 2.2. We investigated the possibility of using OCR to extract text from models from different domains, and we proposed two approaches to correct the multi-line text error produced during the text extraction. However, all analyzed graphical models were designed using computer tools, and we believe that extracting information from hand-written model also plays an important role in the model management. Furthermore, we want to evaluate different OCR techniques and our approaches on additional kinds of graphical models, including, for instance, SysML models, models drawn on whiteboards, and hand-written models. We also want to evaluate if combining Xamã with Bayesian algorithm [109] can improve the text recognition, in particular fixing the multi-line text error. Simultaneously, we intend to combine OCR with image processing to analyze additional graphical elements such as lines and arrows presented in the models. There are some additional research questions that need to be answered for a better evaluation of OCR as technology to support the automatic identification of relationships. These questions include: what are the consequences of a missed recognition by the OCR tool? How much effort is needed to identify errors by the OCR tool? We also want to evaluate the accuracy of using only the text extracted from the models to identify possible relationships. In parallel, we can compare the results of detecting possible relationships with/without semantics detection.

6 Related work

To the best of our knowledge, there are no studies on the use of off-the-shelf OCR services on models from different domains, as well as no studies proposing techniques to correct multi-line text errors. There are, however, a number of studies in which OCR has been applied to domain-specific models, and studies proposing post-possessing techniques to correct misspellings produced by OCR. These studies are described below:

Img2UML [14, 22] extracts UML class diagrams from images, identifying, e.g., class names, fields, and methods. Img2UML uses Microsoft Office Document Imaging as the OCR technique for text recognition. While Img2UML is geared toward and evaluated on a specific domain, the techniques we have analyzed have been applied to models of multiple domains. Several studies have used OCR as part of a tool classifying images as UML diagrams: targeting class diagrams [110, 111], sequence diagrams [112], and component diagrams [111].

Going beyond engineering models, Reis [57] compare Google Cloud Vision and Microsoft Cognitive Services in recognizing text from the photographs of the pages of the Bible. Additional comparison studies have been published by Mello and Dueire Lins [113] and Vijayarani and Sakila [114].

Regarding post-possessing techniques, Cacho et. al. [115] propose combining edit Levenshtein distance with the concept of context given by trigrams, i.e., a contiguous sequence of three words. The authors use Google Web 1T Database as context to support the acceptance of words with bigger edit Levenshtein distance in case where the candidate word makes more sense in the context of the sentence.

Bassil and Alwani [116] propose using the Google online spelling suggestion to identify and correct words that have been misspelled during the OCR process. When applied this approach, the authors observed a significant improvement in OCR error correction rate.

Kanjanawattana and Kimura [117] propose an approach that uses ontologies, natural language processing (NPL), and edit distance to improve the OCR result correction. Differently from the other approaches, in this study the authors use a collection of two-dimensional bar graphs from journal articles as dataset. The authors concluded that their solution is effective because of the high accuracy compared to other methods.

Lasko and Hauserb [109] evaluated five methods to correct misspelled words during the OCR process. The methods evaluated in this study were: the edit distance algorithm with a probabilistic substitution matrix, an adaptation of the cross-correlation algorithm, Bayesian analysis, Bayesian analysis on an actively thinned reference dictionary, and the generic edit distance algorithm. The authors concluded that Bayesian algorithm is the most accurate one.

Khirbat [118] proposes the use of support vector machine (SVM) to identify misspellings, and the use of Levenshtein’s edit distance and simulated annealing (SA) algorithm to correct the words. In the end, they validate the words checking whether they are present in Google Web n-gram corpus.

Perianez-Pascual et. al. [15] evaluate the use of OCR in recognizing OCL expressions and propose strategies to improve the quality of this recognition. They evaluate these strategies in a collection of 3250 images of OCL expressions. As conclusion, they improve the OCR recognition in 15.63%.

7 Conclusion

We presented a study of suitability of the off-the-shelf OCR services in the context of multi-domain model management. We evaluated performance of two well-known services, Google Cloud Vision and Microsoft Cognitive Services, on a collection of 43 models from different domains: 17 UML diagrams, 9 Matlab Simulink models and 17 models from scientific papers from the control system engineering domain.

We observed that Google Cloud Vision overall outperforms Microsoft Cognitive Services both in terms of precision and in terms of recall. This observation is consistent both with the previous work [57, 58] and with a follow-up study investigating performance of the two OCR-services on models of different domains.

Focusing on Google Cloud Vision, we identified a list of 17 kinds of errors distributed over four categories: non-alphanumeric characters, mathematical formulas, spacing and character confusion. Among these errors, the most common are related to text written on multiple lines, wrong/missing characters, and an empty space between letters. It is also important that in presence of multi-line texts, Greek letters, subscripts, and equations because Google Cloud Vision failed every single time.

We presented Xamã, a tool that includes two approaches to correct the multi-line text error by identifying and merging textual elements that are positioned on multiple lines. We evaluated these approaches in a collection of 51 models from different domains: 20 UML diagrams, 10 Matlab Simulink models, and 21 models from scientific papers from the system engineering domain. For this goal, we defined a set of heuristics that take into consideration the location, alignment, and the distance of the text fragments. For the second approach, we combined a slightly modified version of these heuristics with a shape detection feature using well-known image process techniques.

We compared these two approaches in terms of precision, recall, and f-measure, and we concluded that Xamã, with shape detection feature, presents statistically better results on Matlab Simulink and models from scientific papers. We also compared out tool with Google Cloud Vision, and Xamã outperformed Google Cloud Vision.

Additionally, we compared the results produced by Xamã with the ones presented by img2UML. We observed that Xamã outperformed img2UML in both precision and recall. We conjecture that the reason might be due the use of an outdated OCR engine.

To conclude, we observed that correcting multi-line text error can increase significantly the overall accuracy, specially when the model presents a high number of text positioned on multiple lines. Xamã produces good results on both approaches (with and without shape detection).