The design of valid multidimensional star schemas assisted by repair solutions

Salem, Ali; Ben-Abdallah, Hanêne

doi:10.1007/s40595-015-0041-1

The design of valid multidimensional star schemas assisted by repair solutions

Regular Paper
Open access
Published: 05 March 2015

Volume 2, pages 169–179, (2015)
Cite this article

Download PDF

You have full access to this open access article

Vietnam Journal of Computer Science

The design of valid multidimensional star schemas assisted by repair solutions

Download PDF

Ali Salem¹ &
Hanêne Ben-Abdallah²

7206 Accesses
4 Citations
Explore all metrics

Abstract

Multidimensional schemas are used to model data warehouse systems, a special type of large databases dedicated for decision support. Several approaches have been proposed in this domain to ensure the satisfaction of decision makers needs and the accuracy of the generated schemas. This paper presents an approach for the validation of multidimensional star schema assisted by repair solutions. Our approach aims to assist designers by detecting constraint violations and proposing repair solutions based on a number of error-based rules formalized in Prolog. Its efficiency stems from using both linguistic similarity computation and a set of heuristics developed by bottom-up design methods to produce warnings about semantic constraint violations and their impact on the quality of the analysis results.

Ontology-Based Multidimensional Contexts with Applications to Quality Data Specification and Extraction

The Ontological Multidimensional Data Model in Quality Data Specification and Extraction

Automated Schema Quality Measurement in Large-Scale Information Systems

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Face to the international, unrestrained economic competition, an increasing interest in decision support systems has emerged over the last two decades. A decision support system assists decision makers to extract data pertinent to their analysis interests from their transactional systems (called on-line transactional processing systems). Most decision support systems rely on on-line analytical processing systems where data is often stored in multidimensional databases. These latter can be organized in terms of two types of storage areas: a data warehouse that regroups all data required for any potential analysis requirements, and/or a set of data marts each regrou** data extracts required for one particular analytical requirement, e.g. analyzing the sales of the enterprise. Given the fact that a data warehouse can be developed based on a set of data marts, in the remainder of this paper we will not distinguish between the two types and we will mean data mart development whenever we talk about decision support systems development.

The literature provides several methods (notations and processes) proposed to develop and/or enhance decision support systems (cf. [6, 17, 21, 34, 41, 45]). This area of research, however, still faces critical challenges to foster the efficient development of this type of systems. Similar to any software system, the efficient development of a decision support system depends on the quality of its models (specification and design, in particular). In their attempts to model a decision support system based on the data warehouse approach, the proposed methods have often relied on various multidimensional models to specify data mart schemas, including the star, constellation, and snowflake schema models [20, 25]. One main advantage of the used multidimensional models is their relatively simple graphical notations (cf. [2, 27]). However, current development processes may produce ambiguous, and sometimes, inconsistent schemas. By inconsistent, we mean a schema that is either ill-structured (for instance, having a circular functional dependency among its elements), or not amenable to loading from a data source (for instance, using elements unmatchable within the data source). These inconsistencies may result in non-functional decision support systems. The present work was undertaken to investigate the feasibility and potential gain effects brought by an approach based on a formal language (Prolog) for the design of a valid star schema.

Our design approach keeps the graphical notation advantage by allowing designers to draw their multidimensional schema. It also complements existing design approaches by offering a framework for the analysis of the structural, conformity and semantic constraints satisfaction of the schemas they produce. It aims at assisting designers by detecting constraint violations and proposing repair solutions based on a number of rules formalized in Prolog. Its efficiency stems from three sources: (i) the formalization strategy used facilitates the traceability of the violated constraints pertinent to schema elements [37]; (ii) the use of linguistic/semantic similarity computation (widely used in various artificial intelligence applications) together with the use of a set of heuristics developed by bottom-up design methods assist in the analysis of the conformity and semantic constraints; and (iii) the definition of a set of error-based repair solutions offers vital feedback to the designers for repairing their schema based on the errors detected.

That is, our approach is capable of detecting violations of various classes of constraints and suggesting repair solutions for any violated structural and/or conformity constraint. In addition, for every violated semantic constraint, it produces warning messages and information about the impact of the constraint violation on the quality of the analysis results. This paper presents our proposed approach with special focus on repair rules as they represent the underlying knowledge base from which repair solutions can be suggested.

The remainder of this paper is organized as follows: Sect. 2 first overviews the multidimensional schema concept through an example; it then provides a review of the related works currently available in the literature on the topic at hand. Section 3 describes our proposed design validation approach. Section 4 presents an illustrative case study and Sect. 5 reports on the results of a preliminary evaluation of the proposed approach. Finally, Sect. 6 summarizes the herein presented propositions and highlights their future extensions.

2 Related works

2.1 Multidimensional modeling: the star schema

The multidimensional star schema model is the basic building block for the remaining models (including the constellation, and snowflake schema models [20, 25]), and therefore any work on the star schema can be generalized on the remaining models.

A star schema represents one subject of analysis, called fact, as a point in a multidimensional space. The fact itself contains a set of measures that represent different numerical analysis indicators. The analysis can be performed along various axes called dimensions; these latter organize data in terms of hierarchies representing different levels of data details/aggregations [13]. Each level (also called attribute) can be further described with weak attributes.

Let us consider the car rental example depicted in Fig. 1 to illustrate the star schema concepts mentioned above. In this example, a decision-maker is interested in analyzing the fact renting in terms of the amount measure. The analysis can be conducted along three dimensions, namely cars, agency and date. Each of these dimensions offers several levels of amount aggregation, along one hierarchy for the date and agency dimensions and along one of two possible hierarchies for the cars (hcars1 and hcars2) dimension. For example, the hierarchy hcars1 offers aggregations along reg, model, brand, and/or all, respectively. The all attribute is the closure of each hierarchy, representing the highest level of granularity. The attribute power is a weak/descriptive attribute associated with the level reg.

2.2 Constrained multidimensional schema generation

The quality of a multidimensional schema (and/or a data warehouse) was extensively addressed in the literature on data warehousing, particularly through metrics defined in terms of the structural elements of the schemas (cf. [38, 39]) and through sets of constraints defined at the model level.

Inspired from the software engineering domain, the so-far proposed metrics offer a heuristic view about the quality of data warehouse schemas based on empirical studies and best practices (cf. [15, 23]). While they can be used as indicators to the structural and, to a limited extent, semantic quality of data warehouse schemas, these metrics do not offer a means to the validation of the loadability of a schema. On the other hand, several researchers (cf. [7, 20, 25]) have proposed a set of constraints that a data warehouse schema must respect to produce both syntactically correct and loadable data warehouse schemas. This herein presented work is interested in constraint-based consistency checking given their coverage of the syntactic, semantic and loadability aspects.

The literature of multidimensional modeling provides several types and classes of constraints (cf. [2, 9, 11, 12, 19, 20, 27, 28, 36]). Our approach supports all existing constraints presented in the literature; in addition, it uses a simple classification for the structural, conformity, and semantic constraints.

The existing three approaches to multidimensional schema design account for only some of the constraints when generating the schemas, depending on their objective. More specifically, focused on generating schemas that meet the decision makers’ needs, the top-down approach [1, 22, 32, 43] disregards the data source involved in the definition of a multidimensional model, which, as previously reported in [30], may result in unloadable schemas where the conformity constraints may also not hold; in addition, this approach does not explicitly address the structural well-formedness of the produced schemas.

On the other hand, starting from the data sources [8, 14, 16, 31], the bottom-up approach ensures that the resulting multidimensional schema is loadable and respects the conformity constraints; however, it does not guarantee the structural well-formedness of the schema nor the satisfaction of decision makers’ needs. In this approach, the satisfaction of the structural constraints depends on both the identification rules of the multidimensional elements and the design quality of the source schema.

To overcome these problems, the mixed approach [5, 21, 40, 42] combines the two previous approaches by confronting the constructed schema with the data source schema.

Except for the work of Bargui et al. [5] which uses linguistic relationships, existing mixed methods rely on an exact match between the names of the multidimensional schema and data source schema elements; such assumption is too restrictive given the domains differences (decision making and information system). In addition, none of the mixed approaches offers repair solutions when a constraint is violated. The work presented in this paper complements all three types of design approaches by explicitly verifying all types of constraints in an automated way. In addition, it assists designers by suggesting repair solutions. Finally, this approach integrates the semantic similarity computation [18] to solve the problem pertinent to the difference between the names used in the design of the star schema and the names existing in the data source schema.

3 Proposed verification and validation methodology

Our approach relies on a formalization approach using Prolog to facilitate the automated analysis of star schemas. This formal language was selected particularly because of its multiple advantages: Its highly expressive power covers all the multidimensional concepts and constraints and the relational data base concepts and constraints. Moreover, its notation is relatively user-friendly and conducive to the simple understanding of mathematical notation. Last but not least, it offers a variety of useful editing and verification tools.

3.1 Overview

As illustrated in Fig. 2, our approach of multidimensional star schema analysis proceeds by transforming the graphical star schema into Prolog predicates. For validation purposes, the source database schema must also be translated into Prolog predicates. Afterward, the verification and validation (V&V) process can be invoked via the interrogation of Prolog. The results of this latter step are then interpreted to produce a report with the list of errors (if any), suggested repairs, and quality warnings. To make use of our approach, we have developed a toolset which offers a graphical editor through which the designer can draw his/her schema, invoke the Prolog translation and analyzer, and receive graphical and/or textual analysis reports.

The remainder of this section details the steps of our analysis approach while discussing their theoretical bases.

3.2 Semantic similarity integration

When using our approach, designers can draw their star schemas independently of the sources. They can, therefore, use multidimensional elements having names that are different from those existing in the source schemas as far as they belong to the same context. Accordingly, the names used in the star schema must be semantically similar to those existing in the source schemas. The similarity determination problem can be resolved through computing the semantic similarity between the names existing in the sources and the names the designer used in the star schema.

The exploitation of computers to quantify and compare semantic similarities is an important area of research in artificial intelligence. In particular, the development of efficient measures to compute concept similarity is fundamental for computational semantics. In this research area, several similarity measures rely on knowledge resources to quantify semantic resemblance, such as the WordNet “is a” taxonomy; this latter quantifies the common characteristics between the concepts expressed by the relation of specialization/generalization. Some of these measures are based on topological parameters extracted from the taxonomical hierarchy to achieve the best expression possible of the semantic content. Among these parameters, we cite the depth, the ancestors’ number and the descendants’ number which are exploited to quantify the specificity of a concept. These measures are evaluated using benchmarks that are based on datasets formed by words pairs evaluated by experts, such as RG65 [35], MC30 [29] and AG203 [3].

Our method uses WordNet to discover the semantic similarity among the names used in the star and source schemas. More specifically, it uses WordNet synsets in adapting the similarity measure proposed in [26] by applying the information content^{Footnote 1} quantification method of [17].

WordNet represents a polysemous word by a set of synsets each of which contains cognitively synonymous nouns. For a given word w, its sysents can be retrieved through the WordNet “is a” taxonomy. Given a word $w_1$ used to name a star schema element and a word $w_2$ used to name a source schema element, the similarity measure proceeds as follows: it extracts their two sets of synsets [Syn ($w_1$), Syn($w_2$)] from WordNet; then it computes the semantic similarity between the words ($w_1$, $w_2$) as follows:

$$\begin{aligned} SemSim(w_{1},w_{2})=\underset{(s_{1},s_{2})\in Syn(w_{1})\times Syn(w_{2})}{max} ICSemSim(s_{1},s_{2}) \end{aligned}$$

where ICSemSim($s_1$,$s_2$) represents the semantic similarity estimation based on the quantification method of information content [17] and the similarity measure of Lin [26] as follows:

$$\begin{aligned} ICSemSim(s_{1},s_{2})= \frac{2 \times IC(LCS(s_{1},s_{2}))}{IC(s_{1})+IC(s_{1})} \end{aligned}$$

where: IC($s_i$) represents the information content of each synset $s_i$ representing the word $w_i$ computed using the method from [17] which is based on the ancestors’ subgraphs in the “is a” taxonomy; and LCS($s_1$,$s_2$) represents the lowest common subsumer of the two synsets $s_1$ and $s_2$ from the WordNet “is a” taxonomy. It’s important to note that the provided values are contained in the interval [0, 1].

When a name w belonging to the star schema does not exist in the source schema, the semantic similarity is computed between w and each name $w_i$ in the source schema and which may play the role of w in the star schema. For example, the name of a dimension or a fact should be compared with only relationship names; this restriction reduces the computational complexity of the similarity measurement process and respects the heuristics of the bottom-up approach to star schema design.

Let Candidates Words be the set of names in the data source schema.

Among these candidate names, we will pick the name $w_i$ as the correspondent to the name w in the star schema such that $w_i$ has the maximum $SemSim(w,w_{i})$ value.

Note that when computing the semantic similarity degrees between a name w and the names in the CandidatesWords set, we may obtain low values; that is, none of the candidate words expresses the same meaning as w. To make sure that we only retain a name that is reasonably similar to w, we used a threshold below which no $w_i$ would be chosen. As discussed in Sect. 5, the value of this semantic similarity threshold can be determined experimentally. It can be also fine-tuned by the star schema designer in consultation with the data source administrator. The retained couple must be added to the knowledge base of prolog. To do this, we add the couple (w, $w_i$) to the prolog file containing the description of the source and the star schema using a predicate.

3.3 Formalization in Prolog

Our formalization strategy uses a set of transformation rules that preserve the “informal” semantics of the graphical star schema. It also ensures the traceability between the Prolog predicates and the analyzed star schema. This advantage is crucial for giving feedback on potentially violated constraints to the designer. Table 1 summarizes the transformation rules of the star schema concepts into Prolog.

Table 1 Transformation rules of star schema concepts

Full size table

To provide for the validation of a star schema with respect to a relation data source, we also have defined transformation rules of relational schemas into Prolog. Briefly, each relationship is expressed in Prolog by the relation predicate with four arguments: the name of the relation, its primary key, its foreign key list, and the non-key attributes list. The attributes of the foreign key list are not necessarily considered as primary keys. For this reason, we distinguish two lists: a primary keys list and a foreign keys list. In the case where a foreign key is considered as a primary key, it must belong to both lists.

3.4 Prolog interrogation

This step consists in executing the process of the structural verification, the validation or verification of the semantic constraints. The structural verification process is a syntactic test. It checks the structural constraints presented in Table 2.

Table 2 List of multidimensional star schema structuralconstraints

Full size table

As far as the validation process is concerned, it necessarily involves the testing of conformity constraints (see Table 3). The latter are inspired from bottom-up approaches. Thanks to the conformity constraints, the validation of the designers’ schemas is reduced to a simple checking of conformity constraints. To do so, we need a connection to the source database to extract its relational schema. In fact, to ensure the conformity constraints, we will need the names of primary key, foreign key, and non-key attribute for each relation and the types of each attribute. For this reason, we define a type preducate with two arguments: the attribute name and its type. The naming conflict between the star schema and data source schema is resolved by interaction with the dictionary WordNet^{Footnote 2} [24] to account for other linguistic relationships, such as synonym and similarity [18].

Table 3 List of multidimensional star schema conformity constraints

Full size table

The semantic constraints can be divided into two categories (see Table 4): summarizability and domain constraints. Our approach verifies theses two categories of semantic constraints based on a data source. Accordingly, we defined a set of queries inspired from re-engineering works [4, 44]. The results of these queries enrich the knowledge base of Prolog.

Table 4 List of multidimensional star schema semantic constraints

Full size table

To illustrate the verification process of the semantic constraints, let us consider the verification process of the disjunction constraint for the supplier dimension shown in Fig. 3.

The supplier dimension contains one hierarchy, called hSupp, composed of the following levels: id, city, country, continent, and all. Table 5 expresses three instances of the supplier dimension: Ins1, Ins2, and Ins3. We note that the instance Tripoli related to the city level is associated with two instances of the country level: Libya and Lebanon. This instance sharing means that the disjunction constraint is violated. Our approach detects this problem at the source level, and warns the designer that violation of the disjunction semantic constraint may to engender certain inconsistencies in the results of the future decisional system. Indeed, two instances of city level associated with more than one instance of id. According to id, certain measures are probably counted many times.

Table 5 Example of supplier dimension instances

Full size table

3.5 Interpretation and repairing solutions

The interpreter generates two types of reports: error and warning reports. After the transformation of the multidimensional star schema and the source schema in Prolog, the error reports are generated by map** between the interpreter and Prolog. An error report consists of two zones: error messages and repairing solutions.

The warning messages inform the designers about the result qualities of the generated cube. They are the results of the semantic verification process. They identify the percentage of instances that do not satisfy each condition of the summarizability constraint, the most accurate results and the incoherence reasons. As for the domain constraints, the interpreter returns to the designer with a list of the violated constraints, it indicates the fact, dimension, hierarchy and level names.

The repairing solutions are detected by a set of rules we defined in Prolog. These rules are inspired from the conformity constraints (see Table 2). Each rule takes as a starting point the predecessor of the element that violates the constraint in the structure of the multidimensional star schema. The repairing rules are:

Rule of searching facts (RSF): the role of RSF is to provide designers with propositions as to which relationships that are able to be facts. The RSF selects each relationship that respects the fact conformity constraints. For each selected fact, the interpreter interrogates Prolog to execute the RSD rule. The selected fact is rejected when the result of the RSD rule is “No”. So, we conclude that the proposed fact cannot be related to any dimension.
Rule of searching measures (RSM): it returns to designers all attributes respecting the conformity constraint measure. RSM checks all attributes of fact relationship and its parallel relationships.
Rule of searching dimensions (RSD): To search all possible dimensions, RDS selects each relationship that respects the conformity constraints dimension. After detecting all dimensions, the orthogonality test is performed. The returned results represent all possible combinations of dimensions.
Rule of searching level (RSL): RSL searches all attributes that depend functionally and directly on the given level and that respect the level conformity constraint. Recursively, RSL allows the deduction of some new hierarchies. In the case where the identifier is taken as the given level, RSL returns all possible hierarchies. From a given level, the RSL rule provides designers with all possible successors. It is possible to give new hierarchies. The RSL rule is applied when some constrains are not satisfied, including the non-empty dimension, non-cyclic hierarchy, level conformity, and hierarchic dependency.
Rule of searching weak attribute (RSW): the RSW provides all the attributes that directly depend on a given level and that do not satisfy a level conformity constraint. The interpreter applies the RSW when a weak attribute conformity constraint is violated.

The repairing rules are used after the V&V process. Table 6 indicates the rules appropriate to fix the violated constraints.

Table 6 Repair rules according to the violated constraints

Full size table

4 Case study

To explain the technical details of our approach, we propose the V&V of the sale fact analyzed by period, documents, and publisher dimensions (Fig. 4).

Table 7 describes the source schema after its transformation into Prolog. The attribute types are extracted and added into Prolog knowledge bases.

Table 7 Code Prolog related to the relational source schema

Full size table

The sale fact presents three structural errors:

The fact does not respect the CNEf. In fact, the fact does not contain measures. RSM is executed and searches in the sources schema for the relation whose name is similar to the fact name. It then looks for all attributes that belong to this relation and that can respect the CCM. The repairing solution is: amount and quantity. The first is a numeric attribute belonging to the sale relation; the second belongs to the command relationship and is detected because the command is parallel to the sale relation.
The CNCh is violated in the hierarchy docAut of the dimension documents. In this case, one occurrence of specialty will be replaced or eliminated. The RSL is automatically executed. In the first instance of execution, RSL searches for the relation having a name that is similar to that of the dimension. In the second instance of execution, it takes specialty as a starting point. It checks the fact that docAut is not a successor of specialty. In order to validate the correctness of the second occurrence of specialty, RSL is executed for another time and takes codAut as a starting point. To search for the possibility of substituting the first occurrence of specialty, RSL is re-executed for another time with the predecessor of the specialty level, codDoc, taken as a starting point. The repair solution is: codDoc, codAut, specialty, and all. Overall, the RSL must be executed at least three times by considering the following as starting points: the first occurrence of level violating the CNCh, its predecessor, and the predecessor of the second occurrence.
The CNEd is not respected in the publisher dimension. In fact, the publisher dimension does not contain hierarchies. To compute a repair solution, the relation having a name similar to that of the dimension is selected. In our example, this relation is editor. The RSL is then executed, with the identifier of the editor relation being taken as a starting point is. A new hierarchy is provided for each detected level. Each extracted level is considered as a starting point of a new execution for the RSL. For the running example, two hierarchies are detected: hier1: codEd, city and all. hier2: codEd, city and all.

After the revision of the sale fact, the validation process is required. The checking of the conformity constraints shows three errors: COS, CDC and CLC.

The COS error relates to the existence of a functional dependency between the codDoc identifier of the documents dimension and the codEd identifier of the publisher dimension. To repair this error, the RSD is initially executed to provide all combinations of the orthogonal dimension. The designer then needs to analyze sale by documents and publisher. Accordingly, and considering that codDoc determines codEd, the RSL is re-executed to provide a new hierarchy belonging to documents, with codDoc being used as a starting point. The proposed hierarchy must contain codEd as a level. The repairing solutions are:

Fact dimensions: documents, period, and bookstore.

The hierarchies in the documents dimension are:

hierarchy 1: codDoc, codEd, city and all.

hierarchy 2: codDoc, codEd, country and all.

Concerning the CDC error, the publisher must be eliminated or replaced because the editor relationship, having a name similar the publisher, cannot be a dimension for the sale fact. The RSD is, therefore, re-executed to provide a new dimension that substitutes publisher. The proposed solution eliminates the publisher dimension or replaces it by the bookstore dimension.

The third error, CLC, refers to the fact that price cannot be considered as a level. After the RSL is executed with codDoc taken as a starting point, the proposed solutions are:

hierarchy 1: codDoc, codEd, city and all.

hierarchy 2: codDoc, codEd, country and all.

hierarchy 3: codDoc, specialty and all.

Among the results of the running example of summarizability verification, the following message is returned:

Warning—completeness (successor):
Fact: sale
Dimension: documents
Hierarchy: docAut
Level: codDoc
Next level : codAut
Number: 2, 6 %
2, 6 % of codDoc instances have no successors (codAut) in the hierarchy docAut of the documents dimension of the sale fact. Thereafter, the most accurate results on the measures of sale are those relative to codDoc. According to codAut, certain measures are probably not included

For the running example, the following warning message is provided:

Warning—domain constraint
Fact: sale
Dimension: documents
Hierarchy: docAut
Level: specialty
Constraint: specialty = “chemistry”
The domain restrictions and the sets of the level instances is empty

The message indicates that the domain constraint integrated by the designer implies an empty set as an intersection between the domain restriction and the set of the city level instances.

5 Evaluation

To evaluate our approach, we developed a tool supporting the automatic V&V process. To guide the designers during error revisions, our tool provides error messages and repair solutions. The warning messages are generated to inform the designers about the result qualities of the future cube. To evaluate experimentally our approach, we adopted a black box testing approach to evaluate: (1) its capacity in detecting errors and warnings, and (2) the quality of its repair solution reports. This black box methodology was adopted to test the structural verification and error detection. It consists in testing some star schemas and comparing the detected errors with the waited ones.

The tests we conducted for the structural constraints demonstrated that our approach is capable of detecting all structural errors. The conformity error detection requires the connection to the data sources and it depends on the threshold of the semantic similarity. For a semantic similarity threshold near 1, the validation process becomes stricter; that is, the elements from the star schema whose names do not match (exact or synonym) names in the source schema are considered as nonconform. Reducing this threshold, the number of elements considered as not conform and having similar elements in the source is reduced. The best results were found by fixing the similarity threshold to 0.2. We determined this value experimentally on a set of 150 name pairs we collected from the literature and student projects. After applying our similarity measure while varying the threshold, we manually analyzed the results to identify the threshold that chose the words express the closest meaning. The source schemas used in our tested validation process include Media-planning [32], Faculty members load [10], the bookstore, etc. Table 8 shows samples of name pairs. In this list, we cannot replace student by teacher or sale by buying, which would be allowed with a threshold less than 0.2.

Table 8 Example of similar words

Full size table

The warning messages are the results of the semantic constraint verification process. To test these constraints, we constructed source databases for the above-mentioned case studies, that contain a limited number of records ranging from 100 to 3000 records. Comparing the provided results with the multiplicity between the entities in the conceptual schemas, we obtain coherent results.

To check the repair solutions, we tested each search rule in some of its worst-case and intermediate scenarios. For the intermediate case, the example presented in the previous section illustrates the result quality. For the worst-case scenarios, because the behavior of our approach is similar to the bottom-up approach, we compared our results with those generated by this type of approach. The methods in this approach extract facts based on two criteria: the existence of a numeric attribute and/or the cardinality of the association between entities. The methods based on the first criterion can extract facts with no dimensions, and those based only on the cardinality may extract a fact with no numerical attribute (assuming the count as a default measure). Our validation approach takes in consideration both criteria.

When we applied several bottom-up and automatic methods, the process of dimension extraction provides all possible dimensions without respecting the orthogonality constraint. Using our approach, the worst-case scenarios of RSD execution is when the drawn fact is not associated with any dimension. The number of dimensions by fact is the same. However, our approach offers all possible combinations of orthogonal dimensions. From a given level, RSL detects all possible levels that depend functionally and directly on the given level and that respect the level conformity constraint. Recursively, RSL returns all possible hierarchies.

Compared to other approaches, our approach is characterized by its interactivity. In fact, guided by the provided repairing solutions, the designers can repair their schemas according to their needs. In summary, the approach presented in this work complements all three types of design approaches by explicitly verifying all types of constraints in an automated way. Besides its automation, this approach has two main merits. The first merit relates to its ability to integrate the semantic similarity computation [18] to solve the problem pertinent to the difference between the names used in the design of the star schema and the names existing in the data source schema. The second merit pertains to the solutions proposed to correct a schema based on the errors provided by the check module and hence avoiding its redesign.

6 Conclusion

In this paper, we proposed an approach for the design of valid multidimensional schemas. The approach has three distinct features: it handles structural, conformity, and semantic constraints; it uses linguistic similarity computation in the validation process; and it provides designers with possible repair solutions. It is advantageous compared to existing mixed approaches where designers cannot intervene after the need definition phase. Our approach is interactive: In error cases, a complete report is returned to the decision maker wherein errors are indicated and possible repair solutions are offered. To provide the designer, who can be non-expert in formal methods, with the means needed for their schema checking, we developed an automatic multidimensional V&V framework. Moreover, our framework provides designers with three message types: error reports, repair solutions, and warning messages. (It is also worth noting that our repair solution proposition approach can be extended to handle domain-specific constraints as illustrated in [37].) Finally, the input of our approach is a graphical notation. Thereby, the use of our framework does not require the expertise of a specific language in contrast with other works. For example, the use of the approaches described in [32, 42] require an expert in UML and the use of approaches described in [21, 33] require an expert in SQL.

Considering the promising potential of warning messages, future studies, some of which are underway in our laboratory, are needed to further investigate the potential benefits of these results in the context of extract-transform-load procedure definition.

Notes

Information content-based approach quantifies the similarity between concepts as a function of the information content (IC) that both concepts have in common in a given ontology. The basic idea is that general and abstract entities found in a discourse present less IC than more concrete and specialized ones.
http://wordnet.princeton.edu/.

References

Abdelhédi, F., Ravat, F., Teste, O., Zurfluh, G.: Selfstar: un système interactif pour la construction de schémas multidimensionnels. In: INFORSID, pp. 335–350 (2011)
Abelló, A., Samos, J., Saltor, F.: Yam2: a multidimensional conceptual model extending UML. Inf. Syst. 31(6), 541–567 (2006)
Article Google Scholar
Agirre, E., Alfonseca, E., Hall, K., Kravalova, J., Paşca, M., Soroa, A.: A study on similarity and relatedness using distributional and wordnet-based approaches. In: Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL’09), pp. 19–27. Association for Computational Linguistics, Stroudsburg (2009)
Alhajj, R.: Extracting the extended entity-relationship model from a legacy relational database. Inf. Syst. 28(6), 597–618 (2003)
Article MATH Google Scholar
Bargui, F., Ben-Abdallah, H., Feki, J.: A hybrid approach for data mart schema design from NL-OLAP requirements. In: Proceedings of the 14th International Conference on Applications of Natural Language to Information Systems (NLDB’09), pp. 295–296. Springer, Berlin (2010)
Battaglia, A., Golfarelli, M., Rizzi, S.: Qbx: a case tool for data mart design. In: Troyer, O., Bauzer Medeiros, C., Billen, R., Hallot, P., Simitsis, A., Mingroot, H. (eds.) Advances in Conceptual Modeling. Recent Developments and New Directions. Lecture Notes in Computer Science, vol. 6999, pp. 358–363. Springer, Berlin (2011)
Boulil, K., Pinet, F., Bimonte, S., Carluer, N., Lauvernet, C., Cheviron, B., Miralles, A., Chanet, J.P.: Guaranteeing the quality of multidimensional analysis in data warehouses of simulation results: application to pesticide transfer data produced by the macro model. Ecol. Inf. 16, 41–52 (2013)
Article Google Scholar
Cabibbo, L., Torlone, R.: The design and development of a logical system for olap. Lect. Notes Comput. Sci. 1874, 1–10 (2000)
Article Google Scholar
Carpani, F., Ruggia, R.: An integrity constraints language for a conceptual multidimensional data model. In: Proceedings of XIII International Conference on Software Engineering and Knowledge Engineering (SEKE), vol. 1 (2001)
Feki, J., Hachaichi, Y.: Une démarche et un outil. J. Decis. Syst. 16(3), 303–333 (2007)
Article Google Scholar
Franconi, E., Kamble, A.: The GMD data model for multidimensional information: a brief introduction. In: Data Warehousing and Knowledge Discovery, pp. 55–65. Springer, Berlin (2003)
Ghozzi, F., Ravat, F., Teste, O., Zurfluh, G.: Contraintes pour modèle et langage multidimensionnels. Ingénierie des Systèmes d’Information 9(1), 9–34 (2004)
Article Google Scholar
Golfarelli, M., Rizzi, S.: A methodological framework for data warehouse design. In: Proceedings of the 1st ACM International Workshop on Data Warehousing and OLAP, pp. 3–9. ACM, New York (1998)
Golfarelli, M., Rizzi, S., Saltarelli, E.: Wand: a case tool for workload-based design of a data mart. In: 10th National Convention on Systems Evolution for Data Bases, pp. 422–426. Citeseer (2002)
Gupta, R., Gosain, A.: Validating data warehouse quality metrics using PCA. In: ICDEM, pp. 170–172 (2010)
Hachaichi, Y., Feki, J.: An automatic method for the design of multidimensional schemas from object oriented databases. Int. J. Inf. Technol. Decis. Mak. 12(06), 1223–1259 (2013)
Article MATH Google Scholar
Hachaichi, Y., Feki, J., Ben-Abdallah, H.: Modélisation multidimensionnelle de documents xml centrés-données. J. Decis. Syst. 19(3), 313–345 (2010)
Article Google Scholar
Hadj Taieb, M.A., Ben Aouicha, M., Ben Hamadou, A.: A new semantic relatedness measurement using WordNet features. Knowl. Inf. Syst. 41(2), 467–497 (2014). doi:10.1007/s10115-013-0672-4
Hurtado, C.A., Gutierrez, C., Mendelzon, A.O.: Capturing summarizability with integrity constraints in olap. ACM Trans. Database Syst. (TODS) 30(3), 854–886 (2005)
Article Google Scholar
Hurtado, C.A., Mendelzon, A.O.: Olap dimension constraints. In: Proceedings of the Twenty-First ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 169–179. ACM, New York (2002)
Jovanovic, P., Romero, O., Simitsis, A., Abelló, A.: Ore: an iterative approach to the design and evolution of multi-dimensional schemas. In: Proceedings of the Fifteenth International Workshop on Data Warehousing and OLAP (DOLAP’12), pp. 1–8. ACM, New York (2012)
Kimball, R.: The Data Warehouse Toolkit. Wiley, New York (2006)
Google Scholar
Kumar, M., Gosain, A., Singh, Y.: On completeness and traceability metrics for data warehouse requirements engineering. Int. J. Comput. Syst. Eng. 1(4), 229–237 (2013)
Article Google Scholar
Leacock, C., Chodorow, M.: Combining Local Context and WordNet Similarity for Word Sense Identification, vol. 11, pp. 265–283. The MIT Press, USA (1998)
Lechtenbörger, J., Vossen, G.: Multidimensional normal forms for data warehouse design. Inf. Syst. 28(5), 415–434 (2003)
Article MATH Google Scholar
Lin, D.: An information-theoretic definition of similarity. In: Proceedings of the 15th International Conference on Machine Learning, pp. 296–304. Morgan Kaufmann, San Francisco (1998)
Luján-Mora, S., Trujillo, J., Song, I.Y.: A uml profile for multidimensional modeling in data warehouses. Data Knowl. Eng. 59(3), 725–769 (2006)
Article Google Scholar
Mazón, J.N., Lechtenbörger, J., Trujillo, J.: A survey on summarizability issues in multidimensional modeling. Data Knowl. Eng. 68(12), 1452–1469 (2009)
Article Google Scholar
Miller, G.A., Charles, W.G.: Contextual correlates of semantic similarity. Lang. Cogn. Process. 6(1), 1–28 (1991)
Article Google Scholar
Moody, D.L., Kortink, M.A.: From enterprise models to dimensional models: a methodology for data warehouse and data mart design. DMDW00, Sweden 5 (2000)
Nebot, V., Berlanga, R., Prez, J., Aramburu, M., Pedersen, T.: Multidimensional integrated ontologies: a framework for designing semantic data warehouses. In: Spaccapietra, S., Zimnyi, E., Song, I.Y. (eds.) Journal on Data Semantics XIII. Lecture Notes in Computer Science, vol. 5530, pp. 1–36. Springer, Berlin (2009)
Prat, N., Akoka, J., Comyn-Wattiau, I.: A UML-based data warehouse design method. Decis. Support Syst. 42(3), 1449–1473 (2006)
Article Google Scholar
Romero, O., Abelló, A.: Multidimensional design by examples. In: Proceedings of the 8th International Conference on Data Warehousing and Knowledge Discovery (DaWaK’06), pp. 85–94. Springer, Berlin (2006)
Romero, O., Abelló, A.: Automatic validation of requirements to support multidimensional design. Data Knowl. Eng. 69(9), 917–942 (2010)
Article Google Scholar
Rubenstein, H., Goodenough, J.B.: Contextual correlates of synonymy. Commun. ACM 8(10), 627–633 (1965)
Article Google Scholar
Salem, A., Jedidi, F.G., Ben-Abdallah, H.: Validation formelle de schma multidimensionnel vis vis de sa source. In: Ben-Abdallah, H., Feki, J. (eds.) EDA, RNTI, vol. B-6, pp. 137–151. Cpadus (2010)
Salem, A., Triki, S., Ben-Abdallah, H., Harbi, N., Boussaid, O.: Verification of security coherence in data warehouse designs. In: TrustBus, pp. 207–213 (2012)
Serrano, M.A., Calero, C., Piattini, M.: Experimental validation of multidimensional data models metrics. In: HICSS, p. 327 (2003)
Serrano, M.A., Trujillo, J., Calero, C., Piattini, M.: Metrics for data warehouse conceptual models understandability. Inf. Softw. Technol. 49(8), 851–870 (2007)
Article Google Scholar
Soussi, A., Feki, J., Gargouri, F.: Approche semi-automatisée de conception de schémas multidimensionnels valides. In: EDA, pp. 71–90 (2005)
Thenmozhi, M., Vivekanandan, K.: A tool for data warehouse multidimensional schema design using ontology. Int. J. Comput. Sci. Issues (IJCSI) 10(2), 161–168 (2013)
Tria, F.D., Lefons, E., Tangorra, F.: Hybrid methodology for data warehouse conceptual design by UML schemas. Inf. Softw. Technol. 54(4), 360–379 (2012)
Article Google Scholar
Tsois, A., Karayannidis, N., Sellis, T.: Mac: conceptual data modeling for olap. In: Proceedings of the International Workshop on DMDW, pp. 28–55. Citeseer (2001)
Yeh, D., Li, Y., Chu, W.: Extracting entity-relationship diagram from a table-based legacy database. J. Syst. Softw. 81(5), 764–771 (2008)
Zhao, J., Schewe, K.D., Koehler, H.: Dynamic data warehouse design with abstract state machines. J. Univers. Comput. Sci. 15(1), 355–397 (2009)
MATH Google Scholar

Download references

Author information

Authors and Affiliations

Mir@cl Laboratory, University of Sfax, Sfax, Tunisia
Ali Salem
King Abdulaziz University, Jeddah, KSA
Hanêne Ben-Abdallah

Authors

Ali Salem
View author publications
You can also search for this author in PubMed Google Scholar
Hanêne Ben-Abdallah
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ali Salem.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.

Reprints and permissions

About this article

Cite this article

Salem, A., Ben-Abdallah, H. The design of valid multidimensional star schemas assisted by repair solutions. Vietnam J Comput Sci 2, 169–179 (2015). https://doi.org/10.1007/s40595-015-0041-1

Download citation

Received: 18 August 2014
Accepted: 20 February 2015
Published: 05 March 2015
Issue Date: August 2015
DOI: https://doi.org/10.1007/s40595-015-0041-1

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

The design of valid multidimensional star schemas assisted by repair solutions

Abstract

Similar content being viewed by others

Ontology-Based Multidimensional Contexts with Applications to Quality Data Specification and Extraction

The Ontological Multidimensional Data Model in Quality Data Specification and Extraction

Automated Schema Quality Measurement in Large-Scale Information Systems

1 Introduction

2 Related works

2.1 Multidimensional modeling: the star schema

2.2 Constrained multidimensional schema generation