Log in

Quantitative Analysis of Apache Storm Applications: The NewsAsset Case Study

  • Published:
Information Systems Frontiers Aims and scope Submit manuscript

Abstract

The development of Information Systems today faces the era of Big Data. Large volumes of information need to be processed in real-time, for example, for Facebook or Twitter analysis. This paper addresses the redesign of NewsAsset, a commercial product that helps journalists by providing services, which analyzes millions of media items from the social network in real-time. Technologies like Apache Storm can help enormously in this context. We have quantitatively analyzed the new design of NewsAsset to assess whether the introduction of Apache Storm can meet the demanding performance requirements of this media product. Our assessment approach, guided by the Unified Modeling Language (UML), takes advantage, for performance analysis, of the software designs already used for development. In addition, we converted UML into a domain-specific modeling language (DSML) for Apache Storm, thus creating a profile for Storm. Later, we transformed said DSML into an appropriate language for performance evaluation, specifically, stochastic Petri nets. The assessment ended with a successful software design that certainly met the scalability requirements of NewsAsset.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Notes

  1. Domain Specific Modeling Language.

  2. Small and medium-sized enterprise.

  3. Modelling and Analysis of Real Time and Embedded Systems (OMG 2011a).

  4. Dependability Modelling and Analysis (Bernardi et al. 2011).

  5. The Student t-distribution with \(N-1 = 21\) degrees of freedom has been used.

  6. Figure 7 does not show the utilizations of the bolts that are below \(5\%\).

References

  • Apache. (2017a). Apache Storm Website. http://storm.apache.org/.

  • Apache. (2017b). Apache Zookeeper Website. https://zookeeper.apache.org/.

  • Ardagna, D., & et al. (2016). Modeling performance of Hadoop applications: a journey from queueing networks to stochastic well formed nets. In Carretero, J., & et al. (Eds.) Proceedings of the 16th Int. Conf. on algorithms and architectures for parallel processing. ISBN 978-3-319-49583-5 (pp. 599–613). Cham: Springer.

  • ATC. (2018). Athens technology center Website. https://www.atc.gr/default.aspx?page=home.

  • Avizienis, A., Laprie, J.-C., Randell, B., Landwehr, C.E. (2004). Basic concepts and taxonomy of dependable and secure computing. IEEE Transactions on Dependable and Secure Computing, 1(1), 11–33. https://doi.org/10.1109/TDSC.2004.2.

    Article  Google Scholar 

  • Bernardi, S., Merseguer, J., Petriu, D.C. (2011). A dependability profile within MARTE. Software &, Systems Modeling, 10(3), 313–336.

    Article  Google Scholar 

  • Chiola, G., Marsan, M.A., Balbo, G., Conte, G. (1993). Generalized stochastic Petri nets: a definition at the net level and its implications. IEEE Transactions on Software Engineering, 19(2), 89–107.

    Article  Google Scholar 

  • DICE Consortium. (2016). Requirement Specification. Technical report, European Union’s Horizon 2020 research and innovation program. http://wp.doc.ic.ac.uk/dice-h2020/wp-content/uploads/sites/75/2015/08/D1.2_Requirement-specification.pdf.

  • DICE Consortium. (2016). Storm profile. https://github.com/dice-project/DICE-Profiles.

  • DICE Consortium. (2017). DICE Simulation tool. https://github.com/dice-project/DICE-Simulation/.

  • Dipartimento di informatica, Università di Torino. (2015). GRaphical Editor and Analyzer for Timed and Stochastic Petri Nets. http://www.di.unito.it/greatspn/index.html.

  • Diplaris, S., & et al. (2012). SocialSensor: sensing user generated input for improved media discovery and experience. In Proceedings of the 21st international conference on World Wide Web (pp. 243–246). ACM.

  • Flexiant. (2017). Flexiant cloud orchestator Website. https://www.flexiant.com/.

  • Gianniti, E., & et al. (2017). Fluid Petri nets for the performance evaluation of MapReduce and Spark applications. ACM SIGMETRICS Performance Evaluation Review, 44(4), 23–36.

    Article  Google Scholar 

  • ISO. (2008). Systems and software engineering – High-level Petri nets – Part 2: Transfer format. ISO/IEC 159092:2011. Geneva: International Organization for Standardization.

    Google Scholar 

  • Kroß, J., Brunnert, A., Krcmar, H. (2015). Modeling big data systems by extending the Palladio component model. Softwaretechnik-Trends, 35(3), 1–3.

    Google Scholar 

  • Kroß, J., & Krcmar, H. (2016). Modeling and simulating Apache Spark streaming applications. Softwaretechnik-Trends, 36(4), 1–3.

    Google Scholar 

  • Lagarde, F., Espinoza, H., Terrier, F., Gérard, S. (2007). Improving UML profile design practices by leveraging conceptual domain models. In Kurt Stirewalt, RE, Egyed, A., Fischer, B. (Eds.) Proceeedins of the 22nd IEEE/ACM international conference on automated software engineering (ASE 2007) (pp. 445–448). Atlanta : ACM.

  • Law Averill, M. (2015). Simulation modeling and analysis. McGraw-Hill.

  • Marsan, M.A., Balbo, G., Conte, G., Donatelli, S., Franceschinis, G. (1994). Modelling with generalized stochastic Petri nets, 1st edn. New York: John Wiley & Sons, Inc.

    Google Scholar 

  • Nalepa, F., Batko, M., Zezula, P. (2015a). Model for performance analysis of distributed stream processing applications. In Proceedings of the 20th international conference on database and expert systems applications (pp. 520–533). Springer.

  • Nalepa, F., Batko, M., Zezula, P. (2015b). Performance analysis of distributed stream processing applications through colored Petri nets. In Proceedings of the 10th international doctoral workshop on mathematical and engineering methods in computer science (pp. 93–106). Springer.

  • OMG. (2011a). UML Profile for MARTE: Modeling and Analysis of Real-time Embedded Systems, Version 1.1. http://www.omg.org/spec/MARTE/1.1/.

  • OMG. (2011b). Meta object facility (MOF) 2.0 Query/View/ Transformation specification, version 1.1. http://www.omg.org/spec/QVT/1.1/.

  • Rak, T. (2015). Response time analysis of distributed web systems using QPNs. Mathematical Problems in Engineering, 2015, Article ID 490835, 10. https://doi.org/10.1155/2015/490835.

  • Ranjan, R. (2014). Modeling and simulation in performance optimization of big data processing frameworks. IEEE Cloud Computing, 1(4), 14–19.

    Article  Google Scholar 

  • Requeno, J.I., Merseguer, J., Bernardi, S. (2017). Performance analysis of Apache Storm applications using stochastic Petri nets. In Proceedings of the 5th international workshop on formal methods integration.

  • Samolej, S., & Rak, T. (2009). Simulation and performance analysis of distributed internet systems using TCPNs. Informatica, 33(4), 405–415.

    Google Scholar 

  • Selic, B. (2007). A systematic approach to domain-specific language design using UML. In Proceedings of the 10th IEEE international symposium on object-oriented real-time distributed computing (pp. 2–9). IEEE Computer Society.

  • Singhal, R., & Verma, A. (2016). Predicting job completion time in heterogeneous MapReduce environments. In Proceedings of the 30th IEEE international parallel and distributed processing symposium workshops (pp. 17–27). IEEE.

  • Wang, K., & Khan, M.M.H. (2015). Performance prediction for Apache Spark platform. In 2015 IEEE 17th international conference on high performance computing and communications (HPCC), 2015 IEEE 7th international symposium on cyberspace safety and security (CSS), and 2015 IEEE 12th international conferen on embedded software and systems (ICESS) (pp. 166–173). IEEE.

  • Zimmermann, A. (2017). Modelling and performance evaluation with TimeNET 4.4 (pp. 300–303). Cham: Springer International Publishing.

    Google Scholar 

Download references

Acknowledgements

This work has been supported by the European Commission under the H2020 Research and Innovation program [DICE, Grant Agreement No. 644869], the Spanish Ministry of Economy and Competitiveness [ref. CyCriSecTIN201458457R], and the Aragon Government [ref. T21_17R, DIStributed COmputation (DISCO)]

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to José I. Requeno.

Appendices

Appendix A: Transformation of a UML Design to a Performance Model

For evaluating performance metrics (utilization, throughput and response time), we need to transform the UML Apache Storm design into a performance model. We choose as target performance model Generalized Stochastic Petri Nets (see Appendix C). In the following, we recall the transformation patterns proposed in Requeno et al. (2017). Each pattern takes as input a part of the Storm design and produces a GSPN subnet.

1.1 A.1 Activity Diagram Transformation

Tables 910 and 11 show the patterns for the activity and deployment diagrams. For each figure, the left hand side presents the input of the pattern, i.e., the UML model elements, possibly stereotyped with the Storm profile. The right hand side indicates the corresponding GSPN subnet. For an easier understanding of the transformation, we depicted: a) text in bold to match input and output elements; b) interfaces with other patterns as dotted grey elements, because they actually do not belong to the pattern.

Table 9 Transformation patterns for storm-profiled UML Activity diagrams (node elements)
Table 10 Transformation patterns for storm-profiled UML activity diagrams (stream policies)
Table 11 Transformation patterns for storm-profiled UML deployment diagrams

Patterns P1 and P2 map spout and bolt tasks, respectively. Both spout and bolt subnets become structurally similar when the bolt subnet is merged with a P3P5 pattern. The subnet consists of two places, a timed transition, an immediate transition, and a set of arcs. Places \(p_{A1}\) and \(p_{A2}\) represent, respectively, the idle state and the processing state of incoming messages. The place \(p_{A1}\) is marked with as many tokens as the parallelism tagged-value associated to the task denotes ($n0). The rate of the timed transition is equal to either the emission rate ($rate) of the spout or the inverse of the host demand of the bolt (1/ $time). The timed transitions have an infinite server semantics because the production of tuples is already constrained by the number of available threads (tokens) defined by the parallelism. The immediate transition in the spout subnet does not have source places because it models the continuous arrival of new messages.

Patterns P3, P4 and P5 map the reception of a stream of tuples by a bolt. In P3 the source of the stream is only one task, whereas in P4 and P5 there are multiple sources. In particular, the pattern P4 represents the synchronous case and the pattern P5 is the asynchronous one. In P3P5 subnets, the interface transition \(t_{A}\) refers to the transition in P2 with the same name. Pattern P6 maps the final node to a transition without output places. This is a sink transition that represents the end of the stream processing and could potentially act as interface with subsequent systems (i.e., injecting tuples in another Storm application).

Patterns P8P11 detail the transformation of the numTuples and grou** tagged-values of a given stream step. Therefore, these patterns refine patterns P3, P4 and P5. The numTuples indicates the number of input tuples that the receiving bolt requires for producing a message. Then, such a value is mapped to the weight of the arcs \(a_{2}\) (P8 subnet), \(a_{1}\) (P9P10 subnets), and \(a_{i}\) (P11 subnet).

Additionally, the grou** defines how the stream should be partitioned among the next bolt’s threads. If the grou** is set to all, every thread of the receiving bolt will process a copy of the tuple, then the weight of the arc \(a_{1}\) in the GSPN subnet is equal to the parallelism of the bolt B (P8). Otherwise, only one thread of the receiving bolt will process the tuple, therefore, the weight will be set to the default value (i.e., 1). If the grou** policy is shuffle, the target execution thread of B is selected randomly among the idle ones (P9). In the case of global policy, the entire stream goes to the bolt’s thread with the lowest id (P10). The initial marking of place \(p_{BG}\), in the GSPN subnet, is set to a single token for restricting the access to just one thread. Finally, the field grou** policy divides the outgoing stream of A by value (P11) and all the messages having the same value are sent to the same threads of the receiving bolt. The transformation creates a GSPN subnet with n basic subnets, where n is the number of different stream values. This number is limited by the number of parallel threads (parallelism tagged-value) in B. When an incoming message arrives to the receiving bolt, it is redirected to one of the basic subnets according to the probabilities prob˙i assigned to the immediate transitions \(t_{B1_{i}}\).

1.2 A.2 Deployment Diagram Transformation

Pattern P7 (Table 11) illustrates the modifications introduced in the GSPN model by the profile extensions in the deployment diagram. The Storm tasks are first logically grouped into partitions in the activity diagram, later they are deployed as artifacts and mapped to physical execution nodes (GaExecHost stereotype) in the deployment diagram. In particular, P7 maps the GaExecHost to a new place \(p_{R}\) in the GSPN, with an initial marking that represents the number of computational cores of the node (resMult tagged-value). The addition of such place restricts the logical concurrency, that is the number of threads of the Storm tasks, to the number of available cores. The pattern corresponds to the acquire/release operations of the cores by the spouts and bolts.

1.3 A.3 Performance Model and Implementation

Figure 11 shows the final GSPN model for the Storm design in Figs. 3 and 4. It has been obtained by applying the patterns and combining the subnets through the interfaces. The image of the GSPN model has been simplified for readability purposes. For instance, the input (output) arcs of the transitions that acquire (release) tokens from (to) the resource places Core_1 and Core_2 are shown as broken arcs.

Fig. 11
figure 11

GSPN for the design in Figs. 3 and 4

The UML Profile for Apache Storm can be downloaded from DICE Consortium (2016b). The transformation of the UML models to GSPN, as well as the evaluation of the performance metrics, have been automatized in the DICE Simulation tool (DICE Consortium 2017), they are publicly available. The transformation uses QVT-MOF 2.0 (OMG 2011b) to obtain a Petri Net Markup Language file (PNML) (ISO 2008), an ISO standard for XML-based interchange format for Petri nets. Later on, a model-to-text (M2T) transformation from PNML into a GSPN tool specific format, concretely for the GreatSPN tool (Dipartimento di informatica 2015), has been performed. Other tools, such as TimeNet (Zimmermann 2017), also could be used, although not integrated in the framework yet.

Appendix B: Computation of Reliability Metrics

The DICE Simulation tool (DICE Consortium 2017), used in this work, also computes reliability metrics. In particular, the availability and mean time to failure for UML Apache Storm designs stereotyped using the Apache Storm profile presented in Section 3.

Many of the technologies covered for data-intensive applications (DIAs), such as Apache Storm, are designed as fault-tolerant, which means that their failures are internally handled and are not visible to users. Therefore, the metrics are calculated from the properties of the resources used by the technologies, rather than from the activities executed by the applications. The following paragraphs detail how each metric is computed.

1.1 B.1 Computation of the Mean Time to Failure

Apache Storm applications are fault-tolerant and the progress of the computation of elements in the input streams is managed and stored by a differentiated software, Zookeeper (Apache 2017b). Therefore, for the failure of a Storm application, the important concept to consider is the failure of the cluster of Zookeeper nodes, rather than the failure of spouts and bolts. In this case resources that execute Zookeeper service have to be stereotyped with <<StormZookeeper>> to point out their functionality and with <<DaComponent>>, inherited from the DAM profile, to specify their multiplicity with the “resMult” attribute and the estimated MTTF of the resource with the “failure” attribute.

To compute the MTTF, the DICE Simulation tool traverses all Nodes and Devices represented in the Deployment Diagram of the application looking for the element stereotyped as <<StormZookeeper>>. When it finds this element, it gets from it the information in its <<DaComponent>> stereotype regarding the number of resources used for the Zookeeper cluster and the estimated MTTF of each of them. Finally, the computation of the global MTTF considers that the Storm DIA takes place when all the resources dedicated to execute Zookeeper cluster have failed.

1.2 B.2 Computation of the availability

The availability of a system is defined as the readiness for correct service (Avizienis et al. 2004). We compute the percentage of time that a system is available, i.e., the percentage of time that the system is ready for executing a correct service for its users (steady state availability). It is defined as the mean time that the system is working correctly between two consecutive failures (i.e., MTTF) with respect to the Mean Time Between two consecutive Failures (called MTBF). In turn, MTBF is defined as the mean time of correct service until a failure plus the Mean Time to Repair (MTTR). Therefore, the steady state availability is calculated using the formula:

$$Availability=\frac{MTTF}{MTTF+MTTR}\cdot 100, $$

where the value for MTTR is taken from attribute “repair” of the <<DaComponent>> stereotype.

Appendix C: Petri Nets

A GSPN (Marsan et al. 1994, Generalized Stochastic Petri Net,) is a Petri net with a stochastic time interpretation. Therefore a GSPN is a modeling formalism suitable for performance analysis purposes. A GSPN model is a bipartite graph, consisting of two types of vertices: places and transitions.

Places are graphically depicted as circles and may contain tokens. A token distribution in the places of a GSPN, namely a marking, represents a state of the modeled system. The dynamic of the system is governed by the transition enabling and firing rules, where places represent pre- and post-conditions for transitions. In particular, the firing of a transition removes (adds) as many tokens from its input (output) places as the weights of the corresponding input (output) arcs. Transitions can be immediate, those that fire in zero time; or timed, those that fire after a delay which is sampled from a (negative) exponentially distributed random variable. Immediate transitions are graphically depicted as black thin bars while timed ones are depicted as white thick bars.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Requeno, J.I., Merseguer, J., Bernardi, S. et al. Quantitative Analysis of Apache Storm Applications: The NewsAsset Case Study. Inf Syst Front 21, 67–85 (2019). https://doi.org/10.1007/s10796-018-9851-x

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10796-018-9851-x

Keywords

Navigation