1 Introduction

Suppose you are at the helm of a small manufacturing company looking for improvements in the product packaging process. The old packaging machines are no longer up to the task, while there are five new top models on the market. The challenge? Choosing the best one, with a large investment at stake—in the order of hundreds of thousands of euros—and a major impact on the company’s operations. To solve this complex decision, four experienced engineers and technicians provide assessments based on their technical expertise and (not always very in-depth) information on the five models; based on his/her own perspective, everyone formulates a preference ranking among the five models (e.g., the third model is preferred to the first, which in turn is preferred to the fourth, and so on). But how to combine these individual evaluations for a collective decision on the most suitable packaging machine? This is the challenge of the so-called ranking-aggregation problem, in which data and science are used to guide towards a well-informed choice. Specifically, this ancient and widespread problem has three characteristic elements (Spohn 2009; Reich 2010; Saari 2011):

  1. 1.

    A set of objects to be prioritised according to a subjective attribute, i.e., a feature whose perception may depend on the person perceiving the stimulus, his/her technical knowledge and personal taste.

  2. 2.

    A set of experts formulating preference rankings of the objects of interest. Experts may be regarded as equally important or with a hierarchy of importance, depending on their competence regarding the evaluation they are supposed to carry out.

  3. 3.

    A collective judgement concerning the objects, resulting from the aggregation of expert rankings through a suitable aggregation technique. In the scientific literature, depending on the field and historical period, one can also encounter alternative expressions such as “collective/consensus judgement/assessment/evaluation”, which can, however, be considered interchangeable.

Traditional fields in which the problem is very popular are social choice, psychometrics, economics, and multi-criteria decision making (MCDM), with relevant contributions from eminent scientists (e.g., de Borda, Pareto, Samuelson, Arrow, Thurstone, Kendall, etc.) (Spohn 2009; Saari 2011; Arrow 2012; Bana e Costa, 2012; Köksalan et al. 2013; Franceschini et al. 2022). For instance, despite differences in terminology and application context, typical MCDM applications share the theoretical and methodological foundations of the ranking-aggregation problem. They begin with a finite number of alternatives (analogous to objects), each represented by its performance across multiple criteria (analogous to experts, though not necessarily flesh-and-blooded subjects). The objective is to identify the best alternative(s), akin to achieving a collective judgment (Belton and Stewart 2002; Zeleny 1976).

Due to the great generality, transversal disciplinary nature and multiplicity of potential applications, the ranking-aggregation problem has become of interest to many other scientific disciplines and operational contexts, including manufacturing. Some of the many possible manufacturing applications are as follows:

  • Production management, regarding the selection of the most appropriate production system on the basis of productivity, flexibility or other performance attributes (Chuu 2009; Chatterjee and Chakraborty 2014; Nestic et al. 2019; Hakimi-Asl et al. 2018; Qin et al. 2020).

  • Procurement, in which some managers have to identify the most appropriate suppliers or materials for a certain manufacturing system (Giachetti 1998; Yu and Hou 2016).

  • Conceptual design, regarding the opinions of different designers about alternative design concepts, from the perspective of specific technical features (Franceschini and Maisano 2019);

  • Quality control, regarding the prioritization of defects on manufactured parts, aggregating expert judgments by visual inspection (Franceschini and Maisano 2018b);

  • Reliability engineering, regarding the aggregation of the opinions of maintenance/reliability experts on the criticality of the (potential) failures in a production equipment (Geramian et al. 2019);

  • Customer-driven design, regarding the opinions of a panel of (service/product) customers on the degree of importance of a set of customer needs (Nahm et al. 2013);

  • Analysis of market demand, regarding the opinions by marketing experts about the most appropriate actions for the promotion of a new product/service (Franceschini and Maisano 2018a).

The analyst’s focus is often directed to the aggregation technique, which can be interpreted as a “black box” transforming input data (i.e., experts’ rankings and importance hierarchy) into output data (i.e., collective judgement) (Franceschini et al. 2022). However, this may lead to overlooking other important methodological aspects that characterise the ranking-aggregation problem, such as preliminary assessment of the degree of concordance among experts, verification of the consistency and robustness of output data, etc.

Aimed at scientists and practitioners in the manufacturing field, this work provides a set of useful tools to tackle the ranking-aggregation problem in a practical and effective manner, addressing the following research question: “How can the ranking-aggregation problem be effectively handled in the manufacturing field and what methodologies and tools can enhance the plausibility and robustness of the solution obtained through the expert-ranking aggregation?”. It is hypothesized that manufacturing scientists are often unfamiliar with the problem of interest, even though they may occasionally be dealing with it. Therefore, the article tries to bridge this knowledge gap by providing a relatively straightforward and effective operational methodology. The innovative aspect of this work lies in the integration of tools, within the proposed methodology, which are individually available in the scientific literature but are combined here in an organic manner. Furthermore, the proposed methodology is flexible, adapting to problems with different characteristics and incorporating different tools interchangeably. It is also iterative, including intermediate verifications that allow for adjustments and corrections while addressing the ranking-aggregation problem.

The remainder of this work is organised in three sections. Section 2 briefly introduces a real-world case study, concerning cobot-assisted manual (dis)assembly, which accompanies the description of the proposed methodology. Section 3, which is the heart of the article, provides a step-by-step description of the operational assisted methodology based on three phases, namely (i) problem formulation, (ii) collection of expert rankings, and (iii) collective judgment and validation. Section 4 summarises the original contributions of this work, its implications, limitations and insights for future research.

2 Case study

A company in the automotive industry reconditions different types of electrical components, mainly starter motors and alternators. Although the operations required are mostly manual and specific to each component, they can be divided into the following groups:

  • Disassembly

    • Disassembling any external coverings and shells (to access the internal parts);

    • Removal of electrical connectors and cables;

    • Unfastening bolts, screws, and other fasteners;

    • Separation of any electronic circuits (from the motherboard or main body);

    • Extraction of internal components (sensors, relays, transistors, capacitors, diodes, etc.).

  • Reconditioning

    • Identification of parts to be replaced or repaired;

    • Repairing/replacing these parts;

    • Intermediate testing.

  • Reassembly

    • Mounting internal components: repaired or replaced in their respective housings;

    • Reconnecting electrical cables and connectors;

    • Fastening with bolts, screws, or other fastening elements;

    • Ensuring that electrical connections are securely fastened;

    • Reassembling external shells and coverings;

    • Testing and diagnostics to verify the proper functioning of the reconditioned unit;

    • Cleaning, polishing, and final marking.

Because of the wide variety of components and the complexity of (dis)assembly and repair operations, the company has been assisting human operators with collaborative robots, or simply cobots (see Fig. 1), which are particularly useful for assisting manual operations that require great precision, dexterity and strength (Gervasi et al. 2022). Cobots are extremely versatile for multiple tasks, such as (i) picking up, clam**, handing the tools and parts to be machined/assembled, (ii) supporting dimensional inspection, online quality control, etc., and (iii) guiding less experienced operators, like virtual tutors.

Fig. 1
figure 1

Cobot-supported operator for a manual assembly task

The current market includes a relatively wide range of cobot models, which could be adapted to the context of interest. The company management decided to identify the most appropriate cobot depending on the programming-practicality attribute, which is crucial in making task preparation faster and easier, while reducing the level of technical skills required by operators (El Zaatari et al. 2019). The following five cobot models were selected from those at the forefront of the market, as they all (i) have a similar payload (around 5–10 kg), (ii) are designed for precision assembly and machining applications, and (iii) are relatively cost-effective:

  • (o1) Techman Robot TM5-700;

  • (o2) ABB GoFa10;

  • (o3) Universal Robots UR10E;

  • (o4) Yaskawa Motoman HC10DTP Classic;

  • (o5) Kinova Link 6.

In order to carry out a comprehensive evaluation, the company set up a panel of eight experts (mostly engineers, technicians and external consultants) from different technical areas and with diverse and complementary skills, a brief description of which follows:

  • (e1) Industrial-automation expert with in-depth skills in industrial process design and optimization;

  • (e2) Electrical engineer with comprehensive knowledge of electrical components and the technical specifications required for their safe assembly and disassembly;

  • (e3) Artificial-vision specialist capable of integrating advanced vision systems onto cobots, for precise recognition and positioning of electrical components;

  • (e4) Ergonomics expert with skills to define ergonomic and intuitive interaction modes with cobots for operators;

  • (e5) Robot-programming specialist with significant experience in both traditional industrial robots and collaborative robots;

  • (e6) Workplace-safety expert with in-depth knowledge of safety regulations and protocols for safe human-machine collaboration;

  • (e7) Maintenance expert with skills for planning and managing preventive and corrective maintenance activities on cobots;

  • (e8) Quality engineer with relevant experience to ensure that the robot-assisted assembly/disassembly process complies with quality standards.

3 Assisted operational methodology

This section illustrates an assisted operational methodology for tackling the ranking-aggregation problem in a practical, comprehensive and critical manner. The flowchart in Fig. 2 summarises the proposed methodology, which is divided into three operational phases illustrated in the corresponding subsections: “problem formulation”, “collection of expert rankings” and “collective judgment and validation”. The multiple feedback loops denote the iterative nature of the proposed procedure, which includes several intermediate verifications, with possible in-progress corrections and adjustments.

Fig. 2
figure 2

Flow chart summarising the assisted operational methodology for ranking aggregation

3.1 Problem formulation

First, the specific problem and its characteristics should be identified clearly and unambiguously. Based on the above considerations, a specific ranking-aggregation problem can be formulated. With reference to the case study, the cobot models are the n = 5 objects (o1–o5, cf. Section 2) that will be evaluated in terms of programming practicality, i.e., the attribute of interest. This attribute encompasses a range of desiderata, many of which are related to subjective perceptions, as detailed below:

  • An intuitive and easy-to-learn programming language will reduce development time and programming errors.

  • The user interface of the teach pendant should be intuitive and user-friendly to simplify and speed up the programming and control phase of the cobot.

  • It would be desirable to be able to programme and simulate the behaviour of the cobot even offline, without necessarily being connected to it.

  • The cobot should be integrable with external sensors (such as cameras, force sensors, etc.), so as to be more versatile for complex tasks.

  • The cobot programming should include advanced safety features to avoid accidents and ensure safe collaboration between the cobot and human operators.

  • Some cobots support the use of third-party programming languages, such as Python or C++ , making the import of external routines more versatile.

  • Tutorials, documentation and technical support should help make operator learning quicker and easier.

As seen in Sect. 2, the m = 8 experts (e1e8) are technicians, engineers and external consultants who formulate their individual preference rankings of the cobot models. In general, when selecting experts (at least) two aspects must be taken into account:

  1. 1.

    The greater the number of experts formulating their individual rankings, the higher the statistical relevance of the problem output (Friedman 1940; Kendall 1962; Gibbons and Chakraborti 2010). Unfortunately, there may be practical constraints that limit the availability of the number of experts (e.g., they should have a high level of technical expertise). Pragmatically, it would be desirable for m to be no less than 5–6 in order for the results of the study to be relevant (Franceschini et al. 2022).

  2. 2.

    It may sometimes be appropriate to have a hierarchy of importance of experts, for instance by discriminating those with greater technical expertise. This hierarchy can be constructed in different ways, typically by associating each expert with a weight or defining an importance ranking (Gibbons and Chakraborti 2010; Leo Kumar 2019). In the case study, the technical competences of the experts are notably different and, at the same time, complementary, with no clear superiority of one over the other (cf. Sect. 2). For this reason, all these experts are regarded as equally important. From a practical point of view, this choice simplifies the handling of the problem and broadens the range of applicable aggregation techniques (cf. Sect. 3.3).

Next, the type of expert rankings can be determined depending on several factors, such as the goal of the problem (e.g., identifying the best/worst object(s), drawing up a complete ranking, etc.), the data-collection strategy (e.g., through focus groups, personal telephone/street interviews, online forms, etc.), the literacy level of experts, etc. Complete rankings—i.e., ranking in which experts order all objects by linking them with strict preference (“oi ≻ oj”) or indifference relationships (“oi ~ oj”)—represent a classic scenario, although their formulation requires some effort, especially if the number of objects is large (Lagerspetz 2016). On the other hand, incomplete rankings are more “digestible” for experts, because they can take into account possible hesitations or doubts; for instance, incomplete are those rankings in which only a small number of top or bottom objects are included (e.g., the three most/least preferred), or in which the expert decides to omit an object from his/her ranking (e.g., since he/she is not familiar with), or even rankings with incomparability relationships between objects (“oi || oj”) (Chen et al. 2012). Given the relatively small number of objects, in the present case experts are asked to formulate complete rankings of all five objects. In Sect. 3.2 we will illustrate a way to indirectly formulate complete rankings, through a simplified response mode.

Subsequently, the collective-judgment type must be defined according to the “desirable” properties for the specific problem. There is a wide range of possibilities: rankings, scalings on different scale types (e.g., interval, ratio), clusterings, scorings,Footnote 1 or collective judgments designating only the winner/loser object, etc. For the sake of simplicity, in the case study the expected collective judgment is represented by a complete ranking. The analyst must be aware that the choice of input/output data for the problem has implications for both the subsequent formulation of rankings (in Sect. 3.2) and the choice of aggregation technique (in Sect. 3.3).

3.2 Collection of expert rankings

This stage begins with a detailed explanation of the problem to experts, who need to understand exactly which objects are to be evaluated, the attribute against which the evaluation is to be made, and how to formulate individual rankings. In order to make this formulation less laborious, especially when the number of objects being compared is large, experts can formulate ratingsFootnote 2 of the objects, which can then be converted into a complete ranking (see example in Fig. 3).

Fig. 3
figure 3

Example of the conversion of judgments (on five objects: o1o5) from a a five-level rating scale to b a (complete) ranking (Franceschini et al. 2022). The procedure was applied to expert e1 and can be extended to the other seven experts

Returning to the case study, Fig. 4 reports the resulting (complete) expert rankings, which include relationships of strict preference (“oi ≻ oj”) and indifference (“oi ~ oj”) between objects. At this stage, it must be ensured that the experts’ rankings are formulated consistently with the expected type; if necessary, the formulation must be corrected/revised (see feedback loop from block 2.3 in Fig. 2).

Fig. 4
figure 4

a Complete rankings of n = 5 objects, formulated by m = 8 experts; b corresponding rank table. Ti is a correction factor for ties (cf. Eq. 1)

Going into the rankings of the eight experts in the case study, it should come as no surprise that they sometimes differ from each other because they are often based on complementary perspectives. For example, experts e4 and e7 seem to express two radically different evaluations of item o1. Probably influenced by their different backgrounds and training, these experts have developed very different perceptions of the cobot model o1, resulting in evaluations in opposite directions.

3.2.1 Concordance among expert rankings

Evaluating the concordanceFootnote 3 among expert rankings is a preliminary check of the plausibility of input data, which is useful to prevent difficulties, such as excessive heterogeneity in the selection of experts, poor understanding of the problem, errors in the formulation of rankings, or other potential obstacles to achieving consensus. The scientific literature includes various statistical indicators, which can be used depending on the problem characteristics (Agresti 2010; Gibbons and Chakraborti 2010; Sato and Tan 2023). Since the present case is characterized by complete expert rankings with equally-important experts, the Kendall’s W and Spearman’s ρ can be used (Franceschini et al. 2022).

W, known as coefficient of concordance, is a multivariate statistic that applies at the level of expert rankings and is related to the dispersion of the ranks associated with each object (Ross 2009; Franceschini et al. 2022). This measure belongs to the range [0, 1], where 1 indicates perfect concordance and 0 indicates independence (Legendre 2010).

Returning to the case study, each ranking can be translated into a set of ranks—that is, permutations of the integers {1, 2, 3, 4, 5}—which are then organized into a so-called rank table, i.e., a bidirectional matrix of size m × n, with row and column labels designating experts and objects (see Fig. 4b). In the case of tied objects—i.e., pairs of objects with indifference relationships, e.g., “oi ~ oj”—we conventionally use the average ranks that each set of bound objects would occupy if a preference could be expressed (Gibbons and Chakraborti 2010); for example, in a ranking where objects o1 and o3 are tied for 3rd and 4th place (e.g., see the ranking by e6 in Fig. 4a), the average rank of (3 + 4)/2 = 3.5 would be assigned to both.

W is defined as:

$$W = \frac{{\mathop \sum \nolimits_{j = 1}^{n} \left( {R_{j} - \overline{R}} \right)^{2} }}{{\left[ {m^{2} \cdot n \cdot \left( {n^{2} - 1} \right) - m \cdot \mathop \sum \nolimits_{i = 1}^{m} T_{i} } \right]/12}},$$
(1)

being

n the number of objects (i.e., 5 here);

m the number of experts (i.e., 8 here);

Rj the column total related to the j-th column of the rank table;

\(\overline{R }=m\bullet \left(n+1\right)/2\) the average column total (i.e., 24 here);

\({T}_{i}={\sum }_{k=1}^{{g}_{i}}\left({t}_{k}^{3}-{t}_{k}\right)\) a correction factor for ties, in which tk is the number of tied ranks in the k-th group of tied ranks (where a group is a set of values having constant tied rank) and gi is the number of groups of ties in the set of ranks (ranging from 1 to n) for expert i. This correction factor ensures that, in the case of perfectly concordant rankings with ties (as all rankings coincide), W = 1 (or 100%) is obtained (Gibbons and Chakraborti 2010).

With reference to the case study (cf. expert rankings and related object ranks in Fig. 4), it is obtained W = 22.4%, denoting a relatively low level of concordance. Not surprisingly, a significance test to the null hypothesis of independence between rankings yields the following parameter:

$$Q = W \cdot m \cdot \left( {n - 1} \right) = 7.2 < \chi_{n - 1,\alpha }^{2} = 9.49,$$
(2)

\({\chi }_{n-1,\alpha }^{2}={\chi }_{\text{4,5}\%}^{2}\) being a chi-square (\(\chi\)2) variable with n − 1 degrees of freedom, corresponding to a conventional significance level of α = 5%. Equation (2) indicates that the null hypothesis cannot be rejected with a confidence level of 1 − α = 95% (Ross 2009; Gibbons and Chakraborti 2010).

To further investigate the reasons for this low inter-expert concordance, the bivariate perspective of Spearman’s correlation coefficientFootnote 4 related to each possible pair of rankings (ρ) can be considered. Table 1 contains the ρ coefficients between all the possible \(\left(\begin{array}{c}m\\ 2\end{array}\right)=\frac{m\cdot \left(m-1\right)}{2}=28\) pairs of expert rankings under consideration (Ross 2009).

Table 1 Spearman’s ρ correlation table for the expert rankings in Fig. 4a

Rather pronounced negative correlations (i.e., ρ ≤ −0.4) between certain pairs of expert rankings stand out. Curiously, they often involve the ranking by e5, denoting a sort of “countertrend” with respect to the other rankings. Upon brief investigation of the reasons for this counter-trend, it turns out that e5 misunderstood the ranking construction, formulating it in the sense of reverse preference; therefore, the correct ranking should be “o3≻(o1 ~ o2 ~ o5)≻o4” instead of “o4≻(o1 ~ o2 ~ o5)≻o3” (see feedback loop from block 2.7 in Fig. 2). After this correction, the new value of W is almost twice as high as the initial one (i.e., W = 40.2% versus 22.4%) and the significance test in Eq. (2) results into \(Q=12.9\ge {\chi }_{n-1,\alpha }^{2}=9.49\), which leads to rejecting the null hypothesis and considering the new level of concordance as statistically significant. Simultaneously, the relatively large negative ρ values for e5 are “reabsorbed” (see Table 2, containing the new ρ values).

Table 2 Spearman’s ρ correlation table for the expert rankings in Fig. 4, after the correction of the ranking by e5 (i.e., “o3≻(o1 ~ o2 ~ o5)≻o4” instead of “o4≻(o1 ~ o2 ~ o5)≻o3”)

As exemplified, the concordance analysis can be useful in pointing out possible anomalies and “pitfalls” in the formulation of expert rankings (Franceschini et al. 2022).

3.3 Collective judgment and validation

At this point, it is needed to solve the ranking-aggregation problem by utilizing an appropriate aggregation technique and, subsequently, verifying the plausibility of the resulting output.

3.3.1 Ranking aggregation

This is the heart of the ranking-aggregation problem and implies some knowledge of the state-of-art aggregation techniques. Far from this ambition, Table 3 simply recalls some possible aspects to be taken into account while selecting the aggregation technique (Franceschini et al. 2022).

Table 3 Aspects to consider when selecting the aggregation technique, with reference to a specific ranking-aggregation problem (Franceschini et al. 2022)

For an overview of the aggregation techniques, we refer the reader to relevant surveys and extensive reviews (Figueira et al. 2005; Reich 2010; Herrera-Viedma et al. 2014). For example, Table 4—adapted from (Franceschini et al. 2022)—classifies nine different aggregation techniques according to the aspects listed in Table 3. It can be noted that some techniques are suited to situations with few objects/experts, while others—which can be defined as more “parsimonious” (Kabirifar et al. 2023; Corrente et al. 2024)—are also suitable for situations with a relatively large number of objects/experts. The summary in Table 4 is evidently partial and not intended to be comprehensive. In future research, we aim to provide a more comprehensive overview in this regard. Here we just point out that (i) aggregation techniques are all inherently imperfect (Arrow 2012), (ii) their success depends not only on their efficacy, accuracy, and scientific rigour but also on their simplicity of use (Oukil 2019; Sarwar et al. 2021), and (iii) in general it would be good to avoid "falling in love" with one technique and—when possible—use multiple techniques simultaneously (cf. concept of wisdom of crowds) (Franceschini et al. 2022).

Table 4 Synthetic comparison among nine aggregation techniques illustrated in (Franceschini et al. 2022), according to the aspects in Table 3. The first and fourth techniques will be used for the case study

In line with this consideration, two relatively simple aggregation techniques are applied for the problem of interest:

  • Borda count (BC). For each expert ranking, the first object accumulates one point, the second two points, and so on (Borda 1781; Saari 2011). In case of ties, the average ranks described in Sect. 3.2 can be used. The collective score (BC) of one object can be calculated by cumulating the scores related to each ranking; in this sense, the BC method implements the concept of “average rank position”. BC is used in various contexts, such as engineering design, the “RoboCup” robot soccer competition, the “Eurovision” song contest, etc. (Dym et al. 2002; Franceschini et al. 2022).

  • Best of the best (BoB). For each expert ranking, the most preferred object obtains one point. In case of a tie between leading objects, the point is fractionalized, dividing it by the number of objects themselves (e.g., ½ if there are 2 objects, 1/3 if there are 3, and so on). In some contexts, the BoB method is also referred to as “Plurality Voting” or “First Past the Post” (Blais 2008).

Figure 5a, b respectively show the results of the application of the BC and BoB techniques to the expert rankings (after the correction of the ranking by e5, cf. Section 2.2). These two aggregation techniques—which are simple and well suited to complete rankings by equally-important experts—here result in two similar collective rankings (see the bottom of Fig. 5). Both techniques lead to the same “trio” of most suitable cobot models: o3 (Universal Robots UR10E) followed by o5 (Kinova Link 6) and again o1 (Techman Robot TM5-700).

Fig. 5
figure 5

a Expert rankings, b scoring/ranking resulting from the application of the Borda Count (BC), and c scoring/ranking resulting from the application of the Best of the Best (BoB)

3.3.2 Consistency analysis

Every aggregation technique surely provides a result; but how does one know whether it is plausible? Certainly, the rationale of the aggregation technique represents a conceptual guarantee that it is capable of producing reasonable results. However, the aggregation technique that most consistently reflects expert rankings cannot be assessed ex ante, but only ex post and on a case-by-case basis (Chiclana 2002; Arrow 2012; McComb et al. 2017).

Studies have focused on the concept of consistency of the collective judgment with respect to input data, defined as “the ability of a collective judgment to reflect the rankings of experts, while taking the importance hierarchy into account” (Franceschini et al. 2022). Among the available tools to assess the degree of consistency of the solution to a certain ranking-aggregation problem, p-indicators are very versatile, as they can be adapted to a variety of contexts, such as those in which expert rankings are (i) not necessarily complete, (ii) equally important, or (iii) characterized by an importance hierarchy (Franceschini et al. 2022). In general, p-indicators can be divided into two families:

  • pj, indicators of local consistency, which are based on the comparison of each j-th expert’s ranking with the collective judgement.

A preliminary operation for determining pj is constructing “a paired-comparison table” in which each ranking (i.e., those from experts and that one deduced from the collective judgment) is transformed into sets of paired-comparison relationships (see symbols “≻” and “ ~ ” in Tables 6a, 7a. Next, a “consistency table”—which turns the paired-comparison relationships of each expert into scores, according to the scoring system in Table 5—is constructed; the conventional assignment of 0.5 points in the case of weak consistency is justified by the fact that this is the intermediate case between that of full consistency (with score 1) and that of inconsistency (with score 0) (Franceschini et al. 2022). The consistency table also reports the sum of the scores (xj) obtained by each j-th expert ranking. Tables 6b and 7b exemplify two consistency tables related to the case study of interest, for both the aggregation techniques (BC and BoB respectively). Tables 6c and 7c show that both techniques result in collective rankings that are generally consistent with the single expert rankings. The least consistent expert rankings (i.e., those with lower pj values) appear to be those formulated by e4 and e6, although the distinction is small.

Table 5 Scoring system used in the construction of the “consistency table”
Table 6 a Paired-comparison table, b consistency table, and c p-indicators related to the BC technique (which resulted into the collective ranking: o3o5o1o2o4, cf. Fig. 5b)
Table 7 a Paired-comparison table, b consistency table, and c p-indicators related to the BoB technique (which resulted into the collective ranking: o3o5o1≻(o2 ~ o4), cf. Fig. 5c)

Next, for each j-th expert, the portion of “consistent” paired-comparisons can be calculated as:

$${p}_{j}=\frac{{x}_{j}}{\left(\begin{array}{c}n\\ 2\end{array}\right)}=\frac{{x}_{j}}{10},$$
(3)

being

xj the total score related to the j-th expert;

\(\left(\begin{array}{c}n\\ 2\end{array}\right)=\frac{n!}{2!\cdot \left(n-2\right)!}=\frac{n\cdot \left(n-1\right)}{2}\) the overall number of paired comparisons (i.e., 10 here).

  • p, i.e., indicator of global consistency. In the case of equally-important experts, the pj values are aggregated through the arithmetic average (Franceschini et al. 2022):

    $$p = \frac{1}{m} \cdot \mathop \sum \limits_{j = 1}^{m} p_{j} ,\quad p \in \left[ {0,1} \right].$$
    (4)

In this particular case, the two aggregation techniques result in two relatively close p-values: i.e., 75.0% for BC and 73.8% for BoB (see Tables 6c and 7c). This confirms that both techniques yield collective rankings that are relatively consistent with the input data (and vice versa), with a slight predominance of BC over BoB. In the case of non-equally-important experts and/or incomplete expert rankings, the formulation of p-indicators is more complex (Franceschini et al. 2022).

Besides the p-indicators, another tool for assessing consistency is \({W}_{k}^{\left(m+1\right)}\), i.e., an indicator inspired by Kendall’s W (cf. Equation 1), which is nothing more than W itself applied to (m + 1) rankings consisting of: (i) the m-expert rankings, and (ii) the collective ranking obtained after the application of a given aggregation model (k) to the previous expert rankings. Consistency between collective ranking and expert rankings is assessed in relative terms, by comparing \({W}_{k}^{\left(m+1\right)}\) with the traditional W. \({W}_{k}^{\left(m+1\right)}\ge W\) denotes consistency (or positive consistency) between the collective ranking and the m-rankings, while \({W}_{k}^{\left(m+1\right)}<W\) denotes inconsistency (or negative consistency) (Franceschini and Maisano 2021). The latter situation can occur when a collective ranking is somehow conflicting with the m-rankings. To make the consistency assessment easier, another synthetic indicator can be used:

$$b_{k}^{\left( m \right)} = \frac{{W_{k}^{{\left( {m + 1} \right)}} }}{{W^{\left( m \right)} }},\quad b_{k}^{(m)} \in ]0, + \infty ].$$
(5)

For a specific set of m rankings, \({b}_{k}^{\left(m\right)} \ge 1\) indicates that the aggregation model (k) provides a somehow consistent collective ranking (positive consistency), while \({b}_{k}^{\left(m\right)}<1\) indicates that it provides a somehow inconsistent collective ranking (negative consistency). Table 8 exemplifies the calculation of indicators \({W}_{k}^{\left(m+1\right)}\) and \({b}_{k}^{\left(m\right)}\) for the case study, considering the BC and BoB aggregation techniques respectively. Positive consistency is observed for both techniques, with a slight predominance of BC over BoB (e.g., consider the \({b}_{k}^{(m)}\) value of 1.13 for BC versus 1.12 for BoB), confirming the result obtained through p-indicators.

Table 8 W, \({W}_{k}^{\left(m+1\right)}\), and \({b}_{k}^{m}\) indicators for the collective rankings resulting from the application of BC and BoB aggregation techniques to the problem of interest

3.3.3 Robustness of the solution

The formulation of rankings is often affected by inherent variability, which can "propagate" onto the variability of the output (Saltelli et al. 2006). Only a very few aggregation techniques associate the resulting collective judgment with a corresponding estimate of variability (Franceschini and Maisano 2020). In general, it may be useful to perform a sensitivity analysis to assess the robustness of the solution against small variations in the input data (Saltelli et al. 2006). An example of sensitivity analysis follows.

Table 9 contains three sets of expert rankings: (i) the initial one (cf. Fig. 5a) and (ii, iii) two additional ones, obtained by applying small distortions to the initial one. These distortions can be achieved automatically in multiple ways. In the present case, a procedure described in the following four steps was adopted.

  1. 1.

    Each expert ranking is translated into a scoring corresponding to the average ranks of individual objects. For example, the ranking by e5, i.e., o3≻(o1 ~ o5)≻o2o4, is translated into the scores (s) o1 = 2.5, o2 = 4, o3 = 1, o4 = 5, o5 = 2.5 (cf. Fig. 5b).

  2. 2.

    Next, the score (s) of each object is distorted by adding to it an error (ε) given by a zero-mean random variable, uniformly distributed within the interval [-1, + 1], i.e., \(\varepsilon \sim U(-1,+1)\). Translating this into a formula:

$$s^{\prime} = s + \varepsilon ,$$
(6)

being \({s}^{\prime}\) the resulting distorted score. The above interval of variability seems in line with the idea of small (positive or negative) variations of input rankings.

Table 9 Set of rankings used for sensitivity analysis
  1. 3.

    Next, the score (\(s^\prime\)) of each object is rounded to the nearest integer, resulting in the new score:

    $$s^{\prime\prime} = round\left( {s^\prime } \right),$$
    (7)

    where round(·) is an operator that rounds a certain score to the nearest integer.

For example, applying the distortion in Eq. 6 to the scores (s) at step one, we get the scoring (\(s^\prime\)): o1 = 2.1, o2 = 4.6, o3 = 1.6, o4 = 4.8, o5 = 2.6; then, applying the rounding in Eq. 7, we get the new scoring (\({s}^{{\prime}{\prime}}\)): o1 = 2, o2 = 5, o3 = 2, o4 = 5, o5 = 3.

  1. 4.

    Subsequently, the set of \({s}^{{\prime}{\prime}}\) scores are translated into an “additional” ranking, with relationships of strict preference ("≻") and indifference (" ~ "), similarly to the transformation from rating to ranking in Fig. 3. Returning to the above example, the \({s}^{{\prime}{\prime}}\) scores at the following step are transformed into the (additional) ranking: (o3 ~ o1)≻o5≻(o2 ~ o4). The procedure was extended to all initial rankings and repeated twice, resulting in the two additional sets of rankings in Table 9(ii), (iii).

For each set (initial and additional), the collective scoring/ranking was determined by applying the BC and BoB aggregation techniques (see results in Table 10). Next, the average dispersion in the rank position of individual objects can be used as a proxy for the robustness of the resulting collective rankings (see Table 11). In this specific case—BC provides somewhat more robust results than BoB (i.e., lower mean standard deviation of 0.44 against 0.60). However, both solutions appear relatively robust (i.e., mean standard deviation lower than 1), therefore no revision of the aggregation techniques seems necessary (cf., feedback loop from block 3.6 of Fig. 2).

Table 10 Rank tables and collective scorings/rankings resulting from sensitivity analysis
Table 11 Results of sensitivity analysis, in terms of mean standard deviation of the objects’ rank positions

4 Conclusion

This paper focused on the ranking-aggregation problem, highlighting its significance due to the variety of potential applications in the field of manufacturing. By adopting a pragmatic approach based on a case study, the paper has elucidated a sequential and iterative operational methodology to address the problem at various levels:

  • Checking the plausibility of expert rankings in terms of concordance, through multivariate and bivariate statistical measures;

  • Guiding the aggregation-technique selection, depending on the desired types of input and output data;

  • Evaluating the consistency and robustness of the resulting collective judgment.

The case study has demonstrated that approaching the problem systematically necessitates multiple iterations and corrections at the aforementioned levels. Notably, the application of the aggregation technique is just one component of the proposed methodology, with various verifications and corrections required prior to the aggregation phase.

This study not only enhances the understanding of the complexity of the ranking-aggregation problem but also provides practical tools to tackle it in a structured and efficient manner. The outcomes hold value for both scientists and practitioners in the manufacturing domain who encounter decision-making challenges related to ranking aggregation. It is worth mentioning these actors may not have extensive expertise to deal with the ranking-aggregating problem comprehensively; thus, the proposed procedure helps to fill this gap.

The proposed methodology can be considered modular in that it is able to combine several practical tools interchangeably; however, the discussion provided in this paper, to avoid excessive length, was limited to exemplifying a few specific tool (e.g., ρ and W as indicators of concordance of experts’ rankings, and p-indicators as measures of consistency between input and output data). Furthermore, the authors acknowledge that the choice of the aggregation technique remains perhaps the most delicate aspect, which was only marginally addressed in this paper. Future research plans include establishing an extensive taxonomy of aggregation techniques and analytical tools to facilitate their selection for specific problems. It is envisaged to create a step-by-step procedure that, based on the problem’s characteristics specified by the user, will guide the selection of appropriate aggregation techniques tailored to the specific case.