1 Introduction

Friction stir welding (FSW) is a modern solid-state welding process that produces high-quality welds through intermixing in the plastic state using frictional heat and pressure generated by a rotating tool. The process was patented by the TWI (The Welding Institute) in 1991 [1]. The solid-state nature of the process bypasses challenges associated with fusion welding of aluminum alloys and produces welds with superior technological properties, making it a sought after joining process in the aerospace and rolling stock industries [2]. Increased efforts in light weight construction as well the growing demand in electric vehicles have increased the number of FSW applications in the automobile industry for the production of battery trays, heat exchangers, and mixed material joints of aluminum and copper for electrical systems [3, 4]. Along with the increases in production volume, the need for reliable and cost-efficient, non-destructive inline quality monitoring, that is easily applicable to the increasing speed of changes in the production environment, is constantly growing across all fields of FSW application [5].

The FSW process is generally implemented on specialized machines capable of highly automated processing using closed loop axial force control to adaptively control the process. While this indicates that sensors to indirectly monitor certain process parameters are already implemented, the accuracy and measurement rate vary widely across different manufactures and machine types and generally provide insufficient process feedback for high resolution quality monitoring. Many different approaches using external sensors to establish relationships between weld data and quality control have been published recently, for example, [6,7,8,9,10,11]. Generally, the quality monitoring is realized through analysis of the dynamic behavior of in plane welding forces or the dynamic variation of axial force or torque. Despite the different approaches on indirect FSW data monitoring, the goal of defect detection is mostly achieved. All of the given examples are limited within their applicability as they only demonstrate feasibility for a certain joining task (one alloy and one sheet thickness), sometimes even limited to one parameter set. This point is addressed within this work in order to demonstrate the generalization and improved applicability of inline capable quality monitoring across different alloys and sheet thicknesses. In order to achieve this goal, deep learning algorithms are used to analyze and categorize the recorded welding data to identify internal weld defects. These internal voids are one of the most difficult defects to detect in FSW and require inspection by phased-array ultrasound or computer tomography [12].

Recent developments of better deep learning algorithms, an increased understanding of machine learning approaches, and easy access to high performance computing increase the potential benefits of recording and analyzing production process data for manufacturers [13]. The used recurrent neural networks (RNN) and convolutional neural networks (CNN) are deep learning algorithms (DL) that include detection and extraction of low-, mid-, and high-level features in the training cycles. This eliminates the need for time consuming, brittle and not scalable hand engineered feature extraction. The automated feature detection and versatile architecture allow for classification of diverse input data [14].

In this work, two types of artificial neural networks (ANN) are used for classification. Both were brought back into focus by recent improvements to architecture, solvers, and activation functions [15, 16] as they perform well computer vision problems (2D data) and time series analysis.

1.1 Typical reasons for FSW weld defects

Compared to most fusion welding processes, the FSW process is regarded as a stable and well controllable process in industrial production environments, allowing for efficient process integration and joining. On the other hand, an implemented, steady-state FSW process is still susceptible to a number of external disturbances that can lead to weld seam defects which are not detectable by the most common quality monitoring procedures, e.g., axial force monitoring and visual inspection coupled with selective destructive testing [17]. These influence factors are commonly divided into equipment-based factors, welding parameter factors, and workpiece influences. The equipment-based factors are dependent on the welding machine, the used fixture, and welding tools and can be regarded as constant when investigating weld quality [18]. Tool wear is an exception to this and needs to be closely monitored and included in any analysis, not only in production, but also in research [19]. The main welding parameters are often empirically developed and fixed for a given welding task, thereby they do not cause process disturbance by themselves, but rather interact with the workpiece to cause defect-inducing deviations. The most common variations are gap tolerances, thickness variations, surface condition changes, and tool wear [20]. The material volume underneath the welding tool and in consequence for force controlled processes the plunge depth and transport of plasticised material are influenced by gap tolerances, sheet thickness variation, and material strength differences in the production run [20], leading to process instabilities. Tool wear and the surface condition of the workpiece [21] influence the interface condition and material transport through wear and adhesion. Combinations of the mentioned influences cause instability of the process and can lead to a process state in which the employed quality control cannot detect the defects develo** in the weld seam.

Two main mechanisms that cause defects in FSW have been identified. A change in plunge depth can be identified readily through machine parameters and related to weld quality. This can be caused by various factors and lead to either insufficient plunge depth that results in incomplete penetration and a decrease in mechanical properties [22], or increased plunge depth. The latter causes close proximity or even direct contact between welding tool and backing plate, causing various defects from adhesions between workpiece and backing plate to tool failures.

The second cause of defect initiation is more difficult to detect as it is based on irregular material flow within the stir zone. Changes in energy input, surface condition, workpiece strength/thickness, interface condition, or tool wear can negatively influence the cyclical process condition and resulting material transport around the tool and thereby disrupt the weld seam formation [23]. This can lead to local or prolonged internal weld seam defects such as voids, cavities, or surface defects [24]. Therefore, with the current state of the art, additional quality assessment is required for FSW production, which increases complexity, manufacturing time, and costs.

1.2 FSW process monitoring and weld seam quality assessment

The FSW process is often characterized by its distinctive and comparatively high process forces. The forces in all three spatial directions are composed of a static and a dynamic component, which in its steady state presents in a cyclical manner, corresponding to the rotational speed of the welding tool fixed to the spindle [25]. While the periodic nature of the spatial forces is widely agreed upon, the reasons and driving causes are still not fully identified, especially regarding in plane forces and their interdependencies and relation to the welding parameters, tools, and workpiece properties. Despite this uncertainty, many researchers have used various approaches to correlate deviation from uniform force or torque oscillation to weld defects, e.g., [6, 10, 11, 26,27,28,29].

The feasibility of empirical force evaluation based approaches was shown in a number of works. These empirical approaches often reduce anomaly detection to simple incremental features or gradual changes along the weld. Jene correlates the mean average lateral force (Fy) to weld seam defects [27], while most other works focus on changes in the dynamic components or combinations of changes in the static and dynamic components. Hattingh et al. [29] monitor forces and torque to determine the field of plasticised material and calculate material flow and resulting flow defects. Luhn [10] graphs the axial force over spindle torque as well as the forces in the welding plane (Fx and Fy) to identify weld seam features. Smooth, uniform graphs indicated defect free welds, while scattered plots indicated weld seam defects. Based on this approach, feed force oscillation (Fx) was identified as an indicator for internal weld defects [11]. It was also shown that a threaded pin greatly increases oscillation in defective welds [11]. The research shown above enables process force-based defect detection but also highlights that empirical monitoring approaches require adaption to each change in the welding condition, thereby limiting applicability.

Besides the welding forces, the frequency spectra can be analyzed to determine voids. Gebhard [18] found that high amounts of low frequency oscillation indicate internal void defects. Frequency domain data was also used for the first ANN defect detection in FSW. Boldsaikhan et al. [30] transformed torque recordings sampled at 51.2 Hz into the frequency domain and used them to train multiple ANNs. A fully connected artificial neural network (FCNN) was able to correctly identify all internal weld defects in their entire testing set, which was strongly biased towards defect free welds. Later, works by the main authors show ANN-based evaluation of wavelet transformations of the in plane forces [31]. The trained FCNN was able to correctly identify internal defects > 0.08 mm with an accuracy of 95%. The welds were labeled through cross section analysis.

The emerging class of convolutional neural networks (CNN) was used by Hartl et al. [8] to classify surface images as well as weld data. Varying welding parameters were used to generate 120 welds, with each weld being split into 17 sections to receive a data set of 2040 classified welds. The highest classification accuracy for internal defect detection was achieved when using a CNN based on AlexNet [32] architecture and evaluating the lateral force (Fy). Over several training cycles, an accuracy of 79.2% could be achieved.

Many researchers have found different ways to correlate FSW data to weld quality indicators with high accuracy. All of the published works are limited in regard to welding tools and workpiece material and thickness. Furthermore, the examined welding parameters are quite limited and often not applicable to industrial use, due to low productivity. They do however demonstrate the feasibility of using force data to detect internal defects and provide a reliable base to correlate process influences and force deviations in the weld data. On the other hand, they do not provide a reliable base for industrial production as the required training sets for each welding application are large and require extensive effort to correctly produce and prepare.

In this study, the application range of in-line defect detection is increased by using deep learning to examine the temporal sequences of FSW process forces and tool torque and categorize welds with internal defects. A data set is established across a wide range of welding parameters, two different Al alloys and two sheet thickness, with their respective welding tools. The data is classified and labeled through analysis of micro-focus computer tomography (µ-ct) pictures. Deep learning algorithms are then trained to extract features and classify the welds based on force and torque measurements.

2 Experimental setup

2.1 FSW and data acquisition setup

All welding experiments were performed on a FSW machine in portal design with a moving table, built by Precision Technologies Group (PTG) Ltd., type 345C. The machine offers a high structural rigidity with position independent compliance and low vibration excitation response [33]. The workpieces were fixed on a welding table with a form fitting cut out for the backing plate. The backing plate was made of 8 mm mild steel that was ground and artificially oxidized.

The workpieces were 135 × 330 mm2 allowing for welds of 300 mm length. Workpieces were cut from sheets made of AA-5754-H22 (1.5 mm and 3 mm thickness) with ultimate tensile strength (UTS) of 240 MPa and AA-7075-T6 (1.5 mm thickness) with UTS of 545 MPa. All welds were performed as one-dimensional blind welds with full thickness penetration in order to circumvent edge and gap influences on the resulting weld seams. The welding parameters used are given in the next chapter. Monolithic welding tools were used to produce the welds. For all of the welding trials, the tool tilt angle was fixed at 1.5°. The pin length was adapted to the sheet thickness to achieve full penetration welds. Tool geometry is shown in Fig. 1, and the relevant dimensions are given in Table 1.

Fig. 1
figure 1

Tool geometry

Table 1 Tool dimensions

The welding data is recorded by a sensor unit integrated into the tool holder, Spike® mobile. The process forces are recorded in three spatial directions Fx, Fy (in welding plane), and Fz, relative to the welding tool. Fx and Fy are recorded as a function of the measured bending moment in the weld plane relative to the measuring point; the axial force and torque are measures directly through strain gauges within the tool holder. The tool holder, tool, and measured variables for an exemplary milling process are shown in Fig. 2. The data is recorded at a measuring frequency of 2.5 kHz and wirelessly transmitted to a receiver integrated into the machine control system and process control. It is also connected to a computer for processing, recording, and evaluation of the weld data.

Fig. 2
figure 2

Spike® instrumented tool holder, data receiver, and processing

2.2 Design of experiments

The welding experiments were designed to analyze the main influences on force response and the resulting deviations in welds containing defects. The main excitation frequency is caused by the tool rotation imprinted on the tool through the spindle speed (RPM), superimposing radial runouts, and discontinuous material transport [34]. The amplitude of the oscillation is mainly due to temperature and volume dependent material resistance to the stirring action. To examine these influences, a wide range of feed rates and tool rotational speeds were used during the experiments. The parameter combinations can be grouped into three distinct sets, determined in order to analyze these influences. Each set consists of five welding parameter combinations with set 2 and set 3 overlap** set 1 in one combination each for a total of 13 different parameter combinations. The parameter combinations are shown in Fig. 3. For set 1, the relationship between feed rate and spindle speed was fixed to two revolutions per millimeter (2 rev/mm), allowing the monitoring of the increase in welding forces due to an increase in feed rate and the associated reduction in welding temperature and thermal softening in front of the welding zone. Sets 2 and 3 are designed around a fixed RPM to keep the oscillation frequency constant and increase the amplitude by increasing the feed rate.

Fig. 3
figure 3

Welding parameter combinations

The spindle speed for set 2 was chosen as 1200 RPM, which is common in industrial applications and offers a high number of measurements (125) per tool revolution to accurately map the welding forces. The spindle speed of set 3 was set to 1800 RPM, as its first higher harmonic of 60 Hz is equal to the lowest natural frequency of the machine [33] and should, despite pre-tensioning through axial force, slightly increase oscillation, thereby increasing detectable variations in the force signals. The feed rate was varied from 600 to 2000 mm/min for set 2 and set 3. The lower limit was dictated by the minimum rate at which welds could be produced without overheating for both Al alloys at the fixed RPM. The upper limit was set due to load limitations of the used measurement equipment. For set 1, the minimum feed rate could be extended due to the reduced spindle speed and the maximum feed rate was limited by the maximum spindle speed.

Position control mode was chosen as weld control strategy over the more common force control. Position control allows for a reliable plunge depth relative to the machine coordinates and eliminates the influences of varying plunge depth. Furthermore, the influence of the closed loop force controller on the resulting data like lag, force deviation, and machine imprinted oscillation of the z-axis force is eliminated. This enables the reliable production of defect free as well as defective welds and weld data generation. For each parameter combination, two different plunge depths were determined in pre-trials, resulting in 78 plunge depths for the data set. The first one per parameter combination was set in order to produce defect free welds, and the second depth for each parameter combination, a reduced plunge depth, was chosen to decrease heat generation and forging pressure and thereby produce welds with internal defects, while maintaining defect free weld surfaces.

Each parameter combination and plunge depth was repeated three times for each of the alloys and plate thicknesses. With a couple of parameter combinations overloading the measuring device, the number of received weld data sets was reduced from the theoretical 234 welds to 203 welds with recorded sets of weld data.

3 Results and discussion

3.1 Welding results, classification, and data set

The parameter combinations given in Fig. 3 were successfully welded for AA5754 H22 in 1.5 mm and 3 mm sheet thickness and AA7075 T6 in 1.5 mm. The forces of some combinations exceeded the measurement capabilities of the used device thereby reducing the final number of weld data sets to 203 individual welds. The welds were visually inspected for surface defects. No open surface defects were found. The welds were inspected for internal defects using micro-focus computer tomography (µ-ct) [35]. The micro focus allows for magnification of the probe through detector positioning in the opening ray path (Fig. 4) [36]. A duplex wire quality indicator was used to determine the spatial resolution of the generated images [37]. In accordance with the cited ISO standard, the detection threshold for internal volumetric defects (voids and tunnel defects) was determined to be 0.08 mm orthogonal to the plates’ surface.

Fig. 4
figure 4

Micro-focus computer tomography and µ-ct machine Viscom VT 9225

The µ-ct pictures were post-processed to adjust brightness and contrast. The pictures were then analyzed to localize and mark internal defects. Defects and discontinuities at the plunge location were disregarded, as well as steel particle adhesions introduced by the base plate in the same area. Internal weld defects were found in 93 of the 203 welds (45.8%), slightly below the targeted 50%. To validate the findings of the µ-ct inspection, cross sections were taken from selected specimens and analyzed for defects. Figures 5 and 6 show the compounded µ-ct pictures of two sheets with two welds each as well as selected cross sections of the welds, including the extraction plane and viewing direction of the cross sections. The cross section analysis is in good accordance with the µ-ct analysis and can be regarded as reliable for the identified detection threshold for both sheet thicknesses. Small defects of < 0.2 mm width and between 0.17 and 0.06 mm height are clearly identifiable in Fig. 6. The µ-ct analysis is therefore used to categorize all welds.

Fig. 5
figure 5

µ-ct of a specimen 1.5 mm AA5754 with different welding parameters including cross sections for validation

Fig. 6
figure 6

µ-ct of a specimen 3 mm AA5754 with different welding parameters including cross sections for validation

To increase the training set size and offer the possibility of inline quality control, the welds are split into shorter sections. A new section starts each second with a length of three seconds (3 s) thereby overlap** the previous and subsequent sections. To receive the same number of sections from each weld, the maximum welding speed of 2000 mm/min and weld length were set as the reference, allowing for five sections from each weld. The sections of welds with lower feed rates were taken from the end of the weld sparing the tool exit location. A graph of the sections along with a picture of a weld made at the maximum feed rate of 2000 mm/min is shown in Fig. 7.

Fig. 7
figure 7

Weld at 2000 mm/min and data segments for analysis

The welds are categorized into welds with internal defects (NOK) and welds without any internal defects (OK). The categorization is made over the entire weld length as well as for the shorter sections. This enables the training to be performed on both full length weld data, as well as the shortened overlap** segments. The results of the categorization with the determined threshold of 0.08 mm defect size are shown in Table 2 for the entire sets as well as each individual welding task. The categorization resulted in eight data sets of different sizes and complexities to be used for training, comparison, algorithm generalization, and validation.

Table 2 Classification of welds and weld segments

3.2 Modeling, training, and testing of the neural networks

The welding data was used to train deep neural networks to detect characteristic features and classify the welding force response data to identify internal weld seam defects. DL networks were chosen for their ability to detect high level features, exceeding the manually identified and implemented gradients, thresholds, or frequency-based empirical features used for quality control. Deep learning methods replace the manual features through multiple layers that weight the inputs (linear combination of selected inputs), sum the weighted inputs, and apply non-linear activation functions to generate an output [14]. For this work, the classification ability of two types of DL networks was examined. The differences in architecture change the way the data is learned and features are generated.

The recorded welding force and torque data were used to train the networks using supervised learning with the categorization described previously. The weld data can also be visualized to validate its quality and determined simple features. As an example of the recorded data, which the ML is based upon, Fig. 8 is shown. The data for axial (Z-force), in plane forces (feed force x and lateral force y), and torque are identifiable on their respectively scaled axis with their static and dynamic components. As described, an equal length of weld data towards the end of the weld is used in training, shown as “data for analysis.”

Fig. 8
figure 8

Recorded data of (left) defect free weld at 600 mm/min and (right) weld with internal defects at 2000 mm/min

The first investigated network type is convolutional neural networks (CNN). The features within the 2D data are detected by banks of convolutional filters of various sizes that slide along the input features. The shared-weight architecture of the filters makes them shift and space invariant providing translational equivariance known as feature maps [38]. This helps to recognize and categorizes patterns across the input data. The network architecture directly influences the complexity of detectable features and patterns a CNN can identify. The detectable feature complexity is proportional to the number of filter layers and their sizes and numbers, as each convolutional layer allows for detection of more complex features. Figure 9 shows a representation of one very simple filter (size 3 × 3) with a stride of 1 and another typical layer type, a pooling layer [39, 40].

Fig. 9
figure 9

Graphic representation of convolutional filter and pooling layer [40]

For the CNN application, the recorded data (Fx, Fy, Fz, and torque) of each segment was reshaped to mirror a one-dimensional pixel string, where each recorded data point value represents the grayscale value of a single pixel. This reduces computational time as it combines four separate input streams into one of four times the length, resulting in a data series of 70,000 full weld length or 30,000 for segments (7 s/3 s at 2.5 kHz *4 data streams) by 1 value, and the binary classification was used as the output. Based on previous works [26], an architecture based on AlexNet [41] was chosen to maximize the feature complexity and detection accuracy. The original architecture was adapted by replacing the rectified linear unit (ReLU) activation functions through exponential linear units (ELUs) to enable negative values and thereby push mean unit activations closer to zero. This enables faster learning rates and enables better generalization in networks containing more than five layers [16]. ELUs are also better at batch normalization than ReLUs, which is important for big data sets and graphics processing unit (GPU) acceleration, as video memory is limited and smaller batch sizes need to be employed when utilizing consumer hardware. A block diagram of the resulting layer architecture is shown in Fig. 10. Filter numbers, sizes, and strides of the network were optimized throughout the training cycles to maximize test-set classification accuracy.

Fig. 10
figure 10

Architecture of adapted AlexNet CNN network

The second network type used was recurrent neural networks (RNNs) that were designed to learn long- and short-term dependencies of time-series data. To do this, the network forms a graph between its nodes along the temporal sequence, exhibiting weighted dynamic temporal behavior [42].

Figure 11 shows a number of possible sequences. Further information on architecture and operation principles can be found in [42, 43].

Fig. 11
figure 11

RNN with two inputs one output and direct (blue), indirect (green), and sideways (red) feedback

The RNNs were trained with four input neurons (Fx, Fy, Fz, and torque) and the binary classification as the output. The RNNs contained a bi-directional long short-term memory layer (BiLSTM), which equates to two long short-term memory layers, one learning the dataset forward, the other one backwards. The number of hidden units in the BiLSTM layer was varied to find the optimal relationship between classification accuracy and overfitting prevention. The hidden layers were followed by one fully connected layer with two outputs, to match the number of output classes, one softmax layer and finally the classification layer.

The used dataset was randomly divided for each training setup (each iteration of CNN and RNN) into 80% training data, 10% validation data, and 10% test data. The training data is used to learn the features and calculate weights and biases for the network. These weights and biases are checked and reset during training using the validation data at specified intervals. The fully trained network is then applied to the test set, which data was not used during training to test the classification accuracy of the generated network. The resulting classification accuracies depend on the random distribution of data among the separate sets as well as the weight initialisation. Therefore, each setup is repeated three times (n = 3) to validate the results and prevent bias due to unbalanced datasets (OK/NOK) and outliers due to initialisation bias. Furthermore, the order of the data sets was shuffled before each training cycle. For the training of RNNs and CNNs, the ADAM optimizer [44] was used at different, constant learning rates.

3.3 Comparison of neural network architecture classification accuracy, generalization ability, and computational requirements

To analyze the gathered data and categorize the welds according to high-level features and short time dependency evaluation, CNNs based on the AlexNet architecture were used. The intricate architecture of layers of convolutional filters and grouped convolutions allows the detection of multiple features in parallel (grouped convolutions) and complex features (multiple layers). The filter counts and sizes were reduced for computational ease as the number of recognizable features and resulting classifications in this work is vastly lower than in the original network design purpose. All the data (Fx, Fy, Fz, and torque) was reshaped into a one-dimensional string of values by stacking the individual inputs, corresponding in length to the duration and measurement frequency of the four stacked inputs (i.e., 70,000 × 1 and 30,000 × 1 data points). For each data set or subset, the input layer was adjusted accordingly and various setups of filter size, number, and stride were tested. Special focus was given to the relationship between the initial filter size, its stride, and the relationship to the number of data points per revolution of the two main excitation frequencies (measurement frequency/spindle speed). The achieved test-set classification accuracies for the best filter setups are shown in Table 3 for network training with the full test set for both full length and shortened, overlap** segment data (Fig. 7), as well as individual training with segments for the different welding tasks, separated by Al alloy and sheet thickness. The resulting categorization accuracy of the CNN is satisfactory for the full-set maximum length data, as well as the individual data sets for both thicknesses of AA5754, reaching over 81% (49 of 60) on the full data set and ~ 90% for AA5754 subsets. Contrasting, these results are the significantly worse results for the segments of the entire set; despite the increased number of observations and overlap** data, only 75.49% of test data was classified correctly and only 74.80% of the AA7075 test-set data was classified correctly. No overfitting occurred during the training due to the chosen architecture and the classification accuracy of the test-set data matched the training accuracy well. During the analysis of the results, no clear indications could be made towards false positives or false negatives as their prevalence shifted by each iteration and test-set allocation. Regarding the joining task (alloy and thickness), it was found that the results of the full-set training reflect the results of the individual training, thereby performing worse for the data corresponding to welds of AA7075 and better for data from AA5754 welds. Overall, it can be seen that the CNN can generalize across sheet thickness and alloy but does lose classification accuracy, especially when categorizing shortened segments. Individual networks with adjusted filter configurations outperformed the generalized network at the cost of computational time for training.

Table 3 Classification accuracy of the respective test data sets for the DNNs for the full training set and subsets

The second network type was RNNs with a bi-directional long short-term (BiLSTM) teachable layer. The BiLSTM RNN was used for its main advantage of learning long-term dependencies from time-series data. The three spatial forces and tool torque (Fx, Fy, Fz, and torque) were used as parallel input sequences (4 × 17,500 for full length and 4 × 7500 data points for segments). The number of hidden units in the layer was adapted for each training subset to optimize classification accuracy without overfitting to the training data. Separate setups with multiple (2–3) BiLSTM layers and interjected dropout layers to prevent overfitting were investigated but could not reliably improve accuracy while significantly increasing computational time. Unlike previous works based on different data sets [26], the BiLSTM RNNs perform better than the CNNs for the full data set as well as across all subsets, presumably benefitting from the increase in data set size (144 to 1015 segments). The benefit of increased data can be seen when comparing segment accuracy of the full set, to the full length data, with an increase in accuracy of 1.44% from 88.56% to 90.00% (see Table 3). The RNN shows a high level of generalization ability, only slightly outperformed by the CNN in cross alloy and thickness feature detection, while performing at a significantly higher accuracy level. The average classification accuracy of the individual subsets increases by 2.81% compared to 2.32% of the CNN when weighted by the number of test-set samples. Due to the varying length of the used data and different numbers of observations in each data subset, the number of hidden units in the teachable layer varied between 200 (205 segments of 3 mm AA5754) and 525 for the full set, full length training. Figure 12 shows an example of the 3 mm AA5754 training. A quick initial convergence can be seen, with asymptotical training accuracy convergence over the full training cycle. Due to limited GPU memory, the training batch size had to be small (here, 33 weld segments) and therefore shows periodically varying accuracy.

Fig. 12
figure 12

RNN BiLSTM training iterations of AA5754 3 mm data

Analog to the CNNs, no clear indication towards false positives or false negatives could be found for the test sets over the iterations. In individual training, the AA5754 1.5 mm data delivered the best network for classification, exceeding 96% over three iterations. The data from the AA7075 welds again proved the most difficult for classification, resulting in a test-set accuracy of 88.62%.

When training DL networks based on AlexNet CNN architecture and BiLSTM RNNs, the RNNs outperform the CNNs significantly in classification accuracy for the investigated filter and layer setups. Generalizing over the data set including different alloys and sheet thicknesses, the RNNs correctly classify 8.33% more of the test data, increasing to 13.07% when classifying 3 s long segments of the data set. This trend can also be seen when each alloy and thickness is evaluated individually, resulting in a classification difference of 8.82%. Along with previous results and studies by other authors, it is to be expected that the CNN could be even further improved to fit the data set and deliver higher accuracies, rivaling the RNN classification performance.

The achieved higher classification accuracy of the RNNs went along with reduced computational requirements. The complex architecture of the CNN leads to expansive data sizes during training. Depending on batch size and filter arrangement, the net and training data exceeded 80 gigabyte (GB) during training. The filter setup strongly influenced the training duration and needed iterations for convergence. On a high performance compute-cluster, the network training took between 12 and 36 h utilizing two Intel Xeon Gold 6258R (28 cores each at 2.7 GHz base clock) at 50% load and enough volatile memory (RAM) to store all required data. The resulting classification nets are about 2.7 GB in size and once loaded allow for weld data classification in < 0.02 s on a mobile computer (Intel i5 6200U dual core, 2.3 GHz, 15 W TDP, RAM > network size). The reduced size of the RNN during training allowed for GPU acceleration of the matrix computations. A consumer GPU (Nvidia RTX3090 24 GB) was used, reducing training time to 4–12 h depending on input data size and the number of hidden units in the BiLSTM layer. The resulting classification networks are ~ 5 MB in size and allow for weld data classification in < 0.01 s on a mobile computer.

4 Conclusion

For this work, 203 welds with different welding parameters, Al alloys, and sheet thicknesses were produced. The welds were classified according to µ-ct pictures into defect free welds and welds containing inner defects > 0.08 mm. The recorded torque values and welding forces in three spatial directions relative to the welding tool were used to train different architectures of DL networks, BiLSTM RNNs, and CNNs. The networks were investigated regarding their classification accuracy and generalization ability for longer and shorter weld seam segments.

  • It is possible for various DL architectures to identify classifying features in weld data without the need for data pre-processing and explicit feature generation.

  • The force feedback of physical phenomena and process disturbances during FSW can be used to classify welds with internal defects without explicit feature identification.

  • The investigated DL architectures were able to classify welds from the data set as well as subsets based on force and torque recordings.

  • The investigated DL architectures were able to generalize the feature recognition and classification across data set from multiple Al alloys and sheet thicknesses.

  • For both RNN and CNN, the classification performance dropped measurably when generalizing across different alloys and sheet thicknesses.

  • The classification of segmented welds provides a base for online quality monitoring based on DL of FSW force data.

  • RNNs outperformed CNNs in classifying test data of the entire set as well as all investigated subsets. The generalization ability of CNNs exceeded RNNs in testing.

  • RNNs achieved classification accuracies of 90% on the entire test set and up to 96% on alloy and thickness specific subsets.

The work proved the viability of using DL networks to identify internal weld seam defects in FSW as a means for inline quality control. For the established data set and µ-ct analysis based classification, it can be summarized that RNNs are better suited for training and detection of internal weld seam defects than CNNs.

Furthermore, the categorization accuracy reduction due to the generalization for both investigated network types leads to the conclusion that supplemental information integrated within the data pre-processing or training stages will improve classification accuracy. The supplemental data should relate to the physical properties of the weld material and its behavior under processing conditions. This will enable the data normalization of the training cycle to better generalize the recorded data to make features signifying weld defects more comparable for the network training.

In a next step, the option of reinforcement learning to categorize further alloys and sheet thicknesses based on the learned features of the trained ANNs will be evaluated.