1 Introduction

An increasing number of fashion consumers now purchase fashion products from online platforms. Forrester research predicts the size of the e-commerce market will amount to USD $911 million, about 36% of global fashion retail sales by 2022, making fashion e-commerce the world’s largest e-commerce market [1]. In order to achieve a competitive advantage, fashion e-commerce companies are continuously seeking out innovative technologies.

On e-commerce websites, users aim to find fashion products that satisfy their fashion preferences. Users usually cannot easily locate specific fashion items because e-commerce websites list a large number of items. In general, e-commerce websites offer a variety of solutions to help users find items of interest. Recommendation systems are one of these technologies [2, 3]. In an e-commerce environment, recommendation system typically learns a recommendation model by utilizing user behaviors, such as purchasing and clicking on items on an e-commerce site. Then using this model, the recommender proposes users with items that are predicted to be highly preferred by users. Collaborative filtering-based recommenders (CFR) and content-based recommenders (CBR) are the most conventional recommendation development approaches [25]. CFRs also try to include attributes of fashion items in creating recommendations [25, 26]. CFRs and CFRs’ use of attributes implies that they are better suited than other recommendation systems for including fashion attributes when building recommendation models.

When a user selects fashion products on a web page, they usually select fashion products one after another. This implicitly shows a preference for each selected fashion product. ‘Coordinates’ are one of the major concepts used in recommendation systems for fashion items [15, 22, 23, 27]. This implies that there are significant sequential patterns in user preference activity. In order to reflect the sequential behavioral patterns in which users interact with fashion products on a web site, we propose recommenders that utilize them to suggest recommendations.

However, both CFRs and CBRs do not include sequence patterns of items and attributes created by users’ behaviors. Some temporal recommenders perform better than conventional CFRs and CBRs because they include a time factor when building recommendation models [28]. The temporal recommender aims to provide recommendations to users at a suitable time and thus it uses time in making the final decision to obtain accurate predictions [28]. Early temporal recommenders only consider time factor to evaluate the importance of user behaviors with items. However, many recent researches have focused on directly modeling the sequential structure, considering time factors [29]. This will be discussed in more detail in the next section.

2.2 SAR with RNN

Recently, deep learning has been adopted to expand SARs due to its significant success in sequence modeling [30,31,32]. Bernhardsson [33] attempts to use RNN to predict the next track a user may want to play on a music streaming site. Hidasi et al. [14] suggest a SAR which uses Gated Recurrent Unit (GRU). GRU focuses on the vanishing gradient problem and controls the internal state of RNN using one or more small neural networks called gates. GRU gates learn the update time and degree of the hidden state. Devooght and Bersini [13], Devooght and Bersini [34], Wu et al. [35] adopt a unidirectional single layered Long Short-Term Memory (LSTM) to develop a SAR. They demonstrate that the proposed SAR outperforms other recommendation methods, such as k-NN and MCR, and excels in short term prediction and item coverage. Table 1 summarizes the basic ideas and characteristics of RNN, LSTM, and GRU used in the development of SARs.

Table 1 Comparison of RNN, LSTM and GRU

Tan et al. [39] improve SARs by considering data augmentation and temporal shifts in user behaviors. Jannach and Ludewig [40] combine GRU4REC proposed by Hidasi et al. [14] with k-NN based CFRs and demonstrate that this hybrid approach outperforms other recommenders. These studies support the conclusion that SARs outperform conventional recommenders. We adopted the GRU4REC proposed by Hidasi et al. [14], because they solve the vanishing gradient problem and because it is suitable for SAR development for fashion products, which usually have long sequence patterns.

2.3 SAR with multiple sequence datasets

If several types of sequence datasets can be obtained from the problem domain, it is desirable to combine them for the development of the recommenders. Based on this intuition, several studies have attempted to improve SARs using multiple sequence datasets, named MSARs. Zhang et al. [41] use RNN for sequential click prediction for sponsored searches of online search engines. They construct datasets by considering advertising features (e.g., advertising identifier, location of advertising display and advertising text related to query), user features (e.g., user identifier and query), and sequential features (e.g. time interval since last impression). Liu et al. [42] propose a MSAR which combines heterogeneous attributes and item sequence data. They propose a hierarchical attribute combination method to handle text attributes of varying lengths. They also propose an MSAR named Heterogeneous Attribute RNN, which learns recommendation models using item and attribute sequence datasets. Smirnova and Vasile [43] suggest a context sequence modelling method for MSAR that combines context embeddings and item embeddings by parameterizing hidden unit transitions as a function of context information. They show that their proposed method improves the prediction of the next event. Wu et al. [44] propose a MSAR that combines ratings, text reviews, and temporal patterns. They used an IMDb dataset to prove that their proposed model exhibits better constellations than conventional methods. Hidasi et al. [45] aim to combine image and text sequence datasets with item sequence datasets. They propose two different approaches —feature concatenation and parallel processing methods. The feature concatenation approach concatenates different sequence datasets and them uses them to train the model of MSAR, the parallel processing approach leverages individual datasets to learn MSARs and combines them using weighted aggregation methods.

Fashion products have various attributes. Previous studies have utilized these attributes to develop CBRs. This study aims to reflect the attributes of fashion products in the development of MSARs. We generate multiple attribute sequence datasets using the fashion item click history. To develop multiple sequence-based MSARs for fashion products, this study used the concepts proposed by Hideasi et al. [45].

3 Method

3.1 Data preparation

The sequence dataset used in our recommenders are generated from a Korean e-commerce website referred to as Store K in this paper. Store K sells various fashion products manufactured by a leading Korean fashion manufacturer. The sequence dataset is created as follows:

First, we collected log data from the operations database that stores what fashion products users clicked on Store K. The log data consists of a user identifier, a product identifier, and a click time. An item session dataset is created using this log data, where a session consists of products that one user clicks sequentially during the day. Within a session, the product click logs are sorted in ascending order by click time. Because sequence data requires one input and one corresponding output, the session must have at least two click records. The length of sessions varies because the number of clicks per day is different for each user, and the same user also varies from day to day.

Second, attribute session datasets are created using item session data. Because each product has attributes such as brand, season, category, price, and product planning group, session datasets for each attribute are generated by arranging the relevant attributes of the products in the product session dataset. In this way, multiple session datasets are generated for each attribute of the product considered in this study.

Finally, session parallel mini-batches are generated from session datasets following the method proposed by Hidasi et al. [45]. Figure 1 shows how session parallel mini-batches are generated from five session data as an example. The first \(N\) session elements (items or attributes) of session dataset, referred to as mini-batch size, are selected to construct the input and output of the first mini-batch, where the output is the next elements of session dataset. In the example of Fig. 1, the mini-batch size is 3 and the first mini-batch is generated using the first elements of \({Session}_{1}\), \({Session}_{2}\), and \({Session}_{3}\). In this mini-batch, inputs are \({\text{a}}_{\text{1,1}}\), \({\text{a}}_{\text{2,1}}\), and \({\text{a}}_{\text{3,1}}\), and outputs are \({\text{a}}_{\text{1,2}}\), \({\text{a}}_{\text{2,2}}\), and \({\text{a}}_{\text{3,2}}\). The next mini-batch is formed by putting the next elements of each session of the previous mini-batch. If any session does not have the next element, the next available session is put in its place. For example, when creating mini-batch 3 in Fig. 1, \({\text{a}}_{\text{2,3}}\) has no next element to be entered as the output. In this case, the next session, \({Session}_{4}\), comes in instead. It is assumed that the sessions are independent and thus the appropriate hidden states are reset when this switch occurs. In this way, mini-batches are generated by utilizing the product sequence and attribute sequence datasets. We used attribute session data because we expected that combining the product with attribute session data would improve the recommender’s performance. In particular, we can solve the cold start problem by using attribute session data to suggest recommendations.

Fig. 1
figure 1

Session parallel mini-batch generation method

3.2 GRU

This research uses a special RNN approach called GRU [46]. GRU uses update and reset gates to solve the vanishing gradient problem of standard RNN. These two gates basically decide what information should be passed to the output.

The update gate (\({\text{z}}_{t}\)) for time step \(t\)is calculated by:

$${\text{z}}_{t} = ~\sigma \left( {{\text{W}}_{z} x_{t} + {\text{U}}_{z} {\text{h}}_{{t - 1}} } \right)$$
(1)

where \(\sigma\) is a smooth and bound function such as a logistic sigmoid function, \({x}_{t}\) is the input of the unit at time step t, \({W}_{z}\) is the weights for \({x}_{t}\), \({h}_{t-1}\) is the previous activation function, \({U}_{z}\) is the weights for \({h}_{t-1}\). The update gate helps the model to determine how much of the past information from previous time step needs to be passed along to the future step. This is the reason why sigmoid is used as the active function of the update gate.

The reset gate (\({r}_{t}\)) for time step \(t\)is calculated by:

$${\text{r}}_{t} = \sigma ({\text{W}}_{r} x_{t} + {\text{U}}_{r} {\text{h}}_{{t - 1}} )$$
(2)

where \(\sigma\) is a smooth and bound function such as a logistic sigmoid function, \({x}_{t}\) is the input of the unit at time step\(t\), \({W}_{r}\) is the weights for\({x}_{t}\), \({h}_{t-1}\) is the previous activation function, \({U}_{r}\) is the weights for \({h}_{t-1}\). The reset gate is used from the model to decide how much of the past information to forget.

The candidate activation function (for \({h}_{t-1}\).) for time step t is calculated by:

$${\hat{\text{h}}}_{t} = {\text{tanh}}({\text{W}}x_{t} + r_{t} \odot {\text{Uh}}_{{t - 1}} )$$
(3)

where tanh is the nonlinear activation function, \({x}_{t}\) is the input of the unit at time step \(t\), \(W\) is the weights for \({x}_{t}\), \({h}_{t-1}\) is the previous activation function, \(U\)is the weights for \({h}_{t-1}\). \(r_{t} \odot Uh_{{t - 1}}\) part calculates the Hadamard (element-wise) item between the reset gate and \(U{h}_{t-1}\). Finally, the activation of the GRU is calculated by:

$${\text{h}}_{i} = \left( {1 - z_{t} } \right){\text{h}}_{{t - 1}} + z_{t} {\hat{\text{h}}}_{t}$$
(4)

3.3 SAR architectures

This work proposes four SAR architectures with GRU layers as described in Fig. 2. Figure 2(a) shows the Item-GRU4REC (I- GRU4REC) architecture, which is built by using an item sequence dataset. The mini-batches of the item sequence dataset are provided to the embedding layer, and the GRU layer processes the output of the embedding layer. The SoftMax layer combines the outputs of the GRU layer to generate a set of candidate recommendation items. Negative sampling and ranking loss calculations are performed to evaluate the correctness of the recommendations. Finally, the weights of the embedding layer, the GRU layer, and the SoftMax layer are updated for the next iteration.

Fig. 2
figure 2

Modeling architectures of different SARs

Figure 2(b) illustrates an Attribute-GRU4REC (A-GRU4REC) architecture constructed with attribute sequence datasets. Multiple mini-batches of attribute sequence datasets (AttSeq 1, AttSeq 2, and AttSeq N) are processed by the multiple embedding layers (Embedding Layer1, Embedding Layer2, and Embedding Layer N). The outputs of the embedding layers are combined by the feature combining layer. The output of the feature combining layer is processed by the GRU layer and then processed by SoftMax layer to generate recommendations.

Figure 2(c) shows a Concatenate MSAR (C-MSAR) architecture that concatenate the item and attribute sequence datasets from the input. Although A-GRU4REC considers only a combination of attribute sequence data, C-MSAR attempts to combine attribute sequence data as well as item sequence data. The concatenated embeddings are processed by the GRU layer. The output of the GRU layer has N + 1 result, where N is the number of attributes. The output of the GRU layer is forwarded independently to each SoftMax layer, and the SoftMax combining layer linearly combines N + 1 SoftMax results.

Figure 2(d) shows a Parallel MSAR (P-MSAR) architecture. This architecture builds I-GRU4REC and A-GRU4REC independently using item and attribute sequence datasets. The weight combining layer learns the optimal combination of item and attribute weights. Before learning the optimal combination of weights, I-GRU4REC and A-GRU4REC are trained with optimal parameters on the first instance.

4 Experiment

4.1 Dataset

We collected item click log data consisting of user identifier, item identifier, and date-time from Store K for four weeks. Using this data, we constructed session datasets for the experiment as summarized in Table 2. First, we constructed a dataset for parameter setting using full click log data, which consisted of 52,556 sessions with 18,637 items. With the click log data for the first three weeks, we constructed a training dataset, consisting of 40,371 sessions with 17,525 items. We then constructed the first test dataset, named Test I, with a week of click data. The sessions in Test I include new items that are not included in the training items. Test I dataset consists of 9,137 sessions with 11,310 items. We also constructed a second test dataset, named Test II, with click log data for the last week. The sessions in Test II do not include new items that are not included in the training items. Test II dataset consist of 3,048 sessions with 11,267 items. Training, Test I, and Test II datasets are used to compare different SARs.

Table 2 Dataset summary

The fashion products used in this research have five attributes – brand, season, category, price level and production planning group. Fashion products are provided by 20 fashion brands owned by the e-commerce company. There are a total of 192 product categories for five seasons (spring, summer, fall, winter and non-seasonal). We assumed that people tend to buy products that belong to different price levels. To include the price, we used categorized values for prices instead of actual numerical values. The price of fashion products is classified into 10 levels using the equal width binning approach within the same product category because products in different categories have different price ranges. Product planning groups are assigned according to what year and season the product was planned. There are a total of 175 production planning groups. The attributes used in this study and the number of values for each attribute are summarized in Table 3.

Table 3 Feature Summary

4.2 Baseline recommender

We compared the proposed SARs with k-KNN CFR, which recommends similar items measured by cosine similarity between vectors in sessions. For the given two sessions A and B, the cosine similarity (\(cosine\_smilairty(\text{A},\text{B})\)) is defined as the division between the product of session vectors and the product of the Euclidean norms of each session vector.

$$ cosine\_smilairty (\text{A},\text{B}) = \frac{{\sum\nolimits_{{i = 1}}^{n} {{\text{A}}_{i} } {\text{B}}_{i} }}{{\sqrt {\sum\nolimits_{{i = 1}}^{n} {\left( {{\text{A}}_{i} } \right)^{2} } } \sqrt {\sum\nolimits_{{i = 1}}^{n} {\left( {{\text{B}}_{i} } \right)^{2} } } }}$$

k-KNN CFR is one of the most common item-to-item solutions in practical recommenders and provides recommendations in the “others who viewed this item also viewed these ones” setting. Despite its simplicity, k-KNN CFR is generally considered a reliable reference recommender for comparing performance of various recommenders [47, 48].

4.3 Performance measures

Because each session in the test dataset has multiple sequence items, multiple evaluations are performed for each session. For a particular session in the test dataset, for its \({i}^{th}\) session item, the recommender proposes Top-N recommendations, where the recommendations are sorted by predicted preferences, so the most preferred recommendations expected to be preferred come first. If the \({i}^{th}+1\) session item is in the recommendations, the recommendation is considered successful. Otherwise, it is considered a failure. Figure 3 shows an example in which the recommender proposes the Top- N recommendations (\({r}_{x,1}\), \({r}_{x,2}\), … and \({r}_{x,N}\)) based on the sequence of items (\({a}_{x,1}\), \({a}_{x,2}\), and \({a}_{x,3}\)). If the next item in the session (\({a}_{x,4}\)) is in the Top- N recommendations, the recommendation is considered successful; otherwise it is considered unsuccessful.

Fig. 3
figure 3

Testing example

Since we focused on the number of successful recommendations within the test, we selected Hit ratio and Mean Reciprocal Rank (MRR) with the Top-N recommendations to measure the performance of the recommender. Hit ratio measures the percentage of successful recommendations among the Top-N recommendations provided. Therefore, Hit ratio is defined as:

$${\text{Hit}}\;{\text{Ratio}} = \frac{{\left| {{\text{hitset}}} \right|}}{{\left| {{\text{test}}} \right|}}$$
(5)

where \(\left| {{\text{test}}} \right|\) is the number of test sessions and \(\left| {{\text{hitset}}} \right|\) is the number of successful recommendations.

MRR measures the performance of recommendations by considering the ranking of recommended items. Hit ratio simply measures whether the test item exists within the recommendations, whereas MRR examines which ranking the test item matches the recommended item. MRR is defined as:

$${\text{MRR}} = \frac{1}{N}\sum\limits_{{i = 1}}^{N} {\frac{1}{{{\text{rank}}_{i} }}}$$
(6)

where \({rank}_{i}\) represents the \(i{\text{th}}\) rank.

Performance evaluations for each session are performed one less than the length of the session. For example, if a particular session has four session items, as shown in Fig. 3, Hit ratio and MRR are calculated three times, excluding the last sequence item. Finally, the overall performance is measured by averaging Hit ratio and MRR results for all sessions in the test dataset.

5 Results

5.1 Parameter settings

It is important to set the appropriate parameters for SARs. This section describes experimental results based on different parameter settings. SARs have six parameters: learning rate, loss function, hidden activation function, final activation function, dropout and RNN size. In a hidden activation function, tanh exhibits consistently high performance. Therefore, it has been continuously used in the following parameter settings without further discussion.

5.1.1 Parameter settings for Item-SSAR

We consider learning rate, loss function, final activation function and dropout as the hyperparameters for I-GRU4REC. The learning rate compares the two ratios – 0.001 and 0.0005. Bayesian Personalized Ranking (BPR) [49] and TOP1 [14] are considered as loss function. BPR is a matrix factorization method using pairwise ranking loss, whereas TOP1 is a regularized approximation to the relative rank of the relevant items. Linear and Rectified Linear Unit (ReLU) are considered for the final activation function. Finally, dropout is set to 0.5 for all comparisons. The performance is evaluated using Hit Ratio and MRR of the Top-N recommendations. The evaluation results are summarized in Table 4. Our evaluation results show that I-GRU4REC performs best when the learning rate is 0.0005, the loss function is TOP1, the RNN size is 1,000 and the final activation function is Linear. In the top 5 recommendations, the Hit Ratio is 29.20% and the MRR is 20.30%. Meanwhile in the top 20 recommendations, the Hit Ratio is 43.50% and the MRR is 21.80%.

Table 4 Experimental results of I-GRU4REC

5.1.2 Parameter settings for attribute embedding size

Because the number of values of each attribute is different, the appropriate embedding size for each attribute must be determined before modeling A-GRU4REC. After setting the learning rate to 0.001, the dropout to 0.5, the loss function to BPR, and the final activation to Linear, Hit Ratio and MRR are measured when different embedding sizes and different numbers of recommendations are provided for each attribute. Because the number of values of each attribute is different, the number of evaluations for attributes is different. The evaluation results are summarized in Table 5, where the best performing embedding size is shown in bold font.

Table 5 Experimental results on Attribute Embedding Size

5.1.3 Parameter settings for A-GRU4REC

We select the appropriate size of the embeddings of attributes, and then compare the performance of various A-GRU4RECs with different loss functions (BPR and TOP1), final activation functions (Linear and ReLU), learning rates (0.001 and 0.0005), and dropouts (0.2, 0.5, and 0.8). The experimental results are summarized in Table 6, where Hit Ratio and MRR are measured at the Top-N recommendations, and the best settings are marked by bold font. The results show Hit Ratio and MRR are optimal when the loss function is BPR, the final activation function is Linear, the learning rate is 0.001, and the dropout is 0.5. For the Top 5 recommendations, Hit Ratio is 9.90% and MRR is 5.20% and for the Top 20 recommendations, Hit Ratio is 20.80% and MRR is 6.30%.

Table 6 Experimental results of A-GRU4REC

5.2 Performance comparison of recommenders

We performed performance comparisons using Training, Test I, and Test II datasets. Table 7 summarizes the experimental results using Test I dataset, which contains new items that are not in Training dataset. Experimental results show that all SARs proposed in this study significantly outperform conventional k-NN CFR. When the top 5 items are recommended, Hit Ratio of k-NN CFR is only 3.5%, while that of I-GRU4REC, A-GRU4REC, C-MSAR, and P-MSAR are 14.8%, 8.2%, 15.1% and 17.4% respectively. Although SARs show significantly higher performance, Hit Ratio improvements decrease as the number of recommendations increases. Similarly, I-GRU4REC, A-GRU4REC, C-MSAR, and P-MSAR exhibit consistently higher MRR than k-NN CFR. When the top 5 items are recommended, MRR of k-NN CFR, I-GRU4REC, A-GRU4REC, C-MSAR, and P-MSAR are 5.1%, 9.5%, 4.2%, 9.8%, and 10.5%, respectively. MRR improvements also decrease as the number of recommendations increases. Among the various SARs, P-MSAR shows the highest improvement in Hit Ratio and MRR. I-GRU4REC based solely on a single item sequence dataset performs much better than A-GRU4REC based on a multiple attribute order dataset. This seems to mean that the effect of item sequences on SAR improvement is greater than that of attribute sequences. However, this does not mean that considering an attribute sequence is completely meaningless. In practice, C-MSAR and P-MSAR, which consider item sequences and attribute sequences together, perform better than I-GRU4REC.

Table 7 Recommendation results of test I

Table 8 summarizes the experimental results when new items are excluded in the testing dataset (Test II). In the top 5 recommendations with Test II, Hit Ratio and MRR of k-NN CFR are higher than those in Test I. In addition, Hit Ratio and MRR of all RNN-SARs using the Test II dataset also show significant improvements. When the top 5 items are recommended, Hit Ratio of k-NN CFR is only 8.1%, while that of I-GRU4REC, A-GRU4REC, C-MSAR, and P-MSAR is 23.7%, 12.1%, 24.7% and 27.8% respectively. MRR of k-NN CFR is only 5.1%, while that of I-GRU4REC, A-GRU4REC, C-MSAR, and P-MSAR is 5.1%, 9.5%, 4.2%, 9.8%, and 10.5% respectively. Hit Ratio and MRR have been significantly improved when the Test II dataset is used for the experiment, but the degree of improvement in performance is lower than that of Test I.

For example, the Hit Ratio of P-MAR in Test I is 5.97 times higher than that of k-NN CFR, while the Hit Ratio of P-MSAR in Test II is 3.43 times higher than that of k-NN CFR. As with Test I, among different SARs, P-MSAR shows the highest improvement in Hit Ratio and MRR, and P-MSAR show significant improvement over other MSARs. The performance improvement is largely due to item sequences, because I-GRU4REC has significantly higher performance than A-GRU4REC. However, as with the above experimental results, C-MSAR and P-MSAR using both item and attribute sequences perform better than I-GRU4REC.

Table 8 Recommendation results of test II

6 Discussion

This research aims to propose SARs for fashion product recommendation. Unlike existing recommendation algorithms, such as CFRs and CBRs, SARs explicitly consider the sequence of user behaviors when constructing recommenders. The sequences of user behavior implicitly reflects the temporal aspect of user behavior, similar to temporal recommenders [28, 50]. While temporal recommenders use temporal information to weight the ratings, SARs focus on the sequence of user behavior. Most SARs in e-commerce use a single sequence dataset of items. This study proposes three SARs, namely A-GRU4REC, C-MSAR and P-MSAR, using multiple sequence datasets.

This research demonstrates that P-MSAR outperforms k-NN CFR and other SARs. All SARs outperform k-NN CFR in Hit Ratio and MRR, which is consistent with the previous studies [51,52,53,54]. Even A-GRU4REC, which shows the lowest performance among SARs, achieves significantly higher performance compared to k-NN CFR. Since CFR generally outperforms CBR, it is an interesting result in that A-GRU4REC using only attribute sequences can present better recommendations for cold-start items than CFR as well as CBR.

I-GRU4REC significantly improves performance. Its performance is almost twice as high as that of A- GRU4REC. This result is consistent with the fact that CFR generally outperforms CBR. However, it should be noted that I-GRU4REC cannot suggest recommendations for cold-start items, which do not exist in the training dataset. This may suggest that there is room for performance improvement using hybrid SARs. The performance of I-GRU4REC is optimal when there is no new item in the test dataset (Test II). This is natural because I-GRU4REC can only provide recommendations for items that exist in the training dataset. Interestingly, the performance of A- GRU4REC is also enhanced when there are new items in the test dataset (Test II).

Both C-MSAR and P-MSAR combine item and attribute sequences and outperform k-NN CFR, I-GRU4REC and A- GRU4REC. Improvement is more significant when using P-MSAR as a recommendation. This result is consistent with the previous studies using attributes in fashion recommendation [15, 18, 23, 55, 56].

7 Conclusion

This study advocates SARs for recommendation in the e-commerce domain, particularly with regard to the fashion product recommendation problem. Consistent with previous studies, I-GRU4REC outperforms existing recommender, k-NN CFR. However, I-GRU4REC has inherent limitations because it cannot suggest recommendations for new items because there is no existing sequence data for new items. Therefore, this study attempted to combine item and attribute sequence data. Our experimental results show that our proposed hybrid SARs, C-MSAR and P-MSAR, outperform I-GRU4REC as well as k-NN CFR. This result clearly confirms that SARs using item and attribute sequence data can improve recommendation performance.

Despite this successful conclusion, our current approach has the following limitations: First, although our work has successfully demonstrated that SARs is suitable for fashion product recommendations, additional efforts are required to apply SARs in the real world. The attributes we consider are attributes of fashion products, but there is a limitation that the unique design attributes of fashion products (e.g., color, style, etc.) are not reflected. As suggested in previous studies [16,17,18, 23, 57], these visual attributes can be used importantly in recommending fashion products. Therefore, it is necessary to develop SARs including them. Second, this study proved the usefulness of SARs based on offline evaluation using historical data, but user evaluation in which users participate is required for more reliable performance evaluation. Furthermore, simulation through actual e-commerce sites is needed. The proposed SARs are recommended by considering short-term behaviors instead of long-term preferences due to the nature of RNNs. Therefore, it is necessary to explore ways to reflect long-term preferences. Finally, our current model considers only the next item in the sequence, but the current sequence can affect subsequent sequences. For this problem, attention RNN algorithms are promising and have already been applied to other domains [6, 58, 59]. Therefore, it is possible to extent the study by adopting attention RNN algorithms.