1 Introduction

Genome-wide association study (GWAS) is an observational study for discovering associations between genetic markers and certain phenotypic trait such as coronary artery disease (CAD). Findings from a GWAS can help in reliably predicting an individual’s risk of having this disease and in develo** effective ways for prevention or treatment [24]. Different forms of genetic markers may be considered in GWAS. A simple and widely used one of them, called single nucleotide polymorphism (SNP), refers to the specific genetic variants (i.e. two alleles) occurred in two corresponding base positions (loci) in a pair of chromosomes in at least 1% of the population. Numerically, a SNP is characterized as the number of minor allele at the loci, thus takes three possible values 0, 1 and 2, representing the three states of the alleles pair: homozygous recessive, heterozygous, and homozygous dominant. Typically, observations are available on hundreds of thousands or even millions of SNPs in a GWAS with a sample of hundreds of cases and controls, thanks to the use of modern high-throughput DNA sequencing techniques. Also the SNPs are likely to be highly correlated in-between knowing they dwell in the tiny chromosome space. These complications present significant computational challenges on analysing the intrinsic SNPs versus phenotype associations by the current statistical and machine learning methods.

Many statistical methods for GWAS formulate the SNPs versus phenotype associations by various regression models for categorical data, and then assess the significance of each individual SNP-phenotype association by single-locus tests [25]. The regression models that have been used in GWAS include ANOVA [13], generalized linear models (GLMs) [16], and (generalized) linear mixed models (LMMs) [27]. Principal components of the SNPs are sometimes used in GLMs to reduce the effects of false positives attributed to the population substructure. Approaches based on LMMs, e.g. CMLM [28] and ECMLM [15], have been shown to be successful in dealing with both population substructure and relatedness in genomic data by treating them as fixed effects and random effects, respectively.

Yet it is well-recognized that some phenotypes or diseases have complex genetic aetiologies. In such cases, each individual SNP may have a weak marginal effect or no effect on the disease, but a combination of some SNPs can synergistically contribute to the risk of the disease. This has brought forth the need for including epistatic (aka. gene–gene) interactions into the statistical regression models. However, in the presence of hundreds of thousands of SNPs in GWAS, the number of interaction terms grows exponentially when the interaction order increases, leading to “large p, small n” scenarios with computational difficulties [12]. Here, p represents the number of variables (SNPs), while n denotes the number of samples. In these situations, exhaustive marginal hypothesis testing methods are not able to be scaled to interactions of higher orders than the pairwise ones.

To overcome this exponential explosion challenge, two-stage approaches have been developed to search for epistasis through dimension reduction and variable selection. Several penalized regression methods such as LASSO regression [22] and Elastic-Net [29] have been used in GWA studies by [26] and [6], respectively, in a two-stage manner: screen the whole genome to get a small set of SNPs potentially having significant main effects on the disease, then apply variable selection to identify all significant SNPs’ main and interaction effects from the small SNPs set obtained from screening. Some other two-stage approaches focus more on the screening stage. For example, Fan and Lv [7] developed a dimension reduction method via correlation learning, which is called Iterative Sure Independence Screening (ISIS). This method has been extended to GWAS, resulting in several enhanced versions of ISIS, such as GWASelect [10], EPISIS [23], and TS-SIS [14]. Nevertheless, most of these methods are only capable of analysing those effects involving a small number of selected SNPs that have strong marginal effects but weak interaction ones.

In recent decades, machine learning has been widely used to tackle high-dimensional data analysis problems, which inspires its applications to GWAS. Among all machine learning methods, random forest and association rule mining seem to be the two typical methods for identifying important SNP-phenotype associations in GWAS. Random forest, which is a supervised learning method, can rank the importance of each SNP in terms of its association with the phenotype in an ensemble of classification trees. An importance score, e.g. Gini index or cross-entropy, for each SNP is defined based on certain loss function expressing the error in predicting the phenotype and can be easily computed from growing trees [4]. Since this importance score is calculated in the presence of other variables in the model, it is particularly suitable to be used at the filtering stage (stage 1) in the two-stage GWAS approaches.

As for association rule mining (ARM), it is an unsupervised machine learning method designed to search for important associations rules made up from an arbitrary number of items [3]. Unlike the SNP-phenotype association terms identified from a statistical regression or supervised learning model, the SNP-phenotype association rules identified from ARM are formulated based on various concepts used in set theory. Thus, the above-mentioned association rules and the association terms explain different types of SNP-phenotype associations which may not be the same to each other. This would be welcome since they provide complementary information to GWAS.

Early ARM methods are computationally intensive for mining even a moderate-sized basket of items. Recently, several parallel and distributed computing techniques such as GARMS [1] and BPARES [2] have been applied to boost ARM but it could still be hard to mine large-scale GWAS data due to the underlying combinatory explosion in constructing rules. Qian et al. [17] proposed a Gibbs sampling-induced stochastic search approach to mine the rules efficiently. However, this approach treats each item as a binary variant, which does not apply to ARM in GWAS because the SNPs there are multinomial variants. In this paper, we will develop a multinomial Gibbs-sampling-induced new stochastic search method for ARM in GWAS, which scales well to large-scale GWAS data.

The rest of the paper is structured as follows. In Sect. 2, we introduce the format of GWAS data and the framework of ARM first before develo** the random (stochastic) search algorithm MultinomRSA. The core of this algorithm is a Gibbs sampler having multinomial marginal conditional distributions. The MultinomRSA algorithm is particularly suitable for mining SNPs versus phenotype association rules because each SNP item is better modelled by a multinomial random variable. In Sect. 3, we assess the performance of our method using simulation studies and a real-world CAD dataset. Finally, in Sect. 4, we provide conclusions and an overview of potential areas for further research.

2 Methodology

2.1 Data, conceptions and notations

2.1.1 GWAS data

According to its genetic definition, a SNP can be numerically coded as the number of minor alleles at the corresponding pair of loci. The process of obtaining coded values for all observed SNPs is called genoty**. Normally, there are two types of alleles in human genome: major allele and minor allele. Also there are two alleles dwelling at each locus. Therefore, each genotyped SNP variable has three possible values: 0, 1, and 2, representing the three states of the alleles pair: homozygous recessive, heterozygous, and homozygous dominant. For example, if the minor allele is denoted as T and the major is A, then \(\hbox {SNP}=0\) if AA is observed; \(=1\) if AT or TA is observed; and \(=2\) if TT is observed at the loci.

A real-world GWAS example to be presented in Sect. 3 uses a large portion of the PennCATH dataset that is collected from a case–control observational study of CAD [20]. The PennCATH data consist of the genotyped values of 861,473 SNPs across 3850 individuals whose CAD information and information on cardiovascular risk factors are also available. If the phenotype CAD and the genotyped SNPs can be regarded as a set of items, relations between CAD and the SNPs can be mined by ARM. In order to see this, we provide a brief review of ARM in the following.

2.1.2 ARM framework

ARM was developed in [3] for mining interesting relations between items from transactions data recorded by point-of-sale systems in supermarket. For example, the rule \(\{\hbox {ham, cheese}\}\!\rightarrow \! \{\hbox {bread}\}\) found in a supermarket’s sales data would suggest that a customer buying ham and cheese is likely to buy bread as well. Information on how frequent this rule is observed in the sales data can be used as the basis for certain marketing decisions made by the supermarket.

A common formulation of ARM is given in [9], being summarized here. Define \(\mathcal {I} = \{ I_1, I_2, \ldots , I_m \}\) as a set of m items called the item space and \(\mathcal {D} = \{\textbf{t}_1, \textbf{t}_2, \ldots , \textbf{t}_L\}\) as a list of transactions, where each transaction in \(\mathcal {D}\) is just a subset of items in \(\mathcal {I}\), i.e. \(\textbf{t}_l \subset \mathcal {I}\), \(l = 1, \ldots , L\). An association rule is defined as an implication of the form \(\textbf{X} \rightarrow \textbf{Y}\) where \(\textbf{X}, \textbf{Y} \subset \mathcal {I}\) and \(\textbf{X} \cap \textbf{Y} = \emptyset \). The sets of items (for short itemsets) \(\textbf{X}\) and \(\textbf{Y}\) are called antecedent and consequent of the rule, respectively. The support of an itemset, \(\textbf{X}\), \(\text {supp}(\textbf{X})\) is defined as the proportion of transactions in \(\mathcal {D}\) which contain \(\textbf{X}\). The confidence of an association rule \(\textbf{X} \rightarrow \textbf{Y}\) is defined as

$$ \begin{aligned} \text {conf}(\textbf{X} \rightarrow \textbf{Y}) = \frac{\text {supp}(\textbf{X} \& \textbf{Y})}{\text {supp}(\textbf{X})} \end{aligned}$$

where \( \textbf{X} \& \textbf{Y}\) is the itemset obtained by amalgamating \(\textbf{X}\) with \(\textbf{Y}\). Define \(\hbox {conf}(\textbf{X} \rightarrow \textbf{Y})=-\infty \) if \(\hbox {supp}(\textbf{X})=0\), to indicate no need to measure \(\hbox {conf}(\textbf{X} \rightarrow \textbf{Y})\) if the itemset \(\textbf{X}\) is not observed in the transactions dataset \(\mathcal {D}\). The support of an itemset measures its commonness and the confidence of an association rule measures its association strength. By the essential meaning of support, we can also define the support for a rule \(\textbf{X} \rightarrow \textbf{Y}\), which is just

$$ \begin{aligned} \text {supp}(\textbf{X} \rightarrow \textbf{Y})\equiv \text {supp}(\textbf{Y} \rightarrow \textbf{X})\equiv \text {supp}(\textbf{X} \& \textbf{Y}). \end{aligned}$$

By ARM we aim to find out the rules that have high support or high confidence or some other properly defined metrics.

Note that for transactions data generated from the item space \(\mathcal {I}\), any itemset \(\textbf{X}\) containing k items in \(\mathcal {I}\) can be equivalently expressed as a binary indicator vector \(\textbf{V}(\textbf{X})=(J_1, \ldots , J_m)\), where \(J_\ell =1\) if \(I_\ell \in \textbf{X}\) and \(J_\ell =0\) if \(I_\ell \not \in \textbf{X}\), \(\ell =1,\ldots ,m\). It is easy to see that the number of distinct transactions that can be generated from \(\mathcal {I}\) is \(2^m-1\).

From an initial look, GWAS data do not resemble transactions data. Thus, it seems ARM is not applicable to GWAS data mining. However, a SNP variable having three levels can be represented by three indicator variables: \(J_0^{(\tiny \hbox { SNP})}\), \(J_1^{(\tiny \hbox { SNP})}\), and \(J_2^{(\tiny \hbox { SNP})}\), where \(J_\ell ^{(\tiny \hbox { SNP})}=1\) if the SNP is observed at level \(\ell \); \(J_\ell ^{(\tiny \hbox { SNP})}=0\) otherwise; \(\ell =0,1,2\). This implies that a SNP variable can be regarded as a set of 3 items \(I_0^{(\tiny \hbox { SNP})}\), \(I_1^{(\tiny \hbox { SNP})}\), and \(I_2^{(\tiny \hbox { SNP})}\), corresponding to their respective indicators \(J_0^{(\tiny \hbox { SNP})}\), \(J_1^{(\tiny \hbox { SNP})}\), and \(J_2^{(\tiny \hbox { SNP})}\). Also the phenotype variable naturally specifies 2 items \(I_{\tiny \hbox { D}}\) and \(I_{\tiny \hbox { ND}}\) corresponding to disease and no disease, respectively. Hence, observations of each individual in a GWAS dataset containing m SNP variables can be converted as a specific transaction that consists of m items from the 3m SNP items and one item from \(I_{\tiny \hbox { D}}\) and \(I_{\tiny \hbox { ND}}\), such that each SNP variable contributes one and only one item to the transaction. Such a transaction can be equivalently represented by the binary indicators determined by all items in the transaction.

In order to represent not only a transaction but also any itemset of the m SNPs, we introduce an additional indicator \(J_{\tiny \hbox { no}}^{(\tiny \hbox { SNP})}\) which equals 1 or 0 depending on whether or not the SNP is inside the itemset. Now, write

$$\begin{aligned} \textbf{J}^{(\tiny \hbox { SNP})}=(J_{\tiny \hbox { no}}^{(\tiny \hbox { SNP})}, J_0^{(\tiny \hbox { SNP})}, J_1^{(\tiny \hbox { SNP})}, J_2^{(\tiny \hbox { SNP})}) \end{aligned}$$

as a set of 4 binary indicators for a given SNP. Then, an itemset \(\textbf{I}(\hbox { SNPs}\; s_{1:k})\) containing k observations from k SNPs, denoted as \(\{\hbox {SNP} s_1, \ldots , \hbox {SNP} s_k\}\) with \(1\le s_1, \ldots , s_k\le m\) and \(k\le m\), can be represented as

$$\begin{aligned} \textbf{I}^{(\tiny \hbox { SNPs} s_{1:k})}=(I_{\ell (1)}^{(\tiny \hbox { SNP} s_1)}, \ldots , I_{\ell (k)}^{(\tiny \hbox { SNP} s_k)}), \end{aligned}$$

where \(\ell (j)\) is the observed value of SNP \(s_j\), \(j=1,\ldots , k\). Regardless the value of k, the corresponding indicator vector for \(\textbf{I}^{(\tiny \hbox { SNPs} s_{1:k})}\) can always be expressed as

$$\begin{aligned} \textbf{J}^{(\tiny \hbox { SNPs} s_{1:k})}=(\textbf{J}^{(\tiny \hbox { SNP}_1)}, \ldots , \textbf{J}^{(\tiny \hbox { SNP}_m)}), \end{aligned}$$

in which there are m 1’s and 3m 0’s; but there are exactly \(m-k\) indicators of form \(J_{\tiny \hbox { no}}^{(\tiny \hbox { SNP})}\) that equal 1. It is easy to see that for an item space of m SNPs and a phenotype, there are up to \(2\times 3^m\) distinct transactions that can be observed in a GWAS data set, whereas \(4^m-1\) non-empty distinct itemsets can be generated from an item space of m SNPs. In this paper, we are interested in mining the SNPs-phenotype induced association rules of the following forms:

$$\begin{aligned} \hspace{50pt} (\textbf{J}^{(\tiny \hbox { SNP}_1)}, \ldots , \textbf{J}^{(\tiny \hbox { SNP}_m)})\rightarrow & {} I_{\tiny \hbox { D}} \end{aligned}$$
(1)
$$\begin{aligned} (\textbf{J}^{(\tiny \hbox { SNP}_1)}, \ldots , \textbf{J}^{(\tiny \hbox { SNP}_m)})\rightarrow & {} I_{\tiny \hbox { ND}} \end{aligned}$$
(2)

2.2 ARM processing and metrics

Support and confidence are the key metrics for evaluating how “interesting” or “informative” an association rule \(\textbf{X}\rightarrow \textbf{Y}\) is. But it is computationally infeasible to search for the most interesting rules based on \(\hbox {supp}(\textbf{X}\rightarrow \textbf{Y})\) and/or \(\hbox {conf}(\textbf{X}\rightarrow \textbf{Y})\) by a brute-force approach even if the associated item space has moderate number of items, because the search space has a cardinality of exponential order. Current approaches for tackling this difficulty are to use constrained search. A typical such method is the Apriori algorithm [3] in which one sets thresholds \(t_{\tiny \hbox { supp}}\), \(t_{\tiny \hbox { conf}}\) and \(t_{\tiny \hbox { len}}\), respectively for support, confidence and length of each of the rules to be searched. Namely, one either searches for the rules of the highest support(s) subject to

$$\begin{aligned} \hbox {length}(\textbf{X}\rightarrow \textbf{Y})\le t_{\tiny \hbox { len}} \quad \hbox {and}\quad \hbox {conf}(\textbf{X}\rightarrow \textbf{Y})\ge {t_{\tiny \hbox { conf}}} \end{aligned}$$

or searches for the rules of the highest confidence(s) subject to

$$\begin{aligned} \hbox {length}(\textbf{X}\rightarrow \textbf{Y})\le t_{\tiny \hbox { len}} \quad \hbox {and}\quad \hbox {supp}(\textbf{X}\rightarrow \textbf{Y})\ge t_{\tiny \hbox { supp}}. \end{aligned}$$

Effectiveness and efficiency of such a constrained search method critically depend on the selection of \(t_{\tiny \hbox { supp}}\), \(t_{\tiny \hbox { conf}}\) and \(t_{\tiny \hbox { len}}\). And it is very difficult to be scaled to ARM on large item space.

In this paper, we propose to use a stochastic search approach instead for ARM to find the most “interesting” or “informative” rules. For this, we need a different metric to measure the interestingness of an association rule. In Qian et al. [17], an importance measure is proposed for each association rule \(\textbf{X}\rightarrow \textbf{Y}\), which is defined as

$$\begin{aligned} \hbox {imp}(\textbf{X}\rightarrow \textbf{Y})=f(\hbox {supp}(\textbf{X}\rightarrow \textbf{Y}), \hbox {conf}(\textbf{X}\rightarrow \textbf{Y})) \end{aligned}$$
(3)

where \(f(\cdot , \cdot )\) can be any positive and increasing function with respect to \(\hbox {supp}(\textbf{X}\rightarrow \textbf{Y})\) and \(\hbox {conf}(\textbf{X}\rightarrow \textbf{Y})\). Here, we choose

$$\begin{aligned} \hbox {imp}(\textbf{X}\rightarrow \textbf{Y})=\hbox {supp}(\textbf{X}\rightarrow \textbf{Y}) \times \hbox {conf}(\textbf{X}\rightarrow \textbf{Y}) \end{aligned}$$

as the importance of \(\textbf{X}\rightarrow \textbf{Y}\). This keeps the same virtue of favouring the rules of high support and confidence in ARM. By our stochastic search method to be developed, we aim to find out those association rules that have the highest importance values.

Two situations need to be separately considered in this development, depending on whether an item indicator involved in association rules is binary or polytomous. For the situation of binary item indicator, a stochastic Bernoulli Gibbs sampling ARM method will be given, while for the situation of polytomous item indicator, a stochastic multinomial Gibbs sampling ARM method is developed. The details are proceeded in the following.

2.3 Stochastic Bernoulli Gibbs sampling ARM

For an item space \(\mathcal {I}=\{I_1,\ldots , I_{m+1}\}\) comprising itemsets of the form \(\textbf{J}_m=(J_1,\ldots , J_m)\), consider the following collection of association rules

$$\begin{aligned} \mathcal {R}=\left\{ \textbf{J}_m\rightarrow I_{m+1}\!:\, \textbf{J}_m \in \{0,1\}^m,\, \textbf{J}_m\ne \textbf{0}\right\} . \end{aligned}$$

All rules in \(\mathcal {R}\) can be ranked according to their importance values. Denote \(\textbf{J}_{m(k)}\) as the kth order antecedent such that \(\hbox {imp}(\textbf{J}_{m(k)}\rightarrow I_{m+1})\) is the kth highest. When m is moderate or large, we know it is computationally infeasible to find the rules of the highest-importance values in \(\mathcal {R}\) by a brute-force search method because \(2^m-1\), the cardinality of \(\mathcal {R}\), is of exponential order. On the other hand, consider a probability distribution on \(\mathcal {R}\) defined by a softmax function, i.e.

$$\begin{aligned} P_\xi (\textbf{J}_m)\equiv & {} \hbox {Pr}(\textbf{J}_m\rightarrow I_{m+1}) \nonumber \\= & {} \frac{e^{\xi \cdot \hbox { imp}(\textbf{J}_m \rightarrow I_{m+1})}}{\sum _{(\textbf{J}_m^\prime \rightarrow I_{m+1})\in \mathcal {R}} e^{\xi \cdot \hbox { imp}(\textbf{J}_m^\prime \rightarrow I_{m+1})}} \end{aligned}$$
(4)

where \(\xi > 0\) is a tuning parameter for adjusting the probability ratio between every two rules in \(\mathcal {R}\). It is easy to see that those rules in \(\mathcal {R}\) having the kth highest importance value also have the kth largest probability defined in (4) for any \(\xi \) value. This implies that if one can generate a random sample of rules from \(P_\xi (\textbf{J}_m)\), those rules having the highest-importance values are more likely to appear in the sample and appear earlier in the sample than those rules not having high-importance values. Moreover, when \(\xi \) is larger, the frequency of a high-importance rule appearing in the sample is even higher than that of a not-high-importance rule appearing in the sample. Therefore, the problem of finding high-importance rules in \(\mathcal {R}\), which is computationally not scalable, can be solved by finding high-importance rules from the random samples generated from (4), which we will see to be computationally feasible. By the probability law of large numbers for binomial and multinomial distributions, we can see the highest-importance rules in the generated samples converge to those highest-importance rules in \(\mathcal {R}\) with probability 1, with the relevant approximation error being smaller than \(1/\sqrt{\hbox {sample size}}\) with at least 95% probability. This suggests a polynomial order for the size of the generated samples is sufficient to ensure the convergence. This also means the computational complexity is of the same polynomial order, while that for the brute-force search is of exponential order.

Now, the question is how to generate a random sample of rules from the probability distribution \(P_\xi (\textbf{J}_m)\) given by (4) which is a multivariate discrete distribution. This question is not trivial because the denominator in (4) involves \(2^m-1\) terms and is intractable to compute even when m equals upper 10’s. Such difficulty can be bypassed by applying Gibbs sampling which is a Markov chain Monte Carlo (MCMC) algorithm aiming to generate a Markov chain from a feasible transition probability matrix such that the stationary distribution of this Markov chain equals the target multivariate discrete distribution.

The involved transition probability matrix in Gibbs sampling is determined by the product of all univariate conditional probability functions of \(P_\xi (\textbf{J}_m)\). It is easy to see that the conditional probability function of each \(J_s\) given \(\textbf{J}_{-s}=(\textbf{J}_{1:(s-1)}, \textbf{J}_{(s+1):m})\), \(s=1,\ldots , m\), is

$$\begin{aligned} P_\xi (J_s|\textbf{J}_{-s})=\frac{e^{\xi \cdot \hbox {imp}((\textbf{J}_{1:(s-1)},J_s,\textbf{J}_{(s+1):m}) \rightarrow I_{m+1})}}{\sum _{J_s^\prime =0}^1 e^{\xi \cdot \hbox { imp}( (\textbf{J}_{1:(s-1)},J_s^\prime ,\textbf{J}_{(s+1):m}) \rightarrow I_{m+1})}} \end{aligned}$$
(5)

which is a Bernoulli probability distribution not involving the intractable denominator of (4). Here, \(\textbf{J}_{a:b}=(J_a, J_{a+1}, \ldots , J_b)\) if integers a and b satisfy \(a<b\); \(\textbf{J}_{a:b}=J_a\) if \(a=b\); and \(\textbf{J}_{a:b}=\emptyset \) if \(a>b\).

To implement Gibbs sampling for generating rules from (4) we start from a randomly selected transaction \((\textbf{J}_m^{(0)}, I_{m+1})=(J_1^{(0)},\ldots , J_m^{(0)}, I_{m+1})\) from the transactions dataset, and construct the initial rule \(\textbf{J}_m^{(0)}\rightarrow I_{m+1}\). Then, generate \(J_1^{(1)}\) from \(P_\xi (J_1|J_2^{(0)},\ldots , J_m^{(0)})\) in (5) to substitute \(J_1^{(0)}\). Continue to generate \(J_s^{(1)}\) from \(P_\xi (J_s|J_1^{(1)},\ldots \!, J_{s-1}^{(1)},J_{s+1}^{(0)},\ldots \!, J_m^{(0)})\) in (5) to substitute \(J_s^{(0)}\), with \(s=2,\ldots , m\). This ends up with the rule \(\textbf{J}_m^{(1)}\rightarrow I_{m+1}\), an update of \(\textbf{J}_m^{(0)}\rightarrow I_{m+1}\). This procedure is repeated sequentially for N times to get the N rules \(\{\textbf{J}_m^{(n)}\equiv (J_1^{(n)}, \ldots ,J_m^{(n)})\rightarrow I_{m+1},\, n=1,\ldots , N\}\). We name the method just described as the stochastic Bernoulli Gibbs-ARM algorithm.

By the properties of Gibbs sampling, the corresponding N itemsets \(\textbf{J}_m^{(1)}, \ldots , \textbf{J}_m^{(N)}\) constitute a Markov chain having the probability distribution (4) as its unique stationary distribution. Therefore, the most important rules in \(\mathcal {R}\) can be determined (with probability 1) from the generated Markov chain, and there are at least three ways to identify them.

Firstly, the most important rules can be determined by the most frequent itemsets among \(\textbf{J}_m^{(1)}, \ldots , \textbf{J}_m^{(N)}\). However, this could be ineffective if the frequency of each distinct itemset in \(\textbf{J}_m^{(1)}, \ldots , \textbf{J}_m^{(N)}\) is very small, e.g. 1 or 2, which is very likely the case if N and \(\xi \) are not large enough.

The second method is to identify the most important rules among the N generated rules and use them to estimate the most important rules in \(\mathcal {R}\). By the ergodicity theorem for Markov chain, the most important sampled rules among the N generated converge to the most important population rules in \(\mathcal {R}\) with probability 1 if the underlying Markov chain has a finite state space and satisfies the detailed balance condition. The most important sampled rules can be easily identified by computing the importance values for all the N generated rules. The second method is mostly effective if m is not very large and the generated Markov chain is sufficiently ergodic.

By the third method, we apply a two-step approach which first identifies the most frequent items from \(\textbf{J}_m^{(1)}\), \(\ldots \), \(\textbf{J}_m^{(N)}\) and use these items to constitute a new item space, then apply Gibbs sampling again to determine the most important rules from the collection of association rules given by the new item space. Since the cardinality of the new item space will mostly be smaller than that of the original item space, its most important rules can be more efficiently identified by the second step Gibbs sampling or Apriori algorithm. And these most important rules will converge to the most important rules in \(\mathcal {R}\) with probability 1 by the probability law of large numbers and the ergodicity theorem for Markov chain.

Rationales behind the aforementioned three ways of mining important rules are explained in detail in Qian and Zhao [18]. Finally, note that \(P_\xi (\textbf{J}_m)\) defined by (4) equals 0 if an itemset \((\textbf{J}_m, I_{m+1})\) is not observed in the transaction dataset \(\mathcal {D}\). Thus, \(P_\xi (\textbf{J}_m)\) is in effect defined on the collection of all itemsets of \((I_1,\ldots , I_m)\) observed in \(\mathcal {D}\). Moreover, the Markov chain generated by Gibbs sampling via conditional probability distribution (5) is still ergodic in the association rule space induced from \(\mathcal {D}\). In Sect. 3, we will use simulated data and a real data case study to demonstrate the effectiveness and efficiency of Gibbs sampling for ARM.

2.4 Stochastic multinomial Gibbs-ARM for GWAS

Now, we focus on using Gibbs sampling to mine the SNPs-phenotype association induced rules of the form \((\textbf{J}^{(\tiny \hbox { SNP}_1)}, \ldots , \textbf{J}^{(\tiny \hbox { SNP}_m)})\rightarrow I_{\tiny \hbox { D}}\) that is given in (1). ARM for rules of the form \((\textbf{J}^{(\tiny \hbox { SNP}_1)}, \ldots , \textbf{J}^{(\tiny \hbox { SNP}_m)})\rightarrow I_{\tiny \hbox { ND}}\) that is given in (2) can be performed in the same way, thus is skipped.

Corresponding to (4), the target probability distribution for GWAS-ARM using Gibbs sampling is

$$\begin{aligned}{} & {} P_{D\xi }(\textbf{J}^{(\tiny \hbox { SNP}_1)}, \ldots , \textbf{J}^{(\tiny \hbox { SNP}_m)}) \hspace{90pt} \nonumber \\{} & {} \quad = \frac{e^{\xi \cdot \hbox {imp}\left( (\textbf{J}^{(\tiny \hbox { SNP}_1)}, \ldots , \textbf{J}^{(\tiny \hbox { SNP}_m)})\rightarrow I_{\tiny \hbox { D}}\right) }}{\sum _{\{\textbf{J}^{\prime (\tiny \hbox { SNP}_{\!v}\!)} \in I^{(\!4\!)}, v=1,\ldots \!,m\}}\!e^{\xi \cdot \hbox {imp}\left( (\!\textbf{J}^{\prime (\!\tiny \hbox { SNP}_1\!)}\!,\ldots \!, \textbf{J}^{\prime (\!\tiny \hbox { SNP}_m\!)})\rightarrow I_{\tiny \hbox { D}}\!\right) }} \end{aligned}$$
(6)

where \(I^{(4)}\) is a collection of 4 row vectors of a \(4\times 4\) matrix giving the four possible row vector values that each itemset \(\textbf{J}^{\prime (\tiny \hbox { SNP}_{\!v})}\) can take.

Since \(\hbox {imp}\left( (\textbf{J}^{(\tiny \hbox { SNP}_1)}, \ldots , \textbf{J}^{(\tiny \hbox { SNP}_m)}) \rightarrow I_{\tiny \hbox { D}}\right) =0\) if the itemset \((\textbf{J}^{(\tiny \hbox { SNP}_1)}, \ldots , \textbf{J}^{(\tiny \hbox { SNP}_m)})\) is not observed in the underlying GWAS dataset, it follows that the domain \(\mathcal {D}_{\tiny \hbox { SNP}}\) of the 4m-variate distribution function \(P_{D\xi }\) of (6) is the collection of all possible itemsets of the antecedents of the rules \((\textbf{J}^{(\tiny \hbox { SNP}_1)}, \ldots , \textbf{J}^{(\tiny \hbox { SNP}_m)})\rightarrow I_{\tiny \hbox { D}}\) in the GWAS dataset. In other words, \(\mathcal {D}_{\tiny \hbox { SNP}}\subseteq \{I^{(4)}\}^m\).

To generate a Markov chain having (6) as its stationary distribution, we need the conditional distribution of each \(\textbf{J}^{(\tiny \hbox { SNP}_{\!v})}\) given \(\textbf{J}^{(\tiny \hbox { SNPs} 1:m)}_{-v}\) (i.e. all elements of the itemset \((\textbf{J}^{(\tiny \hbox { SNP}_1)}, \ldots , \textbf{J}^{(\tiny \hbox { SNP}_m)})\) except the vth element, \(v=1,\ldots , m\)):

$$\begin{aligned}{} & {} P_{D\xi }\left( \textbf{J}^{(\tiny \hbox { SNP}_{\!v})}|\textbf{J}^{(\tiny \hbox { SNPs} 1:m)}_{-v}\right) \hspace{100pt} \nonumber \\{} & {} \quad \equiv P_{D\xi }\left( \textbf{J}^{(\tiny \hbox { SNP}_{\!v})}|\textbf{J}^{(\tiny \hbox { SNPs} 1:(v-1))}, \textbf{J}^{(\tiny \hbox { SNPs}(v+1):m)}\right) \hspace{25pt} \nonumber \\{} & {} \quad = \frac{e^{\xi \cdot \hbox { imp}\left( (\textbf{J}^{(\tiny \hbox { SNP}_1)}, \ldots , \textbf{J}^{(\tiny \hbox { SNP}_m)})\rightarrow I_{\tiny \hbox { D}}\right) }}{\displaystyle \sum _{_{\textbf{J}^{\prime (\tiny \hbox { SNP}_{\!v})}\in I^{(\!4\!)}}}\!\!\!\!\!\!\!e^{\xi \cdot \hbox {imp} \left( (\textbf{J}^{(\tiny \hbox { SNPs} 1:(v\!-\!1))},\textbf{J}^{\prime (\tiny \hbox { SNP}_{\!v}\!)}, \textbf{J}^{(\tiny \hbox { SNPs} (v\!+\!1):m)})\rightarrow I_{\tiny \hbox { D}}\!\right) }} \end{aligned}$$
(7)

which is a 4-category size-1 multinomial distribution.

Recall that each \(\textbf{J}^{(\tiny \hbox { SNP}_{\!v})}\) has four possible values given by (1, 0, 0, 0), (0, 1, 0, 0), (0, 0, 1, 0) and (0, 0, 0, 1), indicating the following four situations, respectively

  1. 1.

    \(\hbox {SNP}_v\) is not inside the rule \((\textbf{J}^{(\tiny \hbox { SNP}_1)}, \ldots , \textbf{J}^{(\tiny \hbox { SNP}_m)})\rightarrow I_{\tiny \hbox { D}}\) if \(\textbf{J}^{(\tiny \hbox { SNP}_{\!v})}=(1,0,0,0)\).

  2. 2.

    \(\hbox {SNP}_v=0\) is inside the rule \((\textbf{J}^{(\tiny \hbox { SNP}_1)}, \ldots , \textbf{J}^{(\tiny \hbox { SNP}_m)})\rightarrow I_{\tiny \hbox { D}}\) if \(\textbf{J}^{(\tiny \hbox { SNP}_{\!v})}=(0,1,0,0)\).

  3. 3.

    \(\hbox {SNP}_v=1\) is inside the rule \((\textbf{J}^{(\tiny \hbox { SNP}_1)}, \ldots , \textbf{J}^{(\tiny \hbox { SNP}_m)})\rightarrow I_{\tiny \hbox { D}}\) if \(\textbf{J}^{(\tiny \hbox { SNP}_{\!v})}=(0,0,1,0)\).

  4. 4.

    \(\hbox {SNP}_v=2\) is inside the rule \((\textbf{J}^{(\tiny \hbox { SNP}_1)}, \ldots , \textbf{J}^{(\tiny \hbox { SNP}_m)})\rightarrow I_{\tiny \hbox { D}}\) if \(\textbf{J}^{(\tiny \hbox { SNP}_{\!v})}=(0,0,0,1)\).

By applying Gibbs sampling to generate N itemsets \(\{\textbf{J}_{(n)}^{(\tiny \hbox { SNPs} 1:m)}, n=1,\ldots , N\}\) from (6) in the same way as is proceeded in Sect. 2.3, we summarize the procedure by the following algorithm.

Algorithm 1
figure a

Stochastic multinomial Gibbs-ARM

The returned rules \(\{\textbf{J}_{(n)}^{(\tiny \hbox { SNPs} 1:m)}\rightarrow I_{\tiny \hbox { D}},\, n=1,\ldots , N\}\) constitute a Markov chain with (6) being its stationary distribution. Therefore, the same three methods described in Sect. 2.3 can be used to find the most important rules from the generated Markov chain.

3 Experiments

In the experiments, we will test the proposed algorithm with simulated datasets and a real-world dataset. The simulated datasets are used to show that the algorithm can accurately and efficiently find the most important association rules, and the real-world dataset containing much larger number of SNPs is for demonstrating the capability of the algorithm to work with big data.

3.1 Simulation studies

In the simulation studies, we apply our algorithm to three simulated transaction datasets to demonstrate its performance under a wide range of scenarios. The datasets were generated by using the R package “SNPSetSimulations”, which is used to simulate GWAS-related data. Simulation setting is specified by the users of the package before starting the simulation, which includes the number of SNPs (m), the minor allele frequency (MAF) of the SNPs, and the correlation structures among SNPs (R) and between SNPs and phenotype (\(\varvec{\beta }\)). Then, based on the given information, it will generate a matrix of genotyped SNPs data, along with binary phenotype values calculated from a logistic regression model with certain genotyped SNPs as its covariates [8].

It is natural that those SNPs used in calculating the phenotype response values will be inside the antecedents of the most important association rules of the corresponding transaction data. However, the SNPs-phenotype association specified by the logistic regression model is not the only important one that can be identified by association rule mining. We expect the most important association rules may contain other SNP items not used in calculating the phenotype response values.

Fig. 1
figure 1

Sample correlations of the three SNPs in Simulation 1

Table 1 Ten most important rules found by Apriori (\(t_{\tiny \hbox { supp}} \ge 0.01, t_{\tiny \hbox { conf}} \ge 0.01\)) with their frequencies of appearing in the \(N=1000\) rules generated by multinomial and Bernoulli Gibbs-ARM methods (Simulation 1)

In the following, we will use \(I_{vu}\) (\(v=1,\ldots , m\) and \(u=1,2,3\)) to represent the uth level of level-\((u-1)\) item of \(\hbox {SNP}_v\), and use \(I_{\hbox {D}}\) and \(I_{\hbox {ND}}\) for the positive and negative phenotype, respectively.

Simulation 1 In the first dataset, we start with a small number of SNPs to show that the algorithm can find the most important association rules by comparing them with the rules obtained from the Apriori algorithm.

Consider a dataset having three SNP variables with a binary phenotype response. MAF of the SNPs are set to 0.4, and they are assumed to be uncorrelated (Fig. 1). In logistic regression under an additive model, the intercept is set to \(-3\), and the coefficient of SNP3 is set to 5. Namely,

$$\begin{aligned}{} & {} R = \begin{pmatrix} 1 &{} 0 &{} 0\\ 0 &{} 1 &{} 0 \\ 0 &{} 0 &{} 1 \\ \end{pmatrix} \\{} & {} \varvec{\beta } = (-3, 0, 0, 5) \end{aligned}$$

Then, we use this genotype configuration to generate a sample of 50 cases and 50 controls, amounting to 100 transactions. In association rule mining, each of the SNPs will be represented by three items corresponding to the three levels of SNP, and there will be nine SNP items in total, therefore \(4^3 - 1 = 63\) possible rules with \(I_{\hbox {D}}\) (and respectively \(I_{\hbox { ND}}\)) as the consequent. Ignoring the multinomial constraint within the SNP items, there would be \(2^9-1=512\) possible rules instead.

We first use the Apriori algorithm to find the association rules with \(t_{\tiny \hbox { supp}}\ge {0.01}\) and \(t_{\tiny \hbox { conf}} \ge {0.01}\). In the experiment, we define “importance” as the product of confidence and support. By the Apriori, 12 rules were found and the top 10 important ones are listed in column 1 of Table 1 with their importance values in column 4. Note that in Sect. 3, we use \(I_{ij}\) to represent the jth level of \(\hbox {SNP}_i\) (i.e. \(\hbox {SNP}_i=j-1\)), \(j=1,2,3\). As expected, the top rules in the table are related to SNP3 (items \(I_{3\cdot }\)). Specifically speaking, they all contain the second level item of SNP3 (\(I_{32}\)), and \(I_{32}\) alone as the antecedent gives the most important rule.

We then generated \(N=1000\) rules from the simulated dataset using either the stochastic multinomial Gibbs sampling ARM method or the Bernoulli Gibbs-ARM method for the tuning parameter \(\xi =10\) or \(\xi =20\). Frequencies of the generated rules in each case are shown in Table 1, from which we see both multinomial and Bernoulli Gibbs-ARM methods have identified the same top 10 important rules as the Apriori algorithm. There are only very small differences in the order of the frequencies in each case, with the multinomial Gibbs-ARM method performing better than the Bernoulli Gibbs-ARM method.

As mentioned in Sect. 2.3, the tuning parameter \(\xi \) controls the ratio of sampling probabilities between each pair of rules generated by Gibbs sampler. Low \(\xi \) means this ratio is close to 1, leading to more different rules being generated and less differentiable from each other in terms of the importance. On the other hand, high \(\xi \) leads to less different rules being generated and tending to stuck at local maxima with relatively low importance. Therefore, the tuning parameter should be carefully chosen. As compared with \(\xi = 10\), when \(\xi = 20\), the algorithms tend to result in higher frequencies on the more important rules, which is expected by the sampling properties of our methods.

Simulation 2 SNPs used in Simulation 1 are uncorrelated with each other. This might make it easy for the algorithms to find the SNPs associated with the phenotype. The SNPs are often correlated in the real-world. So, in the second simulation, we use the same setups of Simulation 1 except that the SNPs have an autocorrelation structure with coefficient \(\rho =0.7\) (Fig.  2). Namely, the correlation between \(\hbox {SNP}_i\) and \(\hbox {SNP}_j\) is

$$\begin{aligned} R(\hbox {SNP}_i, \hbox {SNP}_j) = 0.7^{|i-j|};\quad i, j \in \{1, 2, 3\} \end{aligned}$$

For this example, the correlation matrix is

$$\begin{aligned} R = \begin{pmatrix} 1 &{} 0.70 &{} 0.49\\ 0.70 &{} 1 &{} 0.70 \\ 0.49 &{} 0.70 &{} 1 \\ \end{pmatrix} \end{aligned}$$

The results are shown in Table 2. Similar to Simulation 1, the algorithms successfully found all the important rules, even when the SNPs are correlated.

Table 2 Ten most important rules found by Apriori (\(t_{\tiny \hbox { supp}} \ge 0.05, t_{\tiny \hbox { conf}} \ge 0.5\)) with their frequencies of appearing in the \(N=1000\) rules generated by multinomial and Bernoulli Gibbs-ARM methods (Simulation 2)
Fig. 2
figure 2

Sample correlations of the SNPs in Simulation 2

Simulation 3 In the third simulation, we generate a more complex dataset to show the computational advantages of our algorithm.

This dataset contains 100 SNPs with a similar correlation structure as Simulation 2 (Fig. 3):

$$\begin{aligned} \hspace{15pt}R(\hbox {SNP}_i, \hbox {SNP}_j) = 0.7^{|i-j|};\quad i, j \in \{1, 2,..., 100\} \\ \varvec{\beta } = (-3, 1, 2, 3, 0,..., 0)\hspace{50pt} \end{aligned}$$

In this example, MAFs are still set to 0.4, and in the logistic model, the intercept is \(-3\), and the coefficients for the first 3 SNPs are 1, 2, and 3, respectively, 0 for the other SNPs. So, we expect to find the rules containing items from SNP1 to SNP3 to be important. Then, a sample of \(L=500\) transactions consisting of 250 cases and 250 controls are generated from this configuration.

Fig. 3
figure 3

Sample correlations of the SNPs in Simulation 3

When the Apriori algorithm is applied to the 250 cases dataset, we find 216,084 rules satisfying \(t_{\tiny \hbox { supp}} \ge 0.05\) and \(t_{\tiny \hbox { conf}} \ge 0.5\). The top 10 most important ones among them are displayed in Table 3. From the table, we can see that items \(I_{12}\), \(I_{22}\), \(I_{32}\) appear frequently in the top rules. This is because SNP1, SNP2, and SNP3 have nonzero coefficients in the underlying data-generating logistic model and they should have strong associations with the phenotype. Items of SNP4 and SNP5 also appear in the top 10 rules as they are highly correlated with SNP2 and SNP3.

Then, both Gibbs-ARM algorithms are applied to the 250 cases dataset and the associated frequencies in the \(N=100\) generated rules are shown in Table 3. For a large item space, the tuning parameter \(\xi \) needs to be large in order to find the important rules efficiently. So we have gradually increased \(\xi \) and found that by choosing \(\xi = 70\) and \(\xi = 250\) for multinomial and Bernoulli Gibbs-ARM methods, respectively, they can find some important association rules even only \(N=100\) rules are generated.

Specifically, the multinomial Gibbs-ARM method found 3 of the top 10 rules from the \(N=100\) generated rules where the top 1 rule (i.e. \(I_{32}\rightarrow I_{\tiny \hbox { D}}\)) has the highest frequency 0.79. It is expected that more of the top 10 rules can be found from the generated rules with higher frequencies if \(\xi \) is set smaller. But the involved computation will be more intensive.

The Bernoulli Gibbs-ARM method did not perform as well here in that only the top 1 rule can be found with frequency 0.98 from the \(N=100\) generated rules. This is expected because the items of each SNP have a multinomial constraint and the Bernoulli Gibbs-ARM ignores this constraint.

Table 3 Ten most important rules found by Apriori (\(t_{\tiny \hbox { supp}} \ge 0.05, t_{\tiny \hbox { conf}} \ge 0.5\)) with their frequencies of appearing in the \(N=100\) rules generated by multinomial and Bernoulli Gibbs-ARM methods (Simulation 3 cases)
Table 4 Ten most important rules found by Apriori (\(t_{\tiny \hbox { supp}} \ge 0.05, t_{\tiny \hbox { conf}} \ge 0.5\)) with their frequencies of appearing in the \(N=100\) rules generated by multinomial and Bernoulli Gibbs-ARM methods (Simulation 3 controls)

Control Class For transactions having the control phenotype (\(I_{\tiny \hbox { ND}}\)), the same methods can be applied to find the most important association rules. For example, in Simulation 3, the phenotype was simulated from logistic regression with coefficients 1, 2, 3 for SNP1, SNP2, and SNP3, respectively. So, we also expect strong associations between certain levels of these SNPs and \(I_{\tiny \hbox { ND}}\). The top 10 rules found by Apriori, and the Gibbs-ARM results are shown in Table 4. We see the results in Tables 4 are similar to that in Table 3: the multinomial Gibbs-ARM algorithm has found 7 of the top 10 important rules, while the Bernoulli one only found 1.

Marginal frequencies From the rules explored by each of the three algorithms, the marginal item frequencies could also indicate the association strength between each SNP item and \(I_{\tiny \hbox { D}}\) or \(I_{\tiny \hbox { ND}}\). Tables 5 and 6 list the top 10 frequent items for case and control classes, respectively. For the case transactions, \(I_{32}\) and \(I_{22}\) are among the top 10 frequent items, while for the control transactions, \(I_{31}\), \(I_{21}\), \(I_{11}\), and \(I_{41}\) seem to have strong associations with \(I_{\tiny \hbox { ND}}\).

Table 5 Marginal frequencies of the 10 most frequent items identified by each algorithm (Simulation 3 cases)
Table 6 Marginal frequencies of the 10 most frequent items identified by each algorithm (Simulation 3 controls)

From the three simulations, we have demonstrated that the multinomial Gibbs-ARM algorithm is able to find the most important rules in all types of scenarios. In complex scenarios, e.g. Simulation 3, the multinomial Gibbs-ARM method could explore a larger number of important rules as compared with the Bernoulli Gibbs-ARM method.

3.2 Real-world dataset

In a GWAS tutorial paper of Reed et al. [19], they preprocessed the PennCATH dataset and performed GWAS by fitting an additive model for each SNP variable (with sex, age, and principal components to correct the effect of population substructure). In this section, we will use the same dataset and follow the preprocessing procedure in that tutorial paper, but undertake the association rule mining by the stochastic multinomial Gibbs sampling based method, which allows us to discover the association relationships between the SNPs and the target disease in an efficient way.

3.2.1 Dataset description

The PennCATH dataset [20] consists of the genotype information of 861,473 SNPs across 3850 individuals for GWAS of CAD and cardiovascular risk factors. A set of 1401 individuals in the dataset were de-identified and selected to be used in the aforementioned paper [19]: 933 are positive and 468 are negative for CAD. The dataset contains the clinical data, genotypes, and CAD status for each sample.

3.2.2 Preprocessing

In [19], 6 steps of preprocessing have been applied to the PennCATH dataset before association analysis:

  1. 1.

    Read PLINK data into R

  2. 2.

    SNP-level filtering (part 1)

  3. 3.

    Sample-level filtering

  4. 4.

    SNP-level filtering (part 2)

  5. 5.

    Create principal components to capture population substructure

  6. 6.

    Impute non-typed SNP with external data

The population substructure and SNP imputation are irrelevant for Gibbs sampling based ARM, therefore, we only follow the first 4 steps to filter SNP variables and samples with R.

The program first convert the PLINK data (.bed,.bim, and.fam files) into an object in R. Then, it filters the SNPs based on their call rate, MAF, and filtered the samples based on their call rate, heterozygosity, relatedness, and ancestry. Finally, it filters the SNPs based on Hardy–Weinberg equilibrium (HWE) and this preprocessing results in 1401 samples (no individuals filtered out) with 656,890 SNP variables.

3.2.3 Reducing item space

To speed up computation, further preprocessing could be done to reduce the item space. This includes two steps: select a range of SNPs potentially related to CAD based on historical analysis results, and filter SNPs with low support.

Chromo9p21 Research on GWAS of CAD has achieved substantial progress, indicating some significant associations between SNPs on Chromosome 9p21 and CAD [5, 11, 21]. In this paper, we use the SNPs on Chromosome 9p21 and expect to find some strong association rules there. According to Genome Reference Consortium GRCh37 (hg19), SNPs on Chromosome 9p21 have positions from 19,900,000 to 33,200,000 on Chromosome 9. This information can be used to extract the relevant SNPs with R, and it gives us 3758 SNP variables.

Filtering SNPs As the support of a rule can never exceed the minimum support of its contained items, if we want to generate rules with support \(\ge t_{\tiny \hbox { supp}}\), we can first filter out the items with support lower than \(t_{\tiny \hbox { supp}}\). Since each rule from the GWAS transactions data is of the form given in (1) and (2), we are only able to filter out itemsets of form \(\textbf{J}^{(\tiny \hbox { SNP})}=(J_{\tiny \hbox { no}}^{(\tiny \hbox { SNP})}, J_0^{(\tiny \hbox { SNP})}, J_1^{(\tiny \hbox { SNP})}, J_2^{(\tiny \hbox { SNP})})\) with their support \(<t_{\tiny \hbox { supp}}\). Table 7 lists the number of SNPs remained after filtering out the SNP itemsets with support less than various \(t_{\tiny \hbox { supp}}\) values. It suggests \(t_{\tiny \hbox { supp}}=0.4\) and 0.3 are good choices for consequent case \(\textrm{CAD}= 1\) and \(\textrm{CAD} = 0\), respectively, which ends up with 1948 SNPs left for \(\textrm{CAD} = 1\) and 1950 SNPs for \(\textrm{CAD} = 0\), a roughly 50% reduction from 3758 SNPs.

Since each SNP is expressed by three items, so the two transaction datasets (for \(\textrm{CAD} = 1\) and \(\textrm{CAD} = 0\)) would have \(1948 \times 3 = 5844\) and \(1950 \times 3 = 5850\) SNP items involved, respectively.

Table 7 Numbers of remaining SNPs for each consequent case after filtering with different \(t_{\tiny \hbox { supp}}\) values

3.2.4 Tuning parameter

In our experiments, the importance of an association rule is defined as the product of its support and confidence. To mine the most important association rules from the two aforementioned transaction datasets (for \(\textrm{CAD} = 1\) cases and \(\textrm{CAD} = 0\) controls, respectively), we have used the multivariate Gibbs-ARM algorithm to generate 100 rules with various given tuning parameter \(\xi \) values for each instance. Results on the number of distinct rules generated in the \(\textrm{CAD}=1\) instance are summarized in Table 8, from which we can see the effect of \(\xi \) on rule generations.

Taking the results of positive cases with \(\textrm{CAD}= 1\) as an example. We can see that there are three kinds of \(\xi \): maybe too low, maybe too high, and seemingly appropriate. For \(\xi = 300\) and \(\xi = 500\), the tuning parameter is probably too small, so that the algorithm almost always generates different rules at different times. On the other hand, with \(\xi = 700\) and \(\xi = 1000\) it only generates 11 and 2 distinct rules, respectively, suggesting that the tuning parameter value is probably too high so that rule generation process tends to be trapped in a neighbourhood having locally high importance rules. Consequently, if we are interested in finding the top k rules instead of the top 1, we will need more distinct rules to be generated by the multivariate Gibbs-ARM algorithm. For the current datasets, \(\xi = 550\) or \(\xi = 600\) might be the best options, knowing that the final choice would be depending on the needs. They give 40 or 18 distinct rules, implying it has capacity to find some high-importance rules from the generated samples without getting stuck into a local maxima neighbourhood. For the rest of the experiments, \(\xi = 550\) is chosen to generate rules for the \(\textrm{CAD} = 1\) cases.

Table 8 Number of distinct rules generated with different \(\xi \) for \(\textrm{CAD} = 1\)

3.2.5 Rule generation

Table 9 displays the top 10 important rules generated with \(\xi = 550\) for \(\textrm{CAD}=1\) cases. From the table, we can see that the frequencies of these rules are ordered similarly as their importance values, and the rule with the highest importance coincide with the most frequent rule in the sample: \((\hbox {rs41474551}=2)\rightarrow I_{\tiny \hbox { CAD=1}}\). Moreover, all the SNPs we found in the rules are at level 2, which means that individuals having 2 minor alleles at those important SNPs may have higher risk of the disease. This actually aligns with what we have expected from the genetic point of view since minor alleles are deemed to be more likely the risk alleles in GWAS. Therefore, we can conclude that the multivariate Gibbs-ARM algorithm worked well by confirming this biological understanding.

Table 9 Top 10 important rules associated with positive class \(\textrm{CAD} = 1\) (\(\xi \) = 550)
Table 10 Top 10 important rules associated with negative class \(\textrm{CAD} = 0\) (\(\xi \) = 2500)

3.2.6 Rules associated with negative class

For the negative class \(\textrm{CAD} = 0\), we have tried \(\xi = 550\) to generate 100 rules using the multivariate Gibbs-ARM algorithm and found they are all distinct rules from each other and have very low importance values. Thus, using \(\xi =550\) did not help us find important rules. By running more tests and increasing \(\xi \) up to 2500, the algorithm is finally able to distinguish some interesting rules from the rest, and ends up with 30 distinct rules. The top 10 important rules are listed in Table 10. From the table, we can see that the importance values of negative class rules are much lower than the positive class rules. This is because of the imbalanced labels in the dataset. There are only 468 negative (\(\textrm{CAD}=0\)) transactions/individuals and the support of any rules would not exceed \(468 / 1401 = 0.3340\). The confidence of the negative rules are also lower than the positive ones, which means SNPs generally have weak associations with \(\textrm{CAD}=0\). Nevertheless, our algorithm is still capable of finding rules with a relatively high support, 0.3291 which is only slightly lower than the largest possible value, 0.3340. Also, 46% of the generated rules are among the top 10 important rules in the dataset and the most important rule has the highest frequency. This demonstrates the good performance of the algorithm for the negative class as well.

Table 11 Top 10 marginal frequencies of the generated SNP items by the multinomial Gibbs-ARM sampling method for both cases

3.2.7 Item frequencies

Top 10 marginal frequencies of those items generated by the multinomial Gibbs-ARM sampling algorithm for both \(\hbox {CAD}=1\) and \(\hbox {CAD}=0\) instances are listed in Table 11. Similar to that said for the simulation examples, the marginal frequencies of the items could also be used to assess the associations between the items and the disease. In addition, the items having high frequencies in the generated samples provide a reduced item space to find important rules by using the Apriori algorithm. For example, for the \(\textrm{CAD}=1\) class, there are 20 items appearing more than once in the generated samples, and these items, constituting a new item space, could be used to find association rules by the Apriori algorithm. The Gibbs-ARM sampling method used here is largely for reducing the itemset space so that the computation cost from applying the Apriori to this new itemset space is also reduced.

4 Conclusion

In this paper, we proposed a stochastic multinomial Gibbs sampling based association rule mining algorithm to analyse transactions data obtained from GWAS. In the experiments, we have tested the algorithm on three simulated datasets and a real-world dataset. The algorithm has shown good performance in all our experiments and case study. In the real-world example, we have considered 1948 SNPs (5844 items) for the positive class and 1950 SNPs (5850 items) for the negative class, which is significantly more than 366 SNPs (1098 items) in the dataset used by Qian et al. [17]. This has demonstrated the capacity of the algorithm in big data ARM scenarios. Also, the multinomial Gibbs-ARM sampling algorithm substantially extends the capacity of the Bernoulli Gibbs-ARM algorithm in that the former in capable of incorporating the constraints in the itemsets to more accurately and efficiently generate a Markov chain of the rules.

The application of our method is not limited to finding the important SNPs related to certain disease in GWAS. It can be used to find the associations between various combinations of the categorical variables with different numbers of levels and a certain level of a response variable in any study field.