Keywords

1 Introduction

Automatic detection of polyps and cancer, in colonoscopy videos, is frequently based on machine learning algorithms using supervised learning [3, 12, 13]. Machine learning algorithms build models based on training sets and those training sets are created with annotations manually made by gastroenterologists. Those annotations are done by segmenting colonic video files into shots representing endoscopic findings (lesions). For instance, a file annotation of an observed lesion (cancer) is 10:14:30 (time begin) and 10:16:27 (time end). It is highly possible that shots annotated as lesion contain frames with blur, low contrast, noise and/or brightness—called non-informative.

In fact, machine learning is an inverse and ill-posed problem, there is a set of assumptions that a learning algorithm makes about the true function that it is trying to learn a model off. In general, the training of a classifier should be done including a sufficiently big number of frames representing most of the possible frame configurations. Otherwise the classifier is obviously bad trained and does not provide the expected result. In our case, annotations—of cancer or polyps—contain non-informative frames that may affect the learning of a model and, consequently, produce a classifier with low accuracy. Thus, a preprocessing is needed before annotating colonoscopy videos and building machine learning models for the classification of normal-vs-abnormal frames.

Research has been conducted on identifying non-informative frames in colonoscopy videos. Methods based on edge detection and brightness segmentation in order to remove non-informative frames from colonoscopy videos are presented in [14, 19]. Other techniques based on color transformations [10], lumen detection [6, 9], video tracking framework [11], global features [8, 20], and texture analysis [14] were proposed to address this problem.

Results reported in [14] indicate that precision, sensitivity, specificity, and accuracy for the edge-based and the clustering-based classification techniques are greater than 90% and 95%, respectively, using the specular reflection detection technique. However, the comparison done by Rungseekajee and Phongsuphap in [18] between their proposed approach and the edge-based classification in [14] yields lower values for the edge-based classification presented in [14]. In [18] precision 90%, sensitivity 61%, specificity 50%, and accuracy 60% are reported.

In [2], we proposed a method based on edge detection using the hypothesis that non-informative frames usually do not contain many edges. Experimental evaluation showed values of accuracy and precision over 95%. In this paper, we explore the use of texture analysis to automatically classify informative and non-informative frames, in colonoscopy videos. Local Binary Pattern (LBP) operator [15] is used as texture descriptor and it is calculated on the frequency domain, and the Support Vector Machines (SVM) [5] is used for building a classifier. The proposed method aims to preprocess data sets before being used in machine learning algorithms by eliminating non-informative frames. Experimental evaluation shown values of accuracy over 97%.

2 Automatic Classification of Non-Infomative Frames

For the sake of completeness, a definition of non-informative and informative frames is presented in order to clarify the meaning in the application domain. The former corresponds to a frame out-of-focus, containing bubbles and/or light reflection artifacts due to wall contact and/or light reflections on water used to clean the colon wall and/or motion blur. The latter corresponds to a frame with well-defined content and spread over whole frame.

A description of the proposed automatically identification of non-informative frames is as follows. Given a video sequence, each frame is converted into gray-level scale and gray-level frame is transformed into the frequency domain using the Discrete Fourier Transform (DFT). Then, the LBP operator is applied at each pixel and a histogram is built to represent the content per frame. Finally, a classifier is created using the linear-SVM algorithm.

2.1 Frequency Domain Transform

The results shown in Fig. 1 indicate that the informative frame contain more low frequencies than the higher ones. The non-informative frame contains components of all frequencies, with smaller magnitudes for higher frequencies. The non-informative frame also contains a dominating direction in the Fourier image, passing vertically through the centre. These originate from the regular patterns in the original frame. We can observe that the frequency domain provides discriminant information about the frame content.

Each frame is converted into gray-scale and then transformed into the Fourier domain using the following equation [1].

Fig. 1.
figure 1

Illustration of the frequency domain using a non-informative and an informative frames. (a) spatial domain of a non-informative frame, (b) frequencies spectrum of (a), (c) spatial domain of an informative frame and (d) frequencies spectrum of (c).

$$\begin{aligned} F(u,v) = \frac{1}{MN}\sum _{x=0}^{M-1}\sum _{y=0}^{N-1}f(x,y)\exp [-i 2 \pi (\frac{ux}{M}+\frac{uy}{N})], \end{aligned}$$
(1)

where \(M \times N\) is the frame dimension, f(x, y) is a value at (x, y) position in the spatial domain and the exponential term is the basis function corresponding to each point F(u, v) in the Fourier space. The equation can be interpreted as the value of each point F(u, v) is obtained by multiplying the spatial domain with the corresponding base function and summing the result.

2.2 Texture Analysis

Initially, the texture analysis on the frequency spectrum was conducted using Haraclik features such as Angular Second Moment, Contrast, Correlation, Dissimilarity, Entropy, Energy and Uniformity [17], as it was done in [14]. However, the obtained results in experiments were not discriminant enough for the classification using SVM.

Fig. 2.
figure 2

Illustration of the LBP descriptor calculation on the frequency domain.

The Local Binary Pattern (LBP) is a common texture descriptor that contains several advantages, such as invariance and low computational cost [15]. In our approach, the LBP works on the frequency spectrum using a \(3\times 3\) kernel, where pixels around at the central pixel are thresholded with the value at the central pixel. A binary string is obtained as result of the LBP thresholding, it is converted into a decimal number and used to replace the central pixel value. Finally, a histogram of the LBP decimal numbers is calculated. The obtained histograms are used as frame content representations to classify colonoscopy frames. Figure 2 illustrates the LBP calculation.

2.3 Classification Using Support Vector Machines

The original algorithm for SVMs was proposed by Vapnik and Lerner in 1963 [21]. The algorithm solves a classification problem for linearly separable data. The algorithm finds a separating hyperplane that has the maximum distance from the closest input points. The hyperplane, if exists, is called maximum-margin hyperplane. The decision rule only depends on the dot product of the training vector and the unknown point.

We chose Support Vector Machines (SVMs) for building the classifier based on the problem, i.e. a binary classification, and the available training data, i.e. limited training dataset size. The training dataset are annotated by gastroenterologists and they usually have not spare time for this task.

3 Experimental Evaluation

The performance of the proposed method is evaluated in this section. Data set, experimental settings and evaluation criteria are presented. Tests were conducted using a Laptop with Windows 8 Pro X64, Intel (R) Core (TM) i5 @ 2.60 GHz and 4,00 GB RAM and the implementation was done using the C++ programing language. The classification of colonoscopy frames is performed using the LIBSVM Library [4] and its implementation requires the following parameters: a feature matrix that is constructed with the histograms of LBP descriptors; a column vector of labels that contain a class value. We used the class value 1 for informative frames and \(-1\) for non-informative ones. And the cost value C; a low cost value, such as \(c=1\), has a minor sensibility in the classification errors than a big one.

Several videos of complete colonoscopic procedures were recorded at the Hospital Universitario del Valle and three videos were selected to present evaluation results. Those videos have length of 6, 12 and 14 min, respectively, frame resolution of \(636 \times 480\) and they were recorded at 10 fps, using MP4 format and H264 compression. Frames are extracted—using the FFmpeg Multimedia Framework [7]—from video sequences. A total of 600 frames—taken 100 informative and 100 non-informative per video—were used. The selected frames were manually annotated by a gastroenterologist.

A set of metrics commonly used to evaluate the performance of a binary classification is employed [16]. The metrics are based on the confusion matrix and correspond to Sensitivity, Specificity, Accuracy, Precision and F-Measure.

3.1 Evaluation Conditions

The proposed method was evaluated using three evaluation conditions. The evaluation conditions are illustrated in the Fig. 3 and described as follows.

Fig. 3.
figure 3

Illustration of evaluation conditions of the proposed method.

  1. (A)

    The LBP is calculated using the frequency domain image. The obtained descriptor is an array of 256 length.

  2. (B)

    The frequency domain image is divided into \(4\times 4\) blocks and the LBP is calculated on each block. This yields 16 histograms that are concatenated. The obtained descriptor is an array of 4096 length.

  3. (C)

    The frequency domain image is divided into \(4\times 4\) blocks and the \(2\times 2\) central blocks are used in the calculation of the LBP. This yields 4 histograms that are concatenated. The obtained descriptor is an array of 1024 length.

3.2 Results and Discussion

The SVM classification model is calculated using 80% of the data set—400 frames—as a training set and the remaining frames as test set—200 frames.

Table 1. Confusion matrices calculated under the three evaluation conditions

Table 1 contains the confusion matrices calculated under the three evaluation conditions and presents the number of frames correctly using TI for symbolising True Informative and TN for symbolising True Non-Informative, and the number of frames incorrectly classified using FI for symbolising False Informative and FN for symbolising False Non-Informative. The evaluation condition C yields the lowest false informative whilst the evaluation condition B yields the lowest false non-informative.

Table 2. Performance metric values

Having in mind that performance measures are calculated for assessing the capability of correctly classifying non-informative frames, Table 2 contains the obtained performance measure values using the evaluation conditions. The highest values of accuracy and precision were obtained under the evaluation condition C and the lowest values were obtained in the evaluation condition B.

Obtained results may be interpreted looking at Fig. 3 where frequency magnitudes get smaller for higher frequencies, that are located in the external blocks. In this way, LBP calculated using the four central blocks encodes the frequency domain.

Table 3 contains the results using the edge-based approach presented in [2]. Our proposed approach, using LBP computed at the central blocks in the frequency domain, outperforms the edge-based approaches proposed by us in [2] using the same training image set.

Table 3. Performance metrics reported in [2]

4 Final Remarks

In this paper, a method based on texture analysis was proposed to classify colonoscopy frames into two categories: informative and non-informative. The method uses the LBP descriptor as texture feature that are calculated on the frequency domain and the SVM as classifier.

The proposed classification method is able to correctly detect frames without relevant information, that should not be used for training machine learning algorithms in the classification of normal-vs-abnormal frames.

Moreover, the proposed method may be used to significantly reduce duration of videos—if frames classified as non-informative are deleted—before being analysed by gastroenterologists.

Metrics used to evaluate the performance of the proposed method shown that the accuracy and the precision are over 95% when the central blocks in the frequency domain are used to calculate the LBP descriptor and in general the proposed method outperforms the proposed approach in [2].