Keywords

1 Introduction

Materials recognition from their visual appearance is an important problem in computer vision having numerous applications spanning from scene analysis to robotics and industrial inspection. Knowledge of the material of an object or surface can provide useful information about its properties. Metallic objects are usually rigid while fabric is deformable and glass is fragile. Humans use daily this kind of information when they interact with the physical world. For example, they are more careful when they hold a glass bottle than a plastic one and they avoid walking on ice or other slippery surfaces. It would be useful in many applications, such as robotic manipulation or autonomous navigation, if a computer vision system had the same ability.

Due to the limited amount of data available the problem of material recognition/classification was not among the most popular computer vision areas [17]. But with the advent of the Internet and crowd-source annotation, collection and annotation of large databases of materials and textures was made feasible, which in turn brought renewed attention to the problem of material classification [812]. Currently the trend is to use materials captured in unconstrained conditions, also referred to as “in-the-wild”, and mainly in the large scale [811]. In this paper, we take a different direction which has, to the best of our knowledge, has received limited attention. We study the problem of fine-grained material, and particularly fabric, classification in the fine scale introducing the first large database suitable for the task.

Recognizing materials from images is a challenging task because the appearance of a surface depends on a variety of factors such as its shape, its reflectance properties, illumination, viewing direction etc. Material recognition is usually treated as texture classification. Statistical learning techniques have been employed [13] on databases of different materials taken under multiple illumination and viewing directions [1315]. Despite the very high classification rates achieved this way [3], the introduction of data captured “in-the-wild” have recently showed that the problem is extremely challenging in real world conditions.

On the other hand, humans do not rely only on vision to recognize materials. The sense of touch can be really useful to discriminate materials with subtle differences. For example, when someone wants to find out the material a garment is made of, they usually rub it to assess its roughness. This gives us the hint that the micro-geometry of the garment surface provides useful information for its material.

Fig. 1.
figure 1

Samples from our database. From left to right, top to bottom: cotton, terrycloth, denim, fleece, nylon, polyester, silk, viscose, and wool.

In this paper, we study how the micro-geometry and the reflectance of a surface can be used to infer its material. To recover the micro-geometry we use photometric stereo, which gives us the normal map and the albedo of the surface by capturing images under different illumination conditions. An advantage of photometric stereo over other techniques, like binocular stereo, is that there is no correspondence problem between images. All images are captured under the same view point and only the illumination is different. That makes photometric stereo a fast and computational efficient technique that suits well the time constraints of a real-world application.

To be able to apply photometric stereo in unconstrained environments, we built a portable photometric stereo sensor using low cost, off-the-shelf components. The sensor recovers the normal map and the albedo of various surfaces fast and accurately and has the ability to capture fine micro-structure details, as we can see in Fig. 1. Using the sensor we collected surface patches of over 2000 garments. Knowing the fabric of a garment can help robots manipulate clothes more robustly and facilitate the automation of household chores like laundry [16] and ironing.

In contrast to most datasets [17, 18] where fabrics are treated as one of the material classes, we investigate the problem of classifying 9 different fabric classes - cotton, denim, fleece, nylon, polyester, silk, terrycloth, viscose, and wool. This fine-grained classification problem is much more challenging due to the large intra-class and small inter-class variance that fabrics have. We perform a thorough evaluation of various feature encoding methodologies such as Bag of Visual Words (BoVW) [1], Pyramid Histogram Of visual Words (PHOW) [19], Fisher Vectors [20, 21], Vector of Locally-Aggregated Descriptors (VLAD) [22] using both SIFT [23], as well as features from Convolutional Neural Networks (CNNs) [24]. We show that using both normals and albedo the classification accuracy is improved. In summary, our main contributions are:

  • The first large, publicly available database of garment surfaces captured under different illumination conditions.

  • An evaluation of texture features and encodings which shows that the micro-geometry and reflectance of a fabric can be used to classify its material more accurately than by using its texture.

2 Related Work

At first, texture classification was implemented on 2-D patches from image databases such as Brodatz textures [25]. One of the first efforts to collect a large database suitable for studying appearance of real-world surfaces was made in [13]. The so-called CUReT database contains image textures from around 60 different materials, each observed with over 200 different combinations of viewing and illumination directions. Leung and Malik [1] used the CUReT database for material classification, introducing the concept of 3D textons. They used a filter bank to obtain filter responses for each class of the CUReT database and they clustered them to obtain the 3-D textons. Then, they created a model for every training image by building its texton frequency histogram and they classified each test image by comparing its model with the learned models, introducing what is now known as the BoVW representation. In [1] a set of 48 filters were used for feature extraction (first and second derivatives of Gaussians at 6 orientations and 3 scales making a total of 36, 8 Laplacian of Gaussian filters and 4 Gaussians). A similar approach was followed in [2] using the so-called MR8 filter. The MR8 filter bank consists of 38 filters but only 8 filter responses. To achieve rotation invariance the filter bank contains filters (Gaussian and a Laplacian of Gaussian) at multiple orientations but only the maximum filter response across all orientations is used. Later Varma and Zisserman [3] questioned about the necessity of filter banks by succeeding comparable results using image patch exemplars as small as \(3 \times 3\) instead of a filter bank. The KTH-TIPS (Textures under varying Illumination, Pose and Scale) image database was created to extend the CUReT database by providing variations in scale [4]. Despite the almost perfect classification rate in CUReT database the results of the same algorithms in KTH-TIPS showed that the performance drops significantly [5].

Sharan et al. created the more challenging material database FMD [18] with images taken from the Flickr website and captured under unknown real-world conditions (10 materials with 100 samples per material). Liu et al. [26] used the FMD and a Bayesian framework to achieve 45 % classification rate while Hu et al. [27] by using the Kernel Descriptor framework improved the accuracy to 54 %. Sharan et al. [7] used perceptually inspired features to further improve the results to 57 %.

Recently, in [10] the authors presented the first database with textures “in-the-wild” with various attributes (including the material of the texture) and in [9] a very large-scale, open dataset of materials in the wild, the so-called Materials in Context Database (MINC) was proposed [9]. Another database (UBO2014) that appeared recently for synthesising training images for material classification appeared in [28] (7 material categories, each consisting of measurements of 12 different material samples, measured in a darkened lab environment with controlled illumination). The above databases are suitable for training and testing generic material classification methodologies from appearance information.

In this paper we study the problem of fine-grained material classification using both the reflectance properties (albedo) and the micro-geometry (surface normals) of a surface. The most closely related work on fusing albedo and normals for material classification is the work on [29] which proposes a rotational invariant classification method. The authors applied successfully their method on real textures but they used only 30 samples belonging to various material classes (fabric, gravel, wood). In contrast, our database contains around 2000 samples and is suitable for fine-grained classification.

Regarding recognition of materials using micro-geometry the only relevant published technology is the tactile GelSight sensor [30, 31]. In [30] the authors used the Gelsight sensor to collect a database of 40 material classes with 6 samples per class (the database is not publicly available). The Gelsight sensor is able to provide very high resolution height maps but cannot recover albedo [31] (contrary to our sensor which provides both micro-geometry and albedo).

Fig. 2.
figure 2

The photometric stereo sensor on a table (left) and installed in the gripper of an industrial robot (right)

Fig. 3.
figure 3

Bottom view of our photometric stereo sensor having all lights off (left) and with one light on (right)

Fig. 4.
figure 4

Albedo, normals, and 3-D shape of our calibration target

Fig. 5.
figure 5

The 9 main fabric classes with albedo and 3D shape

3 Photometric Stereo Sensor

Photometric stereo [32] is a computer vision method that uses three or more images of a surface from the same viewpoint but under different illumination conditions to estimate the normal vector and albedo at each point. By integrating the normal map the 3-D shape of the surface can be acquired. Photometric stereo has found application in a wide variety of fields, ranging from medicine [33] to security [34] and from robotics [35] to archaeology [36].

We used this method to acquire the micro-geometry and the reflectance properties of garment surfaces. In order to capture a large number of garments, we built a small-sized, portable photometric stereo sensor using low-cost, off-the-shelf components, Fig. 2. The sensor consists of a webcam with adjustable focus, white light emitting diode (LED) arrays and a micro-controller to synchronize the LED arrays with the camera. It features an easy to use USB interface that connects directly to a computer with a standard USB port. All the components are enclosed in a 3-D printed cylindrical case with a diameter of 38 mm and height of 40 mm. The case has also a rectangular base that allows the sensor to be mounted in the gripper of a robot, Fig. 2 (right).

The sensor captures 4 RGB images of a small patch of the inspected surface, one for each light source, in less than a second. The working distance of the camera is 3 cm and to achieve focus we had to modify appropriately the camera optics. The sensor resolution is \( 640 \times 480 \) pixels but we crop all images to \(400 \times 400\) pixels to avoid out-of-focus regions at the corners due to the very shallow depth of field. The corresponding field of view is approximately 10 mm \(\times \) 10 mm, adequate to capture the micro-structure of most fabrics.

We use near-grazing illumination to maximize contrast and eliminate specularities. Because the light sources are close to the inspected surface the lighting field is not uniform. To confront this issue, we have captured with our sensor a flat surface with known albedo and use these images to normalize every new surface we capture.

As can be seen in Fig. 3, we use LED arrays to illuminate uniformly the inspected surface. However in such a close distance the illumination is not directional and we have to know the light vectors in each position to apply photometric stereo. To do so, we used a chrome sphere (2 mm diameter) in different locations to calibrate our lights and by interpolation we found the light vectors for each pixel. Next, we run photometric stereo to get the albedo and the normal map using the method of Barsky and Petrou [37] to deal with shadows and highlights. At the end we integrate the normals to obtain the 3D shape of the surface [38].

We evaluated the accuracy of our sensor by using a calibration target, consisting of a \( 2 \times 2 \) grid of spheres of diameter of 2 mm. As we can see in Fig. 4 the sensor can accurately reconstruct the shape of the spheres. Since our goal is to capture the geometry of a surface for classification, we didn’t further assess the reconstruction accuracy.

4 Dataset

Using the photometric stereo sensor we collected samples of the surface of over 2000 garments and fabrics. We visited many clothes shops with the sensor and a laptop and captured all our images “in the field”. For every garment we kept information (attributes) about its material composition from the manufacturer label and its type (pants, shirt, skirt etc.). The dataset reflects the distribution of fabrics in real world, hence it is not balanced. The majority of clothes are made of specific fabrics, such as cotton and polyester, while some other fabrics, such as silk and linen, are more rare. Also, a large number of clothes are not composed of a single fabric but two or more fabrics are used to give the garment the desired properties. Figure 5 presents samples for major classes, along with its albedo and its 3-D micro-geometry.

4.1 Data Collection

The procedure we followed to capture our fabric dataset is depicted in Fig. 6. The photometric stereo sensor is placed on a flat region of the garment, equal to the field of view (10 mm \(\times \) 10 mm). Initially the sensor turns on one light source to allow the camera to focus on the surface of the garment. Due to the shallow depth of field, this task is very important. Subsequently we capture 4 images, one for each light source. We normalize the captured images using reference images of a flat surface with known albedo.

Fig. 6.
figure 6

Capturing setup

Figure 6 shows the graphical user interface we use to collect the data. In the top row the 4 captured images are shown, while in the bottom row we visualize the albedo, the normal map (as an RGB image) and the 3-D shape of the surface. On the right side we can select the fabric of the garment (cotton, wool, etc.) as well as its type (t-shirt, pants, etc.).

5 Experiments

In this section we describe the experiments we run on our fabric database for fine-grained material and garment classification. First of all, a very important aspect of every classification system is the selection of features to use. In our case, we want to use features that distinguish fabric classes from one another. We are mostly interested in features that represent the micro-geometry and reflectance of the surface as they are encoded in normals and albedo respectively. To this end we have followed similar steps as [10]. We have used handcrafted, i.e. dense SIFT [23] featuresFootnote 1, as well as features coming from Deep Convolutional Neural Networks (CNN) [24, 40]. Deep learning and especially CNNs dominate the field of object and speech recognition. CNNs contain a hierarchical model where we have a succession of convolutions and non-linearities. CNNs can learn very complex relationships, given that exist a large amount of data. In our case we used the VGG-M [40] pre-trained neural network which has 8 layers, 5 convolutional layers plus 3 fully connected. We have applied the non-linear features (dense SIFT and CNNs) to (a) albedo images, (b) the original images under different illuminants, (c) the normals (concatenate the three components, \(N_x, N_y, N_z\), into a single image) and (d) fusing normals and albedo (by creating a larger image consisting of the albedo plus the three components of the normals). The following feature encodings have been applied to the local (dense SIFT) features:

  • Feature encoding via the use of BoVW. BoVW uses Vector Quantisation (VQ) [41] of local features (i.e., responses of linear or non-linear filters) by map** them to the closest visual word in a dictionary. The visual words (vocabulary) are prototypical features, i.e. kind of centroids, and are computed by applying a clustering methodology (e.g., K-means [42]). The BoVW encoding is a vector of occurrence counts the above vocabulary of visual words. BoVW was first proposed in [1] for the purpose of material classification.

  • Another popular orderless encoding is the so-called VLAD [22]. VLAD, similar to BoVW, applies VQ to a collection of local features. It differs from the BoVW image descriptor by recording the difference from the cluster center, rather than the number of local features assigned to the cluster. VLAD accumulates first-order descriptor statistics instead of simple occurrences as in BoVW.

  • A feature encoding that accumulated first and second order statistics of local features is the FV [21]. FV uses a soft clustering assignment by using a Gaussian Mixture Model (GMM) [43] instead of K-means. The soft-assignments (posterior probability of each GMM component) is used to weight first and second order statistics of local features.

For the case of CNNs features we have used them directly in the classifier or we used FV encoding before performing classification [11]. The classifier we used was a linear 1 versus “all” Support Vector Machine (SVM) [44]. We have performed four fold cross-validation (i.e., using \(75\,\%\) for training and \(25\,\%\) for testing in each fold).

In all experiments, we used the VLFeat library [45] to extract the SIFT features and to compute the feature encodings (BoVW, VLAD, FV). For the CNNs we used the MatConvNet library [46].

5.1 Results

For our classification experiments we didn’t use garments with blended fabrics. We kept only those whose composition is 95 % of one material. We also discarded fabric classes with too few samples. So, for our experiments we have chosen a subset of 1266 samples which belong to one of the following fabric classes: cotton, terrycloth, denim, fleece, nylon, polyester, silk, viscose, and wool. The number of samples in each class is shown in Table 1.

Table 1. Number of samples per class for fabric classification

Table 2 summarizes the results of our experiments for fabric classification. As we can see, the combination of albedo and normals gives constantly slightly better accuracy. This is true not only for the hand-made features but also for the learned features of CNNs. This provides evidence that using both geometry and texture is useful for fine grained material classification problems.

Table 2. Results for fabric classification

We also run experiments for garment recognition to investigate whether the garment type can be estimated by the microgeometry or albedo. Here we split our dataset in 6 classes: blouses, jackets, jeans, pants, shirts, and t-shirts. Table 3 shows the number of samples per garment class.

Table 3. Number of samples per class for garment classification

We used the same approach with fabric classification testing SIFT features in different encodings as well as CNN features. The results are presented in Table 4. It is obvious that this problem is a more difficult and requires more research but also in this case the fusion of albedo and normals gives the best results.

Table 4. Results for garment classification

Figure 7 presents the confusion matrix for the combination of albedo and normals using fully-connected deep convolutional features from the VGG-M model and FV pooling. This model achieved 79.6 % average accuracy, surpassing both dense SIFT features with FV pooling and CNN with FV pooling (74.7 % and 76.1 % respectively). As we can see cotton, terrycloth, denim and fleece can be recognized successfully with almost no errors. On the other hand, classes like nylon and viscose are more difficult to classify due to their glossy appearance and the small number of samples in our dataset.

Fig. 7.
figure 7

Confusion matrix of the best fabric classifier (albedo and normals with FC-CNN and FV)

The same combination of local features and pooling encoder (FC-CNN + FV) gives also the best result for the problem of garment classification. Here the accuracy is 64.6 %, while using dense SIFT we get 55.6 % and using CNN + FV, 57.1 %. As we can see by observing the confusion matrix in Fig. 8 only the classification rate of jeans is high. This is expected since jeans are made by denim fabric and our system can recognize denim accurately (Fig. 7). On the contrary, all other garments can be made of various fabrics and there is no one-to-one correspondence between fabric and garment class.

Fig. 8.
figure 8

Confusion matrix of the best garment classifier (albedo and normals with FC-CNN and FV)

6 Conclusion

Material classification is important but challenging problem for computer vision. This work focuses on recognizing the material of clothes. Inspired by the fact that humans use their sense of touch to infer the fabric of a garment, we investigated the use of the micro-geometry of garments to recognize their materials. We built a portable photometric stereo sensor and introduced the first large scale dataset of garment surfaces, belonging to 9 different fabric classes. We tested a number of different features for material classification and showed that using both the micro-geometry and the reflectance properties of a fabric can improve the classification rate. Although there is room for improvement, our system is efficient and practical solution for material classification that can be applied in many different scenarios in robotics and industrial inspection.