Keywords

1 Introduction

First-person vision (FPV) uses wearable cameras to record a scene from the point of view of the wearer. FPV applications include lifelogging, video summarisation, and activity recognition. Datasets are important to support the development and testing of algorithms and classification pipelines for FPV applications. Publicly available FPV datasets are mainly focused on activities such as cooking [14] contains 13 activities performed by 10 subjects using Google Glass. Each subject recorded two 30-second clips for each activity. The activities are grouped as motion (walking and running), social interaction (e.g. talking on a phone and to people), office work (writing, reading, watching videos, browsing internet), food (eating and drinking) and house work. LENA includes different varieties of walk activity, which are walk straight, walk back and forth and walk up/down, as different activities, and challenges such as scene and illumination variations.

In this paper, we present FPV-O, a dataset of office activities in first-person vision available at http://www.eecs.qmul.ac.uk/~andrea/fpvo.html. FPV-O contains 20 activities performed by 12 subjects with a chest-mounted camera (see Fig. 1). The activities include three person-to-person interaction activities (chat, shake and wave), sixteen person-to-object interaction activities (clean and write on a whiteboard, use a microwave; use a drink vending machine, take a drink from the vending machine, open and drink; use a mobile phone, read and typeset on a computer, take a printed paper, staple and skim over and handwrite on a paper, wash hands and dry), and one proprioceptive activity (walk). The more stable chest-mount solution is often preferred [2, 8, 20, 21] to the head-mount solution, which is affected by head motion [7, 11]. The larger number of activities and subjects in FPV-O compared to other datasets create classification challenges due to intra-activity differences and inter-activity similarities (see Fig. 2). The dataset is distributed with its annotation and features extracted with both hand-crafted methods and deep learning architectures, as well as classification results with baseline classifiers. FPV-O includes around three hours of videos and contains the highest number of both subjects (12) and activities (20) compared to other office activity datasets. The dataset will be made available with the camera ready paper.

The paper is organized as follows. Section 2 provides the details of the FPV-O dataset, the definition and duration of the activities as well as the contribution of each subject to the dataset. Section 3 describes state-of-the-art features extracted for the recognition of activities in FPV. Section 4 presents the baseline results of the feature groups (and their concatenation) in classifying office activities. Finally, Sect. 5 concludes the paper.

2 The FPV-O Dataset

We used a chest-mounted GoPro Hero3+ camera with \(128\times 720\) resolution and 30 fps frame rate. Twelve subjects (nine male and three female) participated in the data collection. Each subject recorded a continuous video sequence of approximately 15 min on average, resulting in a total of 3 h of videos. The FPV-O activities (see Table 1) extend the scope of existing office activity datasets [9, 10, 14] by including, for example, more object-interactive activities such as using a printer, microwave, drink vending machine, stapler, computer, which are commonly performed in a typical office environment. The number of classes (20) is larger compared to existing office activity datasets [9, 10, 14] (Table 2).

Fig. 1.
figure 1

Keyframes of activities from sample videos in the FPV-O dataset.

The detailed contribution of each subject in the FPV-O dataset for each class is given in Table 3. The annotation includes start and end times of each class segment for each video sequence. The ground-truth labels were generated using ELAN [16].

Table 1. Definition of activities in the FPV-O dataset. P-P: person-to-person interactions; P-O: person-to-object interactions; P: proprioceptive activity.
Table 2. Summary of the FPV-O dataset that describes the number of video segments (# segments) per subject, \(S_i\), \(i \in \{1,12\}\), and overall duration (Dur.) in minutes (min). M: male; F: female.
Table 3. Details for the contribution of each subject, \(S_i\), \(i \in \{1,12\}\), in FPV-O for each class in number of frames. Note that some activities are not performed by some subjects.

The challenges in FPV-O include intra-activity differences, i.e. an activity performed differently by subjects due to their pose differences (see Fig. 2). Chest-mounting may also result in different field-of-views for male and female subjects (see Fig. 2a). Additional challenges include inter-activity similarities, e.g. chat and shake (Fig. 2b); read and typeset (Fig. 2c). Some of inter-activity similarities could be avoided by merging them into a macro activity, e.g. read and typeset can be merged as using a computer activity. Some activities may occur only for a very short duration, e.g. open has only 146 frames, whereas typeset has 28, 561 frames (see Table 3). This exemplifies class imbalance in FPV-O as data-scarce activities, such as open, wave and shake, have limited information for training. There are also illumination changes as indoor lighting often mixes with daylight (Fig. 2d). Moreover, feature extraction is made challenging by motion blur and lack of texture (Fig. 2e).

Fig. 2.
figure 2

Sample frames that illustrate the challenges in the FPV-O dataset.

3 The Features

We selected three frequently employed FPV-based existing methods to extract discriminant features for office activity classification in FPV-O. These are average pooling (AP) [17,18,19], robust motion features (RMF) [1, 2] and pooled appearance features (PAF) [1, 12], which we describe below.

Let \(\mathbf V =(V_1,\cdots ,V_n, \cdots , V_N)\) be N temporally ordered activity samples from a subject. Each sample, \(V_n\), contains a window of L frames, i.e. \(V_n =(f_{n,1}, f_{n,2}, \cdots f_{n,i}, \cdots f_{n,L})\). Successive \(V_n\) pairs may overlap. Each of AP, RMF and PAF provides a feature representation for \(V_n\). AP and RMF mainly exploit motion using optical flow, whereas PAF encodes appearance information. Grid optical flow of \(V_n\) is \(G_n =(g_{n,1}, g_{n,2}, \cdots g_{n,i}, \cdots g_{n,L-1})\), where \(g_{n,i}=g^x_{n,i}+jg^y_{n,i}\) represents a flow vector between successive frames, \(f_{n,i}\) and \(f_{n,i+1}\), \(i \in [1, L-1]\). The superscripts x and y represent horizontal and vertical components, respectively. \(\gamma \) is the number of grids in each of horizontal and vertical components, hence results \(\gamma ^2\) grids per frame. AP [18] applies average pooling of each element across \(L-1\) grid flow vectors in \(G_n\), which helps discard noise. After smoothing, the final representation of AP is derived as a concatenation of the horizontal and vertical grid components.

RMF [2] extracts more discriminative features by encoding the direction, magnitude and frequency characteristics of \(G_n\). RMF contains two parts: grid optical flow-based features (GOFF) and centroid-based virtual inertial features (VIF).

GOFF is extracted from the histogram and Fourier transform of motion direction, \(G^\theta _n = arctan2(G^y_n/G^x_n)\), and motion magnitude, \(|G_n| = \sqrt{|G^x_n|^2+|G^y_n|^2}\). The histogram representations quantize \(G^\theta _n\) and \(|G_n|\) with \(\beta _d\) and \(\beta _m\) bins, respectively. The frequency representations are derived from grou** the frequency response magnitude of \(G^\theta _n\) and \(|G_n|\) into \(N_d\) and \(N_m\) bands, respectively. VIF is a virtual-inertial feature extracted from the movement of intensity centroid, \(C_n= C^x_n+jC^y_n\), across frames in \(V_n\). The intensity centroid is computed from first-order image moments as \(C_n^x = {\mathcal {M}_{O1}}/{\mathcal {M}_{OO}}\) and \(C_n^y= {\mathcal {M}_{1O}}/{\mathcal {M}_{OO}}\). For a frame, \(f_{n,i}\), which is H-pixels high and W-pixels wide, its first-order image moments are calculated from the weighted average of all the intensity values as \(\mathcal {M}_{pq}^i= \sum _{r=1}^{H} \sum _{c=1}^W \;r^pc^q f_{n,i}(r,c)\), where \(p,q \in \{0,1\}\). Once the centroid locations are computed across frames, \(C_n\), then successive temporal derivatives are applied to obtain corresponding velocity, \(\dot{C_n}\), and acceleration, \(\ddot{C_n}\), components. VIF is extracted from \(\dot{C_n}\) and \(\ddot{C_n}\) as inertial features from accelerometer and gyroscope data, e.g. minimum, maximum, energy, kurtosis, zero-crossing and low frequency coefficients.

PAF [12] features are motivated by exploiting appearance features using different pooling operations. We test two types of appearance features: Overfeat and HOG. Overfeat [13] is a high-level appearance feature that is extracted from the last hidden layer of a deep convolutional neural network - Overfeat [13], which was pretrained with a large image dataset (ImageNet [5]). HOG (histogram of oriented gradients) is a commonly used frame-level appearance descriptor [1]. A simple averaging can be applied for each feature element in HOG and Overfeat across frames to obtain a representation for a video sample, \(V_n\). Gradient pooling (GP) can also be applied to encode variation of the appearance features across frames. The gradient is computed by applying a first-order derivative on each feature element across time, and the pooling operations include sum and histogram of positive and negative gradients [12].

We follow the parameter setups for the corresponding methods as in their authors’ choices. Hence, we used \(L=90\) frames (equivalent to three seconds duration) and \(\gamma =20\) for each of horizontal and vertical grid components. Thus AP becomes 800-D. For GOFF of RMF, \(\beta _d =36\), \(\beta _m =15\), \(N_d =N_m=25\), resulting 137-D feature vector. For VIF, we extracted 106-D feature vector that is composed of 60-D frequency features, i.e. 10-D low frequency coefficients from 6 inertial time-series components (2 velocity, 2 acceleration and their magnitude (2)). RMF concatenates GOFF and VIF resulting in 243-D feature vector. For PAF, HOG is 200-D extracted using 5-by-5-by-8 spatial and orientation bins. Overfeat [13] is extracted from the last hidden layer resulting 4096-D feature.

4 Baseline Classification Results

In this section, we describe the setups employed to validate different state-of-the-art features on the FPV-O dataset using multiple classifiers. The baseline results, evaluated using various performance metrics, are thoroughly discussed.

We employed support vectors machines (SVM) and k-nearest neighbours (KNN), which are the most frequently employed, respectively, parametric and non-parametric classifiers for activity recognition in FPV [2]. We apply one-vs-all (OVA) strategy for the training of SVM. Since FPV-O consists of 12 subjects, we experiment one-subject-out validation, which reserves one subject for testing and uses the remaining for training in each iteration.

We employ precision (\(\mathcal {P}\)), recall (\(\mathcal {R}\)) and F-score (\(\mathcal {F}\)) metrics to evaluate the classification performance. Other performance metrics such as accuracy and specificity are not used as they are less informative of the recognition performance in the OVA strategy [2]. Given true positives (TP), false positive (FP) and false negative (FN), the metrics are computed as \(\mathcal {P} = \frac{TP}{TP+FP}\), \(\mathcal {R}= \frac{TP}{TP+FN}\) and \(\mathcal {F}=\frac{2*\mathcal {P}*\mathcal {R}}{\mathcal {P}+\mathcal {R}}\). For each one-subject-out iteration, \(\mathcal {P}\), \(\mathcal {R}\) and \(\mathcal {F}\) are evaluated for each activity. The final recognition performance is computed by averaging, first, over all activities, and then over all subjects. The confusion matrix is also given to visualize the misclassification among activities. All experiments were conducted using Matlab2014b, i7-3770 CPU @ 3.40 GHz, Ubuntu 14.04 OS and 16 GB RAM.

Table 4. Performance of different state-of-the-art features on the FPV-O dataset with an SVM and a KNN classifier. The performance metrics are \(\mathcal {P}\): precision; \(\mathcal {R}\): recall and \(\mathcal {F}\): F-score. Key – AP: average pooling; VIF: virtual inertial features; GOFF: grid optical flow-based features; RMF: robust motion features; PAF: pooled appearance features; GP: gradient pooling applied on the appearance features. RMF [2] + PAF [12] represents a concatenation of existing motion- and appearance-based features.
Fig. 3.
figure 3

Confusion matrices based on the SVM classifier using RMF [2], PAF-GP [12] and the concatenation of RMF and PAF.

The performance of the selected methods on the FPV-O dataset is shown in Table 4. AP [17, 18] only concatenates smoothed horizontal and vertical grid components, and does not use magnitude and direction information, which is important in this context. As a result, the performance of AP are the lowest both with SVM and KNN. RMF outperforms AP as it encodes motion magnitude, direction and dynamics, using multiple feature groups (GOFF and VIF). The performance of VIF of RMF are inferior to GOFF as the intensity centroid of a frame hardly changes over time since the subjects remain stationary for different activities, e.g. read, mobile and typeset. While both AP and RMF are designed to encode motion information in FPV-O, PAF exploits appearance information that is more discriminative for interaction-based activities. As a result, PAF has the highest performance among the selected methods. PAF achieves equivalent performance with and without gradient pooling (GP) (see Table 4). This also confirms the superiority of appearance information for this dataset as variation encoding using GP did not provide significantly discriminative characteristics. The concatenation of both motion and appearance features outperform all the remaining feature groups.

The confusion matrices shown in Fig. 3 replicate the corresponding performance of motion-based (RMF [2]), appearance-based (PAF-GP [12]) and their concatenation (RMF [2] + PAF [12]). The concatenation of RMF [2] and PAF [12] improved the recognition performance of drink from \(9\%\) with RMF and \(7\%\) with PAF-GP to \(22\%\). The same is true for print whose recognition performance was improved from \(11\%\) with RMF and \(56\%\) with PAF-GP to \(76\%\) with the concatenation. On the other hand, the combination of motion and appearance features worsened the recognition performance of write. Note also the frequent misclassification with open due to the class imbalance problem (see Table 3).

5 Conclusions

We collected, annotated and distributed a dataset of 20 office activities from a first-person vision perspective (FPV-O) at http://www.eecs.qmul.ac.uk/~andrea/fpvo.html. Moreover, we employed and discussed state-of-the-art features extracted using both handcrafted methods and deep neural architectures, and baseline results of different feature groups using SVM and KNN classifiers.

FPV-O covers about three hours of egocentric videos collected by 12 subjects and contains the highest number of office activities (20) compared to existing datasets with similar activities. FPV-O contains challenging intra-activity differences and inter-activity similarities in addition to motion blur and illumination changes. We hope that this dataset and associated baseline results will support and foster research progress in this area of growing interest.