A First-Person Vision Dataset of Office Activities

Abebe, Girmaw; Catala, Andreu; Cavallaro, Andrea

doi:10.1007/978-3-030-20984-1_3

Girmaw Abebe¹⁶,
Andreu Catala¹⁷ &
Andrea Cavallaro¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11377))

Included in the following conference series:

IAPR Workshop on Multimodal Pattern Recognition of Social Signals in Human-Computer Interaction

742 Accesses
5 Citations

Abstract

We present a multi-subject first-person vision dataset of office activities. The dataset contains the highest number of subjects and activities compared to existing office activity datasets. Office activities include person-to-person interactions, such as chatting and handshaking, person-to-object interactions, such as using a computer or a whiteboard, as well as generic activities such as walking. The videos in the dataset present a number of challenges that, in addition to intra-class differences and inter-class similarities, include frames with illumination changes, motion blur, and lack of texture. Moreover, we present and discuss state-of-the-art features extracted from the dataset and baseline activity recognition results with a number of existing methods. The dataset is provided along with its annotation and the extracted features.

You have full access to this open access chapter, Download conference paper PDF

Scaling Egocentric Vision: The Dataset

Vision-based human activity recognition: a survey

Article Open access 15 August 2020

A Study on Vision-Based Human Activity Recognition Approaches

Keywords

1 Introduction

First-person vision (FPV) uses wearable cameras to record a scene from the point of view of the wearer. FPV applications include lifelogging, video summarisation, and activity recognition. Datasets are important to support the development and testing of algorithms and classification pipelines for FPV applications. Publicly available FPV datasets are mainly focused on activities such as cooking [14] contains 13 activities performed by 10 subjects using Google Glass. Each subject recorded two 30-second clips for each activity. The activities are grouped as motion (walking and running), social interaction (e.g. talking on a phone and to people), office work (writing, reading, watching videos, browsing internet), food (eating and drinking) and house work. LENA includes different varieties of walk activity, which are walk straight, walk back and forth and walk up/down, as different activities, and challenges such as scene and illumination variations.

In this paper, we present FPV-O, a dataset of office activities in first-person vision available at http://www.eecs.qmul.ac.uk/~andrea/fpvo.html. FPV-O contains 20 activities performed by 12 subjects with a chest-mounted camera (see Fig. 1). The activities include three person-to-person interaction activities (chat, shake and wave), sixteen person-to-object interaction activities (clean and write on a whiteboard, use a microwave; use a drink vending machine, take a drink from the vending machine, open and drink; use a mobile phone, read and typeset on a computer, take a printed paper, staple and skim over and handwrite on a paper, wash hands and dry), and one proprioceptive activity (walk). The more stable chest-mount solution is often preferred [2, 8, 20, 21] to the head-mount solution, which is affected by head motion [7, 11]. The larger number of activities and subjects in FPV-O compared to other datasets create classification challenges due to intra-activity differences and inter-activity similarities (see Fig. 2). The dataset is distributed with its annotation and features extracted with both hand-crafted methods and deep learning architectures, as well as classification results with baseline classifiers. FPV-O includes around three hours of videos and contains the highest number of both subjects (12) and activities (20) compared to other office activity datasets. The dataset will be made available with the camera ready paper.

The paper is organized as follows. Section 2 provides the details of the FPV-O dataset, the definition and duration of the activities as well as the contribution of each subject to the dataset. Section 3 describes state-of-the-art features extracted for the recognition of activities in FPV. Section 4 presents the baseline results of the feature groups (and their concatenation) in classifying office activities. Finally, Sect. 5 concludes the paper.

2 The FPV-O Dataset

We used a chest-mounted GoPro Hero3+ camera with \(128\times 720\) resolution and 30 fps frame rate. Twelve subjects (nine male and three female) participated in the data collection. Each subject recorded a continuous video sequence of approximately 15 min on average, resulting in a total of 3 h of videos. The FPV-O activities (see Table 1) extend the scope of existing office activity datasets [9, 10, 14] by including, for example, more object-interactive activities such as using a printer, microwave, drink vending machine, stapler, computer, which are commonly performed in a typical office environment. The number of classes (20) is larger compared to existing office activity datasets [9, 10, 14] (Table 2).

The detailed contribution of each subject in the FPV-O dataset for each class is given in Table 3. The annotation includes start and end times of each class segment for each video sequence. The ground-truth labels were generated using ELAN [16].

Table 1. Definition of activities in the FPV-O dataset. P-P: person-to-person interactions; P-O: person-to-object interactions; P: proprioceptive activity.

Full size table

Table 2. Summary of the FPV-O dataset that describes the number of video segments (# segments) per subject, \(S_i\), \(i \in \{1,12\}\), and overall duration (Dur.) in minutes (min). M: male; F: female.

Full size table

Table 3. Details for the contribution of each subject, \(S_i\), \(i \in \{1,12\}\), in FPV-O for each class in number of frames. Note that some activities are not performed by some subjects.

Full size table

The challenges in FPV-O include intra-activity differences, i.e. an activity performed differently by subjects due to their pose differences (see Fig. 2). Chest-mounting may also result in different field-of-views for male and female subjects (see Fig. 2a). Additional challenges include inter-activity similarities, e.g. chat and shake (Fig. 2b); read and typeset (Fig. 2c). Some of inter-activity similarities could be avoided by merging them into a macro activity, e.g. read and typeset can be merged as using a computer activity. Some activities may occur only for a very short duration, e.g. open has only 146 frames, whereas typeset has 28, 561 frames (see Table 3). This exemplifies class imbalance in FPV-O as data-scarce activities, such as open, wave and shake, have limited information for training. There are also illumination changes as indoor lighting often mixes with daylight (Fig. 2d). Moreover, feature extraction is made challenging by motion blur and lack of texture (Fig. 2e).

3 The Features

We selected three frequently employed FPV-based existing methods to extract discriminant features for office activity classification in FPV-O. These are average pooling (AP) [17,18,19], robust motion features (RMF) [1, 2] and pooled appearance features (PAF) [1, 12], which we describe below.

Let \(\mathbf V =(V_1,\cdots ,V_n, \cdots , V_N)\) be N temporally ordered activity samples from a subject. Each sample, \(V_n\), contains a window of L frames, i.e. \(V_n =(f_{n,1}, f_{n,2}, \cdots f_{n,i}, \cdots f_{n,L})\). Successive \(V_n\) pairs may overlap. Each of AP, RMF and PAF provides a feature representation for \(V_n\). AP and RMF mainly exploit motion using optical flow, whereas PAF encodes appearance information. Grid optical flow of \(V_n\) is \(G_n =(g_{n,1}, g_{n,2}, \cdots g_{n,i}, \cdots g_{n,L-1})\), where \(g_{n,i}=g^x_{n,i}+jg^y_{n,i}\) represents a flow vector between successive frames, \(f_{n,i}\) and \(f_{n,i+1}\), \(i \in [1, L-1]\). The superscripts x and y represent horizontal and vertical components, respectively. \(\gamma \) is the number of grids in each of horizontal and vertical components, hence results \(\gamma ^2\) grids per frame. AP [18] applies average pooling of each element across \(L-1\) grid flow vectors in \(G_n\), which helps discard noise. After smoothing, the final representation of AP is derived as a concatenation of the horizontal and vertical grid components.

RMF [2] extracts more discriminative features by encoding the direction, magnitude and frequency characteristics of \(G_n\). RMF contains two parts: grid optical flow-based features (GOFF) and centroid-based virtual inertial features (VIF).

GOFF is extracted from the histogram and Fourier transform of motion direction, \(G^\theta _n = arctan2(G^y_n/G^x_n)\), and motion magnitude, \(|G_n| = \sqrt{|G^x_n|^2+|G^y_n|^2}\). The histogram representations quantize \(G^\theta _n\) and \(|G_n|\) with \(\beta _d\) and \(\beta _m\) bins, respectively. The frequency representations are derived from grou** the frequency response magnitude of \(G^\theta _n\) and \(|G_n|\) into \(N_d\) and \(N_m\) bands, respectively. VIF is a virtual-inertial feature extracted from the movement of intensity centroid, \(C_n= C^x_n+jC^y_n\), across frames in \(V_n\). The intensity centroid is computed from first-order image moments as \(C_n^x = {\mathcal {M}_{O1}}/{\mathcal {M}_{OO}}\) and \(C_n^y= {\mathcal {M}_{1O}}/{\mathcal {M}_{OO}}\). For a frame, \(f_{n,i}\), which is H-pixels high and W-pixels wide, its first-order image moments are calculated from the weighted average of all the intensity values as \(\mathcal {M}_{pq}^i= \sum _{r=1}^{H} \sum _{c=1}^W \;r^pc^q f_{n,i}(r,c)\), where \(p,q \in \{0,1\}\). Once the centroid locations are computed across frames, \(C_n\), then successive temporal derivatives are applied to obtain corresponding velocity, \(\dot{C_n}\), and acceleration, \(\ddot{C_n}\), components. VIF is extracted from \(\dot{C_n}\) and \(\ddot{C_n}\) as inertial features from accelerometer and gyroscope data, e.g. minimum, maximum, energy, kurtosis, zero-crossing and low frequency coefficients.

PAF [12] features are motivated by exploiting appearance features using different pooling operations. We test two types of appearance features: Overfeat and HOG. Overfeat [13] is a high-level appearance feature that is extracted from the last hidden layer of a deep convolutional neural network - Overfeat [13], which was pretrained with a large image dataset (ImageNet [5]). HOG (histogram of oriented gradients) is a commonly used frame-level appearance descriptor [1]. A simple averaging can be applied for each feature element in HOG and Overfeat across frames to obtain a representation for a video sample, \(V_n\). Gradient pooling (GP) can also be applied to encode variation of the appearance features across frames. The gradient is computed by applying a first-order derivative on each feature element across time, and the pooling operations include sum and histogram of positive and negative gradients [12].

We follow the parameter setups for the corresponding methods as in their authors’ choices. Hence, we used \(L=90\) frames (equivalent to three seconds duration) and \(\gamma =20\) for each of horizontal and vertical grid components. Thus AP becomes 800-D. For GOFF of RMF, \(\beta _d =36\), \(\beta _m =15\), \(N_d =N_m=25\), resulting 137-D feature vector. For VIF, we extracted 106-D feature vector that is composed of 60-D frequency features, i.e. 10-D low frequency coefficients from 6 inertial time-series components (2 velocity, 2 acceleration and their magnitude (2)). RMF concatenates GOFF and VIF resulting in 243-D feature vector. For PAF, HOG is 200-D extracted using 5-by-5-by-8 spatial and orientation bins. Overfeat [13] is extracted from the last hidden layer resulting 4096-D feature.

4 Baseline Classification Results

In this section, we describe the setups employed to validate different state-of-the-art features on the FPV-O dataset using multiple classifiers. The baseline results, evaluated using various performance metrics, are thoroughly discussed.

We employed support vectors machines (SVM) and k-nearest neighbours (KNN), which are the most frequently employed, respectively, parametric and non-parametric classifiers for activity recognition in FPV [2]. We apply one-vs-all (OVA) strategy for the training of SVM. Since FPV-O consists of 12 subjects, we experiment one-subject-out validation, which reserves one subject for testing and uses the remaining for training in each iteration.

We employ precision (\(\mathcal {P}\)), recall (\(\mathcal {R}\)) and F-score (\(\mathcal {F}\)) metrics to evaluate the classification performance. Other performance metrics such as accuracy and specificity are not used as they are less informative of the recognition performance in the OVA strategy [2]. Given true positives (TP), false positive (FP) and false negative (FN), the metrics are computed as \(\mathcal {P} = \frac{TP}{TP+FP}\), \(\mathcal {R}= \frac{TP}{TP+FN}\) and \(\mathcal {F}=\frac{2*\mathcal {P}*\mathcal {R}}{\mathcal {P}+\mathcal {R}}\). For each one-subject-out iteration, \(\mathcal {P}\), \(\mathcal {R}\) and \(\mathcal {F}\) are evaluated for each activity. The final recognition performance is computed by averaging, first, over all activities, and then over all subjects. The confusion matrix is also given to visualize the misclassification among activities. All experiments were conducted using Matlab2014b, i7-3770 CPU @ 3.40 GHz, Ubuntu 14.04 OS and 16 GB RAM.

Table 4. Performance of different state-of-the-art features on the FPV-O dataset with an SVM and a KNN classifier. The performance metrics are \(\mathcal {P}\): precision; \(\mathcal {R}\): recall and \(\mathcal {F}\): F-score. Key – AP: average pooling; VIF: virtual inertial features; GOFF: grid optical flow-based features; RMF: robust motion features; PAF: pooled appearance features; GP: gradient pooling applied on the appearance features. RMF [2] + PAF [12] represents a concatenation of existing motion- and appearance-based features.

Full size table

The performance of the selected methods on the FPV-O dataset is shown in Table 4. AP [17, 18] only concatenates smoothed horizontal and vertical grid components, and does not use magnitude and direction information, which is important in this context. As a result, the performance of AP are the lowest both with SVM and KNN. RMF outperforms AP as it encodes motion magnitude, direction and dynamics, using multiple feature groups (GOFF and VIF). The performance of VIF of RMF are inferior to GOFF as the intensity centroid of a frame hardly changes over time since the subjects remain stationary for different activities, e.g. read, mobile and typeset. While both AP and RMF are designed to encode motion information in FPV-O, PAF exploits appearance information that is more discriminative for interaction-based activities. As a result, PAF has the highest performance among the selected methods. PAF achieves equivalent performance with and without gradient pooling (GP) (see Table 4). This also confirms the superiority of appearance information for this dataset as variation encoding using GP did not provide significantly discriminative characteristics. The concatenation of both motion and appearance features outperform all the remaining feature groups.

The confusion matrices shown in Fig. 3 replicate the corresponding performance of motion-based (RMF [2]), appearance-based (PAF-GP [12]) and their concatenation (RMF [2] + PAF [12]). The concatenation of RMF [2] and PAF [12] improved the recognition performance of drink from \(9\%\) with RMF and \(7\%\) with PAF-GP to \(22\%\). The same is true for print whose recognition performance was improved from \(11\%\) with RMF and \(56\%\) with PAF-GP to \(76\%\) with the concatenation. On the other hand, the combination of motion and appearance features worsened the recognition performance of write. Note also the frequent misclassification with open due to the class imbalance problem (see Table 3).

5 Conclusions

We collected, annotated and distributed a dataset of 20 office activities from a first-person vision perspective (FPV-O) at http://www.eecs.qmul.ac.uk/~andrea/fpvo.html. Moreover, we employed and discussed state-of-the-art features extracted using both handcrafted methods and deep neural architectures, and baseline results of different feature groups using SVM and KNN classifiers.

FPV-O covers about three hours of egocentric videos collected by 12 subjects and contains the highest number of office activities (20) compared to existing datasets with similar activities. FPV-O contains challenging intra-activity differences and inter-activity similarities in addition to motion blur and illumination changes. We hope that this dataset and associated baseline results will support and foster research progress in this area of growing interest.

References

Abebe, G., Cavallaro, A.: Hierarchical modeling for first-person vision activity recognition. Neurocomputing 267, 362–377 (2017)
Article Google Scholar
Abebe, G., Cavallaro, A., Parra, X.: Robust multi-dimensional motion features for first-person vision activity recognition. Comput. Vis. Image Underst. (CVIU) 149, 229–248 (2016)
Article Google Scholar
Asnaoui, K.E., Hamid, A., Brahim, A., Mohammed, O.: A survey of activity recognition in egocentric lifelogging datasets. In: Proceedings of IEEE Conference on Wireless Technologies, Embedded and Intelligent Systems (WITS), Fez, Morocco, pp. 1–8, April 2017
Google Scholar
Damen, D., et al.: Scaling egocentric vision: the epic-kitchens dataset. ar**v preprint ar**v:1804.02748 (2018)
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Miami, USA, pp. 248–255, June 2009
Google Scholar
Fathi, A., Li, Y., Rehg, J.M.: Learning to recognize daily actions using gaze. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7572, pp. 314–327. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33718-5_23
Chapter Google Scholar
Kitani, K.M., Okabe, T., Sato, Y., Sugimoto, A.: Fast unsupervised ego-action learning for first-person sports videos. In: Proceedings of IEEE Computer Vision and Pattern Recognition (CVPR), Colorado, USA, pp. 3241–3248, June 2011
Google Scholar
Nam, Y., Rho, S., Lee, C.: Physical activity recognition using multiple sensors embedded in a wearable device. ACM Trans. Embed. Comput. Syst. 12(2), 26:1–26:14 (2013)
Article Google Scholar
Narayan, S., Kankanhalli, M.S., Ramakrishnan, K.R.: Action and interaction recognition in first-person videos. In: Proceedings of IEEE Computer Vision and Pattern Recognition Workshops (CVPRW), Columbus, USA, pp. 526–532, June 2014
Google Scholar
Ogaki, K., Kitani, K.M., Sugano, Y., Sato, Y.: Coupling eye-motion and ego-motion features for first-person activity recognition. In: Proceedings of IEEE Computer Vision and Pattern Recognition Workshops (CVPRW), Providence, USA, pp. 1–7, June 2012
Google Scholar
Poleg, Y., Ephrat, A., Peleg, S., Arora, C.: Compact CNN for indexing egocentric videos. In: Proceedings of IEEE Winter Conference on Applications of Computer Vision (WACV), New York, USA, pp. 1–9, March 2016
Google Scholar
Ryoo, M.S., Rothrock, B., Matthies, L.: Pooled motion features for first-person videos. In: Proceedings of IEEE Computer Vision and Pattern Recognition (CVPR), Boston, USA, pp. 896–904, March 2015
Google Scholar
Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., LeCun, Y.: Overfeat: integrated recognition, localization and detection using convolutional networks. In: Proceedings of International Conference on Learning Representations (ICLR), Banff, Canada, April 2014
Google Scholar
Song, S., Chandrasekhar, V., Cheung, N.-M., Narayan, S., Li, L., Lim, J.-H.: Activity recognition in egocentric life-logging videos. In: Jawahar, C.V., Shan, S. (eds.) ACCV 2014. LNCS, vol. 9010, pp. 445–458. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-16634-6_33
Chapter Google Scholar
Spriggs, E.H., De La Torre, F., Hebert, M.: Temporal segmentation and activity classification from first-person sensing. In: Proceedings of IEEE Computer Vision and Pattern Recognition Workshops (CVPRW), Miami, USA, pp. 17–24, June 2009
Google Scholar
Wittenburg, P., Brugman, H., Russel, A., Klassmann, A., Sloetjes, H.: ELAN: a professional framework for multimodality research. In: Proceedings of International Conference on Language Resources and Evaluation (LREC), Genoa, Italy, pp. 1556–1559, May 2006
Google Scholar
Zhan, K., Faux, S., Ramos, F.: Multi-scale conditional random fields for first-person activity recognition. In: Proceedings of IEEE International Conference on Pervasive Computing and Communications (PerCom), Budapest, Hungary, pp. 51–59, March 2014
Google Scholar
Zhan, K., Faux, S., Ramos, F.: Multi-scale conditional random fields for first-person activity recognition on elders and disabled patients. Pervasive Mobile Comput. 16(Part B), 251–267 (2015)
Article Google Scholar
Zhan, K., Ramos, F., Faux, S.: Activity recognition from a wearable camera. In: Proceedings of IEEE International Conference on Control Automation Robotics & Vision (ICARCV), Guangzhou, China, pp. 365–370, December 2012
Google Scholar
Zhang, H., et al.: Physical activity recognition based on motion in images acquired by a wearable camera. Neurocomputing 74(12), 2184–2192 (2011)
Article Google Scholar
Zhang, H., Li, L., Jia, W., Fernstrom, J.D., Sclabassi, R.J., Sun, M.: Recognizing physical activity from ego-motion of a camera. In: Proceedings of IEEE International Conference on Engineering in Medicine and Biology Society (EMBC), Buenos Aires, Argentina, pp. 5569–5572, August 2010
Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Biomedical Engineering, University of Oxford, Oxford, UK
Girmaw Abebe
Universitat Politecnica de Catalunya, Barcelona, Spain
Andreu Catala
Centre for Intelligent Sensing, Queen Mary University of London, London, UK
Andrea Cavallaro

Authors

Girmaw Abebe
View author publications
You can also search for this author in PubMed Google Scholar
Andreu Catala
View author publications
You can also search for this author in PubMed Google Scholar
Andrea Cavallaro
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Girmaw Abebe .

Editor information

Editors and Affiliations

Ulm University, Ulm, Germany
Friedhelm Schwenker
University of Southern California, Playa Vista, CA, USA
Stefan Scherer

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Abebe, G., Catala, A., Cavallaro, A. (2019). A First-Person Vision Dataset of Office Activities. In: Schwenker, F., Scherer, S. (eds) Multimodal Pattern Recognition of Social Signals in Human-Computer-Interaction. MPRSS 2018. Lecture Notes in Computer Science(), vol 11377. Springer, Cham. https://doi.org/10.1007/978-3-030-20984-1_3

Download citation

DOI: https://doi.org/10.1007/978-3-030-20984-1_3
Published: 15 May 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-20983-4
Online ISBN: 978-3-030-20984-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

A First-Person Vision Dataset of Office Activities

Abstract