(PDF) First Report on ICH Capture and Analysis

January 23, 2018 | Author: Anonymous | Category: Documents
Share Embed


Short Description

movements, while for hand and upper body movement, the PMD CamBoard ... 14-channel EEG device, digitized at 128 Hz. Afte...

Description

Project Title:

i-Treasures: Intangible Treasures – Capturing the Intangible Cultural Heritage and Learning the Rare KnowHow of Living Human Treasures

Contract No:

FP7-ICT-2011-9-600676

Instrument:

Large Scale Integrated Project (IP)

Thematic Priority:

ICT for access to cultural resources

Start of project:

1 February 2013

Duration:

48 months

Deliverable No: D3.1 First Report on ICH Capture and Analysis Due date of deliverable:

31 January 2014

Actual submission date:

12 March 2014

Version:

1st Version of D3.1

Main Authors:

Samer Al Kork (UPMC), Bruce Denby (UPMC), Aurore Hakoun (UPMC)

Project funded by the European Community under the 7th Framework Programme for Research and Technological Development.

D3.1 First Report on ICH Capture and Analysis

i-Treasures ICT-600676

Project ref. number

ICT-600676

Project title

i-Treasures - Intangible Treasures – Capturing the Intangible Cultural Heritage and Learning the Rare KnowHow of Living Human Treasures

Deliverable title

First report on ICH Capture and Analysis

Deliverable number

D3.1

Deliverable version

Version 15

Previous version(s)

1-14

Contractual date of delivery

31 January 2014

Actual date of delivery

12 March 2014

Deliverable filename

D3.1 Final Version March 11 2014.doc

Nature of deliverable

R = Report

Dissemination level

PP = Restricted to other programmed participants (including the Commission Services)

Number of pages

127

Work package

3

Partner responsible

UPMC

Author(s)

Bruce Denby (UPMC), Samer Al Kork (UPMC), Aurore Hakoun (UPMC), Kele Xu (UPMC), Pierre Roussel (UPMC), Maureen Stone (USM),Athanasios Manitsaris (UOM), George Kourvoulis (UOM), Anastasios Katos (UOM), Alina Glushkova (UOM), Vasso Gatziaki (UOM), Christina Volioti (UOM), Nikos Grammalidis (CERTH), Kosmas Dimitropoulos, (CERTH), Filareti Tsalakanidou (CERTH), Alexandros Kitsikidis (CERTH Martine AddaDecker (CNRS), Lise Crevier-Buchman (CNRS), Claire Pillot-Loiseau (CNRS), Patrick Chawah (CNRS), Angelique Amelot (CNRS),Thibaut FUX CNRS),Nicolas Audibert (CNRS), Stephane Dupont (UMONS), Joelle Tilmanne (UMONS), Thierry Ravet (UMONS), Benjamin Picart (UMONS Stelios Hadjidimitriou (AUTH), Vasileios Charisis (AUTH), Leontios Hadjileontiadis (AUTH) , G. Sergiadis (AUTH)Sotiris Manitsaris (ARMINES/ENSMP)

Editor

Samer Al Kork (UPMC)

Filename: D3.1 Final Version March 12 2014.docx

Page 2 of 127

D3.1 First Report on ICH Capture and Analysis

EC Project Officer

i-Treasures ICT-600676

Alina Senn

Abstract

The document describes the data capture and analysis techniques employed for the ICH modalities of face recognition; body and gesture recognition; vocal tract capture and analysis; EEG analysis; and sound capture. Sensors, methods, and algorithms used in each case are described in detail.

Keywords

Intangible Cultural Heritage (ICH), ICH capture and analysis, multi-sensory data, vocal tract, ultrasound, electroglottograph, piezoelectric accelerometer, motion capture, Kinect, skeleton fusion, gesture recognition, motion analysis, 3D facial feature tracking, facial deformation measurements, facial action unit detection, electroencephalography analysis (EEG), Emotiv, sound processing, pitch analysis, image processing, deep learning.

Filename: D3.1 Final Version March 12 2014.docx

Page 3 of 127

D3.1 First Report on ICH Capture and Analysis

i-Treasures ICT-600676

Signatures Written by

Responsibility- Company

Samer Al Kork

UPMC

Date

Verified by

Bruce Denby

Responsible for D3.1 (UPMC)

Bruce Denby

WP3 Leader (UPMC)

Approved by Nikos Grammalidis

Coordinator (CERTH)

Yiannis (Ioannis) Kompatsiaris

Quality Manager (CERTH)

Filename: D3.1 Final Version March 12 2014.docx

Page 4 of 127

D3.1 First Report on ICH Capture and Analysis

i-Treasures ICT-600676

Table of Contents 1.

Executive Summary........................................................................................... 7

2.

Introduction........................................................................................................ 8

3.

2.1

Background .................................................................................................. 9

2.2

Aim of this Report ......................................................................................... 9

2.3

Report Structure ........................................................................................... 9

ICH Capture Analysis and Feature Extraction.................................................. 10 3.1

Body and Gesture Data Capture and Analysis ............................................ 10

3.1.1

Full body skeleton capture ................................................................... 10

3.1.1.1 Human skeleton tracking based on fusion of multiple skeletal streams……................................................................................................. 11 3.1.1.2 3.1.2

Human body tracking based on fusion of multiple depth maps .... 15

Full body gesture recognition ............................................................... 21

3.1.2.1 Gesture recognition with HCRF based on a depth map extracted skeleton………. ........................................................................................... 22 3.1.2.2 3.1.3

Hidden Markov Models for Gesture Recognition ......................... 24

Hand/Finger data capture .................................................................... 33

3.1.3.1

Skeleton-based hand joints detection .......................................... 33

3.1.3.2 Finger gesture recognition without using any skeletal model (UOM)……................................................................................................... 41 3.1.3.3

Gesture Recognition in Byzantine Music ..................................... 42

3.1.4 Full Upper body data capture for the traditional craftsmanship use case (UOM)……. ..................................................................................................... 43

3.2

3.1.4.1

Capture, modelling and recognition (UOM, ARMINES) ............... 43

3.1.4.2

Gesture Analysis ......................................................................... 44

3.1.4.3

Technical gesture recognition based on inertial sensors ............. 46

Facial Expression Data Capture and Analysis ............................................ 47

3.2.1

System Overview ................................................................................. 47

3.2.2

System Design and Architecture .......................................................... 48

3.2.3

System Inputs and Outputs Data Formats ........................................... 51

3.2.3.1

System input ............................................................................... 51

3.2.3.2

System output ............................................................................. 54

3.2.4

Feature Post processing and Pre processing ....................................... 55

3.2.4.1

3D face detection and pose estimation........................................ 57

3.2.4.2

2D+3D facial feature tracking based on ASM models ................. 57

3.2.4.3

Facial feature localization ............................................................ 65

3.2.4.4

Face and facial feature tracking in 2D and 3D image sequences 81

Filename: D3.1 Final Version March 12 2014.docx

Page 5 of 127

D3.1 First Report on ICH Capture and Analysis

3.3

i-Treasures ICT-600676

3.2.4.5

Facial measurement extraction ................................................... 84

3.2.4.6

Facial Action Unit recognition ...................................................... 88

3.2.4.7

Initial experimental evaluation ..................................................... 91

EEG Data Capture and Analysis ................................................................. 92

3.3.1

Background and Module Overview ...................................................... 92

3.3.2

System Design and Architecture ......................................................... 94

3.3.3

EEG Data Acquisition Device and Setup.............................................. 94

3.3.4

Data Processing and Classification ...................................................... 96

3.3.4.1

Input Data and Features Extraction ............................................. 97

3.3.4.2

Features Classification and Output Data ..................................... 99

3.3.4.3

Application Realization .............................................................. 101

3.3.5 3.4

Future work ........................................................................................ 102

Vocal Tract Data Capture and Analysis (UPMC, CNRS, USM) ................. 103

3.4.1

Introduction ........................................................................................ 103

3.4.2

Vocal tract data capture system architecture ..................................... 104

3.4.2.1 3.4.3

Helmet design and sensor setup ............................................... 105

Data Acquisition system design ......................................................... 106

3.4.3.1

Sensor Core Design .................................................................. 106

3.4.3.2

Data visualization ...................................................................... 107

3.4.4

Rare singing data collection ............................................................... 110

3.4.4.1 Assessment phase of the hyper-helmet for the different singing types…….. ................................................................................................. 110 3.4.4.2 3.4.5

3.5

Definition of recording material .................................................. 110

Feature Post processing and Pre processing ..................................... 111

3.4.5.1

Ultrasound image processing .................................................... 111

3.4.5.2

Lip image and other sensor processing ..................................... 113

3.4.5.3

Perspectives ............................................................................. 113

Sound Data Capture and Analysis ............................................................ 113

3.5.1

Data capture ...................................................................................... 113

3.5.2

Sound data analysis and feature extraction ....................................... 114

3.5.2.1

Segmentation ............................................................................ 115

3.5.2.2

Pitch analysis ............................................................................ 117

3.5.2.3

Perspectives ............................................................................. 120

4.

Partners responsible for each Module/Task ................................................... 121

5.

Conclusions ................................................................................................... 121

6.

References .................................................................................................... 122

Filename: D3.1 Final Version March 12 2014.docx

Page 6 of 127

D3.1 First Report on ICH Capture and Analysis

1.

i-Treasures ICT-600676

Executive Summary

The objective of WP3 is to use multi-sensory technology to capture and analyze the different forms of ICH addressed in the i-Treasures project. This document details the status of ICH Capture and Analysis for the tasks associated with each ICH acquisition modality, that is: 

body and gesture capture for the dance use cases;



facial expression capture and recognition;



electroencephalographic signals (EEG) for emotion detection;



vocal tract configuration for the singing use case;



and sound capture.

Each modality is presented in a separate section, where it is outlined in terms of: 

The goals of the task in relation to the specific use case addressed;



The types of sensors used for this modality;



The data acquisition system chosen for the given modality;



Issues of data manipulation and storage, where applicable;



The algorithms and techniques chosen for: o

Calibration

o

Pre- and post-processing of acquired data

o

Feature extraction

o

Data analysis per se

o

Data evaluation protocoles

Filename: D3.1 Final Version March 12 2014.docx

Page 7 of 127

D3.1 First Report on ICH Capture and Analysis

2.

i-Treasures ICT-600676

Introduction

The overall objective of WP3, entitled ICH Capture and Analysis, is to capture, process and analyze different forms of ICH based on the knowledge of experts. Its goals are to develop systems capable of recording and subsequently analyzing data on full body and hand gesture movements; facial expressions; emotional state of composers via EEG analysis; the real time configuration of the vocal tract for rare song experts; and sound capture, which is of course important in several use cases. Data from all modalities will have to be captured, cleaned, stored, and analyzed, and specific sensors and algorithms must be developed for each, since each domain has its own particularities. As we shall see, progress towards these goals in the covered period has been achieved in a multitude of ways, depending upon the ICH modalities involved: Multiple Kinect depth cameras are used to acquire depth field data of dancers’ full body movements, while for hand and upper body movement, the PMD CamBoard Nano depth camera, Animazoo inertial sensor suit, and Kinect are used. Acquisition software derives from open-source point-cloud and 3D graphics packages, raw data are registered to hand/body skeleton models (where applicable) and gesture recognition performed using Random Decision Forests RDF, Hidden Markov Models HMM, and principle component analysis PCA. The relevant software modules are the Kinect Acquisition Tool, “sar” Toolbox for Depth Cameras Calibration and Synchronization; HMM based Real-time Gesture Recognition Library; PianOrasis Finger Gesture Recognition System; and Anima-OSC Gesture Recognition System. Facial expression capture is performed using the Kinect camera. A first version of a robust, unobtrusive, 2D & 3D face tracker, which has the ability to analyze 2D and 3D facial data in real time and recognize basic facial muscle movements (called Facial Action Units) was developed. Several image analysis algorithms for face detection, facial feature tracking, facial feature localization, facial measurement extraction and facial action unit recognition are presented. Electroencephalographic data are acquired with a lightweight, portable Emotiv EPOC 14-channel EEG device, digitized at 128 Hz. After feature extraction with power, complexity, and zero-crossing based techniques, thresholding, k-Nearest Neighbor (kNN), and Support Vector Machines (SVM) methods are employed to classify the user’s emotional state, for use in the Contemporary Music Composition use case. For the singing use cases, the vocal tract configuration is captured using a lightweight “hyper-helmet” including an ultrasonic (US) transducer for tongue movement, video camera for the lips, and microphone. The singer also wears an Electroglottograph (EGG, for vocal fold capture), nasality accelerometer, and respiration belt. Data are acquired, displayed, and logged at 60 fps (US and video), using the RTMaps data acquisition package. A Deep Learning DL approach has been adopted for tongue contour extraction. The software modules used in this task are the: Multi-sensor Data Acquisition Module for Rare singing; Deep Learning Module for Automatic Contour Extraction; and i-Coffee Data Display and Analysis Module. Sound capture has focused mainly on Human Beat Box, singing since novel techniques will have to be developed for this new and yet not widely studied style. The first step is to develop a ground truth by manually segmenting and labeling the unusual sound unit categories employed in beat boxing. Comparative studies of pitch tracking algorithms for beat box recordings have also been undertaken. The corresponding software module for this study is entitled the Pitch Analysis Algorithms Module. In what follows, the data capture and analysis procedure for each of the WP3 task is laid out in detail. Filename: D3.1 Final Version March 12 2014.docx

Page 8 of 127

D3.1 First Report on ICH Capture and Analysis

2.1

i-Treasures ICT-600676

Background

The aim of the i-Treasures project is to develop an open and extendable platform to provide access to intangible cultural heritage (ICH) and propose a novel strategic framework for the safeguarding and transmission of ICH by using novel multisensory technology for the creation of cultural content that has never been analyzed before.

2.2

Aim of this Report

This document, named “D3.1: First Report on ICH Capture and Analysis”, describes in detail the design and development of the modules for ICH capture and analysis that have been developed in Tasks 3.1-3.5 during the first phase of WP3 (Month 1-12). It also discusses the most important parameters affecting their performance as well as their individual advantages, disadvantages and constraints. This deliverable is the outcome of the first phase of Tasks 3.1-3.5.

2.3

Report Structure

The structure of this document is the following: 

Section 1 is this introduction.



Section 2 details ICH Capture Analysis and Feature Extraction per each Task 3.1-3.5. More specifically: o

Section 3.1 presents algorithms and modules for body motion and gesture capture and analysis.

o

Section 3.2 presents algorithms for facial expression analysis.

o

Section 3.3 presents algorithms encephalography analysis (EEG).

o

Section 3.4 presents modules for vocal tract capture and analysis.

o

Section 3.5 presents algorithms for sound capture and analysis.

and

modules



Section 3 presents the partners responsible for each task.



Section 4 summarizes the conclusions of this deliverable.



Section 5 contains the references.

Filename: D3.1 Final Version March 12 2014.docx

Page 9 of 127

for

electro-

D3.1 First Report on ICH Capture and Analysis

i-Treasures ICT-600676

3.

ICH Capture Analysis and Feature Extraction

3.1

Body and Gesture Data Capture and Analysis

The body and gesture data capture and analysis task covered motion data capture and recognition in different use cases contexts. Full body motion data capture was addressed, as well as hand data capture. In both cases, gesture recognition modules were tested and implemented. During the covered period, the work on the body and gesture data capture and analysis module has mainly been divided amongst the following subtasks: -

Full body capture using both skeleton tracking based on a) fusion of multiple skeletal streams and b) fusion of multiple depth maps.

-

Full body gesture recognition, using both Hidden Conditional Random Fields (HCRF) and Hidden Markov Models (HMMs)

-

Hand/Finger data capture, addressing skeleton-based hand joint detection and finger gesture recognition without skeletal model, and gesture recognition in Byzantine music

-

Upper body data capture and gesture analysis and recognition, for the craftsmanship use-case.

These different subtasks will be addressed in details in the following subsections.

3.1.1 Full body skeleton capture In the full body motion capture modules, we investigate methods to improve the recording of dance movements by implementing some tools that extract accurately the performers’ movements using non-intrusive motion capture techniques. We focused on the use of depth camera. Interest in such systems has been growing rapidly since Microsoft introduced Kinect in 2010. This low cost depth sensor produces a range image and provides a full-body 3D motion capture solution. Kinect and its system development Kit (Microsoft SDK1) were designed to be used in a video game console setup: the users (it is difficult to have more than three people) are supposed to be standing in front of a display screen and a Kinect. To make such technologies more robust to occlusions, we proposed to use multi cameras setups. We investigate two solutions: - The first approach has been primarily designed for a scenario where a dancer is facing the camera and moving on a line or a semicircle, and will be explained in details in Section 3.1.1.1. The initial skeletal tracking data are acquired from each sensor by using Microsoft Kinect SDK. Subsequently, the skeletal data are fused in order to solve the robustness problems related to occlusions. - The second approach that will be investigated concerns the use cases where the assumption of one user facing the depth camera is not valid. In this approach, we plan to fuse the depths maps from every sensor and the skeleton extraction will be computed on this basis. If the users move too much or perform movements near the ground, it provokes a lot of occlusions and artifacts can be observed in the motion data. This frequently occurs in two use cases that we study in this work package: traditional and contemporary dance. In order for the depth maps of several cameras to

1

http://www.microsoft.com/en-us/kinectforwindows/

Filename: D3.1 Final Version March 12 2014.docx

Page 10 of 127

D3.1 First Report on ICH Capture and Analysis

i-Treasures ICT-600676

be fused properly, the calibration and synchronization steps must be performed very accurately. In Section 3.1.1.1.2, we explain a module for precise calibration and synchronization of several depth cameras, and for the fusion of several depth point clouds. Some parts of these two modules presented separately hereunder are not dependent on the use case and are different approaches to the same problem. They will be integrated as options in a common framework, which is part of future work.

3.1.1.1

Human skeleton tracking based on fusion of multiple skeletal

streams The full body capture and analysis module requirements led to the decision to use multi-camera setups in order to improve the robustness of skeleton tracking, to reduce occlusion and self-occlusion problems and to increase the area of coverage of the motion capture space. We have investigated setups including up to 4 Kinect sensors. The process consists of data capture, sensor calibration, skeletal fusion, feature extraction and finally motion recognition. The system architecture is shown in Figure 1 in more detail. Each part will be described in the relevant section. Kinect Acquisition Tool (Figure 2) has been developed for the purpose of implementing the above process. It is a Windows GUI application which facilitates multi-Kinect data acquisition, Kinect pair calibration, fusion of skeletal tracking data acquired from multiple censors, and motion recognition of predefined motion patterns.

Figure 1: Body and gesture capture and analysis system architecture

Filename: D3.1 Final Version March 12 2014.docx

Page 11 of 127

D3.1 First Report on ICH Capture and Analysis

i-Treasures ICT-600676

Figure 2: Kinect Acquisition Tool

3.1.1.1.1

Data Capture

The acquisition process consists of capturing time-stamped data streams from each sensor. Color maps, depth maps, segmentation masks and skeletal data streams can be recorded. Color maps are saved as either uncompressed Windows Bitmap (.bmp) or JPEG (.jpg) image files. Depth maps are saved as 16-bit TIFF image files. For skeleton animation, a text format (SKEL) has been developed. The raw data from the skeletal tracking stream (joint positions, joint rotations and confidence levels) together with time stamps are written to skel files.

Figure 3: Preparation for the recording session. 4 Kinect sensors are placed around the performer, each connected to a separate PC.

Filename: D3.1 Final Version March 12 2014.docx

Page 12 of 127

D3.1 First Report on ICH Capture and Analysis

i-Treasures ICT-600676

Data acquisition process can be performed by the Kinect Acquisition Tool installed on multiple PCs, each controlling one or more sensors (Figure 3). Synchronizing the PC’s using (Network Time Protocol) NTP (http://en.wikipedia.org/wiki/Network_Time_Protocol) server and clients is required in order to get correct time stamping during the recording session. 3.1.1.1.2

Sensor Calibration

Sensors need to be calibrated in order to be able to transform the skeletal tracking data into a common coordinate space. A calibration procedure is thus required to estimate the transformations between the coordinate systems of each sensor and the reference sensor. Our proposed procedure does not require checker boards or similar patterns. The only requirement is that a person is visible from the sensor pair to be calibrated. The calibration is realised by using the Iterative Closest Point algorithm (Besl and McKay, 1992) to estimate the rigid transformation (Rotation-Translation) of two point clouds. The point clouds we decided to use are consisting of joint positions of the person being tracked as detected by each sensor. The implementation of ICP algorithm found in the Point Cloud Library (PCL, http://pointclouds.org/), [26] was used. The skeleton joint positions are fed into the algorithm that minimizes the distance between the transformed positions in the reference frame. This transformation is then used to register the skeletons acquired from each sensor in the reference coordinate system. This approach is based essentially on matching two sparse point clouds containing unreliable data (since skeleton tracking of joints can be erroneous). The calibration process is fast, but in order to overcome the problems in precision must be applied iteratively. We use two terminating thresholds Tjoints and TICP which control the termination of the successful calibration procedure. The first criterion is that the number of joints tracked with high confidence on both sensors needs to be higher that a threshold Tjoints. The higher this number is the better the expected accuracy of the calibration since the point clouds will be larger. The second criterion is that the fitness score of the ICP algorithm needs to be lower than a threshold T ICP. The fitness score is a metric of how well the two point clouds are aligned after applying the rigid body transformation. These thresholds can be adjusted to accommodate various setups and recording conditions. High Tjoints and lower TICP values will lead to longer but more accurate calibration procedure. 3.1.1.1.3

Skeleton Fusion

Skeleton fusion is the process of combining skeletal data recorded by multiple sensors into a single, more robust skeletal representation (Figure 4, Figure 5). Prior to fusion, skeletal data from all sensors have to be transformed to a common reference coordinate system of the reference sensor. The logic used in combining the data is called fusion strategy. We have developed a fusion strategy working on joint positional data, which could be extended to rotational data as well with slight modifications.

Filename: D3.1 Final Version March 12 2014.docx

Page 13 of 127

D3.1 First Report on ICH Capture and Analysis

i-Treasures ICT-600676

Figure 4: Skeletal fusion

Figure 5: Tsamiko dance recording with 3 Kinect Sensors. A fused skeleton is produced combining 3 skeletal streams.

Presuming there are

skeletons, and the position and confidence of -th joint of i-th

skeleton are denoted

respectively. Initially, the sum of all joint confidence levels

of each skeleton is computed and the skeleton with the highest total is selected. This skeleton consists of most successfully tracked joints for the current frame and it is expected to be the most accurate representation of the real person posture. We consider the joint of this skeleton as base and construct the fused skeleton joints in the following manner, by examining the confidence values of each joint of the base skeleton. There are three possible values of confidence returned by the recognition Filename: D3.1 Final Version March 12 2014.docx

Page 14 of 127

D3.1 First Report on ICH Capture and Analysis

i-Treasures ICT-600676

algorithm of Kinect: high, medium and low. If the confidence of the base joint is high, it is left as is for the fused skeleton. Otherwise, if the confidence is medium or low, the joint position is corrected by taking into account the remaining skeletons. If corresponding joints with high confidence are found in any of the remaining skeletons, their average position is used to replace the position value of the joint. If there are no corresponding joints with high confidence, the same procedure is applied for joints tracked with medium confidence. Lastly, if only low confidence joints exist, their average is used as a position value of the fused joint. Since joint averaging, switching from one base skeleton to another in conjunction with sensor calibration inaccuracies can introduce artifacts in the form of sudden rapid changes in joint position, a filtering stabilization step has been introduced which is applied to the fused skeleton stream. A time window of three frames is used in order to keep the last three high-confidence positions per joint. The centroid of these three positions is calculated and updated at each frame. If the Euclidean distance between a joint position and this centroid is higher than a certain threshold, then the joint position is replaced by the centroid, so as to avoid rapid changes in joint positions.The thresholds are different for each joint, since it is expected that some joints (hands and feet) move more rapidly than others. In our experiments of Tsamiko dance, these thresholds were set to 40cm for the feet joints and 20cm for the remaining joints. 3.1.1.1.4

Future improvements

The body skeleton tracking algorithms presented in the previous sections have been implemented in the Kinect Acquisition Tool, and been validated on the Tsamiko dance recording sessions. The initial version has a number of limitations that will be addressed in the future versions. Skeleton capture and fusion is supported for a single person only. This will be extended to concurrent capture of several people. The skeletal fusion algorithm is only applied to joint positional data. Joint rotation fusion is planned to be implemented as a next step. Also, a combined fusion strategy taking into account, both position and rotation data could also be investigated. In addition, an optimization of the data capture process could lead to the ability to perform the motion capture with multiple sensors from a single PC. In its current form, each sensor requires a dedicated PC for capture sessions, due to performance reasons.

3.1.1.2

Human body tracking based on fusion of multiple depth maps

As explained in the introduction of Section 3.1.1.1, the second approach to multi camera setup is to fuse the depths maps from every sensor and to compute the skeleton extraction on this basis in a second time. Shotton [25] describe an algorithm to identify in a depth map the different body parts by sorting the point clouds as we can see in Figure 6. Afterwards, nodes are computed to represent each skeleton articulation. These nodes could be computed on the basis of the fused depths maps.

Figure 6: Example of a pixel classification with the different body parts

The fusion of data from two or more cameras requires a calibration phase: the position of every sensor in the recording space must be known in order for the different depth Filename: D3.1 Final Version March 12 2014.docx

Page 15 of 127

D3.1 First Report on ICH Capture and Analysis

i-Treasures ICT-600676

maps to be combined. In Section 3.1.1.2.1 we will explain theories that we used in these works. Section 3.1.1.2.1.1 is a technical description of the applications that we implemented to record the multi depth sensors network data. 3.1.1.2.1

Description

The calibration of a camera involves determining its physical properties, position and orientation in the space [24], [27]. Estimation of the physical properties (focal, image center and some distortion factors) is known as the intrinsic calibration. The position and the rotation of the camera give the extrinsic parameter. So, if we know the position of an object, we can predict the image that we will obtain with a camera. As the intrinsic parameters do not change for Kinect between different sessions, we focus on computing the extrinsic matrix. We compute the coordinates of every camera, by using, for example, one of them as reference. The form of this transformation is a matrix with a rotation part and a translation vector.

are the pixel coordinates in an image. space.

are the coordinates in the real

Thanks to the depth map, we can inverse this relation: indeed we know the distance in the real space between the point and the camera. So we can transform an image acquired by a depth camera in a 3 dimensions point cloud. We can see in Figure 7 one, two and three point clouds computed on the basis of respective depth maps. The point clouds were aligned after a calibration step.

Figure 7: Visualization of point cloud acquired by a depth cam. We have one point cloud in the figure (a). In (b) and (c) we have respectively 2 and 3 calibrated point clouds.

Different solutions exist to calibrate such system, asking for more or less user participation. Criteria as the proximity and alignment between sensors or the overlap between their field of view are determinant to choose the appropriate calibration method. In Figure 8, illustrates examples of setups that will be used for the Walloon traditional dance database recordings. We decided to propose a semi-automatic Filename: D3.1 Final Version March 12 2014.docx

Page 16 of 127

D3.1 First Report on ICH Capture and Analysis

i-Treasures ICT-600676

procedure to calibrate the system. If the automatic detection fails, the user can interact with a graphic interface to help the calibration. The calibration procedure was realized by using the Point Cloud Library (PCL). This library provides algorithm implementation and graphic tools in order to manipulate 3D point clouds [26]. The first stage of the procedure establishes an approximated correspondence between points in depth captures from the different devices. With correspondences between more than or equal to 3 non-aligned points in each device, the extrinsic calibration is possible. The transformation computation is based on singular value decomposition. The second stage is a fine-tuning that exploits every point of the clouds. It is necessary because the point selection is not accurate with devices such as the Kinect camera. This fine-tuning is based on Iterative Closest Point (ICP) algorithm proposed by Besl and McKay [23]: ICP searches iteratively the most likely correspondences between the whole point clouds and computes the affine transformation that minimizes the difference between these clouds. This iterative algorithm provides a local minimum and not necessary the optimal solution. The first stage helps ICP algorithm to find the optimal transformation and to avoid such pitfall.

Figure 8: Setup example with 4 or 6 Kinect around the recording area (2mx2m square).

The above-mentioned procedures can be extended with a full reconstruction of the recording room. If we have a 3D reconstructed model, it is possible to position the camera views inside. By this means, we could decrease inaccuracy provoked by a short overlap between the fields of view of cameras. 3.1.1.2.1.1 3D Acquisition software In this section, we describe the software developed to calibrate multi depth cameras system and to synchronize the data recording. We implemented the sar2 applications suit in C++ and we tested this one on Linux and MacOS. We use in their implementation the following libraries: Cmake – Open Source, cross platform Build System (http://www.cmake.org/) Qt4 - cross-platform application and UI framework (http://qt-project.org/) OpenCv - an open source computer vision and machine learning software library. (http://opencv.org) PCL – an open source library for 2D/3D image and point cloud processing (http://pointclouds.org/)

2

Smart Attentive Room

Filename: D3.1 Final Version March 12 2014.docx

Page 17 of 127

D3.1 First Report on ICH Capture and Analysis

i-Treasures ICT-600676

VTK - open-source library for 3D computer graphics, image processing and visualization (http://www.vtk.org/) OpenNi - open source library used for the development of 3D sensing applications (http://www.openni.org/)

3.1.1.2.1.1.1 Calibration tools: sar_calibration Thanks to this calibration tool, the user can compute the affine transformation that positions in a same space the point clouds captured by different camera. The data must be stored in PCD file format (this format is defined by Point Cloud Library).

Figure 9: Processing chain to compute the extrinsic calibration matrix in sar_calibration

The graphical user interface version provides three manners of selecting 3D points in order to establish the necessary correspondences between the point clouds. Two 3D windows are displayed. On the left side, we can switch between the original point clouds. On the right side, we visualize the calibration result with the fused point clouds. Patterns like chessboards can be automatically detected to facilitate the point’s selection. If no pattern is used, the user can select manually the points either in a 3 dimensions visualization or in a classical camera view. The selection order is important to define the correspondence. The calibration data are stored in XML files.

Filename: D3.1 Final Version March 12 2014.docx

Page 18 of 127

D3.1 First Report on ICH Capture and Analysis

i-Treasures ICT-600676

Figure 10: sar_calibration graphical user interface

3.1.1.2.1.1.2 Recording tools: sar_commander and sar_recorder A host computer drives each depth camera.

Figure 11: Client - server architecture in sar system

OpenNI Library defines a file format with .ONI extension to store the depth map and the video flow. To fuse the data from the different devices, in addition to a calibrated setup it is necessary to get synchronized information. We implement two applications to insure this synchronization during a recording session between all the depth camera hosts, sar_recorder and sar_commander. Each host uses sar_recorder. It owns the capabilities to display in real time the point cloud in a 3D graphic interface and to record the data under an ONI format.

Filename: D3.1 Final Version March 12 2014.docx

Page 19 of 127

D3.1 First Report on ICH Capture and Analysis

i-Treasures ICT-600676

Figure 12: Graphical user interfaces of sar_commander and sar_recorder

In second, we have the sar_commander. It is a master user interface with textual commands. The hosts through their sar_recorder application must initiate a TCP LAN connection with this application. Currently, sar_commander can send the following commands to every host: startGrabber and stopGrabber : these commands launch and stop for all the hosts a real time display of the point cloud. setupRec : Commanding to the host to initiate OpenNi for a recording action. startRec and stopRec: Launching and stopping depth imaging and RGB frames recording. setTime : Transmitting a common time to every host. A logging file under ASCII format is added to the ONI file. This file contains a flag for each recorded frame with the temporal information. So, after the recording session it is possible to adjust temporally the ONI files in a post-process stage. This timestamp for each frame contains the following information: Current frame identification number; time in ms that has elapsed since the last frame acquisition; date of the current frame acquisition (dd.MM.yyyy.hh:mm). Some other commands are tested to improve the synchronization.

3.1.1.2.1.2 Future improvements We dispose of an adequate system to capture character’s movement with depth sensors network. The principle used in sar application were validated in other domain than the dancers motion capture: analysis of social behaviors and proxemics [29] [28]. These tools are adequate for a research purpose. The fusion is not processed in real time: this step is performed after the recording. It will be necessary to integrate sar_calibration, sar_commander and sar_recorder in an ergonomic solution if we wish to adapt to general public use requirements. It is planned to study the calibration of such system for wider effective recording area. The calibration system should be Filename: D3.1 Final Version March 12 2014.docx

Page 20 of 127

D3.1 First Report on ICH Capture and Analysis

i-Treasures ICT-600676

improved to be efficient with less overlap between the field of view of the sensors. Future work will now include the extraction of body motion features based on these fused depth clouds.

3.1.2 Full body gesture recognition Human action or gesture recognition is a very challenging problem that has driven a very active research work in the academia and has brought a lot of attention from the industrial world recently, with the explosion of natural user interfaces (NUI). However this problem is still far from being solved. The problem of human action or gesture recognition is usually subdivided into three successive issues to be addressed:   

capture and recording of the human subject; feature extraction from raw motion capture data; classification of poses or sequences of features.

A usable gesture and action recognition method would ideally be accurate, flexible, easy to extend, real-time, independent from subject identity, robust to occlusions, requiring minimal effort for capturing the data (e.g. markerless and affordable equipment), capable of discriminating between large set of gestures, etc. Since no ideal solution exists, the constraints and mandatory requirements that have to be taken into account in the design of a solution will depend on the considered application, and different custom approaches will be built for different problems. To that extend, the efficiency of a gesture/action recognizer tightly rely on how the given problem has been scoped and understood, and how this understanding has inferred the appropriate choices in the three above-mentioned categories of issues. A wide range of applications require gesture or action recognition, in fields like e.g. security, health and medicine, games and entertainment, education, culture, sports, etc. The gesture recognition tasks envisioned in the i-Treasures project are particularly challenging. Depending on the use cases, motion recognition modules will be needed for full body, upper body or hand motion analysis, like the segmentation of continuous sequences into meaningful gestures. Both for the full/upper body and the hand motion use cases, the motion is extracted and presented as the motion of simplified skeleton models. The gesture recognition tasks encountered in i-Treasures should be robust to different performers, very precise and accurate, as the results have to be stored in the reference database. Moreover, we expect not only the high-level gestures to be recognised, put possibly the style of the performed gesture. In some cases, a different style will mean different basic gestures (the limb trajectories are different, for instance the arms are going up instead of going down), and the gesture recognition problem is the same. But in other cases, the motion styles lies in the variability of the motion: the functional motion (the limb trajectories) is the same, but the way it is performed is different (for instance brisker, smoother, etc.). In that case, more precise models are needed and features must be able to represent all different kinds of motion. Indeed arbitrary simplifications must be possibly avoided in order to avoid the loss of the information that could become meaningful to categorize the gesture style. In addition to this analysis task, gesture recognition must also be performed in the learning scenarios. Indeed, if we want to include some real-time feedback to the student while he/she is learning the expert’s gestures, the student’s motion must be recognized in real-time in order to assess his/her performance. This requirement brings the current research at the very edge of the available technology. Indeed the tracking and recognition of on-going activities and gestures is still vastly an open and challenging problem, and existing tools address it for very limited use cases. Two different approaches to full body gesture recognition have been implemented and will be presented hereunder. The first one, exposed in Section 3.1.2.1 is based on HCRF Filename: D3.1 Final Version March 12 2014.docx

Page 21 of 127

D3.1 First Report on ICH Capture and Analysis

i-Treasures ICT-600676

for gesture recognition with the Kinect skeleton and has been tested in the Tsamiko dance use case. The second one, presented in 3.1.2.2, is based on Hidden Markov Models for real-time gesture recognition and tracking, with more precise motion capture data. For both approaches, both the pre-processing of the data and the recognition module itself are presented.

3.1.2.1

Gesture recognition with HCRF based on a depth map extracted

skeleton The motion analysis algorithm is independent of the data acquisition process. It can work on either the original skeleton animation streams acquired from a single sensor, or on the fused skeleton stream, after the fusion process from multiple censors. After a view invariance transform, the skeleton is split into five parts. For each part a symbol is produced per frame, which represents the current posture label. A sequence of those labels is fed to a classifier to categorize this pattern as one of the predefined motion sequences to be detected. This workflow is depicted on Figure 13 which presents the motion analysis subsystem.

Skeleton parts Torso part

Body posture

Left Hand part

Left Hand posture

Right Hand part

. . .

SVM k

Posture words sequence

{

Skeleton Stream

SVM 2

Right Hand posture

Left Foot part

Left Foot posture

Right Foot part

Right Foot posture

{

SVM 1 View Invariance Transform

HCRF

Detected motion pattern

Figure 13: Motion Analysis subsystem

A pre-processing step for motion recognition is a view invariance transformation of each joint. The skeleton is translated relative to the root joint (so that the root is located at the origin of the coordinate space) and rotated around the vertical axis so as to face in the positive z direction. Next, the skeleton is divided into five sub-parts: Body, Left Hand, Right Hand, Left Foot, Right Foot (Figure 14). This potentially provides more flexibility for the motion detection phase, since we can consider motion patterns where only parts of the body are important for the execution of the motion.

Filename: D3.1 Final Version March 12 2014.docx

Page 22 of 127

D3.1 First Report on ICH Capture and Analysis

i-Treasures ICT-600676

Body Child joint Left Hand

Right Hand

Root joint

Bone / Link

Left Foot

Right Foot

Figure 14: Skeleton sub-parts

For each of the skeleton parts, a feature vector is constructed. In the simplest form it only consists of 3D positions of the joints which belong to the sub-part. Those feature vectors are then transformed into symbols which are assigned by multi-class SMV classifiers, trained to partition a posture space into discrete m postures for each part. 3.1.2.1.1

Hidden Conditional Random Fields

Conditional Random Fields are a class of statistical modelling method often applied to pattern recognition problems and machine learning in general. CRFs are a generalization of Hidden Markov Models and are popular in natural language processing, object recognition and motion recognition tasks. CRFs are discriminative undirected probabilistic graphical model which can encode known relationships between observations. The nodes of a CRF can be divided onto two sets and , denoting the observed and output variables (Figure 15).

Figure 15: HCRF model

For the motion detection step, we have selected the Hidden-state Conditional Random Fields (HCRF) classifier ([30][31]). The workflow for training and recognition using HCRFs is similar to the HMMs. For multi-class classification problem multiple HCRFs are trained, each modeled to detect a specific motion pattern. For the recognition phase, a sequence of symbols is given as input to the HCRF. Then, the identification Filename: D3.1 Final Version March 12 2014.docx

Page 23 of 127

D3.1 First Report on ICH Capture and Analysis

i-Treasures ICT-600676

of each motion pattern is based on the probability/likelihood of the model of the HCRF for each observation sequence. For the implementation of HCRF, the Hidden-state Conditional Random Fields Library v2 was used (http://sourceforge.net/projects/hcrf/). 3.1.2.1.2

Experimental results

For the evaluation of our motion recognition algorithm we applied it the problem of detection of the three basic Tsamiko dance moves, from which the dance consists. Eight repetitions of basic Tsamiko dance pattern (three dance moves per repetition) executed by three dancers were recorded and manually annotated. We split the recorded data into train and test sets by using half repetitions of the basic dance pattern of each dancer (12 repetitions per move) for training the HCRFs and the remaining for testing. We trained an HCRF using the train sequences to be able to distinguish between the three basic Tsamiko dance moves. CRFs with a varying number of hidden states were trained as can be seen from Table 1, in which the dance move detection accuracies of the test set are presented, per dancer and overall. The best overall detection accuracy that was achieved is 93.9% using an HCRF with 11 hidden states. Hidden States

5

8

11

12

15

20

Dancer A

38,4 61,5 84,6 76,9 76,9

69,2

Dancer B

90,9 90,9 100

100

90,9

72,7

Dancer C

66,6 88,8 100

100

100

77,7

Overall

63,6 78,7 93,9 90,9 87,8

72,7

Table 1 Recognition accuracies of Tsamiko dance moves per person and overall recognition accuracies for varying number of hidden states in the HCRF classifier

3.1.2.2

Hidden Markov Models for Gesture Recognition

3.1.2.2.1

Data representation and feature extraction

As overviewed in the introduction, the motion data considered in this motion recognition module is skeletal data. This skeletal data can be recorded using any motion capture technology: Kinect, multiple Kinects, inertial motion capture, optical motion capture, etc. The only requirement is that the format of the data on which the recognition must be conducted is the same as the format of the training data. The recognition module is easily adaptable to any number of dimensions from the data. However since a complete and segmented dance motion database was not available yet when the HMM-based motion recognition module was being developed, the tests were conducted on a walk recognition task. At this stage, we consider that the accuracy and relevance of the developed technology will be preserved once we acquire a dance motion database, later in the project, since HMMs have already been used with much success for dance analysis in other works and since the dance figures are longer than step motion segments and often present recognizable patterns in the motions of all the limbs (and not only in the legs as it is mostly the case in walk). The models were trained Filename: D3.1 Final Version March 12 2014.docx

Page 24 of 127

D3.1 First Report on ICH Capture and Analysis

i-Treasures ICT-600676

thanks to the MOCKEY stylistic walk database ([32]), which aims at studying the expressivity of walk motion. In this database, a single actor walks back and forth while adopting eleven different “styles”. These styles corresponded to different emotions, morphology personifications, or situations, and were arbitrarily chosen because of their recognizable influence on walk, as illustrated in Figure 16. The acted styles were the following: proud, decided, sad, cat-walk, drunk, cool, afraid, tiptoeing, heavy, in a hurry, manly.

Figure 16: Four example postures from the MOCKEY database. From left to right: sad, afraid, drunk and decided walks.

The motion was captured thanks to an inertial mocap suit, the IGS-190 from Animazoo ([33]), containing eighteen inertial sensors. The motion data is described by the evolution over time of the 3D Cartesian coordinates of the root of the skeleton, along with the eighteen 3D angles corresponding to the orientation of the skeleton root and of the seventeen joints of the simplified skeleton used to represent the human body (Figure 16). The global position of the skeleton was discarded in our application, since it is extrapolated in the Animazoo system based on the angle values, and should anyway be removed in order to make the recognition independent to the relative position of the skeleton in the considered 3D space. Each body pose is hence described by 18 ∗ 3D = 54 values per frame. We chose to model the rotations of the eighteen captured joints instead of the Cartesian coordinates of these joints so as the make the models more robust to skeleton size variations between different subjects. We converted the 3D angles from their original Euler parameterisation to the exponential map parameterisation ([34]), which is locally linear and where singularities can be avoided. The motion data was captured at a rate of 30 fps. The walk sequence were annotated into right and left steps, thanks to an automatic segmentation algorithm based on the hips joint angles. These two class labels correspond to the basic gestures, which are recognized in our proof-of-concept real-time walk recognition algorithm 3.1.2.2.2

HMMs for gesture recognition

Different approaches can be found for gesture recognition in the literature. There is an inherent variability associated to the accomplishment of human gestures: the duration of one given motion will be different for each execution of the said motion. Since the motion can be defined as the evolution of body poses over time, the time variability of gestures implies that several realisations of the same gesture will correspond to a different number of parameters, since a different number of poses will have to be considered. In order to deal with this issue, several solutions can be found. One is to somehow cluster the pose space and consider the gesture not as a continuous sequence of poses, but as a sequence of a fixed number of pose classes. For instance, this is the approach implemented in the HCRF (Hidden Conditional Random Forests) recognition Filename: D3.1 Final Version March 12 2014.docx

Page 25 of 127

D3.1 First Report on ICH Capture and Analysis

i-Treasures ICT-600676

approach. Another popular approach is the dynamic time warping (DTW) procedure and its variants, which consists in realigning the on-going gesture with a reference gesture. A third approach consists in modelling the time series with statistical models taking into account the time variability. The most popular models for such a task are the Hidden Markov Models (HMMs). In addition to being able to model the time variability of time series, HMMs have been widely investigated and developed over the last decades for speech recognition applications, which are very close to the motion recognition problem. In this work, we used and adapted to our motion recognition problem HMM tools designed for speech application. In previous work ([35]), we have proved that such models can be used to synthesise new gestures while taking into account the style, which means that all of the stylistic information is encompassed in the model. The same models can hence be used for recognition and stylistic recognition. 3.1.2.2.3

Some theory about HMMs

Hidden Markov Models (HMMs) classifiers were first used by the speech research community, and developed for speech recognition applications. However, their use was rapidly extended to other use cases, like handwritten word recognition or action recognition. HMMs are widely used for the modelling of time series, and have been used for motion modelling and recognition since the nineties. One of the advantages of using HMMs is that they exempt from using time warping, needed in most approaches in order to align sequences prior to analyse them or extract the style component among them. HMMs integrate directly in their modelling both the time and the stylistic variability of the motion, thanks to their statistical nature. In addition to the recognition applications, the last decade has seen a rising interest for the use of HMMs for generation, especially with the development of tools such as the HTS toolkit developed for speech synthesis ([36]). This further extends the interest of HMMs as a modelling tool for parameter sequences, as almost the same models can be used to achieve the recognition and generation tasks. We have shown recently that HMMbased generation can be very useful for the real-time exploration of a stylistic motion space ([37]). A HMM consists of a finite set of states, with transitions between the states governed by a set of probabilistic distributions called transition probabilities. Each state is associated with an outcome (more generally called observation) probability distribution. Only this observation is visible, the state is called hidden or latent: at each time t, the external observer sees one observation ot, but does not know which state produced it. HMMs are double stochastic processes, since both the state transitions and the output distributions are modelled by probabilistic distributions. Particular HMM structures can be designed for specific applications. HMMs have been the subject of a very active research and have been adapted to many different use cases. A huge number of variations of the basic HMMs can hence be found in the literature. The common ground between all these types of HMMs is that an important number of parameters need to be defined in order to be able to design the topology of the HMM and to train it. The left-to-right HMM with no skip transitions illustrated in Figure 17 is an example of such a specific HMM which will be used in our motion modelling and recognition application. A left-to-right model with no skips is a model in which the only possible state transitions at each time are either to stay in the same state or to go to the next state. We define a basic gesture as the basic sequence of motion data that is always executed with the same combination of limb trajectories. Left-to-right model topology is perfectly suited for modelling such motion entities. As typical dance figures always contain the same sequence of basic gestures, the left-toright topology also corresponds to the dance figure application. Filename: D3.1 Final Version March 12 2014.docx

Page 26 of 127

D3.1 First Report on ICH Capture and Analysis

i-Treasures ICT-600676

The complete characterisation of a HMM requires the specification of the number of states, and the definition of two probability measures: the state transition probabilities ti,j between each states pair (si, sj) and the probability density functions (pdfs) ei of the observations in each state si. In continuous HMMs, these pdfs are often modelled by a mixture of Gaussians, or single Gaussians as illustrated in Figure 17. A compact notation is often used to refer to all the parameters defining the HMM: λ.

Figure 17: Structure and definition of a three states left-to-right HMM with no skips.

3.1.2.2.4

HMM training

The process in which the parameters of the HMMs are determined is called the training. The computation related to HMM training was performed using the “Hidden Markov Model Toolkit” (HTK) software, developed by the engineering department of Cambridge University. HTK consists of a set of modular function libraries that makes it possible for the user to build and manipulate Hidden Markov Models. HTK is mainly designed and used for applications in the speech recognition. A small adaptation of the default training and modelling parameters is also provided. Therefore HTK can also be used for any other applications, like dance motion modelling in our case. The training data has to be converted in the HTK format, along with a time-aligned transcription of these training data files. Once the data has been prepared for training and the structure of our HMMs has been defined, the next step towards building our synthesizer consists in estimating the parameters of the HMMs from motion data sequences. The training was performed using the Baum-Welch parameter reestimation formulas. Baum-Welch training can be used to estimate the HMM parameters from a collection of data segments. This collection is built upon the annotation that was provided, i.e. the training function takes all the segments from the continuous motion sequence for which the annotated label is the one corresponding to the HMM that we estimate. It can also be used by only taking into account the transcription of the file and not the beginning and ending times of the segmentation of basic gesture, One option is to perform a first estimation of the HMM parameters based on the segmented and labelled gestures before estimating the parameters based on the whole motion sequence without taking the hand labelled gesture boundaries into account. The other option is to skip the first step and estimate the model parameters directly from the whole motion sequence.

Filename: D3.1 Final Version March 12 2014.docx

Page 27 of 127

D3.1 First Report on ICH Capture and Analysis

i-Treasures ICT-600676

A training script has been written (Deliverable 3.2) and enables the easy manipulation of the different parameters of the training: number of observations for dimensions, number of iterations for each training phase, addition of derivatives of the observations into the model, etc. This training script also takes into account the training of the socalled “garbage” HMM. This special model corresponds to any motion frame that is not part of one of the gestures to be recognized by our system. 3.1.2.2.5

Global motion model

As explained in the previous paragraph, we consider one left-to-right HMM per basic gesture to be recognized, plus one model for “garbage” motion (sometimes called “filler” model). However, for continuous gesture recognition, a global model of the motion is required. In order to obtain such a model, the basic gesture left-to-right HMMs and the filler model need to be connected. We chose not to take any a priori assumption on the possible sequence of gestures, and hence to connect all the basic HMMs in parallel, which means that each gesture can follow any other gesture. This global model is illustrated for our walk motion use case in Figure 18. Our complete model hence consists in eleven states for the walk step recognition example: five states per step, plus one state for the filler or garbage model.

Figure 18: Global motion model of walk combining the basic gestures HMMs.

3.1.2.2.6

Viterbi algorithm for HMM-based motion recognition

Once a global motion model has been trained and built, we can tackle the recognition task. When using HMMs, the recognition problem consists in decoding the most likely sequence of hidden states corresponding to a new sequence of observations, given the parameters of the model. Since each state corresponds to one label (each label corresponding to one basic gesture HMM), decoding the most likely sequence of states also decodes the most likely sequence of labels. The Viterbi algorithm is a dynamic programming algorithm for decoding this most likely sequence of states, which is also called the Viterbi path. The Viterbi algorithm consists of two parts: the forward decoding and the backwards path tracking. If we consider a sequence of observations (motion data in our case), the task of the Viterbi algorithm will be to associate each observation with the most likely state of the global model. In our illustration of the problem, we will only consider five states (instead of eleven) for Filename: D3.1 Final Version March 12 2014.docx

Page 28 of 127

D3.1 First Report on ICH Capture and Analysis

i-Treasures ICT-600676

readability reasons. The basic problem is illustrated in Figure 19. At each time t, the observation ot can correspond to any of the five states.

Figure 19: HMM-based motion recognition: illustration of the state decoding problem.

The first step of the algorithm is to associate a probability to each state obtained thanks to the transition and emission probability density functions of the model. For each state, five different probabilities are computed in our case: probability of being in state i at time t-1 * probability of transiting from state i to the considered state. Only the transition corresponding to the highest probability of the five in stored, as illustrated by the green line in Figure 20.

Figure 20: HMM-based motion recognition: forward procedure of the standard Viterbi decoding.

The same operation is computed for each state at each time of the observation sequence, hence leading to five paths, each one associated with being in one of the five states at the end of the observation sequence (Figure 21).

Filename: D3.1 Final Version March 12 2014.docx

Page 29 of 127

D3.1 First Report on ICH Capture and Analysis

i-Treasures ICT-600676

Figure 21: HMM-based motion recognition: final result of the forward procedure of the standard Viterbi decoding.

Each position is associated with a probability, and only the highest is kept since the Viterbi algorithm aims at finding the most likely state sequence. Starting from the most likely state at the end of the observation sequence, the algorithm can track backwards the path that was taken to arrive there, and hence decode the most likely sequence of state to which it corresponds, as illustrated in blue in Figure 22.

Figure 22: HMM-based motion recognition: back tracking of the Viterbi path in the standard Viterbi procedure.

3.1.2.2.7

Real-time HMM-based motion recognition

One major issue with the use of Viterbi is that the whole sequence of observations needs to be known in advance, so as to be able to perform the recording. It can hence not be used as such for real-time gesture recognition. This is why we had to implement and test several adaptations of the Viterbi algorithm to real-time motion recognition. Four different approaches for online gesture recognition have been implemented and tested in our HMM-based gesture recognition module: forward only, sliding window, state stability, and fusion point. Forward only The forward only algorithm is quite straightforward: the probability of being in each state at each time t is computed in the same way as for the standard Viterbi algorithm. However, at each time t, a decision is taken and the most likely state is considered as the decoded state. The path is hence defined time stamp per time stamp, and impossible state transitions might be found in the path since a decision is taken based Filename: D3.1 Final Version March 12 2014.docx

Page 30 of 127

D3.1 First Report on ICH Capture and Analysis

i-Treasures ICT-600676

on the most likely state, regardless of transition probabilities. The forward only algorithm is illustrated in Figure 23.

Figure 23: Real-time Viterbi decoding: the forward only approach.

Sliding window The sliding window implementation is illustrated in Figure 24. It consists in computing the standard Viterbi path, but on a window of fixed number of observations/states, as displayed in green. Once the Viterbi path has been decoded for that time window, the first state of the path is considered as decoded (as illustrated in blue in the Figure 24. Once the decision is taken, the window slides one observation/state in the future and the same procedure is repeated. This procedure is more accurate than the forward only implementation, but introduces a small delay since the decision is taken after a duration corresponding to the time window.

Figure 24: Real-time Viterbi decoding: the sliding window approach.

State stability The state stability algorithm consists in computing the Viterbi path following the standard method, starting from the most likely state, but on consecutive increasingduration time (Figure 25). Once the oldest state of the different Viterbi paths converge to the same state for a given minimal amount of times, this state is considered as the most likely state, and the same procedure begins again, starting with this newly decoded state as the first state of the Viterbi paths to be decoded. A variable length delay is introduced with this algorithm. A limit is also set, so that if there is no convergence after a given amount of iterations, a decision is taken. Filename: D3.1 Final Version March 12 2014.docx

Page 31 of 127

D3.1 First Report on ICH Capture and Analysis

i-Treasures ICT-600676

Figure 25: Real-time Viterbi decoding: the state stability approach.

Fusion point The fusion point algorithm has been described by Bloit and Rodet ([38]) and is illustrated in Figure 26. Its result is equivalent to the offline Viterbi decoding. The standard Viterbi algorithm is computed on a time window of fixed length. That length is increased by one sample until all paths computed in the forward Viterbi procedure converge to a common sequence of states, as illustrated in blue in Figure 26. These states are then considered as decoded, and the time window is shifted so as to begin with the last decoded state. Once again, such an approach introduces a variable size delay in the decoding process. As for the state stability algorithm, a limit is set so that if there is no convergence after a fixed number of iterations, a decision is taken.

Figure 26: Real-time Viterbi decoding: the fusion point approach.

Figure 27 illustrates a comparison of the recognition procedure on the same sequence of walk motion, with the four different algorithms. Both steps (left and right) were recognized with a 100% accuracy for the all four algorithms. However, we observe that the sliding window and fusion point algorithms gave exactly the same result as the offline standard Viterbi, while the state stability performed a little worse and the forward only method was the worst one.

Filename: D3.1 Final Version March 12 2014.docx

Page 32 of 127

D3.1 First Report on ICH Capture and Analysis

i-Treasures ICT-600676

Figure 27: Real-time Viterbi decoding: comparison of the four different approaches

3.1.2.2.8

Conclusion on HMM-based motion recognition

We have studied an implementation of a complete HMM-based motion recognition procedure, from model training to the gesture recognition itself. The approach we propose in this gesture recognition module has been designed and implemented so as to be easily adapted and tested for the different use cases of the i-Treasures project. In a first time, it has only been tested on walk motion data has presented in the current document. However, the 100% accuracy of the recognition on this simple use case is very promising, even though it is clear that adding more than two different classes will make the problem more interesting and challenging. In addition, this first recognition task was performed on inertial motion capture data, which is very clean compared to skeleton data extracted from depth maps (i.e. coming from the Kinect). The noise coming from the Kinect will add a lot of variability and be much more constraining for the recognition algorithm. Moreover, we are not only interested by the recognition task but also by an accurate following of the motion, for which 100% accuracy was not reached with all different implementations, as illustrated in Figure 27. Further investigation and comparison of the different approaches in terms of accuracy and of delay will be conducted in the next months, as soon as the first databases are being recorded.

3.1.3 Hand/Finger data capture 3.1.3.1

Skeleton-based hand joints detection

3.1.3.1.1

Introduction

Recent research tendencies show an increasing interest in identification and recognition of gestures with the use of different type of motion capture technologies: wireless motion sensor-based, marker-based, and marker-less technologies. Various types of wireless motion sensors [39][40][41] or commercial interfaces, such as the Wii remote controller [42] or the IGS-190 inertial motion capture suits from Animazoo, can provide real-time access to motion information. Usually, they are used for the recognition of gestures performed in space or on tangible objects and they provide a rotation representation of the motion. Marker-based systems are based on opticalmarkers technology, such as Vicon Peak or Optitrack. In [43] for instance, the Vicon system was used to capture the motion of violin players. Marker-based systems often Filename: D3.1 Final Version March 12 2014.docx

Page 33 of 127

D3.1 First Report on ICH Capture and Analysis

i-Treasures ICT-600676

use one of the following: a suit with integrated sensor, small infrared reflectors attached to the hand/body, or a structured pattern drawn on the user hand/body. Marker-less systems however do not require subjects to wear special equipment for tracking and are usually based on passive computer vision approaches. The research conducted in the scope of the i-Treasures project follows the same trajectory as the aim is to record the hand gestures of artists and craftsmen without interfering with their ability to move freely. The ultimate goal being to capture the fine hand and fingers gestures, save it, and allow apprentices to learn it latter – hence saving the intangible heritage. From the literature, a number of sensors have been used to capture the gestures of the hand but our need to use only marker-less narrow down the choices to cameras. For long time, the camera used were RGB color camera but this implied being able to tell apart the skin from the scene (aka skin detection) which is not possible in i-Treasures since we will be dealing with potters whose hands are often covered with clay. To enlarge the scope of our research we have decided to use depth sensor that were made popular after the Microsoft Kinect. The sensor we picked is a PMD CamBoard Nano depth camera, we have a sub-centimeter depth resolution for a human-skin-like object (>40% reflectivity) when within 50cm from the camera. These latter constraints are not an issue for our use cases (Traditional craftsmanship and Contemporary Music) since the hand is usually facing the camera within a short distance. Several methods exist in order to retrieve hand subparts position from a depth camera. The commonly used framework is to fit a skeleton model so that it matches observable features and to apply inverse kinematics to refine the skeleton. A major reference in this field is the Kinect body retrieval algorithm described by [43] where Random Decision Forests (RDF) are trained to perform pixel-wise body classification. This approach has been proven robust to retrieve the hand skeleton though initially applied to entire body gesture due to sensor limitations. Hence, we have based our current research on the application of the RDF to the hand skeleton model. Though this might be a solely sufficient approach for the use-case of contemporary music, the use-case of traditional craftsman requires a deeper analysis of the scene that we present below.

3.1.3.1.2

Scene segmentation in the Traditional Craftsmanship use case

In many existing algorithm the segmentation of the user against the background is based on a distance thresholding or on temporal analysis. In the case of traditional craftsmanship neither of these approaches is acceptable since the hand is in contact with the material and the scene is non-static. Figure 28 illustrates a typical scene, and Figure 29 shows the same scene αs imaged by the PMD depth sensor. There are numerous challenges to leverage for this use-case: self-occlusions and scene occlusions occur, the hand and the objects collide, and as the object is made of clay it is non-rigid temporally.

Filename: D3.1 Final Version March 12 2014.docx

Page 34 of 127

D3.1 First Report on ICH Capture and Analysis

i-Treasures ICT-600676

Figure 28 Typical wheel throwing scene of a potter making a bowl

Figure 29: Sample output of a depth sensor recording a wheel throwing bowl-making process

To allow recognizing the gesture of the hand we first need to segment the scene to narrow down the region of interest. We propose a new approach based on some a priori knowledge of the scene, which can be clustered into: - a background, - the potter’s hand, - the clay and a round plate (attached to the wheel-throwing). Our proposition is to use the fact that: a) the round plate has known geometry features, b) it is spinning on itself, c) the round plate and the clay object share the same axis of rotation. The complete pipeline we propose is depicted in Figure 30.

Filename: D3.1 Final Version March 12 2014.docx

Page 35 of 127

D3.1 First Report on ICH Capture and Analysis

i-Treasures ICT-600676

Figure 30: Complete pipeline to conduct gesture recognition in the complex case of wheel-throwing pottery (green being: implemented as a preliminary version, red: work in progress, blue: to be done).

The first step is to collect the data from multiple depth sensor positioned around the action. We then compute the 3D data from the distance estimation provided by the depth map and from the optical calibration data using Zhang’s model [44]. Now the several point clouds can be registered together with approaches such as the Iterative Closest Point (ICP) and refined using the confidence of distance measurement from each camera. The scene segmentation itself is a three-step stage. -

-

-

The round plate may be extracted using planar information from iterative computation with a statistical heuristic and geometric constraint. In practice we are using MSAC (M-estimator Sample and Consensus) [45] which is a modelbased approach based on RANSAC (RANdom Sample Consensus) [46] but that is more robust to outliers and noise. A CAMSHIFT algorithm is used to define the parameters of the model while allowing some segmentation errors. Once retrieved, the round plate parameters give us the rotation axis (recall that this is the same as for the clay) Using the axis of rotation and the position of the plate we hope to build the profile of the object. A solution for this problem is to use a though-like process and accumulate radially the data. With the profile and the rotation axis we will be able to reconstruct the complete geometry of the pottery from the revolution. The scene segmentation can further be simplified once we have segmented both the clay object and the round plate as their data points may be removed from the point clouds. Hence, the hand positions will be inferred from the scene understanding.

Once we will have the complete scene segmentation, we will be able to segment the hands and use our hand skeleton model to extract the skeleton and allow gesture recognition.

3.1.3.1.3

Hand Skeleton Extraction

Our development investigates the complex use case where the hand is executing music-like finger gestures that can be of high order of complexity. The method used to Filename: D3.1 Final Version March 12 2014.docx

Page 36 of 127

D3.1 First Report on ICH Capture and Analysis

i-Treasures ICT-600676

extract the skeleton of the hand is identical for the pottery case (traditional craftsman use-case). Only the training databases and segmentation process will differ. The first step of the proposed method is to segment the scene as we discussed above. The second step consists in the training phase where several Random Decision Trees are constructed from a training database. The distribution of the probabilities for each label – corresponding to the hands segments – constitutes the partition. The classification step can then be done from this partition. Each pixel from each image is classified independently, directed in real time by the partition and finally labeled. At last, computing the centroid of the corresponding labels distribution assesses the position of the segments. 3.1.3.1.4

Hand model

We built a hand model with 12 labels that encompasses the hand base (palm and wrist) as well as fingers and fingertips as depicted in Figure 31. We believe that this model is complex enough to analyze fine hand configurations and at the same time simpler than the 19 labels model often used in the literature. Furthermore working with the former (12 labels) leads to less classification errors than with the latter (19 labels). In the first model depicted in Figure 31, the two lower phalanges are grouped together as one label and the fingertips are set apart for better tips position estimation. However, further experiments have shown that grouping the two lower phalanges may lead to a dead end as they exhibit very different distance offset as the segment is articulated. Thus, we agreed on a new model which is still articulated around 12 segments but fuses the two extreme phalanges of each finger into one segment (Figure 32). Here, the fingertips labels include two phalanges and are therefore a little less precise. In return, we believe the actual fingertips from depth images will be more easily detected and labeled.

Figure 31 Hand model manually labeled (previous model)

Figure 32 Current synthetic model Filename: D3.1 Final Version March 12 2014.docx

Page 37 of 127

D3.1 First Report on ICH Capture and Analysis

3.1.3.1.5

i-Treasures ICT-600676

Training databases

To train the hand skeleton model implemented requires training database. We have been working on two different types of databases we describe below. 3.1.3.1.6

Hand labelled images

For the Contemporary Music use-case only, we have extracted 500 images from a piano-like gesture recording made with the PMD depth camera. We then have labeled 12 segments following the model in Figure 32. This process is long and fastidious and is also very biased (two operators may label the same image differently). 3.1.3.1.7

Synthetic generation

In order to have a better response and robustness of our classification algorithm for various hands postures, we need a larger training database. As we discussed above, labeling the images manually is fastidious and risky, therefore we decided to generate automatically labeled image thanks to a synthetic 3D hand model with Autodesk Maya 3D. Starting from a physiologically realistic model initially used for 3D animation, we have been able to make a python script that can generate N depth images and N automatically labeled images of N different hands positions. Thus, we generated different databases (100, 1000, 5000, 20000) and will eventually increase the number according to our needs and results following the training step. One important aspect of our image generation is that the corresponding positions follow functions that emulate the essential movements of a piano-like gesture. In order to do so, we have written a set of functions for each basic movement such as an opening hand, a closing hand, and fingers setting apart then together. The function that allows to open the hand is depicted in the following pseudo-code as an example: Pseudo code of the opening hand function

def opening hand(number of image) : for i in range(number of image) : delta = (1.0/number of image)*I for j in range(length(list of fingers)) : close fingers(list of fingers[j], delta)

With these two simple motions prototypes, we are able to reconstruct iteratively all the intermediate steps and obtain all the positions that are likely to happen in piano-like gesture. Furthermore, we have been able to generate these images from three different point of views. It is important to train the algorithm to recognize those positions acknowledging that during recordings, the depth camera is steady while the hands are moving. At last, the colours we selected for the labels maximize the perceptive colour space (HSV space) and improve the classification step.

Filename: D3.1 Final Version March 12 2014.docx

Page 38 of 127

D3.1 First Report on ICH Capture and Analysis

3.1.3.1.8

i-Treasures ICT-600676

Training with Random Decision Forests (RDF)

Our algorithm trains Random Decision Forests (RDF) so as to perform pixel-wise classification. RDF is an improvement of the Decision Trees machine learning approach where a complex problem is split in simple decisions to take that are often depicted as the nodes of a tree (the leaves being the final decision). For each tree of an RDF a subset of pixels x from images of the training database are used to train the tree. To limit the processing and memory cost, we use up to 3 decision trees with a maximal depth of 20. Then, for each node, we randomly generate 2000 weak classifiers for each of these 50 candidate thresholds. The weak classifier we use (i.e. feature) compares the depth offsets in a specific neighbor. Feature at pixel x of depth image I is thus computed as a difference of depth levels for offsets u and v normalized w.r.t. depth at x. We then simulate data partition at this node by thresholding this feature response using each candidate threshold and compute the entropy subsequent to the partition of the dataset. Finally, we keep the combination of feature and threshold that maximizes the information gain, which denotes the difference between current node information entropy and the sum of entropy of the sub trees resulting from the data partition. Data partition and weak classifier selection are then computed recursively on the left and right subtrees in a prefix order, until either maximum depth is reached or the information gain goes under a fixed threshold. We eventually store in tree leaves the probability distributions of the different hand subparts. Those distributions can be recomputed afterwards using all the images from the training database which allows slightly higher recognition rates. Our near future aim is to perform the training stage upon the 20000 synthetic images database discussed above. 3.1.3.1.9

Pixel-wise classification

For each pixel of an input depth image, the trained decision trees independently outputs a probability distribution that is relative to the hand subpart assignment. A color is thus allocated to each pixel – assigning it to a segment. This color is a mixture of the labels colors associated with the different segments. When a color is “pure” – identical to a unique label’s color – we classify a segment without ambiguity. These distributions are then averaged over every tree among the decision forest and form the final pixel probability to belong to each subpart of the model.

Figure 33: Real-time hand segment classification from PMD stream

At last, we estimate the subpart position from each subpart probability map using the Mean Shift algorithm which has the advantage of converging very quickly. It also allows us to filter the noise on the classification measurement on a pixel level with the Filename: D3.1 Final Version March 12 2014.docx

Page 39 of 127

D3.1 First Report on ICH Capture and Analysis

i-Treasures ICT-600676

threshold of the probabilities before computing the density estimator. 3.1.3.1.10

Performance and assessment

To measure the performance of our proposed method we will compare depth image recordings to the output of our pixel-wise classification with a set of extracted labeled images from our database. Another way to assess, more direct and intuitive would be to test our program RTextraction which gives a real time segment estimation from the PMD video stream. The following images are screenshots of the video stream obtained with the program RT extraction. As the user is playing simple arpeggios at a medium speed, the finger joints are tracked in real-time.

Filename: D3.1 Final Version March 12 2014.docx

Page 40 of 127

D3.1 First Report on ICH Capture and Analysis

i-Treasures ICT-600676

Figure 34 Simple Piano-like gesture on RT extraction

3.1.3.2

Finger gesture recognition without using any skeletal model

(UOM) The following methodology referred to the recognition of musical gestures, and more specifically, finger gestures that are performed in space in the use case of contemporary music composition. It is also based on previous works [47] [48], in which computer vision techniques are applied to segment and detect the fingertips using optimal cameras. The gesture recognition is based on stochastic modeling of highlevel features using Hidden Markov Models (HMM) and Gaussian Mixture Models (GMM). More precisely, finger motions are captured in multiple frames with an image-depth camera. The camera is placed several centimeters over the input surface, with a welldefined angle facing the working area. To make the system able to detect hand regions on a video, we cut the bit-planes containing the necessary hand information from each frame. After the hand detection, filters that use mathematical morphology (Erode, Dilate and Gauss etc.) are applied on the image in order to extract the hand from the background and the noise. It is very important to apply these filters with the appropriate parameters to obtain the best result on the image. Afterwards the fingers will be identified based on the geometric properties of the hand's posture. The identification of fingers on the image becomes extremely difficult, especially when the distance between the tips of the fingers is very small. To address this problem, the binary image is imported into the hand segmentation algorithm and a set of image processing methods are applied: (a) the simplification of the binary image by reducing the noise and extracting the silhouette of the hand and (b) the decomposition of the image by extracting the contour of the hand and fingertips. The candidate finger points are calculated by computing the Euclidean distances between the centroid and the coordinates of the pixels belonging to the contour of the fingers. The calculation of the local maxima of the Euclidean distances contributes to the identification of the fingers. To increase the quality of the finger identification scale and rotation techniques are applied on the captured frames of the camera. The centroid of the hand silhouette is calculated based on:

Filename: D3.1 Final Version March 12 2014.docx

Page 41 of 127

D3.1 First Report on ICH Capture and Analysis

i-Treasures ICT-600676

and where, xi and yi are the x and y coordinates of the i-th pixel in the hand region and s is the total number of pixels in the hand region. Finally we calculate the Δx and Δy distances of the centroid and the fingertips (Figure 35).

Figure 35: Distances calculation

The above methodology is implemented inside the Max/MSP and the software called Finger Gesture recognition system, PianOrasis (Figure 36)

Figure 36: PianOrasis

As an improvement of our previous work, real-time fingertips detection has been integrated into PianOrasis for both optical and depth image sequences without any skeleton extraction. The cropping of the Region of Interest (hand and fingers) together with rotation invariance for depth images are also considered as improvements.

3.1.3.3

Gesture Recognition in Byzantine Music

For hand gesture recognition in Byzantine music, the motion capture system, which is consisted of either one Kinect camera or Animazoo motion capture suit and Max/MSP Filename: D3.1 Final Version March 12 2014.docx

Page 42 of 127

D3.1 First Report on ICH Capture and Analysis

i-Treasures ICT-600676

programming language, with its object Gesture Follower (GF) can be used. This system has been also tested in other use cases, such as the use case of traditional craftsmanship. This system captures and recognizes hand gestures in real time. Although the technology is comprehensive, further research should be conducted in the near future in order to find out whether the chanter’s hands provide important information on capture and transmission of rare know-how. As a result, if hand gestures convey meaningful information, the phase of capture and recognition should be conducted. According to [49], chanter’s hand gestures relate exclusively to the rhythmic (the length of a meter in European music) and the most important are “disimos”, “trisimos” and “tetrasimos” rhythmic gestures. However, the most important in Byzantine music is the vocal tract and hand gestures are of secondary importance. At this phase of i-Treasures, GF is used only for testing and not for development or integration to the platform. GF has been already used in the following use cases, traditional craftsmanship and contemporary music composition, for gesture recognition. The purpose of using it is to conduct only preliminary tests in gesture recognition. The findings indicated high and satisfactory results in recognizing musiclike gestures and gestures that are performed by a potter.

3.1.4 Full Upper body data capture for the traditional craftsmanship use case (UOM)

3.1.4.1

Capture, modelling and recognition (UOM, ARMINES)

Some methodologies based on existing technologies have been also used in the project for capturing and data collection. More specifically, a previous methodology, which is developed with the contribution of ARMINES, will be presented below. This methodology which is for gesture capture, modeling and preservation, has been defined based on gesture capturing technologies and gesture recognition system. The methodology has been applied to the wheel-throwing pottery use-case. Two (2) expert potter’s (potter A and B) gestures of the upper part of the body, necessary for the creation of a simple bowl, have been registered with the Animazoo suit containing 11 inertial sensors capturing the rotations of different members of human body. Five (5) repetitions have been registered for each gesture. The data is then analyzed and used for the machine learning. Machine learning phase is done with the use of the system Gesture Follower, developed at IRCAM [50][51] based on Hidden Markov Model and Dynamic Time Warping Technique permitting a time alignment between the model and the data used as input for the recognition. In the Figure 37, you can see the gesture recognition pipeline.

Filename: D3.1 Final Version March 12 2014.docx

Page 43 of 127

D3.1 First Report on ICH Capture and Analysis

i-Treasures ICT-600676

Figure 37: Gesture recognition pipeline based on the Animazoo suit of wireless motion sensors for the upper-part of the potter’s body.

3.1.4.2

Gesture Analysis

The following research methodology can be implemented before the gesture recognition and investigates the relationship between different types of gestures and body segments. More specifically, once the data have been collected from sensors, a pre-processing step of data analysis can be applied before gesture recognition is conducted. This step is referred to gesture analysis, using multidimensional data analysis. The purpose of this step is to categorize/classify body segments into two types of gestures; effective and accompanying ones, according to Delalande's typology [52] This step is crucial for multilevel gesture recognition because, apart from recognizing each gesture; machine learning algorithms would also recognize each gesture to which type/category belongs to. The machine learning algorithms that would be used are HMM, DTW, etc. Statistical techniques such as Linear Discriminant Analysis (LDA) and Principal Component Analysis (PCA) [53]) are used for classification and dimensionality reduction. As a result, PCA is more appropriate for this specific research methodology. For motion capture, Animazoo suit was used, including 11 inertial sensors in order to record the upper-part of the potter’s body. More specifically, x, y, z rotations (Euler angles) were recorded for each body segment. Then the IBM SPSS Software 20.0 was used for multidimensional data analysis. Each three variables for each sensor/body segment were transformed into one new variable, taking into account their weights. The function (1) calculates this new variable (nvar), by multiplying each angle (x, y, z) with its respective weight ( ) that has arisen from Factor Analysis, divided by the sum of the weights.

(1) Then, the PCA is used, as mentioned before, in which two (2) factors were extracted, explaining the 61.826% of the total variance (Figure 38).

Filename: D3.1 Final Version March 12 2014.docx

Page 44 of 127

D3.1 First Report on ICH Capture and Analysis

i-Treasures ICT-600676

Figure 38: 2 Factors, explaining the 61.826% of the total variance

Moreover, the coefficients are sorted by size and are suppressed these ones with absolute value 0 and abs(dgy)>Ty_upper. Such points lie on the inner line of the upper lip but also on the outer line of the lower lip. When the mouth is closed, points on the inner line of the upper lip have usually have a lower abs(dgy) value than points on the outer line of the lower lip. On the contrary, when the mouth is open, points on the inner line of the upper lip have a very high abs(dgy) value. To avoid selecting points in the lower lip, we use two thresholds: a high threshold Ty_upper_1 and a lower threshold Ty_upper_2. o

First, we search for points satisfying abs(dgy)>Ty_upper_1 and dgy>0 and maximizing abs(dgy). If more than 4 or 5 points are found, then we can safely assume that the mouth is open. A 2nd order polynomial curve Lu is then fit on these points using least squares.

o

If no points are found using Ty_upper_1, we search for points satisfying abs(dgy)>Ty_upper_2 and dgy>0. In each scan column, we select the first point that satisfies the previous criteria and also has a higher dgy value that then next point in the scan column. This way, we can avoid selecting points in the outer line of the lower lip. If more than 4 or 5 points are found, we fit a 2nd order polynomial curve Lu on these points using least squares.

In both cases, before fitting the curve, we remove outliers based on distances between neighboring selected points. Although this technique accurately identifies the inner boundary of the upper lip, sometimes this boundary is misplaced because selected points are located in the area between the lower lip and the pit above the chin. To avoid this, we can further constrain our search by demanding that the maximum (k1) and minimum (k2) principal curvatures of candidate points satisfy some criteria, i.e. k10 and Ip0 and abs(dgx)>Tx_left. We also impose another constrain: candidate points should lie in the zone defined by the extrema of Lu and Ll, i.e. by the upper and lower point of the detected inner lip boundary. If more than 4 or 5 points are found, we fit a 2nd order polynomial curve Lleft on these points using least squares. This curve should have a < or ( shape. If its shape is > or ), then the lateral boundary has not been estimated correctly.



Next, we try to define the right lateral mouth boundary by performing a search along axis x for points maximizing abs(dgx) and satisfying dgxTx_right. We also impose another constrain: candidate points should lie in the zone defined by the extrema of Lu and Ll. If more than 4 or 5 points are found, we fit a 2nd order polynomial curve Lright on these points using least squares. This curve should have a > or ) shape. If its shape is ( or a (dec(Mi)>a) denotes an increment (decrement) of more than a% in the value of Mi compared to Ri. inc(Mi) /dec(Mi) denotes that the value of Mi has increased/decreased. H(h1) equals 1 if the hypothesis h1 is correct and 0 otherwise. All threshold values have been determined experimentally and are generalizable.

Filename: D3.1 Final Version March 12 2014.docx

Page 88 of 127

D3.1 First Report on ICH Capture and Analysis

AU1

i-Treasures ICT-600676

Raises the inner eyebrow part IF inc(M1)>10 OR inc(M4>10 OR inc(M24)>30 THEN AU1=true

AU2

Raises the outer eyebrow part IF inc(M2>12) THEN AU2=true

AU4

Lowers the eyebrows IF (dec(M1)>10 OR dec(M4)>10) AND (dec(M3)>10 OR inc(M251)>15) THEN AU4=true

AU5

Raises the upper eyelid, widens the eye opening IF inc(M5)>12 AND inc(M6)>10 THEN AU5=true

AU7

Raises the lower eyelid, narrows the eye opening IF dec(M5)>10 AND dec(M6)>10 THEN AU7=true

AU9

Wrinkles the nose IF dec(M4)>10 AND (inc(M251)>15 OR dec(M3)>10) AND { [NW 12 AND NW 23] OR [NW 12 AND (dec(M7)>10) OR inc(M8)>10)] OR [NW 23 AND ( dec(M7)>10) OR inc(M8)>10)] ) } THEN AU9=true where NW 1 = H(inc(M252)>20) + H(inc(M253)>20)+ H(inc(M254)>20) NW 2 = H(inc(M255)>30) + H(inc(M255)>20)+ H(inc(M257)>30) + H(inc(M258)>10)

AU12

Pulls lip corners upwards obliquely IF inc(M11)>5 AND inc(M12) AND M12>5 AND dec(M17)>5 AND inc(M16)>8 then AU12=true

AU15

Presses lip corners downwards IF M14=0 AND dec(M12) AND M128 AND (NOT dec(M19)>15) THEN AU15=true

AU25

Parts the lips slightly IF inc(M14) AND M14>0.3cm AND M1410) THEN AU25=true

AU26

Parts the lips, parts the jaws

AU27

IF M141cm AND inc(M20)>10 AND (NOT inc(M20)>80) AND (NOT dec(M19)>10) THEN AU26=true Stretches the mouth and pulls the lower jaw downwards IF M141cm AND inc(M20)>80 THEN AU27=true

Table 4: Rules for recognizing Facial Action Units. Mi is the value of measurement i computed in the current frame, while Ri is the corresponding reference measurement. inc(Mi)>a (dec(M i)>a) denotes an increment (decrement) of more than a% in the value of Mi compared to Ri. inc(Mi) /dec(Mi) denotes that the value of Mi has increased/ decreased. H(h1) equals 1 if the hypothesis h1 is correct and 0 otherwise.

The facial feature tracking and facial action unit detection algorithms presented in Section 3.2.4 were implemented in Visual C++ and a software application for algorithm demonstration was developed. The images captured by the 3D sensor are processed in real-time and the results of facial feature tracking are displayed on the application Filename: D3.1 Final Version March 12 2014.docx

Page 89 of 127

D3.1 First Report on ICH Capture and Analysis

i-Treasures ICT-600676

window overlaid on the depth and color images. Different colors are used to indicate the global and local detectors’ estimations as can be seen in Figure 71. In addition, the values of several facial measurements are also displayed in the same window as bar graphs changing with time. Each bar graph corresponds to a different facial measurement such as eye opening, inner and outer eyebrow displacement, mouth opening, mouth shape, nose to mouth distance, curvature measurements, etc. and represents the value of this measurement as a function of time. Each time a new frame is processed and displayed in the application window, a new bar is added in each graph sub-window. The result of the facial action unit detection algorithm is also overlaid on the color image as a text message. We can see the set of action units detected in the current frame in the screenshots shown in Figure 71. The facial expression recognition software is presented in detail in deliverable D3.2 “First Version of ICH Capture and Analysis Modules”.

Figure 71: Facial action units are recognized using a set of rules that compare the extracted facial measurements to reference measurements computed for a neutral face. Here, we can see two examples of action unit recognition. These are screenshots of the facial expression recognition software presented in deliverable D3.2.

The content of this section is significant and very well document. I have two major remarks: - The connection of all this scientific and technical work with the i-Treasures use cases is neglected (or not given enough attention). It should be made clear why face expression detection is important for i-Treasures objectives and why the specific settings that we expect to address in i-Treasures justifies tackling all aforementioned challenges. - Some of the descriptions are quite lengthy, which makes it difficult for the reader to keep track. This is also because the description provides methods for addressing many small problems, which are parts of the bigger problems. Maybe you would like to consider the possibility of keeping the descriptions that are more important (and relevant with the objectives of the project) and provide references (e.g. to papers) for the other cases.

Filename: D3.1 Final Version March 12 2014.docx

Page 90 of 127

D3.1 First Report on ICH Capture and Analysis

3.2.4.7

i-Treasures ICT-600676

Initial experimental evaluation

In this section, we present some initial results from the experimental evaluation of the facial action unit detection algorithm presented in previous sections. To evaluate the developed algorithm, a 2D+3D image database was recorded using an MS Kinect sensor. The resolution of recorded images is 640x480. The two data streams (depth and color) are synchronized automatically and are also registered (1 to 1 pixel correspondence is established). The subjects sit at about 60-70 cm distance from the sensor. The database consists of 64 sequences of 6 subjects, 25 to 40 years old. In each sequence, the human subject displays a single action unit (11 in total) 2-3 times. Facial action periods last approximately 5-10 sec and are proceeded and followed by short neutral state periods. The duration of each recording is about 30-40 sec and the frame rate is about 10 fps. Facial action and neutral face periods were manually identified in each of these sequences and an appropriate tag was assigned to each frame. An example of a subject displaying different action units is shown in Figure 72.

Figure 72 Example of male subject displaying different facial action units.

To train the global active shape model as well as the local detectors, we used a set of 300 image pairs of these 6 subjects depicting different action units. To evaluate our algorithm, we use the first 10 frames of each test image sequence to extract a reference measurement vector. In each of the remaining frames, first we localize the positions of the 81 facial landmarks, next we extract a set of facial measurements and, finally, we detect a set of action units based on the rule-based approach presented in Section 3.2.4.6. Using this procedure, we assign to each frame Filename: D3.1 Final Version March 12 2014.docx

Page 91 of 127

D3.1 First Report on ICH Capture and Analysis

i-Treasures ICT-600676

one or more action unit tags. These tags are subsequently compared against the ground truth. The action unit recognition algorithm was tested in the 64 image sequences (about 6 sequences per action unit). The evaluation results are illustrated in Figure 73. The mean recognition rate is 85.8%. The highest recognition rates are observed for AU1, AU2 and AU26. The lowest recognition rate is observed for AU15. It seems that the ASM cannot track with accuracy the  shape of the inner lip boundary when the mouth corners are pressed downwards. On the contrary, the  shape of the inner lip boundary when someone pulls the mouth corners upwards (AU12) is detected with more accuracy. Regarding AU25, sometimes when the lips are just slightly apart, the open/closed mouth classifier fails and erroneously classifies the mouth as closed.

Figure 73 Facial action units detection rates.

Extensive evaluation of the proposed algorithms will be performed in the context of Task 7.2 “Laboratory testing of modules” using sequences that will be recorded in the context of Task 3.6 “Data collection”. These experimental results will be presented in deliverable D7.2 “First evaluation report”.

3.3

EEG Data Capture and Analysis

3.3.1

Background and Module Overview

Within the i-Treasures project, the scope of electroencephalography analysis is the recognition of users’ different affective states based on their brain activity, in real-time. In general, the flow of the recognition process consists of the following steps: electroencephalogram (EEG) signals are acquired, fed to a processing – classification algorithm, and mapped to an affective state or a quartile of the valence – arousal plane; a two-dimensional plane for emotion characterization with valence denoting whether an emotion is positive or negative, and arousal expressing the intensity of the emotion [54].

Filename: D3.1 Final Version March 12 2014.docx

Page 92 of 127

D3.1 First Report on ICH Capture and Analysis

i-Treasures ICT-600676

During the past years, a vast amount of research has been conducted to investigate the neuroanatomical basis of human affective responses. A detailed review of pivotal research can be found in [55]. The acquired knowledge fostered the development of methodologies for EEG-based emotion recognition by employing advanced signal processing techniques and machine-learning algorithms [56]. Although these methodologies achieved satisfactory recognition rates, they are based on offline processing and classification. To develop a practical real-time system for emotion recognition, however, additional parameters have to be taken into account, besides recognition accuracy, such as sensors configuration, application development environment, and compromises between processing complexity and computational time [57]. A few attempts towards the realization of a practical real-time EEG-based system have been made, with most of them focusing on the recognition of two affective states, one positive- and one negative-valenced. In [58], fast Fourier transform (FFT)-computed signal power features and the AdaBoost.M1 classification technique were used to classify EEG responses into two affective states, amusement and fear. In [59], the PSDs of five frequency bands of the EEG signals were used as features fed to a Gaussian support vector machine (SVM) classifier in order to discriminate between happy and unhappy states. The real-time system was implemented using BCI2000 [60] and Matlab (The Mathworks Inc., Natick, MA). Finally, in [61], the fractal dimension (FD) of EEG signals arising from the frontal cortex was used as feature in order to map brain activity to six emotions, i.e., fear, frustrated, sad, happy, pleasant, and satisfied. FD thresholds were used as decision rules, rather than machine learning algorithms, and the real-time application was implemented in C++. Although, the number of detectable emotions was larger, the system required a priori training of the algorithm by each user in order to achieve satisfactory recognition accuracy. Here, the ultimate goal of the EEG Capturing and Analysis task is to develop a userindependent brain computer interface for the recognition of at least two affective sates (positive and negative) that possesses two critical characteristics, i.e., portability and real-time functioning. In this section, the research towards this direction - conducted during the first development cycle of the i-Treasures - and the proposed system architecture are presented. It must be noted that the EEG Capturing and Analysis module will function as part of the Contemporary Music Composition use case, where the user will be able to create music based on his gestures, facial expressions and affective states. In this vein, the EEG acquisition device and the application environment are selected to comply with the aforementioned requirements. Moreover, a number of feature extraction and classification approaches are examined in order to achieve an optimal balance between recognition accuracy and computational time.

Filename: D3.1 Final Version March 12 2014.docx

Page 93 of 127

D3.1 First Report on ICH Capture and Analysis

i-Treasures ICT-600676

Figure 74: Architecture of the real-time EEG-based affective state recognition system

3.3.2

System Design and Architecture

The system comprises an EEG data acquisition and transmission device and a standalone application for data processing and classification. An overview of the system architecture is presented in Figure 74. For the EEG data acquisition, the Emotiv EPOC device (Emotiv Systems, Inc., San Francisco, CA) is used. A detailed description of the device features and its setup is provided in Section 3.3.3. EEG data are then fed to a .NET-based application that is responsible for the EEG-based feature extraction and classification to affective states. The whole process occurs at refresh rates that verge on real-time behavior. The application comprises a buffer where the stream of EEG data is stored and afterwards fed to the feature extraction function every time a read command is triggered. Subsequently, a feature classification function is called to classify the calculated features by using an already trained model. The methodology behind feature extraction and classification is described in Subsections 3.3.4.1, 3.3.4.2. The output of the application is a value corresponding to an affective state that serves as input to the associated user interfaces.

3.3.3

EEG Data Acquisition Device and Setup

EEG data acquisition is conducted using the Emotiv EPOC device (Emotiv Systems, Inc., San Francisco, CA). EPOC is a portable headset that offers multichannel EEG recording (Figure 75a) and wireless data transmission to a personal computer (PC). The main features of the device are described below [62]: 

EEG Channels: The headset bears 14 sensors - electrodes for simultaneous 14-channel EEG signal acquisition. Sensors positions correspond to AF3, F7, F3, FC5, T7, P7, O1, O2, P8, T8, FC6, F4, F8, and AF4 positions of the 10/20 system of EEG electrode placement [63] (see Figure 75b). Two additional electrodes serve as common mode sense (CMS)/ driven right leg (DRL) references, which can be placed either at P3 and P4 or over the right and left mastoid, respectively (see Figure 75b)).

Filename: D3.1 Final Version March 12 2014.docx

Page 94 of 127

D3.1 First Report on ICH Capture and Analysis

i-Treasures ICT-600676

Figure 75: (a) The Emotiv EPOC headset along with the detachable sensors. (b) Electrode positions of the EPOC device according to the 10/20 system of electrode placement.



Sensors Configuration: Sensors are detachable and bear sponge-like tips. A saline solution is applied on the tips to foster good conductivity. Sensors are placed on sockets at the end of the device branches (see Figure 75a) that are relatively adjustable, in order to achieve the aforementioned electrode scheme.



Data Sampling: EEG data are acquired and digitized by the device using an analog to digital converter at an internal sampling frequency of 2048 Hz with 14 bits resolution (least significant bit voltage = 0.51 μV). The output EEG data is down sampled to 128 Hz.



Filtering: The device includes a 0.16 Hz hardware high-pass filter on the input of each channel and two digital notch filters at 50 Hz and 60 Hz.



Impedance Measurement: There is a patented system for real-time contact quality assessment of the recording sensors.



Power: The device is powered by a rechargeable lithium-polymer battery that typically offers 12 hours of continuous use.



Connectivity: The device uses proprietary wireless connectivity to transmit data at the 2.4 GHz band. A USB receiver accompanying the device must be plugged in to the PC in order for the data to be received.



Safety: The device has been certified for safety under EN 60950-1:2006, IEC 60950-1:2005 (2nd Edition), AS/NZS 60950.1:2003, and JPTUV-029914 (TUV Rheinland) standards.

Setting up and wearing the device is a simple task that can be performed by the user himself/herself, without any expertise being required. The following steps describe the process of the device setup in order for it to function properly: Before placement:

Filename: D3.1 Final Version March 12 2014.docx

Page 95 of 127

D3.1 First Report on ICH Capture and Analysis

i-Treasures ICT-600676

1. A small amount of saline solution must be placed on the sensor-tips. After the absorption, the sensors must be placed on the device sensors sockets (click-into-place lock). To foster a continuous good conductivity of the sensors, the solution has to be reapplied when sensor-tips get dry.

Figure 76: Placement of the Emotiv EPOC headset on the user’s head. The red circle indicates the CMS reference electrode over the left mastoid.

2. The USB receiver must be plugged into the PC. By holding the headset close to the receiver, the device must be turned on using a switch at the rear of the headset. A steady LED on the receiver turns on and indicates a proper pairing. During and After Placement (Figure 76): 3. By carefully expanding the headset, it can now be placed on the user’s head. If needed, the positions of the sensors have to be adjusted in order to ensure good contact. The front sensors of the headset must be 50-60 mm above the eyebrows. 4. For the reference sensors, a good contact has to be ensured. By pressing the reference sensors on the user’s scalp a good conductive path can be established. The Emotiv EPOC headset was selected based on four criteria: 1) it is a wireless, portable device and it is easy to be placed by the user himself/herself. The latter characteristics are crucial as, in the context of i-Treasures, EEG-based affective state recognition will be applied in the Contemporary Music Composition use case where the user must be free to perform gestures and movements, 2) it is a relatively low cost device as compared to other multichannel EEG recording devices, a feature that is critical for the system to be widely available, 3) it offers more EEG recording channels, and as a consequence more available data, than any other commercial low-cost and portable EEG recording device [64], and 4) it is accompanied by an open-license SDK, facilitating the development of brain computer interfaces as is the case here.

3.3.4 Data Processing and Classification In this Section, the methodology used in order to map EEG activity to affective responses is presented. As far as feature extraction is concerned, three approaches are examined which are based on signal power, complexity and higher-order crossings (HOC). Additionally, for the mapping process, two types of feature vectors and three classification algorithms are tested. Finally, in the last subsection, the realization of the methodology in a practical real-time application is presented.

Filename: D3.1 Final Version March 12 2014.docx

Page 96 of 127

D3.1 First Report on ICH Capture and Analysis

3.3.4.1

i-Treasures ICT-600676

Input Data and Features Extraction

The system input data consist of the 14 discrete EEG signals, a signal per recording channel, acquired and streamed by the EPOC headset. Let x i (t ) be the EEG signal acquired from channel i . For a time window w , the windowed signal is subjected to the feature extraction process. For the feature extraction process, three approaches are adopted and tested: Power-based feature extraction The power spectral density (PSD) where f denotes the frequency.

of is computed using a -points FFT, is equal to the next power of 2 that is greater

than window . The summation of yields the power of the feature extracted in this approach, i.e.,

and constitutes

F i  f S( f ) ,

(1)

The PSD was also used to compute features in [59], while, in general, fluctuations of EEG signal power are correlated with brain activation [65]. Complexity-based feature extraction The complexity of EEG signals serve as a measure of the degree of activation of the brain area from where the signals have been captured [66]. Here, the complexity of EEG signals is measured using the fractal dimension (FD) of the signal. In particular, Higuchi’s algorithm for FD estimation is used as in [61]. An epitomized description of Higuchi’s method is given below [67]. new time series x mk

From a time sequence of x(1), x(2),..., x( N ) , are

constructed

as

 N  m xmk  {x(m), x(m  k ), x(m  2k ),..., x(m   k )}  k  m  1,2,..., k ,

, for

where indicates the initial time value, indicates the discrete time interval between points (delay), and denotes the integer part. For each of the curves or time series constructed, the average length is computed as

  N m      ( N  1)i1 k  Lm (k )     

 x(m  ik )  x(m  (i  1)k )  1   N  m k  k  k  

,

where

is the total length of the data sequence

and

N 1 ( N  m) / k k

(2)

is

a normalization factor. An average length is computed for all-time series having the Filename: D3.1 Final Version March 12 2014.docx

Page 97 of 127

D3.1 First Report on ICH Capture and Analysis

same delay (or scale)

i-Treasures ICT-600676

, as the mean of the

for m  1,..., k

lengths

. This procedure is repeated for each average lengths for each , i.e.,

L( k ) 

ranging from 1 to

1 k  Lm (k ) k m1

The total average length for scale

, yielding a sum of

. ,

(3)

, is proportional to

. In the curve of

versus , the slope of the least squares linear best fit provides Higuchi’s FD estimate. Here the discrete time sequence is , and the value is set equal to 5 ( = 5), after testing on signals with known FD. The computed feature is:

F i  FD( xwi (t ))

.

(4)

HOC-based feature extraction HOC is a sequence of the numbers of zero-crossings observed after a sequence of high-pass filters are applied to a finite zero-mean time series { X t }, t  1,..., N . HOC provides a measure of the oscillation of a signal; the more pronounced the oscillation is, the higher the expected number of zero-crossings will be, and vice versa. HOC have already been used as features in methodologies for EEG-based emotion recognition [68] [69]. An epitomized description of the HOC sequence extraction method is provided below [70] .Let be the backward difference operator defined by

X t  X t  X t 1

.

(5)

The difference operator is a high-pass filter. If we define the following sequence of high-pass filters

k   k 1 , k  1,2,3,...

(6)

we can estimate the number of zero-crossings for the constructing a binary time series as

-th order by initially

1,  ( X )  0 Z t (k )   k t , k  1,2,3,...; t  1,..., N 0, k ( X t )  0 (7) and estimating the desired number of zero-crossings by counting symbol changes in , i.e.,

Dk  t 2 [Z t (k )  Z t 1 (k )]2 N

Filename: D3.1 Final Version March 12 2014.docx

.

Page 98 of 127

(8)

D3.1 First Report on ICH Capture and Analysis

i-Treasures ICT-600676

Here, corresponds to and the maximum order of HOC is set equal to 30 ( = 30) as in [68]. The HOC sequence forms feature , i.e.,

F i  {D1 , D2 ,..., D30}

.

(9)

By adopting one of the three aforementioned feature extraction approaches, features are computed for each channel ( ) and the feature vector ( ) from time window to be classified is constructed as

FVw  {F 1 , F 2 ,..., F 14}

.

(10)

In the case of HOC, F i represents a vector, and the FV is the concatenation of 14 such vectors. For the construction, another approach is also examined based on the asymmetry of brain activation during emotion elicitation [71]. According to the evidence, strong left frontal hemispheric activation is observed during the experience of positive emotions, while eminent contralateral activity is observed during negative emotions. In this vein, the second FV type is constructed based on the differences of features computed from signals of symmetric channel pairs about the nasion-inion axis, i.e., AF3–AF4, F7–F8, F3–F4, FC5–FC6, T7–T8, P7–P8, and O1–O2 (see Figure 75b). Thus the differential ( ) can be written as

dFVw  {F 1  F 14 , F 2  F 13 ,..., F 7  F 8 }

, (11)

where indices 1 to 14, denote channels: AF3, F7, F3, FC5, T7, P7, O1, O2, P8, T8, FC6, F4, F8, and AF4. Figure 77 provides an illustration of the construction process.

3.3.4.2

Features Classification and Output Data

For the mapping process three classification methods are examined, thresholding, SVM [72] and -nearest neighbors ( -NN) [73]. A model based on each of the three methods was built using training data. Training data refers to a set of s with known classes or affective labels. The three classification approaches are epitomized below: 

Thresholding: Two thresholds are defined, one for the valence level and one for the arousal level. As far as valence is concerned, the sum of the differential FV ( dFV ) elements to be classified is computed and compared to zero based on the asymmetry theory [71]. A positive value indicates stronger left hemispheric activity and, consequently, a positive valenced state, while a negative value denotes the opposite. For the arousal level, the mean training FV s corresponding to the high and low arousal classes are used as decision criteria. The distance between them and the FV to be classified is computed and the classification result (arousal level) is the class that corresponds to the minimum distance. This approach is very simple and requires less computational time compared to other two.



-NN: For a to be classified, the nearest s from the available training dataset are selected based on a distance metric. The number of these k nearest neighbors corresponding to each class is computed. The classification result is the class to which most of the nearest neighbors belong. Here, the

Filename: D3.1 Final Version March 12 2014.docx

Page 99 of 127

D3.1 First Report on ICH Capture and Analysis

i-Treasures ICT-600676

Euclidean distance was used as a distance metric and the number of was set to three (3-NN). This approach involves calculation of more distances as compared to thresholding and requires more computational time.

Figure 77: Overview of the feature vector ( FV ) construction process.



SVM: A support vector machine constructs a hyperplane or set of hyperplanes in a high-dimensional space, which are used for the discrimination. A good separation is achieved by the hyperplane that has the largest distance to the nearest training of any class (so-called functional margin), since in general the larger the margin the lower the generalization error of the classifier. A kernel function is used for the projection of s to higher dimensional spaces. For more information on SVMs the reader is encouraged to consult [72]. Here, the Gaussian kernel was used. SVM is the most complex and computationally heavy method as compared to the other two.

The to be classified is fed to one of the three trained models-classifiers that outputs a value corresponding to an affective label or an area on the valence arousal plane. So far, offline classification has been performed with the number of target classes ranging from two to four. The two classes correspond to positive/negative-valenced affective states, while the four classes correspond to positive valence – high arousal, positive valence – low arousal, negative valence – high arousal, and negative valencelow arousal states (Figure 78a) To obtain offline training datasets, EEG signals were acquired from subjects through an experiment in the laboratory environment. As emotion evocation media, affective images were used. In addition, participants provided self-reported assessment of their experienced affective states, through appropriate questionnaires, that served as ground truth. An epitomized description of the training data collection procedure is provided below. Data Collection Procedure Subjects (10 so far) participated in an experiment targeting the evocation of four affective states, i.e., anger (negative valence – high arousal), sadness (negative valence – low arousal), happiness (positive valence – low arousal), and surprise (positive valence – high arousal). The experimental protocol (Figure 78b) consisted of five sessions and it was implemented using the Experiment Wizard software tool (BeTA Lab, Amsterdam, and The Netherlands). Each session targeted the elicitation of one of the aforementioned emotions, while the fifth session included neutral stimuli. In particular, during each session, a series of affective pictures with the appropriate Filename: D3.1 Final Version March 12 2014.docx

Page 100 of 127

D3.1 First Report on ICH Capture and Analysis

i-Treasures ICT-600676

emotional label, taken from the International Affective Pictures System (IAPS) [74], was presented to the participant on a computer screen. Each picture was presented for 2 seconds. Between pictures, there was a 1 second-interval, while prior to the beginning of each session there was a 60 seconds period during which the participant was asked to relax. Sessions were carried out with a 5 minutes interval between them in order for the preceding emotional state to fade away as much as possible.

Figure 78: (a) Discretization of the valence-arousal (V-A) plane, from positive/negativevalenced states (first panel) to positive/negative-valenced states with high/low arousal (second panel), to discrete emotions (third panel). The first two panels constitute the target classes examined here. (b) The experimental protocol followed during the data collection procedure.

At the beginning of the procedure, each participant was briefly introduced to the experimental protocol and the self-assessment questionnaire. He/she was asked to sit relaxed in front of the screen. During the procedure, his\her activity was recorded using the EPOC headset. During the presentation of a picture, the participant was informed to press a key in the keyboard to indicate whether they experienced an emotion. At the end of the session, he/she was asked to assess the affective state that they experienced, during the whole session, using a questionnaire based on the SelfAssessment Manikin [75]. The later provides 9-level scales for measuring both valence and arousal. All participants signed a consent form prior to their involvement, and they had the right to quit the procedure at any time.

3.3.4.3

Application Realization

In order to realize the aforementioned data processing algorithm, a .NET application was built in C# using the Visual Studio 2010 (Microsoft Corp., Redmond, WA). A .NETbased realization was adopted because it is suitable for the development of real-time standalone applications. In addition, the SDK of the EPOC headset is also provided in C#, facilitating the merging of the headset functions with the developed feature extraction/classification algorithm. So far, the FD-based feature extraction approach and the threshold-based classification of positive/negative-valenced affective states have been realized, because of their low complexity that requires less computational time. The main functionalities of the alpha-phase application are provided below. Filename: D3.1 Final Version March 12 2014.docx

Page 101 of 127

D3.1 First Report on ICH Capture and Analysis

i-Treasures ICT-600676



Control and Reception: The code of the EPOC SDK concerning the ‘Start/Stop’ control of the recording as well as the capturing of the raw EEG data stream is used. Additionally, information about the quality of sensors contact and the headset battery life is also included.



Temporary Storage: For each recording channel, a buffer is used to temporarily store the streamed raw EEG data. There are 14 buffers in total. Every time a read command is triggered a number of samples from the buffers corresponding to the time window w (currently set to 1024 samples) are fed to the processing function.



Processing and Classification: The samples read from the buffer are fed to a function that computes the FD using the Higuchi’s method. The FD for each EEG epoch is computed sequentially for each recording channel. The FD estimation function has been realized at first in Matlab and then using the Matlab built-in compiler a .dll file was produced that can be called within the main .NET application. After the estimation of the FD values from all recording channels, the difference between values corresponding to symmetric channel pairs (see Equation 11) are computed sequentially, and afterwards, the sum of the differences is calculated and compared to zero.



Output: If the produced sum is positive the output is equal to 0 corresponding to a positive affective state. Else if the sum is negative, the output is equal to 1 denoting a negative affective state. As soon as an output is produced, a new read command is triggered. The result can be streamed from the application via user datagram protocol (UDP) to the associated interface that supports the later transfer protocol.

In general, the affective state recognition application will function as a medium between the EPOC headset and the associated interfaces, e.g., the music composition interface. The .NET executable, along with the required files and the manual, is included in the deliverable D3.2 “First Version of ICH Capture and Analysis Modules”.

3.3.5 Future work So far, systematic research has been conducted on the feature extraction and classification methodology, training data has been collected, and an alpha version of the application for affective state recognition has been developed. Our future work will focus on the points below. Offline Processing: -

Improvement of the trained model and, consequently, of the offline recognition accuracy for the two affective states and expansion to the four classes problem as mentioned in Subsection 3.3.4.2.

-

Given the methodologies described in Subsection 3.3.4.1, feature extraction from specific EEG frequency bands (delta (1-4 Hz), theta (4-8 Hz), alpha (8-13 Hz), beta (13-30Hz), and low gamma (30-49 Hz)) will be examined.

-

Feature selection based on their discrimination performance that will potentially improve classification accuracy and reduce processing complexity.

-

Collection of additional training data by conducting the experiment on more subjects and possible use of audiovisual stimuli, e.g. affective video clips, for a more effective emotion elicitation.

Real-time application:

Filename: D3.1 Final Version March 12 2014.docx

Page 102 of 127

D3.1 First Report on ICH Capture and Analysis

i-Treasures ICT-600676

-

Possible realization of the remaining feature extraction/classification methods as .NET assemblies depending on their recognition performance.

-

Code optimization, e.g., parallel processing of the data from each recording channel in order to reduce computational time.

-

Integrated user interface design

Valuable feedback concerning the issues to be addressed in the 2nd development cycle of the EEG Capturing and Analysis module is also expected after the completion of the 1st Case Studies phase at the end of the second year.

3.4 3.4.1

Vocal Tract Data Capture and Analysis (UPMC, CNRS, USM) Introduction

In i-Treasures, vocal tract capture during rare song performances will ultimately be used to provide reliable vocal tract features to a subsequent animation of articulatore movement, for educational purposes, so that students can learn the exact articulatory strategies necessary for a specific type of rare singing. For this reason, it is essential to build a system which can record the configuration of the vocal tract – essentially the tongue and lips, but also the vocal chords and the soft palate – in real time, with sufficiently rich data to be able to establish a link between image features and actual, physiological elements of the vocal tract. Vocal tract modeling and sensing have for many years been an active research area in the field of speech production and recognition. Non-invasive methods that have been used in vocal tract capture are now sufficiently reliable to be applied to singing cases as well. The i-Treasures project will deal with a number of traditional European singing techniques of the UNESCO Inventory of Intangible Cultural Heritage in need of urgent safeguarding [77].Selected singing styles are the “Cantu in Paghjella” of Corsica (France), the “Canto a Tenore” pastoral songs from Sardinia (Italy), and Byzantine hymns from Mount Athos (Greece) [76].These songs tend to be performed at festive, social and religious occasions; the number of their active practitioners is decreasing. Our study will also include a newly expanding contemporary singing style: the “Human beat box”, where the vocalist imitates drum beats, percussive, and other musical sounds. In the 1980’s, ultrasound (US) imaging techniques began to become popular for vocal tract sensing studies [78],[79].Indeed, the underside of the chin provides a very convenient aperture for the study of tongue movement using US waves in the range of 1 to 10 MHz With this technique, the upper surface of the tongue, as an air-tissue boundary, gives a very strong reflection of the ultrasonic energy, providing a clear tongue contour which can be studied and tracked in real time, at frame rates of up to 100 images per second, which is perfectly adequate for speech production research. As it is non-invasive, portable, and requires no external magnetic field, ultrasound can also be readily complemented with other complementary sensors, such as an Electroglottograph (EGG) to measure and record vocal fold contact movement during speech; a piezoelectric accelerometer (manufactured by K&K Sound) mounted on the nasal bridge for detecting the nasal resistance in speech sounds ; a video camera to follow lip movement; a breathing belt sensor to determine breathing modalities and position and a standard microphone. The i-Treasures project will use advanced sensing and modeling techniques to help preserve disappearing intangible cultural heritage. Indeed, the rare singing application can make quite appropriate use of the above-mentioned sensing techniques when coupled with automatic feature extraction algorithms and data-driven machine learning Filename: D3.1 Final Version March 12 2014.docx

Page 103 of 127

D3.1 First Report on ICH Capture and Analysis

i-Treasures ICT-600676

analysis techniques, as has been done for speech recognition [79] . Furthermore the combination of sensors chosen, as listed above, has the potential to greatly enhance the quality of knowledge of the chosen rare song types. For example, the Sardinian Cantu a Tenore consists of four voices singing overlaid melodies. Two of the singers use a traditional laryngeal phonation, and two use a method that pitch-doubles the fundamental frequency [80]. It is not known, as of today, if this is done by vibrating both the vocal folds and the ventricular folds as is found in diplophonia, or by amplifying an overtone as is done in Tuva throat singing. The combination of ultrasound and EGG, for example, should enable us to record the tongue, anterior pharyngeal wall and vocal folds at the same time. Our chosen multiple sensor modalities will thus allow us to perform visual and auditory documentation of the singing technique for archiving and future teaching purposes.

3.4.2 Vocal tract data capture system architecture

Figure 79: Overview of the vocal tract capture system for the rare singing sub-use cases. On the left, the main sensors (2 image and 4 electrical time signals) are displayed. Simultaneously recorded data streams are synchronized using RTMaps toolkit (see section 3.4.3). On the right, sensor-specific feature extraction modules produce low (and medium-low) level features for further processing (see for example WP4 classification tasks, and WP5 avatar vocal tract animation).

We first present a schematic overview of the different modules contributing to a successful capture system (see Figure 79). On the left, snapshots of the main sensors (Ultrasound, Camera, Microphone, EGG, Piezo and Breathing belt) are displayed. Some of these snapshots are “notional”; for example the actual ultrasound and video used are cases miniaturized versions of those shown in the table, as will be detailed in the next section. Each sensor requires specific gain tuning and/or zero calibration protocols. Simultaneously recorded data streams from these sensors are synchronized via the RTMaps toolkit, which will be described in Section 3.4.3. On the right hand side Filename: D3.1 Final Version March 12 2014.docx

Page 104 of 127

D3.1 First Report on ICH Capture and Analysis

i-Treasures ICT-600676

of the table, an indication of sensor-specific features obtained from the feature extraction modules is given. These modules produce the low (and medium-low) level features for subsequent processing.

3.4.2.1

Helmet design and sensor setup

As we have described, the vocal tract capture module system consists of a suite of non-invasive sensors including ultrasound, camera, microphone, piezoelectric accelerometer, EGG, and respiratory belt sensor. The Terason t3000™ ultrasound system was selected for i-Treasures. This is a portable ultrasound system with high image quality, the ability to interface to a PC via Fire wire, and the possibility of developing ultrasound applications using the Terason Software Development Kit (SDK). A 128-element micro convex 8MC4 (Terason) probe (with the handle removed in order to provide a more compact package), and 140° opening angle, is placed underneath the singer’s chin in order to study of tongue movement during a performance. A USB commercial inspection camera (Imaging Source, DMM22BUC03ML), with a visible-blocking filter and infrared LED ring, is placed in front of the mouth to record both lip and tongue tip movements. IR lip imaging is preferred in order to remain as insensitive as possible to background lighting conditions. A commercial lapel microphone (Audio-Technica Pro 70) is used to record sound. These sensors are integrated directly onto a lightweight “hyper-helmet”, as shown in Figure 80.

Figure 80: Multi-sensor Hyper-Helmet: 1) Adjustable headband, 2) Probe height adjustment strut, 3) Adjustable US probe platform, 4) Lip camera with proximity and orientation adjustment, 5) Microphone. Filename: D3.1 Final Version March 12 2014.docx

Page 105 of 127

D3.1 First Report on ICH Capture and Analysis

i-Treasures ICT-600676

The piezoelectric accelerometer (Figure 81) is attached with adhesive tape to the nasal bridge of the singer, in order to capture nasal bone vibration, related to the nasal tract airway resistance, which in turn is an indicator of nasality production during singing. Nasality is an important acoustic feature in voice perception and has been the topic of numerous phonetic and speech processing studies. It is also implied in some singing techniques that use the nasal cavity as a resonator in order to modify the timbre of the voice. An EGG (Electroglottograph EG2-PCX2 from Glottal Enterprises) is placed on the singer’s neck to measure DEGG (Derivative Electroglottograph) signal peaks, which are reliable indicators of glottal opening and closing instances [81]. This signal is also very helpful for advanced analyses such as inverse filtering aimed at predicting the output signal from the glottis, which is essential in the speech production and perception process. Finally, on the singer’s chest, a breathing belt sensor is affixed to measure breathing modalities during singing (see Figure 81).

Figure 81: Schematic of the placement of non-helmet sensors, including the nasality Piezo, EGG sensor, and respiration belt.

3.4.3 Data Acquisition system design 3.4.3.1

Sensor Core Design

To meet the requirements of the i-Treasures project for the rare singing use case, the acquisition system developed must be able to synchronously record ultrasound and video data at sufficiently high frame rates to correctly characterize the movements of these articulators, and at the same time log the acoustic speech, EGG, piezoelectric accelerometer and breathing belt sensor waveforms, also in a synchronous fashion. The acquisition platform chosen has been developed using the Real Time, Multisensor, Advanced Prototyping Software (RTMaps) commercialized by Intempora Corporation [82] . On screenshot of the acquisition platform that has been developed Filename: D3.1 Final Version March 12 2014.docx

Page 106 of 127

D3.1 First Report on ICH Capture and Analysis

i-Treasures ICT-600676

for i-Treasures, including all of the aforementioned sensors, is shown in Figure 82. The platform is still under development.

Figure 82: Schematic of the developed RTMaps Data Acquisition System Architecture

3.4.3.2

Data visualization

3.4.3.2.1

RTMaps real time user interface

The platform developed has the ability both to record and display data in real time. Figure 83Error! Reference source not found. shows a screen shot of the data acquisition platform in action, recording some test data for the rare song application. Additionally, the acquired data can be either stored locally, or transferred over a network if desired. Ultrasound and video images are streamed at a rate of 60 frames per second, then stored in either .bmp or jpeg format. Image size for ultrasound and camera are 320 by 240 pixels and 640 by 480 pixels respectively. The EGG, microphone, piezoelectric accelerometer and respiratory belt are interfaced to a fourinput USB sound card (AudiBox 44VSL) whose output feeds into the acquisition system. These four input analog signals are sampled at 44100 Hz with a 16 bit encoding. The sampled analog signals are saved into .wav format output files.

Filename: D3.1 Final Version March 12 2014.docx

Page 107 of 127

D3.1 First Report on ICH Capture and Analysis

i-Treasures ICT-600676

Figure 83: RTMaps Data Acquisition User Interface, showing simultaneous recording and visualization of ultrasound tongue images, lip video, and the four analog sensors.

3.4.3.2.2

Data display, validation and analysis module

It is crucial to be able to monitor the quality of our acquired data regularly in an efficient way. To this end a dedicated tool, called i-Coffee, is being developed, to carry out the following functions: 

Validate the synchronicity of all data streams. In particular, we need to check for potential image data loss due to system overload during capturing.



Display synchronized signals and images, tongue contour.



Check for noise due to sensor movement, temperature, etc.



Check for possible level saturation of signals



Provide an interface for standard signal analysis routines (e.g. spectrogram for audio, fundamental frequency curves, etc.).

The operation of i-Coffee is illustrated in the two screen shots below. Figure 84 shows data for a singing performance in the left half of the image, and a user dialog box on the right. In the data display, the lips (upper left) and the ultra-sound image of the tongue (upper right) are shown above the display of the four time signals: audio, EGG, nasality Piezo and respiration belt.

Filename: D3.1 Final Version March 12 2014.docx

Page 108 of 127

D3.1 First Report on ICH Capture and Analysis

i-Treasures ICT-600676

Figure 84: i-Coffee data display, validation and analysis module, showing, on the left half of the screen, the ultrasound tongue image, lip image, and the 4 analog signal waveforms; and, on the right, a user dialog box.

Figure 85: Screen shot from i-Coffee for a Corsican Paghjella recording. A spectrogram, formant tracking versus time, and EGG information are shown in the displayed windows.

The second snapshot, Figure 85, shows several types of analysis performed on the data of a Corsican Paghjella singer producing a sustained sung vowel /i/. The upper panel shows a narrow-band spectrogram of the vowel, where the harmonics are visible, and the vibrato of the voice with ~5 cycles per second can also be identified.

Filename: D3.1 Final Version March 12 2014.docx

Page 109 of 127

D3.1 First Report on ICH Capture and Analysis

i-Treasures ICT-600676

3.4.4 Rare singing data collection In the following, we present a summary of the data collection activity. Before starting the data collection proper (planned in M15-M31 period), our activities are focused on preparatory steps dealing with technical, operational and functional requirements. These steps are listed here below and some details are provided for the major steps.

3.4.4.1

Assessment phase of the hyper-helmet for the different singing

types The hyper-helmet has been tested with one expert singer each for the Human Beat Box, HBB (Davox), Paghjella (Benedetto Sarocchi), and Byzantine (Dimitrios Manousis) musical styles. Each singer participated in a recording session to validate the helmet for his or her style, and to assist us in specifying an appropriate data collection protocol. A recording session consists of three phases: 1) singer preparation (wearing of the hyper-helmet and arrangement of sensors, cables, etc.); 2) sensor calibration; and 3) a data collection proper. The three phases need to be optimized with respect to time delays, ease of use, etc. The Byzantine expert singer (Manousis) first recorded with the i-Treasures hyperhelmet and RTMaps platform at LPP/CNRS, in Paris, to produce vowels in both singing and speaking mode, before singing a dozen segments of Byzantine chants in both Mount Athos and Ecumenical Patriarchate of Constantinople styles. The Corsican Paghjella singer (Sarocchi) was recorded at ESPCI/UPMC in Paris. He first produced spoken and singing voice using isolated vowels and connected CV syllables with major Corsican vowels and consonants, and then performed three Paghjella songs. As Paghjella is a polyphonic singing type combining three different voices, the singer interpreted these three voices (secunda, terza and bassu) sequentially, starting with the main voice (secunda) and then proceeding to the terza and bassu voices, while listening to the secunda. For the HBB case, we undertook several testing and recording sessions with our expert, Davox. A specific problem for HBB is the difficulty of stabilizing of the ultrasound probe in view of the large range of motion of the jaw in this singing style, as compared to the other styles. A point which still need to be addressed, for all singers, is the comfort of the performer during the acquisition session, and particularly during the initial setup and calibration period in which the singer is wearing the hyper-helmet but is not yet able to produce data. At the time of this writing, each of the three singing styles tested has produced about 30 minutes of singing material, which are being used to develop and assess the next steps to be undertaken in the continuing development of our synchronous data collection platform, as well as our data calibration, data display and analysis modules.

3.4.4.2

Definition of recording material

In order to study the different rare singing techniques, and to extract information and features for automatic classification (WP4) and pedagogical activities and transmission (WP5), we have decided to collect material of different degrees of complexity: isolated vowels (/i/, /u/, /e/, /o/, /a/), CV syllables (/papapapapa/, /tatatatata/, /kakakakaka/...), sung phrases and entire pieces. The material is to be produced both in spoken and singing modes. For Byzantine chant, different styles (Mount Athos vs Ecumenical Patriarchate of Constantinople styles for example) have been selected. For Corsican Paghjella, we propose to study versa (melodies) from three different locations, which are famous for their traditional singing styles: Rusio, Sermanu and Tagliu-Isolacciu. The situation for Sardinian Canto a Tenore is still under discussion. For HBB, basic material will be recorded as defined in [83] (see also acoustic recordings by U. Mons). Filename: D3.1 Final Version March 12 2014.docx

Page 110 of 127

D3.1 First Report on ICH Capture and Analysis

i-Treasures ICT-600676

Short HBB phrases and longer performances in different styles will be recorded (with details still to be defined).

3.4.5 Feature Post processing and Pre processing Sensor data are of two types: images, from the ultrasound machine and lip camera; and 1D time waveforms from the microphone, EGG, accelerometer and breathing sensor. By far the most challenging data are those deriving from the US transducer, as US tongue data are well known for being noisy, variable, and difficult to interpret. Although a number of tools have been developed over the years since the US modality became popular – for example, tongue contour finding algorithms – the driving of a real time model of the vocal tract directly from ultrasound data has never been attempted. The analyses carried out to date have therefore targeted techniques that will be able to provide reliable tongue features to the subsequent animation steps.

3.4.5.1

Ultrasound image processing

The features we aim at extracting from the ultrasound images are reliable coordinates of points on the tongue contour. Ultimately, these should be actual tissue points, but obtaining reliable tissue points from ultrasound image is a difficult problem. As a first step, we strive for a robust, reliable, real-time, and fully automatic tongue contour extraction algorithm. A technique based on machine learning has been chosen. Before applying our machine learning algorithm, we first pre-process ultrasound images to reduce their dimensionality. This pre-processing part consists of an optional filtering (denoising) of ultrasound images, a reduction of the dimensions of the images, a binarization of the images and an adjustment of edges (Figure 86). Subsequently we adopt a Deep Learning (DL) neural network approach which prescribes to train a network using a data base containing both input and desired output data. Thus in order to train our network to reconstruct contours from ultrasound images, we use both the raw ultrasound and a reconstructed contour image as input of the network. The training contours are obtained using a semi-automated but relatively computationally intensive algorithm; the DL network, however, once trained, will be able to predict contours in real time. This is the advantage of the proposed approach. The training contour images also require pre-processing part before being used as inputs of our deep learning algorithm (Figure 87). The dimensions of input contour image are fixed to be the same as the dimensions of reduced ultrasound images, and the image preprocessing is identical to that for the ultrasound training images. The tongue contour extraction using DL networks is carried out in two passes: a learning pass in which the network is trained on both ultrasound and contour images, and a second pass in which the network is trained to reconstruct tongue shape image from ultrasound data only. This second step is made possible by the fact the DL structure has already, in the first learning pass, built an internal representation of the salient variables necessary for the prediction of tongue contours. This is one of the major advantages of the DL approach. After the complete training phase, the network is able to predict tongue shape images. These images, which are still defined on the reduced image scale, are cleaned and processed so that they may be converted into pixel coordinates. Post-processing including skeletonization, etc., are applied, and then pixel locations are converted into 2D coordinates. Gaussian smoothing is further applied to this set of coordinates, corresponding to a minimization of the influence of outliers; in practice, data outside six mean absolute deviations are assigned zero weight. A spline interpolation of the Filename: D3.1 Final Version March 12 2014.docx

Page 111 of 127

D3.1 First Report on ICH Capture and Analysis

i-Treasures ICT-600676

coordinates can also be applied optional to obtain regular spacing of the x coordinates as well as choose the desired number of contour points.

(a) Initial image

(b) Image obtained after region of interest selection

(c) Rescaled image

(d) Binarized image

(e) Image after isolated points removal

(f) Image after connection between neighbor points

Figure 86: Preprocessing performed on ultrasound pictures to reduce dimensionality. For a 240x320 initial picture (a), a region of interest (b) is selected. Then the image is resized into a 30x33 pixels image. The image is then binarized as shown in (d). Then, isolated points, considered as noise, are removed (e). Finally, in order to avoid gaps in the image due to binarization, neighbor pixels are connected like shown in (f).

(a) Initial contour coordinates projected on the initial ultrasound picture.

(b) Projection of (c) Projection of tongue (d) Conversion of tongue contour contour coordinates on contour coordinates into coordinates on the the ultrasound image pixels. We get a 30x33 ultrasound image after after resizing. contour image. selection of region of interest. Figure 87: Conversion of tongue contour coordinates into binary pictures. Figure (a) shows the projection of tongue contour coordinates on the initial ultrasound pictures. Figure (b) shows the projection of the same points on the ultrasound image after selection of a region of interest. Then contour coordinates are under sampled so that they fit with the rescaled image shown figure (c). Then each pixel that belongs to the contour is set to 1 while other pixels are set to 0. One of these binary contour pictures is shown figure (d). Filename: D3.1 Final Version March 12 2014.docx

Page 112 of 127

D3.1 First Report on ICH Capture and Analysis

3.4.5.2

i-Treasures ICT-600676

Lip image and other sensor processing

Image processing algorithms for lip and mouth shape processing are well established and have not been a major point of concern in this period. EGG, nasal accelerometer, and respiration belt signals are one-dimensional temporal waveforms and also do not require special processing.

3.4.5.3

Perspectives

A remaining challenge is to identify fixed anchor “tissue points” on the tongue contour that can be provided to a subsequent tongue animation module. Block-matching, optical flow, and model-based constraint methods are currently being tested to achieve this result. Final implementation of all algorithms into the RTMaps framework is also underway.

3.5

Sound Data Capture and Analysis

3.5.1 Data capture For the purpose of this research work, a new beat box database was recorded. It consists of beat box sounds produced by a single male beat boxer. The database contains 4 sets: individual drums sounds, rhythms, instruments and freestyle. The beat boxer was placed inside a soundproof room, equipped with a computer for recording, and with a Rode Podcaster microphone. The audio was captured at a sampling rate of 48 kHz. For the first part of the database, the beat boxer was asked to pronounce several time the 17 beat boxing drum sounds described in [83]. The total amount of repetition of each sound is detailed in Table 5. We noted that repeating each sound several time led to a certain weariness in the voice of the beat boxer. Based on this observation, the recording protocol should be slightly modified in the future if required. Indeed, it seems better for the beat boxer to pronounce the entire set of 17 individual sounds in one time, and repeat this set several time. Effect

Beatbox notation

Number of repetition

Kick

bf

21

Kick

bi

16

Kick

bu

21

Rimshot

k

32

Rimshot

kh

21

Rimshot

khh

38

Rimshot

suckin

17

Snare

clap

11

Snare

pf

32

Snare

ksh

12

Filename: D3.1 Final Version March 12 2014.docx

Page 113 of 127

D3.1 First Report on ICH Capture and Analysis

i-Treasures ICT-600676

Hi-hat

kss

16

Hi-hat

tss

14

Hi-hat

t

18

Hi-hat

th

18

Hi-hat

h

17

Cymbal

tsh

13

Cymbal

kshh

14

Table 5: Musical classification, notation and repetition amount of beat boxing drum sounds in the repertoire of the study subject.

Then, for the second part of the database, the beat boxer was asked to combine some of these individual sounds in continuous rhythms at different tempi (slow and fast). Unfortunately, for this set, only a single rhythm (both at slow and fast speed) was collected. After that, for the third part of the database, he was asked to imitate the sound of various instruments, with a rhythm and a pitch up to him. They are listed in Table 6 together with the duration of the corresponding recording. Instrument name

Amount [s]

Electric guitar (egressive)

15

Electric guitar (ingressive)

16

Guitar bass

16

Saxophone

16

Trumpet

4

Trumpet corked

9

Trumpet trilled

10

Voice scratch

21

Table 6: Musical classification and collected amount (expressed in seconds) of beat boxing instruments.

Finally, for the fourth and last part of the database, the beat boxer was asked to pronounce continuous rhythms superimposed to singing of instrument sounds (i.e. complete freestyle musical performance). This set also contains some singing speech. We recorded 3 minutes and 30 seconds of such freestyle. We interestingly noted new sounds and instrument imitations which were not present in the first and third sets of the database.

3.5.2 Sound data analysis and feature extraction The goal of this task is to develop sound analysis methods enabling the extraction of relevant features of vocal performances from the several singing heritage use cases of the project. Those features will include characteristics of the performances, such as the onset musical events, the pitch of identified notes, as well as the identification of Filename: D3.1 Final Version March 12 2014.docx

Page 114 of 127

D3.1 First Report on ICH Capture and Analysis

i-Treasures ICT-600676

sound categories and special singing effects relevant to the given style, up to a complete automated transcription, ideally in real-time. For the moment, we focused on the beatbox use case only. Later in the project, we will also consider other use cases, while continuing to work on beatbox benefiting from ongoing data collection sessions. As already mentioned in Deliverable 2.2, various features will be important as descriptors to be evaluated for the classification of beatbox sounds (HMM use case) and vocal effects (other use cases): for instance, MFCCs, MGCs, LSPs, GCI, Maximum Voicing Frequency, etc. (see the complete features list in Deliverable 2.2). However, there are no clues to tell a priori which of those features will be the best for such task. Therefore a comparative experimental approach will be necessary in the second year of the project. Until now, we only focused on a complementary aspect: pitch tracking analysis. To enable the application of machine learning methods to the task, as well as to enable the evaluation of various methods, a ground truth reference annotation needs to be prepared. This has been done essentially manually and in some cases by applying an automated algorithm followed by a thorough manual check and correction. Each set of the recorded beatbox data has first been manually segmented as described below.

3.5.2.1

Segmentation

The first set of the database, containing the individual drums sounds, has been labeled, according to the notation introduced in Table 5 and segmented following these labels, so that each onset of a drum sound event is annotated and each drum sound is labeled according to one of the 17 possible sound categories. The second set of the database, containing the rhythms, has been segmented into two parts: the slow and fast speed rhythms. Each rhythm has also been segmented (identified onsets) and labeled, according to the notation introduced in Table 5. The third set of the database, containing the instruments sounds, has been labeled, according to the notation introduced in Table 6, and segmented following these labels. Furthermore, each instrument sound was segmented according to its own characteristics. Indeed, some performances including musical instrument imitations actually contain different instrument timbres, or specific sounds produced to increase the realism of the performance. This includes unvoiced sounds to simulate more muted guitar notes, exhalation sounds preceding saxophone notes, as well as vibrato and tremolo effects in trumpet imitations. Besides, some performances include clearly audible inhalations. These characteristics are summarized in Table 7 Acoustic characteristics for each beat boxing instruments. Note that voice scratch (or the imitation of disc jockey turntable scratch) was left aside for the moment, as this sound requires a particular focus and further discussions with the performers because it is neither standard drums, nor standard instrument, nor standard speech.

Filename: D3.1 Final Version March 12 2014.docx

Page 115 of 127

D3.1 First Report on ICH Capture and Analysis

Instrument name

i-Treasures ICT-600676

Characteristics & Number of examples

Total number of examples

Electric guitar (egressive)

Guitar (33), silence (3)

36

Electric guitar (ingressive)

Guitar voiced (34), guitar unvoiced (19), silence (4)

57

Guitar bass

Bass (41), inhalation (9), silence (0)

50

Saxophone

Saxophone (42), pre_breath (24), silence (12)

78

Trumpet

Trumpet (16), inhalation (1), silence (10)

27

Trumpet corked

Trumpet corked (13), trumpet corked vibrato (2), silence (1)

16

Trumpet trilled

Trumpet trilled (21), trumpet trilled tremolo (7), inhalation_tss (3), silence (18)

49

Table 7 Acoustic characteristics for each beat boxing instruments.

The fourth and last set of the database, containing the freestyle musical performance, has been segmented and labeled according to the individual drums and instruments sounds already defined. We noted other new sound categories, which were not present in the other sets of the database: e.g. didgeridoo, laugh, sustained instrument sound (voiced) superimposed with beat, etc. These characteristics are summarized in Error! eference source not found.. Effect

Beatbox notation

Total number of examples

From [83] Kick

'bi'

106

Rimshot

'kh'

40

Hi-hat

'th'

19

Hi-hat

't'

39

Hi-hat

'tss'

15

Kick

'bu'

79

Rimshot

'k'

15

Kick

'bf'

30

Rimshot

'suckin'

60

Snare

'pf'

23

Snare

'clap'

13

Rimshot

'khh'

18

Filename: D3.1 Final Version March 12 2014.docx

Page 116 of 127

D3.1 First Report on ICH Capture and Analysis

i-Treasures ICT-600676

Sustained instrument sound superimposed with beat Hi-hat + Voiced sound

'th_ion'

Kick + Singing

'bu_sing'

56

Hi-hat + Singing

't_sing'

75

Kick + Singing

'bf_sing'

59

Hi-hat + Voiced sound

't_ion'

5

Kick + Singing

'bi_sing'

6

Hi-hat + Singing

'th_sing_t'

1

1

Miscellaneous Singing

'sing'

8

Voiced sound

'ion'

19

Voiced sound

'wa'

104

Unvoiced sound

'wa_breath'

Voiced sound

'ing'

10

Unvoiced sound

'ha'

5

Unvoiced sound

'ha_breath'

4

Voiced sound

'dr'

7

Burp

'burp'

2

Didgeridoo

'didgeridoo'

71

Laugh

'laugh'

30

Laugh

'laugh_breath'

28

Silence

'silence'

15

Inhalation

'inhalation'

15

Inhalation

'inhalation_noisy'

2

Breath

'breath'

3

Phones ‘t@’

'pho_t@'

3

Phone ‘@’

'pho_@'

1

Phone ‘k@’

'pho_k@'

2

Phone ‘e’

'pho_e'

1

Speech

'speech'

6

Multiple

Table 8 Acoustic characteristics for each effect appearing in the freestyle musical performance.

Filename: D3.1 Final Version March 12 2014.docx

Page 117 of 127

D3.1 First Report on ICH Capture and Analysis

3.5.2.2

i-Treasures ICT-600676

Pitch analysis

Extracting pitch makes sense for the third (instruments) and fourth (freestyle musical performance) sets of our database, as drum sounds (first and second sets) do not contain any pitch. As already mentioned earlier, various pitch tracking methods are evaluated against a ground truth reference. Similarly to [84], we compare the performance of 4 of the most representative state-of-the-art techniques for pitch extraction: 

RAPT: Released in the ESPS package [85] RAPT [86] is a robust algorithm that uses a multi-rate approach. Here, we use the implementation found in the SPTK 3.3 package [87] .



SRH: As explained in [88], the Summation of Residual Harmonics (SRH) method is a pitch tracker exploiting a spectral criterion on the harmonicity of the residual excitation signal. Here, we use the implementation found in the GLOAT package [89] .



SSH: This technique is a variant of SRH which works on the speech signal directly, instead of the residual excitation.



YIN: It is one of the most popular pitch estimators. It is based on the autocorrelation method, making several refinements to reduce possible errors [90] . In this paper, we used the implementation freely available at [91].

It should be noted that these algorithms allow, in addition to pitch values extraction, Voiced/Unvoiced (VUV) decisions computation as a by-product. These two aspects of pitch extraction should be separately evaluated, in order to find the most appropriate method for each of them. Usually, the ground truth reference pitch is obtained by means of Electroglottograph (EGG) recordings. Unfortunately, EGG signals were not recorded in our database, consisting of audio signals only. Therefore, the reference pitch has been obtained by applying an automated pitch tracking algorithm (Praat [92]) followed by a thorough manual check and correction. After that, the ground truth reference pitch is compared to other state-of-the-art methods of pitch tracking, for each instrument of our database. For assessing the performance of a given method, the three following measures are used [93]: 

Voicing Decision Error (VDE): proportion of frames for which an error of the voicing decision is made. It is computed in percent [%].



Gross Pitch Error (GPE): proportion of frames for which the relative error of F0 is higher than a threshold of 20%. This was computed in the regions where only the ground truth is voiced (whatever the VUV decisions of the other method). Note that it is computed in cents [cents]. A cent is a logarithmic unit used for musical intervals [94]. (100 cents correspond to a semitone and twelve semitones correspond to an octave, which means doubling of the frequency). Note also that our implementation of RAPT unfortunately applies VUV decisions directly on its output pitch values.



F0 Frame Error (FFE): proportion of frames for which an error (either according to the GPE or the VDE criterion) is made. FFE can be seen as a single measure for assessing the overall performance of a pitch tracker. It is computed in percent [%].

A fourth measure is proposed in [93]: the Fine Pitch Error (FPE), which is defined as the standard deviation (in %) of the distribution of the relative error of F0 for which this Filename: D3.1 Final Version March 12 2014.docx

Page 118 of 127

D3.1 First Report on ICH Capture and Analysis

i-Treasures ICT-600676

error is below a threshold of 20%. However, this last measure does not apply in our case since we do not have any EGG reference. The ground truth reference pitch was manually checked and corrected for VUV decisions and GPE. The detailed pitch tracking results are summarized in Table 9.The best performing approaches, for each musical instrument imitation and each error type, are highlighted in red and bold. The last line of Table 9 corresponds to the mean scores, computed independently of the musical instrument type, for each pitch tracking algorithm and each error type. Algorithm

GPE [cents]

VDE [%]

FFE [%]

RAPT

SRH

SSH

YIN

RAPT

SRH

SSH

YIN

RAPT

SRH

SSH

YIN

Electric guitar (egressive)

16.12

21.07

9.88

14.82

10.14

26.03

37.54

38.91

14.65

28.97

39.24

40.16

Electric guitar (ingressive)

6.33

1.88

1.06

3.63

4.41

9.19

9.98

11.67

5.50

9.37

9.98

12.52

Guitar bass

4.20

8.83

7.53

5.07

8.55

15.88

26.46

23.26

9.60

16.68

27.51

23.69

Saxophone

2.69

0.12

0.12

0.00

6.94

6.94

10.87

20.17

7.00

6.94

10.93

20.17

Trumpet

3.10

0.00

0.00

0.00

6.74

8.48

5.87

14.13

7.17

8.48

5.87

14.13

Trumpet corked

0.62

0.15

0.00

0.00

3.30

6.08

4.90

7.25

3.30

6.08

4.90

7.25

Trumpet trilled

2.65

6.44

6.25

2.46

11.75

9.08

14.12

21.72

11.94

10.27

14.61

22.31

Freestyle

99,08

52,15

26,84

5,57

34,66

14,96

19,11

19,66

35,31

22,88

22,85

20,13

Global

16,85

11,33

6,46

3,95

10,81

12,08

16,11

19,60

11,81

13,71

16,99

20,04

Table 9: Detailed pitch tracking results. The best performance are highlighted in red. The last line corresponds to the mean scores, computed independently of the musical instrument type and freestyle, for each pitch tracking algorithm and each error type.

Regarding the GPE, it is observed that the best performance is achieved by YIN for all kind of Trumpet, Saxophone, freestyle, by SSH for all kind of Electric Guitar, and by RAPT for Guitar Bass. More generally, SSH and YIN lead to the best performance, as RAPT score is followed by YIN and SSH in the case of Guitar Bass. Regarding the VDE, we see that the best performance is achieved by RAPT for all kind of Electric Guitar, Guitar Bass and Trumpet Corked, by SRH for Saxophone and Trumpet Trilled, freestyle, and by SSH for Trumpet. Regarding the FFE, the same tendency as for VDE can be highlighted, except that YIN achieved the best performance for freestyle. Comparing only SRH and SSH, we interestingly see that the lowest GPE and the lowest VDE are achieved by SSH and SRH respectively. As these algorithms are basically identical, a hybrid SRH-SSH method can thus be implemented, selecting either SSH when extracting pitch values or SRH when computing VUV decisions. The detailed pitch tracking results, for this new hybrid SRH-SSH method, are summarized in Table 10

Filename: D3.1 Final Version March 12 2014.docx

Page 119 of 127

D3.1 First Report on ICH Capture and Analysis

Algorithm

GPE [cents] HYBRID

i-Treasures ICT-600676

VDE [%] HYBRID

FFE [%] HYBRID

Electric guitar (egressive)

9.88

26.03

26.88

Electric guitar (ingressive)

1.06

9.19

9.43

7.53

15.88

16.98

0.12

6.94

6.94

0.00

8.48

8.48

0.00

6.08

6.08

6.25

9.08

10.27

26.84

14.96

21.03

6.46

12.08

13.26

Guitar bass Saxophone Trumpet Trumpet corked Trumpet trilled Freestyle Global

Table 10: Detailed pitch tracking results, corresponding to the Hybrid SRH-SSH method. The last line corresponds to the mean scores, computed independently of the musical instrument type and freestyle, for each pitch tracking algorithm and each error type.

This new hybrid SRH-SSH method seems not performing better than RAPT regarding VDE and FFE. However, RAPT has the disadvantage of applying VUV decisions directly on its output pitch values. The hybrid SRH-SSH method achieves better scores than YIN, and SRH and SSH taken separately.

3.5.2.3

Perspectives

Our future work encompass: -

Train machine learning techniques (HMMs) on the different type of sounds of our HBB database, in order to automatically recognize and annotate any new beatbox sounds.

-

Collect larger sets of vocal performances.

-

Apply the sound analysis techniques to the other singing use cases of the project.

Filename: D3.1 Final Version March 12 2014.docx

Page 120 of 127

D3.1 First Report on ICH Capture and Analysis

4.

i-Treasures ICT-600676

Partners responsible for each Module/Task Task

Partner

ICH Capture and Analysis -

Task 3.1: Facial Expression Analysis (Leader: CERTH, Contributors: UPMC) and Modeling

-

Task 3.2: Body Recognition

-

Task 3.3: Electroencephalography (Leader: AUTH) Analysis

-

Task 3.4: Vocal Tract Sensing and (Leader: UPMC, Contributors: CNRS, USM, ARMINES) Modeling

-

Task 3.5: Sound Processing

and

UMONS, Contributors: Gesture (Leader: CERTH, UOM, ARMINES)

(Leader: UMONS, Contributors: UPMC, AUTH)

Table 11: Partner Roles

5.

Conclusions

In this First Report on ICH Capture and Analysis, substantial progress has been made on the data acquisition and analysis modules for the different ICH modalities studied in i-Treasures. Use has been made of the some of the latest techniques, such as depth cameras and multi-element inertial sensor modules (full-body and hand/upper-body movement ICH capture modalities), and new, more user-friendly EEG systems (EEG capture). Moreover, cutting edge software tools such as point-cloud 3D graphics packages and professional real-time multi-capture data acquisition systems (vocal tract capture) have been developed. A wide range of developments have been made in the facial recognition analysis, with an ASM-based generic face model superimposed in real time on a displayed face image. In addition to professional high-performance analysis techniques such as machine learning (electroencephalography analysis) and HMMs (full-body and hand/upper-body analysis), newer, cutting edge analysis techniques such as the use of multiple synchronous depth cameras (full body movement) and Deep Learning (vocal tract capture) have also played a role. Finally, challenging, never-before undertaken analyses such as human beat box singing analysis (vocal tract; sound analysis) and real-time tongue modeling (vocal tract) are underway.

Filename: D3.1 Final Version March 12 2014.docx

Page 121 of 127

D3.1 First Report on ICH Capture and Analysis

6.

i-Treasures ICT-600676

References

[1]

P. Ekman and W. V. Friesen, “The Facial Action Coding System: A technique for measurement of facial movement,” Consulting Psychologists Press, Palo Alto, CA, 1978.

[2]

P. Eckman, “Emotions in the Human Faces”, Studies in Emotion and Social Interaction, Cambridge University Press, 2nd edition, 1982.

[3]

F. Tsalakanidou and S. Malassiotis, “Real-time 2D+3D facial action and expression recognition”, Pattern Recognition, Vol. 43, No 5, pp. 1763-1775, May 2010.

[4]

Kinect: www.microsoft.com/en-us/kinectforwindows/

[5]

Stefanos Zafeiriou and Lijun Yin, “3D facial behaviour analysis and understanding”, Image and Vision Computing, Vol. 30, No 10, pp. 681-682, October 2012.

[6]

M. Pantic and L. Rothkrantz, “Automatic analysis of facial expressions: The state of the art,” IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 22, No. 12, pp. 1424–1445, December 2000.

[7]

M. Pantic and L. Rothkrantz, “Facial action recognition for facial expression analysis from static face images”, IEEE Transactions on Systems, Man, and Cybernetics -Part B, Vol. 34, No 3, pp. 1449–1461, June 2004.

[8]

M. Pantic, I. Patras, “Dynamics of facial expression: recognition of facial actions and their temporal segments from face profile image sequences”, IEEE Transactions on Systems, Man, and Cybernetics-Part B, Vol. 36. No 2, pp. 433– 449, April 2006.

[9]

T. Fang, X. Zhao, O. Ocegueda, S.K. Shah and I.A. Kakadiaris, “3D Facial Expression Recognition: A Perspective on Promises and Challenges”, in Proc. 9th IEEE International Conference on Automatic Face and Gesture Recognition (FG'11), Special Session: 3D Facial Behavior Analysis and Understanding, pp. 603-610, Santa Barbara, CA, USA, March 2011.

[10] G. Sandbach, S. Zafeiriou, M. Pantic and D. Rueckert, “Recognition of 3D facial expression dynamics”, Image and Vision Computing, Vol. 30, No 10, pp. 762– 773, October 2012. [11] T. Beeler, B. Bickel, P. Beardsley, B. Sumner, and M. Gross, “High-Quality Single-Shot Capture of Facial Geometry”, ACM Transactions on Graphics, Vol. 29, No 40, pp. 1-9, 2010. [12] D. Bradley, W. Heidrich, T. Popa, and A. Sheffer, “High Resolution Passive Facial Performance Capture”, ACM Transactions on Graphics, Vol. 29, No 41, 2010. [13] T. Weise, S. Bouaziz, Hao Li and M. Pauly, “Realtime Performance-Based Facial Animation”, ACM Transactions on Graphics (Proceedings SIGGRAPH 2011), Vol. 30, No 41, July 2011. [14] S. Malassiotis and M. G. Strintzis, “Robust face recognition using 2D and 3D data: pose and illumination compensation”, Pattern Recognition, Vol. 38, No 12, pp. 2537–2548, 2005. [15] S. Malassiotis and M.G. Strintzis, “Robust real-time 3D head pose estimation from range data”, Pattern Recognition, Vol. 38, No 8, pp. 1153–1165, 2005.

Filename: D3.1 Final Version March 12 2014.docx

Page 122 of 127

D3.1 First Report on ICH Capture and Analysis

i-Treasures ICT-600676

[16] P. Rousseeuw and A. Leroy, "Robust Regression and Outlier Detection", Wiley, New York, 1987. [17] A. Lanitis, C. J. Taylor, and T. F. Cootes, “Automatic Interpretation and Coding of Face Images Using Flexible Models”, IEEE Transactions on Pattern Analysis and Machine intelligence, vol.19, no.7, pp.743-756, July 1997. [18] B. Horn, “Closed-form solution of absolute orientation using unit quaternions”, Journal of the Optical Society of America A, Vol. 4, No 4, pp. 629–642, 1987. [19] G. Chiou and J. Hwang, ”Lip reading from Color Video”, IEEE Trans. on Image Processing, vol. 6, no. 8, pp. 1192-1195, 1997. [20] S. Lucey, S. Sridharan and V. Chandran, “Robust Lip Tracking using Active Shape Models and Gradient Vector Flow”, Australian Journal of Intelligent Information Processing Systems, vol. 6, no. 3, pp. 175-179, 2000. [21] S. Lucey, S. Sridharan and V. Chandran, “Adaptive Mouth Segmentation using Chromatic Features”, Pattern Recognition Letters, vol. 23, no. 11, pp. 12931302, September 2002. [22] www.openni.org/ [23] Paul J. Besl and Neil D. McKay (1992, February). A Method for Registration of 3-D Shapes. IEEE Trans. on Pattern Analysis and Machine Intelligence, Los Alamitos, CA, USA, IEEE Computer Society, vol. 14, no 2, 1992, p. 239–256. [24] Zhengyou Zhang (2000, December 2). A Flexible New Technique for Camera Calibration. IEEE Trans. Pattern Anal. Mach. Intell, 22(11): 1330-1334, 2000. [25] J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, and A. Blake (2011). Real-time human pose recognition in parts from single depth images. In Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR '11). IEEE Computer Society, Washington, DC, USA, 1297-1304. [26] R. B. Rusu and S. Cousins, "3D is here: Point Cloud Library (PCL)", Proc. IEEE Int. Conf. Robot. Autom., 2011. [27] J. Smisek , M. Jancosek and T. Pajdla (2011) "3-D with Kinect", Proc. IEEE ICCV Workshops, pp.1154 -1160 2011 [28] Ben Madhkour, R., Leroy, J., & Zajega, F. (2011, December). KOSEI: a kinect observation system based on kinect and projector calibration. QPSR of the numediart research program, 4(4), 71-81. [29] Leroy, J., Rocca, F., & Gosselin, B. (2013, September 26). Proxemics Measurement During Social Anxiety Disorder Therapy Using a RGBD Sensors Network. Proceedings of the MICCAI 2013 workshop on Bio-Imaging and Visualization for Patient-Customized Simulations [30] A. Quattoni, M. Collins and T. Darrell, “Conditional Random Fields for Object Recognition”, Neural Information Processing Systems, 2004. [31] S. Wang, A. Quattoni, L.-P. Morency, D. Demirdjian and T. Darrell, "Hidden Conditional Random Fields for Gesture Recognition," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2006. [32] J. Tilmanne and T. Ravet, http://tcts.fpms.ac.be/~tilmanne/

The

Mockey

Database,

[33] Animazoo (2008). IGS-190. http://www.animazoo.com.

Filename: D3.1 Final Version March 12 2014.docx

Page 123 of 127

2010.

D3.1 First Report on ICH Capture and Analysis

i-Treasures ICT-600676

[34] F. S. Grassia, “Practical Parameterization of Rota-tions Using the Exponential Map”, Journal of Graphics Tools, 3(3):29–48, 1998. [35] J. Tilmanne, A. Moinet and T. Dutoit, “Stylistic Gait Synthesis Based on Hidden Markov Models”, EURASIP Journal on Advances in Signal Processing, 2012:72(1):1–14, 2012. [36] K. Tokuda et al., HMM-Based http://hts.sp.nitech.ac.jp, 2008.

Speech

Synthesis

System

(HTS),

[37] J. Tilmanne, N. d’Alessandro, M. Astrinaki and T. Ravet, T, “Exploration of a Stylistic Motion Space Through Realtime Synthesis”, 9th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - VISIGRAPP 2014, January 2014, Lisbon, Portugal [38] J. Bloit and X. Rodet, “Short-time Viterbi for online HMM decoding: Evaluation on a real-time phone recognition task”, ICASSP 2008: 2121-2124, 2008. [39] Aylward, R., Daniel, S., Lovell, J., Paradiso, A. (2006), “A compact, wireless, wearable sensor network for interactive dance ensembles”. In Proceedings of the Int. Workshop on Wearable and Implantable Body Sensor Networks, MIT, USA. [40] Coduys, T., Henry, C. and Cont, A. (2004). Toaster and Kroonde: “HighResolution and High-Speed Real-time Sensor Interfaces”, In Proceedings of the International Conference on New Interfaces for Musical Expression(NIME04), Hamamatsu, Japan. [41] T. Todoroff, 2011, "Wireless digital/analog sensors for music and dance performances", Conference on New Interfaces for Musical Expression (NIME'11), pp. 515-518, Oslo, Norway, May 30 - June 1. [42] Grunberg, D. (2008). “Gesture Recognition for Conducting Computer Music.” Retrieved July 11, 2011, from:http://music.ece.drexel.edu/research/gestureRecognition [43] Rasamimanana, N. & Bevilacqua., F. (2009). “Effort-based analysis of bowing movements: evidence of anticipation effects.” The Journal of New Music Research, 37(4):339-351, 2009. [44] Zhengyou Zhang (2000, December 2). A Flexible New Technique for Camera Calibration. IEEE Trans. Pattern Anal. Mach. Intell, 22(11): 1330-1334, 2000. [45] P. H. S. Torr, A. Zisserman, Feature Based Methods for Structure and Motion Estimation. Vision Algorithms: Theory and Practice, Springer, Volume 1883, page 278--294, 2000 [46] Martin A. Fischler and Robert C. Bolles. 1981. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 24, 6 (June 1981), 381-395 [47] S. Manitsaris, A. Tsagaris, V. Matsoukas and A. Manitsaris, « Vision par ordinateur et apprentissage statistique: vers un instrument de musique immaterial », Actes des Journées d’Informatique Musicale (JIM 2012), Mons, Belgique, pp. 17-22, 2012. [48] S. Manitsaris, A. Tsagaris, K. Dimitropoulos, and A. Manitsaris, “Finger musical gesture recognition in 3D space without any tangible instrument for performing arts”, International Journal of Arts and Technology, press. [49] B. Περουλάκης, «Θεωρία και Πράξη της Βυζαντινής μουσικής. Γενικές Ασκήσεις», 2010. [50] F. Bevilacqua, F. Guédy, N. Schnell, E. Fléty and Leroy N., “Wireless sensor interface and gesture-follower for music pedagogy”, Proceedings of the International Conference of New Interfaces for Musical Expression, New York, USA, pp 124-129, 2007. Filename: D3.1 Final Version March 12 2014.docx

Page 124 of 127

D3.1 First Report on ICH Capture and Analysis

i-Treasures ICT-600676

[51] F. Bevilacqua, B. Zamborlin, A. Sypniewski, N. Schnell, F. Guédy, and N. Rasamimanana, “Continuous realtime gesture following and recognition”, LNAI 5934, pp.73–84, 2010. [52] F. Delalande, "La gestique de Gould: éléments pour une sémiologie du geste musical." In G. Guertin, ed. Glenn Gould, Pluriel. Louise Courteau Editrice Inc. pp. 83-111, 1998. [53] S. Balakrishnama and A. Ganapathiraju, “Linear discriminant analysis - A brief tutorial”, Institute for Signal and Information Processing, Mississippi State University, MS State, MS, USA, 1998. [54] J. A. Russell, “A circumflex model of affect,” Journal of Personality and Social Psychology, vol. 39, 1161-1178, 1980. [55] K. A. Lindquist, T. D. Wager, H. Kober, E. Bliss-Moreau, and L. Feldman Barrett, “The brain basis of emotion: a meta-analytic review,” Behavioral and Brain Sciences, vol. 35, pp. 121-202, 2012. [56] M.-K. Kim, M. Kim, E. Oh, and S.-P. Kim, “A review on the computational methods for emotional state estimation from the human EEG,” Computational and Mathematical Methods in Medicine, vol. 2013, Article ID 573734, 13 pages, 2013. doi:10.1155/2013/573734. [57] E. I. Konstantinidis, C. A. Frantzidis, C. Pappas, and P. D. Bamidis, “Real-time emotion aware applications: a case study employing emotion evocative pictures and neuro-physiological sensing enhanced by graphic processor units,” Computer Methods and Programs in Biomedicine, vol. 107, pp. 16-27, 2012. [58] T. D. Pham and D. Tran, “Emotion recognition using the Emotiv Epoc device,” in Lecture Notes in Computer Science, vol. 7667, Neural Information Processing, T. Huang, Z. Zeng, C. Li, and C. S. Leung, Eds. Berlin-Heidelberg: Springer, 2012, pp. 394-399. [59] N. Jatupaiboon, S. Pan-Ngum, and P. Israsena, “Real-time EEG-based happiness detection system,” The Scientific World Journal, vol. 2013, Article ID 618649, 12 pages, 2013. doi:10.1155/2013/618649 [60] G. Schalk, D. J. McFarland, T. Hinterberger, N. Birbaumer, and J. R. Wolpaw, "BCI2000: a general-purpose brain-computer interface (BCI) system," IEEE Transactions on Biomedical Engineering, vol.51, no.6, pp.1034-1043, 2004. [61] Y. Liu, O. Sourina, and M. K. Nguyen, “Real-time EEG-based emotion recognition and its applications,” in Lecture Notes in Computer Science, vol. 6670, Transactions on Computational Science XII, M. L. Gavrilova, C. J. Kenneth Tan, A. Sourin, and O. Sourina, Eds. Berlin-Heidelberg: Springer, 2011, pp. 256277. [62] Emotiv Systems Inc. (2012). Emotiv EPOC specifications [Online]. Available: http://www.emotiv.com/epoc/download_specs.php [63] E. Niedermeyer and F. H. Lopes da Silva. Electroencephalography: basic principles, clinical applications and related fields. Philadelphia, PA: Lippincott and Wilkins, 2004, pp. 139-141. [64] Wikipedia contributors. (2014, January 13). Comparison of consumer brain– computer interfaces [Online]. Available: en.wikipedia.org/wiki/Comparison_of_consumer_brain-computer_interfaces [65] G. Pfurtscheller and F. H. Lopes da Silva, “Event-related EEG/MEG synchronization and desynchronization: basic principles,” Clinical Neurophysiology, vol. 110, pp. 1842-1857, 1999.

Filename: D3.1 Final Version March 12 2014.docx

Page 125 of 127

D3.1 First Report on ICH Capture and Analysis

i-Treasures ICT-600676

[66] A. Accardo, M. Affinito, M. Carrozzi, and F. Bouquet, “Use of the fractal dimension for the analysis of electroencephalographic time series,” Biological Cybernetics, vol. 77, no. 5, pp. 339– 350, 1997. [67] T. Higuchi, “Approach to an irregular time series on the basis of the fractal theory,” Physica D, vol. 31, pp. 277–283, 1988. [68] P. C. Petrantonakis and L. J. Hadjileontiadis, “Emotion recognition from EEG using higher order crossings,” IEEE Transactions on Information Technology in Biomedicine, vol. 14, no. 2, pp. 186-197, 2010. [69] P. C. Petrantonakis and L. J. Hadjileontiadis, “Emotion recognition from EEG brain signals using hybrid adaptive filtering and higher order crossings analysis,” IEEE Transactions on Affective Computing, vol. 1, no. 2, pp. 81-97, 2010. [70] B. Kedem, Time series analysis by higher order crossings, Piscataway, NJ: IEEE Press, 1994. [71] R. J. Davidson, “What does the prefrontal cortex “do” in affect: perspectives on frontal EEG asymmetry research,” Biological Psychology, vol. 67, pp. 219-233, 2004. [72] N. Cristianini and J. Shawe-Taylor, An introduction to support vector machines and other kernel-based learning methods. Cambridge: Cambridge University Press, 2000. [73] T. M. Cover and P. E. Hart, “Nearest neighbor pattern classification,” IEEE Transactions on Information Theory, vol. 13, no. 1, pp. 21-27, 1967. [74] P. J. Lang, M. M. Bradley, and B. N. Cuthbert, “International affective picture system (IAPS): Affective ratings of pictures and instruction manual,” University of Florida, Gainesville, FL, Technical Report A-8, 2008. [75] M. M. Bradley, “Measuring emotion: the self-assessment manikin and the semantic differential,” Journal of Behavior Therapy and Experimental Psychiatry, vol. 25, no. 1, pp. 49-59, 1994. [76] The i-Treasures Project, 2013. [Online]. Available: http://www.i-treasures.eu/. [77] UNESCO, 1995-2012. [Online]. http://www.unesco.org/culture/ich/en/convention

Available:

[78] M. Stone, «Evidence for a rhythm pattern in speech production: Observations of jaw movement, » Journal of Phonetics., 1991 [79] B. Denby, M. Stone, «Speech synthesis from real time ultrasound images of the tongue, » In Acoustics, Speech and Signal Processing; ICASSP, 1, pp. I-685, 2004. [80] N. Henrich, B. Lortat-Jacob, M. Castellengo, L. Bailly and X. Pelorson, «Perioddoubling occurrences in singing: the "bassu" case in traditional Sardinian "A Tenore" singing,» International Conference on Voice Physiology and Biomechanics, 2006. [81] N. Henrich, C. d'Alessandro, M. Castellengo, and B. Doval, "On the use of the derivative of electroglottographic signals for characterization of non-pathological voice phonation", Journal of the Acoustical Society of America, 115(3), pp. 13211332, 2004. [82] INTEMPORA S.A., 2011. [Online]. Available: http://www.intempora.com/. [83] Proctor, Bresch, Byrd, Nayak and Narayanan, “Paralinguistic mechanisms of production in human “beatboxing”: A real-time magnetic resonance imaging Filename: D3.1 Final Version March 12 2014.docx

Page 126 of 127

D3.1 First Report on ICH Capture and Analysis

i-Treasures ICT-600676

study”, Journal of the Acoustical Society of America (JASA), vol. 133, issue 2, p. 1043-1054, 2013 [84] O. Babacan, T. Drugman, N. d’Alessandro, N. Henrich, T. Dutoit, “A comparative study of pitch extraction algorithms on a large variety of singing sounds”, Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vancouver, Canada, May 26-31, 2013. [85]

“Esps software package,” http://www.speech.kth.se/software/#esps.

[86] D. Talkin, “Speech Coding and Synthesis”, Elsevier Science B.V., 1995. [87]

“Speech Signal Processing http://sourceforge.net/projects/sp-tk/.

Toolkit

(SPTK),”

[88] T. Drugman and A. Alwan, “Joint robust voicing detection and pitch estimation based on residual harmonics,” Proc. Interspeech, Firenze, Italy, 2011. [89]

“Gloat matlab toolbox,” http://tcts.fpms.ac.be/~drugman/Toolbox/.

[90] A. de Cheveigné and H. Kawahara, “Yin, a fundamental frequency estimator for speech and music,” J. Acoust. Soc. Am., vol. 111, no. 4, pp. 1917–1930, 2002. [91] “Yin pitch estimator,” http://audition.ens.fr/adc/sw/yin.zip. [92] P. Boersma, “Accurate short-term analysis of the fundamental frequency and the harmonics-to-noise ratio of a sampled sound,” IFA Proceedings, Institute of Phonetic Sciences, University of Amsterdam, 1993, pp. 97–110. [93] W. Chu and A. Alwan, “Reducing f0 frame error of f0 tracking algorithms under noisy conditions with an unvoiced/voiced classification frontend,” Proc. ICASSP, 2009, pp. 3969–3972. [94] Alexander Ellis, “On the Musical Scales of Various Nations”, Journal of the Society of Arts, vol. 33, no. 1688, pages 485–527, 1885.

Filename: D3.1 Final Version March 12 2014.docx

Page 127 of 127

View more...

Comments

Copyright © 2017 DATENPDF Inc.