Thumb Inclination-Based Manipulation and Exploration, a Machine Learning Based Interaction Technique for Virtual Environments

In the context of Virtual Reality (VR), interactions refer to the plausible actions in a Virtual Environment (VE). To have an engrossing interface, interactions by the gestures of hand are becoming prominent. With this research work, a novel interaction technique is proposed where interactions are performed on the basis of the position of thumb in dynamic image stream. The technique needs no expensive tracker but an ordinary camera to trace hand movements and position of thumb. The interaction tasks are enacted in distinct interaction states, where the Angle of Inclination (AOI) of thumb is used for state-to-state transition. The angle is computed dynamically between the tip-of-thumb and the base of the Region of Interest (ROI) of an input image. The technique works in two phases: learning phase and application phase. In the learning phase, user-defined fist-postures with distinct AOI are learnt. The Support Vector Machine (SVM) classifier is trained by the AOI of the postures. In the application phase, interactions are performed in distinct interaction states whereas a particular state is activated by posing the known posture. To follow the trajectory of thumb, dynamic mapping is performed to control a virtual hand by the position of thumb in the input image. The technique is implemented in a Visual Studio project called Thumb-Based Interaction for Virtual Environments (TIVE). The project was evaluated by a group of 15 users in a moderate lighting condition. The 89.7% average accuracy rate of the evaluation proves suitability of the technique in the wide range VR applications.


INTRODUCTION
VE is the computer generated emulation of the real world or of an imaginary space. In the context of VR, interaction is the manmachine dialogue inside a VE. With the privilege of interactions, a user sustains the belief of being there and takes a virtual world as a near-to-real world. By now, it has been proved that gesture-based interactions are suitable for 3D (Three Dimensional) interactions in a VE [1,2]. 3D interaction can either be direct or indirect. The former is to select and/or manipulate objects through natural instinctive gestures while the 1 Department of Computer Science and IT, University of Malakand, Chakdara, KPK, Pakistan.
Email: a visitrais@yahoo.com (Corresponding Author), b sehatullah@hotmail.com, c inam.btk@gmail.com, d azharitteacher@gmail.com latter is to generate commands using menus and button-clicks. Interactions via indirect techniques have a little consistency in a VE because of their difficulties and imperceptibility in the virtual space [1].
The interfaces based on the perceptive gestures ensure naturalism. Therefore, gestural interfaces are suitable for man-machine communication [3]. Although with such interfaces, the degree of realism of a VE is raised [4], due to the machine-side and user-side challenges [4,5], hand gesture-based systems are comparatively more error-prone [5]. A user can associate a posture with any interaction by clicking over an interaction task in the learning phase.
In case of inappropriate posture, the system can be reverted by availing the appropriate option. In the application phase, switching to an interaction state is performed by posing a known fist-posture making the appropriate AOI. At the detection of a fist-posture, the AOI is forwarded to the SVM classifier for accurate classification. After an interaction state S is activated, the movements of the hand along the x, y and/or z-axis are traced to perform interaction I about the respective axis. Linking the libraries of OpenGL and OpenCV, the TIME technique is implemented in the case-study project; TIVE. A satisfactory accuracy rate of 89.7% was achieved for a total of 240 interaction attempts.
Rest of the paper is organized into 6 sections. Previous work is discussed in Section 2. The TIME technique is presented in Section 3. Section 4 is about the implementation of the technique for interactions in a VE. Evaluation of the technique without the use ML is discussed in Section 5. Discussion about the interaction technique is covered in Section 6. Finally, Section 7 presents conclusion and future work.

RELATED WORK
By dint of interactions, a user sustains the belief to be an active actor of the virtual environment. To ensure the naturalness of a VE, the interface of a VR application should be consistent and coordinated [6].
Interactions by the traditional interactive tools like keyboard and mouse are insufficient and lag behind to completely involve the users [7][8] in interactive 3D VE. As hand gestures are adaptable and flexible, hence suitable for interactions in a VE [2,9].
In the literature of VR, several gesture-based techniques have been proposed so far to ensure feasible 3D interactions. The earlier systems based on magnetic [10] and mechanical [11] tools are becoming inadequate due to their intrinsic limitations and cumbersome setups. Nevertheless, the recent advancements in image processing have paved the ways for more efficient and natural interfaces. The systems proposed by [12][13][14] utilize static hand postures for interaction. However, the static posture based systems suffer from the orientation obstruction [15]. For dynamic gestural interfaces, colored markers are proposed for hand gestures recognition [12,[16][17].

TIME: THE PROPOSED TECHNIQUE
This research work intends to incorporate the ML classifier in the designing of a gesture-based direct interface. Engine The Time-Tick procedure of the system checks the stillness of a fist-posture for calculating the AOI. The AOI is calculated if a known fist-posture is posed for about 500ms. A user needs to click on an interaction task (displayed in the form of a dialogue box) to reserve a fist-posture for an interaction task. The AOI and the interaction task selected are forwarded to the SVM classifier for learning. In the application phase, coordinates mapping is performed for the real-time interaction with the VE. Based on the AOI, the SVM classifier recognizes the posture and an interaction task is enacted. After initiating an interaction task, dynamic position of the thumb is traced to interact with the VE. To cancel an interaction task and to switch the system back to the default state, the fistposture with thumb pointing downward is needed to be posed. The entire process is shown in Fig. 1.

Pre Processing
Both in the learning and application phases, a scanned frame image; Fr is converted to YCbCr color space to get the YC . As proved by [35], the YCbCr is the optimal model to separate skin color from the non-skin colors. The YCbCr model is, therefore, followed for the segmentation of hand.

Extraction by Sliding Scan
The ROI is extracted by using our designed sliding scan algorithm [37]. The algorithm traces the firstmost white pixels at top, left and right. The region enclosed by the boundary pixels; Top-most (T ), Leftmost (L 7 and Right-most (R ) is extracted as the ROI, see Fig. 2.
with the boundary pixels The same algorithm is followed to trace the TT. To accurately trace the TT, a user needs to pose a fistposture with thumb up in the initial frame. In order to avoid false detection of thumb, the skin region (white) with 5 non-skin pixels at the top, left and right are extracted as the TT. Moreover an empirical constant ξ is added with the thumb Top-most pixel (TTm) to scan enough region of the thumb.
The region (white/skin) enclosed by the thumb boundary pixels TL , TR , TT and TD pixels, as shown in Fig. 3, is treated as the TT for onward processing. Once the TT is extracted, template matching [38][39] is performed to locate the thumb position in the dynamic image frames. The area of thumb (TA) is calculated using the zero order moments [40], whereas the mid-point of thumb; MT is obtained by the 1 st and 2 nd order moments [40].
Using the zeroth moment, the area (TA) of the TT is calculated as,  The Thumb Template (TT).

Coordinates Mapping
Parallel processing is performed in the TIVE project to capture real time video streams and to perform interaction in the VE at same time. However, the coordinate systems of the OpenCV and OpenGL are quite different, hence coordinates mapping is required for the seamless interaction. The Origin OJ0,0,07 of the OpenGL rendering frame lies at the middle of the clipping area, see  In the proposed system, the mapping function; w [37] transforms the MT ∈ ℝ to the OpenGL coordinates. During the application phase, let MT Y ∈ ℝ denotes the MT position in the first frame and let MT Z ∈ ℝ be the dynamic position of the MT in any following input image frame. Based on the mentioned assumption, the position of VH; VHP ∶ VHP ∈ ℝ e is computed as, To keep the VH visible during navigation, the look-at vector value of the Virtual Camera (VC) is constantly assigned as the z-axis to the VHP.
JJ∆Px/Tc7; J∆Py/Tr77 where 'Tc' and 'Tr' represent the total number of columns and rows respectively, see

Fist Detection
The FD module of the system traces the static fistpostures. A posture is traced for learning (in learning phase) or for testing (in application phase) if posed for about 500ms. To avoid the gesture-spotting issue [41], the extraction of the feature AOI is performed after the expiry of the time-slice. To ensure this, the time-tick module of the system measures the dynamic variation between any two successive image frames; Next Frame (NF) and Previous Frame (PF). The absolute bitwise difference between NF and PF is checked by a background stop-watch. A slight hand and/or thumb movement resets the stop-watch. Detecting no variation for approximately 500ms, the posture is assumed to be posed properly. At the beginning of making a posture; u S 0, the first NF is set as PF. With each following tick; u v | u v > u S , a scanned NF is compared with the PF. If the difference is high, the stop watch is made reset, otherwise is incremented by 1. At u v 500, the skin-color based detection from the last NF is performed. The process is shown in Fig.  7. Fig. 7. Schematic of the Time Tick Module

Fist Learning
The FL module learns a scanned posture on the basis of the feature; AOI. In order to allow the user to set a captured posture for an interaction, a list of interaction tasks is displayed over the posture image. The list contains the basic interactions, except navigation which is accomplished by the perceptive movement of hand in the default state.
From the displayed interaction tasks, a user may select one interaction at a time to associate it with a posture. By selecting an interaction task from the list, a posture is made reserved for that particular interaction task. The position; Bottom Mid (BM) of the Frimg as origin (see Fig. 8), the single lightweight feature; AOI is calculated as, A unique Fist ID (Fid) is assigned to a posture. The Fid is used as a label of the class representing the feature vector. After selecting an interaction task for a traced The classifier is designed to learn features F ∈ ℝ and set Y for class labels y | y ∈ Y where i ‚1, 2, . , . , nƒ. By using the feature vector, SVM builds an optimal hyperplane. The hyperplane is used to predict a class label y K for feature vector F K using the set S of features and class lables; S ‚JF , y 7, . , . , . , JF " , y " 7ƒ. If most of the features belonging to y K are on one side of the hyperplane as y ∈ ‚ 1, 1ƒ. In the proposed technique, the classifier learns the features AOI ∈ ℝ and class labels y for n numbers of distinct fistpostures; i ‚1, 2, . , . , nƒ. Hence, the set S is given as, S ‚JAOI , y 7, . , . , . , JAOI " , y " 7ƒ. The inner product space X: X ⊆ ℝ is computed to get the scoring function f between AOI ∈ X and y ∈ Y ‚1,2, . , . , nƒ.
The function ˆ measuring the similarity of an input instance AOI ∈ X in the defined prototype space D is given as, f ∶ X ‰ D → ℝ During the learning phase, a unique natural number; Fid is assigned dynamically to a fist-posture when a user opts for associating the posture for an interaction task. The same Fid is used as a class label for the detected feature (AOI) of the fist-posture. Therefore, with the dataset D the SVM classifier associates Fid with its feature vector AOI ,

Fist Handling
After learning the postures for different interactions, the FH module identifies a known posture by performing the One-Versus-Rest (OVR) approach instead of the One-Versus-One [44]. To obtain label y K with the OVR approach, F Z " classes are compared with F Z " 1 classes for an unknown extracted feature AOI K as,y K argmax ? , ,.,.,‹ OE• Q Jw . γJAOI K 7 b 7 where γ is the decision function, w the weight vector and b the slope intercept of the hyperplane [45]. The predicted class label; y K is used as F Z to get an associated State-ID (SID). The process of performing an interaction task T K bearing SID K by posing a known fist-posture Fid K is given as, The process of initiating an interaction (task) from a fist-posture is shown in Fig. 9. selected for manipulation as the VH enters into the aura of the object.

Navigation
Navigation refers to the insight of locomotion inside a VE. For exploring a VE, navigation is supported in the default state. As conceivable, the inside (forward) navigation is performed by the forward hand movement, see Figs. 10-11. The reverse (backward) navigation is carried out by the movement of hand away from the camera. To deduce forward or backward hand movement in a scanned 2D image, the initial TA is compared with the Dynamic Thumb Area (DTA); DTA=(y)(TA), for y > 1. To prevent the possibility of unintentional movement of hand, an increase or decrease by 8 units is omitted as clear from the following pseudo-code. The movement of hand along the x-axis for translation along the x-axis is shown in Fig. 12.

Scaling
Scaling is to increase (scale-up) or decrease (scale down) the size of an object. Scaling (scale up) about the x and y axis is performed by the hand movements along the +ve x and y-axis respectively. As conceivable, down scaling is carried out by the hand

IMPLEMENTATION AND EVALUATION
The TIME technique is implemented in the case-study application; TIVE using a Corei5 laptop with 2.60 GHz processor and 4GB DDR. In a Visual Studio project, the OpenGL library was used for the front-end VE. At the back-end image processing was performed by the OpenCV library. Offering a first person's view, the VH represents the user's position in the VE, see Fig. 14. Different 3D objects are rendered at different points of the VE so that to engross the users during interaction. The system provides both textual and audio (beef) signal whenever a user initiates an interaction. With the 'r' key-press event, the entire system is reset where the VC eye is set to look at the origin of the scene; O(0,0,0). The users were familiarized to the system by demonstrating how to interact and make the postures. Moreover, all the participants performed pre-trials for the basic interaction tasks. They were guided to press the Enter key to reset the system for a new trial. All the experiments were performed in the University IT lab in an average lighting condition with illumination level approximately 110 lux [46]. A user interacting with the system with his thumb is shown in Fig. 15.

The Interaction Tasks
Participants were asked to perform the following four tasks in the designed 3D environment. The tasks are arranged in order to assess the basic interactions; Selection, Scaling, Navigation and Translation. In the mid of the VE, a Teapot is rendered to be picked (selected) and manipulated. Each of the users performed two trials of the following tasks. In a single trial, scaling is assessed three times, selection and navigation are evaluated two times while translation is evaluated one time. False detection and inappropriate interactions were deemed as errors.
Overall accuracy achieved for the 240 interaction attempts, as shown in Table 1, was 89.7%. Mean of the accuracy rate (in %age) of the two trials are shown in Fig. 16.

Learning Effect
Outcomes of the evaluation revealed that performance of the users improves with practice. The learning effect was measured from the errors occurrence rate. To analyze differences in the means of the two trails, Fig. 16: Mean of the Percentage Accuracy of the two trials a paired two sample T-test was used. It was assumed that the means were the same; (H0: μ Z 0). The hypothesis; H0 was rejected after getting a significant difference between the outcomes of Trail-1 (M=63.03, SD=5.4) and Trail-2 (39.92, SD=5.9) conditions; (t(6)=-9.08, p=0.009).
The graph showing the errors (%age) of Trail-1 and Trial-2 is shown in Fig. 17. During translation and navigation, the MT was wrongly traced and hence, comparatively more errors were counted as shown in the Fig.   Fig. 17: The Percentage of errors occurred in Trial-1 and Tiral-2 interactions.

Subjective Analysis
At the end of the evaluation, a questionnaire was presented to the participants to measure the three factors; Ease of Use, Fatigue and Suitability in VE.
Most of the participants are opted in favor of the technique. The percentage of the user's response acquired by the questionnaire is shown in Fig. 18.

RECOGNITION BY EXPLICIT PROGRAMMING
To evaluate the recognition of fist-postures without the use of ML, a separate project; TIVE-2 was designed by modifying code of the TIVE project. As no learning is involved in TIVE-2, the FL and FH modules were replaced by a single module; FR (Fist Recognizer). Using a nested if-else structure, explicit programming was performed to recognize different fist-postures. The algorithms for dynamic interaction by the movements of thumb (as discussed under subsection 3.7) were kept unchanged. With TIVE-2, the same tasks were performed by twelve participants in the same environment (the university IT lab). Each of the users performed two trials of the tasks. An average accuracy rate of 82.8% was achieved for 192 interaction attempts (see Table 2). As the AOI for different fist-postures were explicitly set, some postures were not correctly identified. The reason behind low accuracy was variation in the size (length and width) of the users' hand and/or thumb. Moreover, some of the participants faced difficulties in posing the exact fist-postures with the required AOI.

DISCUSSION
According to the recent research works carried out in VR interaction, it has been proved that gesture-based interactions are suitable for VE. However, it is also a fact that such systems are difficult to design. As the hand size and length of fingers vary from individual to individual, therefore interactions by the whole hand gestures are susceptible to false recognition.
With the TIME technique a novel interaction approach is proposed where users will be able to interact with a 3D environment by the simple postures of thumb.
With the inclusion of ML classifier (SVM), the system is made intelligent to associate a fist-posture for an interaction task at run time. To analyze outcomes of the technique with and without ML classifier, two projects; TIVE and TIVE-2 were designed. With TIVE, a user trains the system with different fistpostures at run time. In the TIVE-2 project, AOI for the postures are pre-defined during coding. As a user trains and tests the system with his/her own fistposture, therefore, comparatively high accuracy rate was achieved for the TIVE project. However, due to dissimilarities in hand and thumb size of the users, low accuracy rate was reported for TIVE-2.
The distinguishing feature of the proposed technique is that it frees a VR user to remember the gestures set by others. Once an interaction state is activated, the perceptive horizontal and vertical movements of hand are traced for translation, selection, navigation and scaling. Unlike the costlier and complex setup of datagloves [47] and armbands [48][49], an ordinary camera is used for the detection of hand and thumb. Outcomes of the technique support applicability of the technique in the VR domain. It is pertinent to add that during the evaluation it was observed that most of the errors were due to the quicker movement of the user's hand. In such cases an ordinary camera misses some of the required frame data. The challenge of quicker hand movement can be resolved by using a high quality camera. Moreover, with the use of a high speed processor, faster frame rate and timely extraction of frame data is possible. In short, accuracy rate of the system can be raised with the use of a high speed processor and a quality camera.

CONCLUSION
To cope with the rampant pace of the VR developments, a simple and natural interface is needed for intuitive 3D interactions. With this contribution, we propose a ML based interaction technique where interactions are performed by simple movement of hand. Based on the positions of thumb, different fistpostures are learnt and recognized based on the lightweight feature; AOI. To increase accuracy of the technique, the SVM classifier may be trained with some additional features as well. For instance, the angle of declination of thumb may be used to unambiguously specify the position of thumb in input stream. Similarly, image-based features after adaptive tiling may be used to improve the accuracy rate. However, the single lightweight-feature; AOI is used to ensure quick processing. The technique is twice evaluated; with ML and without ML classifier. An average accuracy of 89.7% was achieved for the TIVE project where the ML algorithm is used to recognize the fist-postures. In TIVE-2 project, postures are identified without the use of ML classifier. In a separate evaluation session, comparatively low accuracy (82.8%) was reported for TIVE-2. It is pertinent to note that the TIVE-2 project was evaluated by 12 users. By increasing the number of users, probability of dissimilarities among hand/thumb size would increase. Hence, the possibility of low accuracy will also increase in case of using the technique without ML classifier.
As a whole, outcomes of the evaluations suggest suitability of the technique in a wide spectrum of manmachine interactions particularly in 3D gaming, robotics virtual prototyping and simulation. The work also presents the integration of image processing, ML and VR. With less efforts, the technique can be made implementable on other sensing platforms. As our future strategy, we are determined to enhance the system for the collaborative VE.