Quill: a Gesture Design Tool for Pen-based User Interfaces

Camille Novak | Download | HTML Embed
  • Jun 28, 2001
  • Views: 12
  • Page(s): 307
  • Size: 6.73 MB
  • Report

Share

Transcript

1 Quill: a Gesture Design Tool for Pen-based User Interfaces by Allan Christian Long, Jr. B.S. (University of Virginia) 1992 M.S. (University of California, Berkeley) 1996 A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Computer Science in the GRADUATE DIVISION of the UNIVERSITY OF CALIFORNIA, BERKELEY Committee in charge: Professor James A. Landay, Chair Professor Lawrence A. Rowe Professor Richard Ivry Fall 2001

2 The dissertation of Allan Christian Long, Jr. is approved: ________________________________________________________________ Chair Date ________________________________________________________________ Date ________________________________________________________________ Date University of California, Berkeley Fall 2001

3 Quill: A Gesture Design Tool for Pen-based User Interfaces Copyright 2001 by Allan Christian Long, Jr.

4 Abstract Quill: A Gesture Design Tool for Pen-based User Interfaces by Allan Christian Long, Jr. Doctor of Philosophy in Computer Science University of California, Berkeley Professor James A. Landay, Chair This dissertation describes the motivation, design, and development of a tool for designing gestures for pen-based user interfaces. Pens and other styli have been ubiquitous for recording information for centuries. Recently, pen-based computers have recently become common, especially small devices such as the Palm Pilot. One benefit pens provide in computer interfaces is the ability to draw gesturesmarks that invoke commands. Gestures can be intuitive and faster than other methods of invoking commands. However, our research shows that gestures are sometimes misrecognized and hard to remember. We believe these problems are due in part to the difficulty of designing good gesturesthat is, gestures that are easy to remember and are recognized welland the lack of tools for helping designers create good gestures. We believe that an improved gesture design tool can help interface designers create good gestures for their applications. Since people confuse similar objects and misremember them, we performed experiments to measure why people perceived gestures as similar. We derived computational metrics for predicting human perception of gesture similarity. Based on the results of our experiments, we developed a gesture design tool, quill. The tool warns designers about gestures that may be hard to remember or recognize, and provides advice about how to improve the gestures. It also provides a convenient way to test recognition of gestures. To evaluate quill, a user study was performed with 10 professional user interface designers and one professional web designer. All designers were able to create gesture sets using 1

5 quill, but not all designers benefited from quills suggestions. More work is needed to make suggestions useful for most designers. The primary contributions of this work are: Improved understanding of the gesture design process, including the types of problems people encounter when designing gestures. Computational models for predicting human-perceived gesture similarity. Confirmation of the importance of good naming for gesture memorability. An intelligent gesture design tool, quill, which automatically warns designers about potential problems with their gestures and advises them about how to fix these problems. This work also suggests several areas for future work in the areas of gesture design tools and gesture similarity and memorability. 2

6 Dedication To Alexis i

7 Table of Contents Abstract 1 Dedication i Table of Contents ii List of Figures iv List of Tables vii Acknowledgements ix I. Introduction 1 1. Related Work 5 II. PDA and Gesture Usage Survey 17 1. Method 17 2. Results 23 3. Discussion 31 4. Summary 36 III. Development and Evaluation of a Simple Gesture Design Tool 37 1. Gesture Recognition Algorithm 37 2. Gesture Design Tool Description 46 3. Experimental Method 51 4. Results and Analysis 55 5. Discussion 59 6. Summary 63 IV. Gesture Similarity Experiments 64 1. Similarity Experiment 1 64 2. Similarity Experiment 2 81 3. Similarity Experiment 3: Pairwise Similarity Survey 89 4. Discussion 96 5. Summary 100 V. Gesture Memorability Experiment 102 1. Participants 103 2. Equipment 104 3. Procedure 110 4. Analysis 114 5. Results 115 6. Discussion 119 7. Summary 121 VI. quill: An Intelligent Gesture Design Tool 122 1. Goals 122 2. Gesture Hierarchy and Naming 124 3. Active Feedback 125 4. quill Example 127 5. User Interface Issues 128 6. User Interface Evolution 134 ii

8 7. Implementation Issues 145 8. Summary 159 VII. quill Evaluation 160 1. Participants 160 2. Equipment 161 3. Procedure 161 4. Quantitative Results and Analysis 163 5. Qualitative Results 167 6. Discussion 169 7. Proposed quill Experiment 172 8. Summary 177 VIII. Future Work 179 IX. Conclusions 183 Bibliography 186 A. PDA User Survey 194 1. PDA User Survey 194 2. Results 200 B. Evaluation of gdt 204 1. Experiment Overview 204 2. GDT Tutorial 205 3. Practice Task 210 4. Experimental Task 211 5. Post-experiment Handout 213 6. gdt Experiment Questionnaire 215 7. GDT Experiment Script 221 8. Consent Form 225 9. Results 226 C. Gesture Similarity Experiments 228 1. Gesture Similarity Experiment Overview 228 2. Similarity Experiment Script 228 3. Post-experiment Questionnaire 230 D. Gesture Memorability Experiment 234 1. Gesture Memorability Experiment 234 2. Memorability Experiment Script 235 3. Memorability Experiment Questionnaire 238 E. quill Documentation 242 1. quill Tutorial 242 2. quill Reference Manual 254 F. quill Evaluation 276 1. Overview 276 2. Long Task: Presentation Editor 277 3. Short Task: Web Browser 278 4. Post-experiment Handout 279 5. Experimenter Script 280 6. quill Experiment Questionnaire 283 7. Experimental Results 290 iii

9 List of Figures Figure 1-1 Example of a delete gesture. 2 Figure 1-2 Example multistroke gesture (x from Graffiti). The dot is not part of the gesture, but only specifies the starting point, as is conventional in Graffiti. 2 Figure 2-1 Agree/disagree statements about gestures. 21 Figure 2-2 PDA user technical sophistication. Fractions are separated by type of PDA. used. 24 Figure 2-3 PDA user education level. Fractions are separated by type of PDA used. 25 Figure 2-4 Time using current PDA. 26 Figure 2-5 Representative delete gesture usage frequency for Newton users. 27 Figure 2-6 Application use: higher numbers indicate more frequent use. Thin lines show standard deviation. 29 Figure 2-7 Tasks done on paper instead of on PDA. Some respondents listed more than one task, others listed none. 30 Figure 2-8 Types of notes taken in meetings. 32 Figure 3-1 List of features from Rubines recognizer. 39 Figure 3-2 Example gesture with feature annotations (from [Rub91b]). 40 Figure 3-3 gdt main window. 47 Figure 3-4 gdt class window. 47 Figure 3-5 gdt distance matrix. To find the distance between two classes, find one class name in the column labels and the other class name in the row labels, and find the entry at that column and row. Larger distances are grayed out, based on the value of the threshold slider on the right. 48 Figure 3-6 gdt classification matrix. 49 Figure 3-7 gdt feature graph. All feature values of all training examples for all classes are in the graph (only 2 features fit in the window at once, however). 50 Figure 3-8 Baseline gesture set used in gdt experiment. 53 Figure 3-9 Operations for which participants invented gestures. 54 Figure 3-10 Recognition rates by participant for different stages of the gdt experiment. 56 Figure 3-11 Angle between first and last points visualization. S is the start point. E is the end point. 1 is a horizontal ray from S. 2 connects S and E. 3 represents the value of the feature, which is the angle between 1 and 2. 62 Figure 4-1 Gesture set for first gesture similarity experiment. A dot indicates where each gesture starts. (Gestures are not numbered consecutively because they were chosen from a larger set.) 65 Figure 4-2 Triad program for the first and second similarity experiments. 66 Figure 4-3 Practice gesture set for the first similarity experiment. 67 Figure 4-4 First similarity experiment, stress vs. dimension. This graph shows a knee a dimension (D) equals 3. It also shows ordinal with proximities gave the best fit to the data (i.e., the lowest stress value). 68 Figure 4-5 First similarity experiment, vs. dimension. There is no obvious knee in the curve, but the graph shows that ordinal with proximities is the best fit (i.e., highest ). 69 iv

10 Figure 4-6 MDS plot of dimensions 1 and 2 for first similarity experiment. 75 Figure 4-7 MDS plot of dimensions 1 and 2 for first similarity experiment, with hand annotations showing groupings based on geometric properties, such as straightness vs. curviness. 76 Figure 4-8 MDS plot of dimensions 3 and 4 for first similarity experiment. Features for dimension 4 are given in Table 4-3. 77 Figure 4-9 MDS plot of dimensions 4 and 5 for first similarity experiment. Features the dimensions represent are given in Table 4-3. 78 Figure 4-10 Dimensions 1 and 2 of 3D configuration for similarity experiment 1. 79 Figure 4-11 Dimensions 2 and 3 of 3D configuration for similarity experiment 1. 80 Figure 4-12 Similarity experiment 2, gesture set 1. It was used to explore absolute angle and aspect. 81 Figure 4-13 Similarity experiment 2, gesture set 2. It was used to explore length and area. 82 Figure 4-14 Similarity experiment 2, gesture set 3. It was used to explore rotation. 83 Figure 4-15 Similarity experiment 2, gesture set 4. It includes gestures from Figures 4- 12, 4-13, and 4-14. 84 Figure 4-16 Similarity experiment 2, stress vs. dimension in MDS analysis. 85 Figure 4-17 Similarity experiment 2, vs. dimension in MDS analysis. 86 Figure 4-18 MDS plot of dimensions 1 and 2 for second similarity experiment (combined data). 87 Figure 4-19 The 37 pen gestures used in the similarity web survey. 90 Figure 4-20 Survey web page for third similarity experiment with sample gesture pair. 91 Figure 4-21 The discriminant function for predicting human-perceived similarity. 94 Figure 4-23 The single most significant feature: sine of initial angle. 94 Figure 4-22 Accuracy of logistic regression similarity model as a function of percentage of similar gestures in the population. 95 Figure 5-1 Commands for memorability experiment. 104 Figure 5-2 Screen that the participant sees while waiting for the experimenter to judge a gesture. 111 Figure 5-3 Screen that the experimenter uses to judge gestures. 112 Figure 5-4 Screen showing participant the correct gesture after an incorrect answer. 112 Figure 5-5 Example iconicness question. 113 Figure 6-1 quill UI overview. 128 Figure 6-2 quill with multiple windows, showing several gesture categories open at once. 129 Figure 6-3 quill recognizes a gesture and shows the result several different ways. 130 Figure 6-4 quill tree view of a training gesture set. 130 Figure 6-5 quill desktop view of a gesture set, gesture group, example (training gesture), and gesture category (clockwise from top left). 131 Figure 6-6 Sketch of quill prototype. 134 Figure 6-7 Storyboard of quill prototype, showing creation of a new gesture category. Top-left shows an empty gesture package. Top-right shows a new gesture v

11 category added. Bottom-left shows the category renamed and one gesture added. Bottom-right shows several gestures added. 135 Figure 6-8 Sketch of quill prototype. This shows a bad training example being reported. 135 Figure 6-9 Storyboard of quill prototype. At the top left the user clicks more info to show the top right. From there the user can click sharpness to show the bottom left screen or Plot sharpness to show the bottom right. 136 Figure 6-10 A way to show training gestures and test gestures for a category together. 136 Figure 6-11 quill low-fi prototype, scenario 1. Initial report of a bad training example. The notice is shown in the log at the bottom, and a warning icon is placed next to the category in the tree and desktop views. 137 Figure 6-12 quill low-fi prototype, scenario 1. A bad training example is highlighted in red. 138 Figure 6-13 quill low-fi prototype, scenario 2. The designer saw the warning at the bottom and has clicked on the More info button. angle of the bounding box is a blue hyperlink. 138 Figure 6-14 quill low-fi prototype, scenario 2. The designer clicked on angle of the bounding box and is shown a graphical explanation (in the window above: Feature: angle of the bounding box). 139 Figure 6-15 Empty gesture category overlay, used in paper prototype evaluation. 139 Figure 6-16 Dialog box from the paper prototype evaluation. Used to choose a test set. 140 Figure 6-17 quill UI overview. 143 Figure 6-18 Misrecognized gestures. 144 Figure 6-19 quill class diagram. Light gray boxes are classes, medium gray boxes are abstract classes, and dark gray boxes are interfaces. 146 Figure 6-20 quill notice classes. 156 Figure 6-21 quill classes for displaying gestures on the desktop. 158 Figure 7-1 Gesture groups and categories for the long experimental task (presentation editor). 162 Figure 7-2 Gesture groups and categories for the short experimental task (web browser). 162 Figure 7-3 Interaction between feedback and task order. Values are average human goodness (scale is 01000, higher is better). 164 Figure 7-4 Example questions from quill post-experimental questionnaire. 166 Figure E-1 Main window regions. 243 Figure E-2 Initial, empty main window. 244 Figure E-3 Main window resizing and layout controls. 250 vi

12 List of Tables Table 2-1 Operations asked about in survey and corresponding gestures on the Apple Newton and Palm Pilot, where gestures exist. Blank spaces indicate that no gesture exists. 20 Table 2-2 PDA user professions. 24 Table 2-3 PDA usage frequency, as a percent of users of each type of PDA. 25 Table 2-4 Gesture usage frequency. 1 = Never, 2 = Rarely, 3 = Often, 4 = Very often. 26 Table 2-5 Opinions about gestures. 1 = Strongly agree, 4 = Strongly disagree. Av- erages indicating disagreement are in bold. 28 Table 2-6 Frequency of use of PDAs during meetings. 1 = never, 4 = very often. 31 Table 4-1 Possible predictors for similarity. Features 1-11 are taken from Rubines rec- ognizer (see 3.1.1.1, p. 39). 71 Table 4-2 Gesture coordinates from MDS for similarity experiment 1, 5D analysis. 74 Table 4-3 Predictor features for similarity experiment 1, listed in decreasing order of im- portance for each dimension. 75 Table 4-4 Coefficients of similarity model from first similarity experiment. 76 Table 4-5 Gesture coordinates from MDS for experiment 1, 3D analysis. 79 Table 4-6 Coefficients of similarity model for experiment 1 from 3D MDS analysis. 80 Table 4-7 Predictor features for similarity experiment 2 (using data from all experiment 2 gesture sets). 87 Table 4-8 Coefficients for overall similarity model for second similarity experiment.88 Table 4-9 Frequency of similarity ratings for experiment 3. 91 Table 4-10 Summary of similarity models. Shows how each model correlates with data from each experiment. The model from experiment 3 is not shown because it produces a similar/not similar answer, so correlation is not appropriate. 96 Table 4-11 Evaluation of similarity model 3 using data from experiments 1 and 2. For each experiment, ther is a significant difference in dissimilarity rating report- ed by participants between gestures that model 3 predicted as similar vs. non- similar. 97 Table 5-1 Memorability experiment gestures, along with the command mappings for the three experimental conditions. 105 Table 5-2 Number of memorability experiment participants by mapping and phase. 115 Table 5-3 Summary of memorability experiment. The left side gives average values. The right side gives p values where there is a significant difference, or a blank if there is no significant difference (p > 0.1). 116 Table 5-4 Correlations of iconicness with other variables. All values are significant at p < 0.015 except italicized values, which are not significant. See Table 5-5 for exact p values. 117 Table 5-5 Significance of correlation of iconicness with other variables. Blank p values indicate the correlation is not statistically significant. See Table 5-4 for corre- lations. 117 Table 5-6 Coefficients for learnability and memorability predictions. (See 3.1.1.1, vii

13 p. 39, for a description of the features). 118 Table 5-7 Correlation of memorability and learnability prediction with model data (used to create the model) and test data (data set aside and not used in model cre- ation). Blank p values indicate no statistical significance. Italic p values indi- cate weak statistical significance. 119 Table 6-1 Strategies for handling analysis in a background thread. 151 Table 6-2 Notice summary. 155 Table 7-1 Means for tasks in the quill evaluation. Goodness is on a scale of 01000, where 1000 is perfect. Time is in minutes. 163 Table 7-2 Summary of questions in post-experiment survey. 165 Table 7-3 Correlations of questionnaire answers with performance. Statistically signifi- cant correlations are in bold. (Responses to questions on the long task and the short task (2834 and 3642) are correlated with performance on the long and short tasks, respectively.) 167 Table B-1 gdt evaluation results (part 1). 226 Table B-2 gdt evaluation results (part 2). 227 Table F-1 Correlations of questionnaire with performance. Statistically significant cor- relations are bold. (Responses to questions on suggestions (8-13) are correlat- ed with performance on suggestion-enabled task. Responses to questions on long task and short task (28-34 and 36-42) are correlated with performance on the long and short tasks, respectively.) 291 viii

14 Acknowledgements So many people and experiences have helped me reach the point where I am now that it is difficult to know where to begin. First, I would like to thank my wife, Alexis. The support she has provided and the joy she has brought to my life have helped me immeasurably during the arduous task of finishing my dissertation. Many thanks are also due to my advisor, James Landay, and my committee member and former advisor, Lawrence Rowe. I feel very fortunate to have had such great mentors and advisors to help me with my research. I would also like to thank my parents, without whose constant love and support I would likely have never had the courage to attempt a PhD in the first place. They have always supported my interests and encouraged me to do my best in everything, even when it meant moving over 2,000 miles away to come to Berkeley. Thanks, Mom and Dad. I have been blessed with many wonderful friends in my time at Berkeley, and I cannot mention them all, so Ill just say thanks to a few of the closest ones: Eric Zylstra, Rick Bukowski, and Laura Downs. You helped make the good times great and eased the pain of the not-so-good times. Another great blessing in my life in recent years has been my small group Bible study. Their friendship and love has meant a lot to me, and their prayers have helped me through many difficult times. I would especially like to thank Jason and Julie Hotchkiss, Jacqueline Huen, and Joel and Barbie Kleinbaum. I would also like to thank everyone from Crossroads, the young adult fellowship that started our small group, and where my wife and I met. You were a wonderful example of what Christian community can be, and the source of many great friendships. Many fellow graduate students and colleagues have been supportive and helpful over the years, especially fellow members of the Group for User Interface Research. I would like to single out a few, but Im grateful to you all. Professor Marti Hearst was always a helpful source of insights and critiques. Rashmi Sinha was a great help with experimental design ix

15 and data analysis. Also, Joseph Michiels was immensely helpful with the first two gesture similarity experiments and the gesture memorability experiment. I would also like to thank the Microsoft Usability Lab, and Ken Hinkley especially, for their invaluable help recruiting participants and providing space for the quill evaluation. I also owe all the people who participated in my many experiments over the years. Thank you all. Berkeley has been a wonderful place to be a graduate student also because of the excellent staff here, both technical and administrative. I am very grateful for their help throughout my graduate student tenure. Finally, I feel a debt to my undergraduate advisor, Randy Pausch, and the wonderful research group I worked with at the University of Virginia (now at Carnegie Mellon University and known as the Stage 3 Research Group). My experiences there showed me how much fun research can be. x

16 Chapter 1 Introduction Interest in pen-based user interfaces is growing rapidly, and with good reason. Pen and paper has been an important, widely used technology for centuries. It is versatile and can easily express text, numbers, tables, diagrams, and equations [Mey95]. Many authors list the benefits pen-based computer interfaces could enjoy on desktop and portable computing devices [BDBN93, FHM95, HB92, Mey95, MS90, WRE89]. In particular, commands issued with pens (i.e., gestures) are desirable because they are commonly used and iconic, which makes them easier to remember than textual commands [MS90]. They are also faster, because command and operand are specified in one stroke. Recently, more computer users have adopted pen-based computers. In 2000, 3.5 million handheld computers were sold, a substantial increase over 1999 sales of 1.3 million [Lue01]. The use of pen-based input for desktop systems is also growing as the cost of tablets and integrated display tablets fall. As pen-based devices proliferate, pen-based user interfaces become increasingly important. Gestures have also become more prevalent recently. The Opera web browser supports a small set of gestures for navigation [Sof01]. Sensiva and KGesture provide gestures for common desktop applications and allows users to create their own gestures under Microsoft Windows and the Kommon Desktop Environment (KDE), respectively [Sen01, Pil01]. A great deal of interface research has dealt with conventional windows, icons, menu, and pointing (WIMP) graphical user interfaces (GUIs). There is also a substantial body of work on pen-based interfaces, but many hard problems remain. To a first approximation, a pen can be used to replace a mouse, but what is easy with a mouse is not necessarily easy with a pen, and vice versa. For any given task, the ideal pen-based UI will not be limited to techniques developed for GUIs, but will incorporate pen-specific techniques that take advantage of the unique capabilities of pens. To take full advantage of pen-based UIs, new interaction techniques will have to be invented that are tailored to the characteristics of pens. 1

17 delete me Figure 1-1 Example of a delete gesture. Figure 1-2 Example multistroke gesture (x from Graffiti). The dot is not part of the gesture, but only specifies the starting point, as is conventional in Graffiti. The pen-specific technique on which this dissertation focuses is gestures. A gesture is a mark or stroke that causes a command to execute. An example of a delete gesture is shown in Figure 1-1. Gestures may be single-stroke or multi-stroke. A single-stroke gesture is produced by a single pen down, pen motion, and pen up. The gesture in Figure 1-1 is an example of a single-stroke gesture. A multi-stroke gesture is multiple single strokes used together as one gesture. An example of a multi-stroke gesture is the x from Graffiti [Lee94], shown in Figure 1-2. The work described in this dissertation has been about single stroke gestures because the recognition algorithms generally available recognize only single-stroke gestures. Gestures are useful on displays ranging from small screens, where space is at a premium, to large screens, where controls can be out of reach [PL92]. In this work we have decided to concentrate on gestures in the spirit of copy editing [Lip91, Rub91b] rather than marking menus [TK95]1, because we believe that traditional marks are more useful in some circumstances. For example, they can specify operands at the same time as the operation, and they can be iconic (i.e., memorable because the shape of the gesture corresponds with its operation). 1. Marking menus are an enhancement of pie menus [CHWS88] in which the menu itself is not immediately drawn. Therefore, a trained user can make a selection quickly because there is no menu- painting delay. 2

18 To determine how users perceived gestures, we surveyed Personal Digital Assistant (PDA) users [LLR97]. The survey showed that users value gestures, and they want more gestures in their applications. At the same time, there are problems with gestures. Specifically, users found that gestures were difficult to learn and remember and that they were often misrecognized by the computer. For any recognition system, misrecognition errors are undesirable, but for gestures they are especially problematic, because gestures perform actions. When a gesture is misrecognized the user must notice that an error has occurred and take special steps to undo the changes. It might seem that the solution to the gesture set design problem is to invent one gesture set and use it in all applications. We believe standard gestures will be developed for common operations (e.g., cut, copy, and paste). However, pens and gestures will be used for a wide variety of application areas, and each area will have unique operations that will require specialized gestures. For example, a user will want different gestures for a text markup application than for an architectural CAD application. It is also possible that the best gestures for small devices will be different than the best gestures for large devices. My thesis is that designers of pen-based UIs would benefit from tools that support them in the tasks of inventing gesture sets and augmenting existing sets with new gestures. The goal of this work is to improve pen-based user interfaces by improving gestures through better gesture design tools. The steps to accomplish this goal are outlined below. As a first step toward understanding the gesture design process and improving it, we built a prototype tool for designing gesture sets, called gdt. gdt incorporated some functionality that we thought would help designers make gestures that were easier for the computer to recognize, but we did not have any models related to human similarity at that time. An experiment was performed to gain insight into the gesture design process and to evaluate the tool. We found that while gdt improved the process of gesture set design, the task was still difficult and there was a great deal more a tool could do to aid the designer. The experiment suggested several ways in which current tools could be improved. The most important improvement was to make the tools more active in giving feedback and guidance to designers. 3

19 The survey of PDA users revealed problems learning and remembering gestures. Therefore, another feature we wanted the gesture design tool to have was advice to the designer about how to make gestures easier for people to learn and remember. Unfortunately, a search of the psychological literature did not reveal anything directly relevant to human memorability or learnability of gestures. In other areas, similar objects are sometimes confusable with one another when people try to remember them, so we ran several studies to determine why people perceive gestures to be similar. From these studies, we derived two different computational models to predict human-perceived gesture similarity. The goal was to incorporate these models into a gesture design tool so it could predict when people would likely confused by a set of gestures. We also ran an experiment to determine what geometric factors may influence the memorability of gestures directly. Unfortunately, we found that the names of the gestures or commands had a large impact as a confounding variable, and there were not enough participants to discover the smaller effect due to geometry. The results of the experiment with gdt and the models of human similarity were incorporated into the design of a new gesture design tool, named quill. Based on the gdt experiment, quill emphasized active feedback to designers about possible problems with their gestures, and suggestions for improvement. For example, if a designer enters gestures that are likely to be confused by the recognizer or that humans are likely to perceive as similar, quill warns the designer about this situation, and can provide suggestions to the designer about how to change the gestures to improve them. quill also warns designers about other problems, such as training gestures that are very different from others of their type, which often indicates a mistakenly entered gesture. quill is designed to be used by people with no technical background in recognition technology, so it presents advice in plain English with diagrams. After quill was prototyped and implemented, a study was run with it, similar to the experiment with gdt. This study found that the feedback quill provided did help some designers improve their gestures, but it did not help all designers. Although this experiment did not conclusively show that the feedback was helpful, we believe that the 4

20 approach of an assistive tool is valuable. More work needs to be done to improve the interface and make the advice accessible to non-technical designers. This dissertation is organized as follows. The remainder of this chapter describes related work. Chapter 2 describes the PDA user survey and discusses its results. The survey showed that users liked using gestures, but experienced problems with recalling them and misrecognition of them on their devices. The following chapter describes the prototype gesture design tool (gdt), gives background on gesture recognition, and describes the experiment that was performed with gdt. From the development and evaluation of gdt, we concluded that designers needed active feedback to help them improve their gestures. The next two chapters describe psychological studies that were carried out to help generate such feedback. The first of these describes the psychological studies on gesture similarity, from which we derived two models of human-perceived similarity of gestures. It is followed by a chapter about a gesture memorability experiment, which showed that the match between gesture shapes and names to be the single biggest memorability predictor. Chapter 6 describes a new tool called quill, and is followed by a chapter describing an experiment evaluating quill. Chapter 8 discusses future work. The final chapter gives conclusions and summarizes this dissertation. 1.1 Related Work This section discusses prior work related to this dissertation. The section is broken down into three major parts: technology, people, and multi-dimensional scaling, which is a data analysis technique. The following subsections discuss each of these topics in turn. 1.1.1 Technology This subsection describes related technology. It begins at the highest level, with pen-based devices, followed by pen applications, then discusses toolkits for building pen applications, and ends with a description of gesture recognizers. 1.1.1.1 Pen-based devices Several commercial devices have been made that use pen input primarily or exclusively. This subsection describes these devices and how they used pens. 5

21 The device that really popularized pen input was the Apple Newton MessagePad [App97]. Although a keyboard was available for later models and those models supported applications such as a spreadsheet and word processor, the Newton was designed primarily for pen input. The interface was very different from a traditional GUI. It minimized the use of overlapping windows and encouraged the user to focus on one application and one document at a time. It used sound to give feedback when its interactors were used, which was very helpful since there was no cursor as with a mouse and sometimes one could mis-tap with the pen. Audio feedback was especially useful on the first model, where some operations took longer than the user might expect. Its core applications were a notepad, to-do list, calendar/scheduler, and address book. The notepad could take three kinds of notes: freeform, outline, and checklist. It allowed the user to write normal English, either longhand or printed, to input text. A soft keyboard was also available. The default mode was to recognize text as it was entered, but the user could elect to enter ink and go back and recognize it later. The notepad allowed the user to mix text and drawings, although the user had to change modes between the two. It also supported a small number of gestures for editing text and drawings (shown in Table 2-1). The Newtons handwriting recognition was widely criticized when it was first introduced, but by the last model the recognition had greatly improved. My experience with the Newton was very positive, and I would have continued using it except that it was the wrong sizetoo big to carry everywhere and much smaller than my 8.5x11 notebook. More recently, the Palm Pilot has become a popular pen-based platform [Com01]. Its display is even smaller than the Newtons, and it is designed very differently. Most applications have very few on-screen controls, and the pull-down menu at the top of the screen is normally hidden. The Pilots core applications are basically the same as the Newtons (i.e., memo pad, calendar/scheduler, to-do list, address book), although the memo pad does not support drawings. (Third party drawing programs are available, however.) The Pilot does not recognize normal English, but uses the Graffiti character set [Lee94], which must be entered in a dedicated area of the screen. It uses even fewer gestures than the Newton (shown in Table 2-1). Perhaps most importantly, the Pilot is small enough to fit in a shirt or pants pocket. 6

22 One of the earliest systems targeted to pen-based computing was the PenPoint operating system [CS91]. It supported pen ink, handwriting, and gestures. It used a notebook metaphor rather than a desktop metaphor, in which an electronic notebook was the central organizing principle for all the users data. It supported a number of built-in gestures and allowed applications to interpret pen input themselves, as well. The developers of PenPoint spent hundreds of hours on user testing to improve their gesture sets [Lan01, Lan96]. The work described here seeks to lessen the need for such expensive testing. The PadMouse is an input device developed by Balakirshnan and Patel that combines a touchpad with a mouse [BP98]. With their software, users can invoke marking menus on the touchpad with their finger. Their experiment shows that up to 32 commands can be used effectively with this technique. Enns and MacKenzie discuss possible benefits of adding a touchpad to a remote control [EM98]. We think these devices would also be suitable for traditional gestures. The following subsection describes applications that used pen input. 1.1.1.2 Pen applications In the physical world, pens are used for a wide variety of applications. Similarly, pens have been used for many computer applications, some of which incorporated gestures. This subsection describes applications that use pen input. Goldberg and Richardson proposed a simple scheme for entering text using a pen, called unistrokes [GR93]. To represent each letter, it uses a simple shape designed to be quick to draw and easy to recognize. Briggs et al compared pen and keyboard interfaces in a spreadsheet, word processor, drawing application, and disk manager [BDBN93]. Users in their study liked using a pen for navigation and positional control, but not for text entry. The authors cited the following advantages for pen input: 1) one-handed, 2) no moving parts, 3) light, 4) silent, 5) fine motor control, 6) direct manipulation, and 7) simple and flexible (at least for some applications). Their pen interface was mostly based on pointing; it used a tablet with an overlay of buttons and a few boxes for handwritten characters. It is interesting that even with no pen-specific interaction techniques users still preferred the pen. 7

23 Wolf, et al added pen input and gestures for common commands (e.g., insert and delete) to a drawing application, a spreadsheet program, a music editor, and an equation editor [WRE89]. They found that the pen was most useful in the spreadsheet task, in which editing with a pen was 30% faster than editing with keyboard commands. Users reported that gestures were easier to recall than keyboard commands. Burnett and Gottfried developed a language and implementation for adding graphical objects to a spreadsheet [BG98]. Users of their system can draw gestures to create new graphical objects and edit their properties. The set of available gestures varies depending on the current selection. Zhao describes a structured drawing program that uses a pen interface [ZKKM95]. The interface was interesting in that it combines gestures and menus to select the type of object to draw. The menu could be used after an object creation gesture to further specify the type of the new object, or prior to a sequence of object creation gestures to set the type of the new objects. This technique could be useful for an application where rapid access to large numbers of commands is desired, while still limiting the number of gestures that the user is required to learn. Landay shows an interface design tool based on sketching and gestures that is well-suited for pen input and gestures, called SILK [LM95, LM01]. Landay used iconic gestures for creating and editing the UI elements. Designers draw UI widgets, such as buttons, and gestures to invoke editing commands, and SILK recognizes them. Designers can also draw free-form, unrecognized ink to show the appearance of custom widgets. Different screen shots are put into a storyboard, and the designer draws arrows from widgets such as buttons to new screens in the storyboard to indicate screen transitions. After the interface is drawn in SILK, it can be simulated, which allows the designer to interact with the interface as the actual user would. Once the interface is sufficiently prototyped, it can be exported as source code which can be enhanced to produce the actual user interface. Inspired by SILK, Lin and others have developed a sketch-based tool for web-site design, DENIM [LNHL00], based on observations of actual web site design practice by Newman and Landay [NL00]. With DENIM, designers sketch web pages and links between pages at different levels of detail. The sketchy site design can be exported as HTML and image 8

24 files and run using a web browser, which allows designers to get a feel for the site while it is still sketchy. DENIM supports a small number of editing and command gestures. Damm, Hansen, and Thomsen built a system, Knight, to help software developers draw and edit diagrams of software structure (UML) on an electronic whiteboard [DHT00]. The authors use gestures in their system because they resemble what software developers draw on traditional media, such as blackboards. Their system also uses marking menus for some commands. Chatty and Lecoanet discuss how pen input and gestures are useful for air-traffic control [CL96]. In an experiment with their system they found that although a few gestures had to be modified because of confusion with direct manipulation actions like moving objects, gestures were very useful. Development of this application could probably have benefited greatly from a gesture design tool. Forsberg, Mark, and Zeleznik developed an application well-suited to pen input: a music editor [FDZ98]. Their editor uses gestures to create new notes, rests, and other musical symbols, and to manipulate those symbols. Some users tried their system with a mouse, but found gestures to be awkward, especially circular ones. Interestingly, they had two different groups of gestures for entering notes, one which was replaced by a formatted note after it was drawn and one which was left as drawn. The authors believed this later method would be less distracting for users. Gestures are used in an unusual manner in Baundels system to edit free-hand drawings [Bau94]. The metaphor in this system is based on how artists clean up drawings. That is, the user sketches close to the curve to be changed and the curve moves toward the new stroke. Their gesture analysis technique is highly specialized to curve editing and is not suitable for general gesture recognition. It would be interesting to extends their drawing program to include iconic gestures. Another usual gesture system was developed by Arvo and Novins for drawing simple shapes [AN00]. It eagerly recognizes shapes as they are drawn. Their approach works well for a small number of shapes, but it may be hard to use with a large number of gestures. Freehand drawing is useful, but structure is also helpful in many situations, such as note- taking in meetings. A group at Xerox PARC added rudimentary perceptual understanding 9

25 to a pen-based whiteboard application [M+95]. Specifically, the system can group items on the electronic whiteboard using alignment. Simple gestures can be used to edit the drawing. Lopresti and Tomkins advocated treating electronic ink as a first class datatype [LT95]. They developed a prototype system that supported input and searching. We could apply this technique to search a library of gestures using the gesture as the key. Kato and Nakagawa showed the benefits of using lazy recognition of both ink and gestures [KN95]. For example, if one person is editing another persons document, the gestures remain visible and unexecuted so the original author can decide which changes to make. The applications described above exemplify the types of applications for which the gesture design tool and interaction techniques described in this dissertation can be used. The following subsection briefly describes toolkits that include support for building applications that use pen input. 1.1.1.3 UI toolkits Although few research toolkits have been developed specifically for pen-based UIs, several toolkits do support pen input. Henry, et al described a highly customizable GUI toolkit that included gesture input and snap-dragging [HHN90]. As in other drawing applications that include gestures, gestures are used for object creation and editing operations. Different gestures can be recognized in different parts of the screen. In this system, gestures are first segmented then recognized. The segmentation can be done independently of the gesture set being recognized, but the recognition step must be tailored to the gesture set. Hong and Landay created a Java toolkit for developing pen-based applications called SATIN [HL00]. It supports pen input interpreters and provides libraries for manipulating ink and widgets optimized for use with a pen. Several applications have been built using SATIN, including DEMIN [LNHL00] (described above). Myers and his coworkers included gesture recognition in their Garnet [MGD+90] and Amulet [M+97] toolkits. Amulet includes a widget for capturing and recognizing gestures, 10

26 and for automatically dispatching a command based on a recognized gesture. Both toolkits use the Rubine recognizer [Rub91b] (described in 3.1.1, p. 38). 1.1.1.4 Recognizers This subsection surveys recent gesture recognition research. This background on the details of gesture recognition, especially feature-based recognizers, is very useful in understanding the proposed gesture design tool. Rubine invented a single-stroke gesture recognizer that matches gestures based on built-in geometric features (e.g., initial direction, length, and size of bounding box) [Rub91b]. His recognizer allows new gestures to be added on-the-fly. It can also recognize a gesture as soon as it is unambiguous, even if it is not yet completed (i.e., eager recognition). This recognizer is the primary recognizer used in this dissertation research. The recognizer developed by Lipscomb is also trainable on-the-fly, but based on different recognition technology [Lip91]. Rather than measuring geometric features, his recognizer progressively refines an input gesture by removing points where small changes in direction occur and matching them to prototypes. This method is insensitive to the scale of the gesture and where appropriate can recognize mirror and rotated gestures. Since it is computationally inexpensive and can be trained with few gestures, it would be suitable for prototyping gestures in a system like the one proposed here. Higher level information can be beneficial in gesture recognition, as [Zha93] showed. Zhaos system is based on two recognizers: a low-level one changes point coordinates into symbols, and a high-level one translates those symbols into application-level commands, subject to appropriate contextual constraints. Ulgen and colleagues described a more complicated recognizer [UFA95]. It combines feature extraction, fuzzy functions, and neural networks to recognize not only gestures, but also shapes. Like Lipscombs recognizer, it is orientation and scale independent. A problem with recognition technology is that it is not always correct. Mankoff, Hudson, and Abowd developed a toolkit that allows application developers to handle ambiguous input more easily than with traditional toolkits [MHA00]. 11

27 More information on handwriting and shape recognition can be found in an extensive 1995 survey by Tappert, Suen, and Wakahara [TSW90]. A survey of pen and hand gesture recognition was written by Watson in 1993 [Wat93]. 1.1.2 People This subsection describes related work about people. The first subsection describes work on similarity judgements of shapes. The second subsection describes experiments that measured relevant interaction techniques or properties of user interfaces. 1.1.2.1 Perceptual similarity Psychologists have investigated similarity of simple geometric shapes, which are less complex than our gestures. Attneave studied how changes in geometric and perceptual properties of different kinds of simple figures influenced their perceived similarity [Att50]. Participants in one experiment reported how similar they perceived parallelograms of differing sizes and tilts to be. Attneave found that similarity was correlated with the log of the area of the parallelograms and with their tilt. Also, parallelograms that were mirror images of one another were perceived as similar. Another study by Attneave indirectly measured perceived similarity by measuring how easily names of triangles and squares were remembered. The assumption was that similar shapes will be misremembered. The squares varied in reflectance and area; the triangles varied in tilt and area. The result of these experiments indicated that similarity of form caused more confusion than similarity of area. Based on these experiments, Attneave concluded the following. In general, the logarithm of quantitative metrics was found to correlate with similarity. Also, if the range of differences in stimuli is small, these differences are linearly related to perceived similarity, and when multiple dimensions of the stimuli change, their dimensions combine nearly linearly to give the change in perceived similarity. When the range of differences is large, the relationship between stimuli value and similarity is not linear, and the stimuli do not combine linearly. Gestures in interfaces typically vary greatly, which means that their features probably do not combine linearly in determining similarity. We used the best fit for each similarity data set that we collected (see Chapter 4). 12

28 Lazarte and colleagues studied how rectangle height and width affected perceived similarity [LS91]. They found that reported similarity was related to rectangle width and height and they derived a model to fit the reported similarity data. Also, they found that not only did different people use different similarity metrics, but that the same participant may have used different metrics for different stimuli. More recently, Santini and Jain developed more sophisticated models of perceived similarity [SJ99]. They discuss the failure of metric similarity models to account for experimental evidence of how perceptual similarity functions. They propose a model that uses fuzzy sets and better matches experimental evidence. 1.1.2.2 Experiments This subsection describes experimental results related to this dissertation. First, experiments on interaction techniques and input devices are described, followed by experiments related to human performance. Techniques and Devices Many experiments have been done on interaction techniques and input devices. This subsection describes experiments involving the interaction technique of marking menus and experiments comparing pens to other devices. Kurtenbach and Buxton studied two users using marking menus in a real application, to which they added a marking menu containing the six most popular commands [KB94]. They found that the marking menu was 34 times faster than a traditional menu. Iconic gestures are very similar to marking menus, so it is likely that a similar speedup would be achieved. Mackenzies group has run several experiments to compare pens with other devices and pen-based interaction techniques with one another [MSB91]. They compared it with a mouse and trackball for pointing and dragging tasks and found that the mouse and pen were equivalent for pointing tasks and that the pen was faster for dragging. The trackball was slower than both pen and mouse for pointing and for dragging. The same group compared the performance of digit entry using handwriting, a stationary pie menu (a one-level pie menu of the digits), a moving pie menu (that popped up at the 13

29 cursor/pen location), and a soft keypad [MNR94]. They found that the soft keypad and handwriting were faster than both types of pie menus. Later they did a more refined experiment comparing handwriting and pie pad for numeric entry where users were allowed much more practice [MMZ95]. A pie pad is a pie menu that always appears in the same place, which allows people to select from it easily. They found that the trained users were able to use the pie pad more quickly than handwriting and that they preferred the pie pad. MacKenzie and Zhang measured how easily the Graffiti alphabet could be learned [MZ97]. They found that after only five minutes of practice, people could recall the Graffiti strokes with 97% accuracy. The same participants returned a week later and could still recall the strokes with 97% accuracy. A comparison of pen and mouse for precise and imprecise pointing and dragging found that the pen was faster than the mouse for dragging when precision is not required, but that they were comparable when precision was required [KFN95]. They also observed that it took longer for right-handed people to move the pen down-right than in other directions. These experiments show that as a pointing device the pen performs comparably to and in some cases better than a mouse. Also, they show that pen-specific interactions such as marking menus can be very efficient. Psychology of pens This subsection describes experiments that investigated psychological and psychophysical properties of pens. Kao, et al measured how long it took people to write letters and how much pressure they exerted while doing so [KSL83]. They found that writing time and pressure both increased as the complexity of the letter increased. Complexity is a potentially useful factor to use in a gesture recognizer and it is also a factor in human learning and recall. Although it is unclear what the best way would be for a computer to measure complexity, pressure is easily measured given an appropriate tablet. Kao and others revealed another significance to pressure [KHW86]. Their experiment compared English longhand, which is relatively curvy, and Chinese writing, which is relatively straight. They found that writing with higher pressure required more attention 14

30 from the participants, and that higher pressure was used with the Chinese (linear) characters than with English (curvy) ones. This implies that gesture composed of straight lines are likely to be learned more quickly than curvy ones. van Mier and Hulstijn measured reaction times and error rates for drawing patterns, figures, and lines of varying complexities [vMH93]. They found that short-term practice decreased reaction time and error rates for all types of drawings of all complexities, and was strongest for figures and patterns. Complex drawings resulted in more errors. 1.1.3 Multi-dimensional scaling Multi-dimensional scaling (MDS) is a technique for reducing the number of dimensions of a data set so that patterns can be more easily seen by viewing a plot of the data, usually in two or three dimensions. It takes as input one or more sets of pairwise proximity measurements of the stimuli. It outputs coordinates and/or a plot of the stimuli in a predetermined number of dimensions (typically 23) such that the pairwise inter-stimuli distances in the new space correlate with the input proximities of the stimuli. There are several decisions to make in using MDS. One is how to use data from multiple participants. A simple method is to average the pairwise proximities and analyze the resulting proximity matrix as if it came from a single participant. However, there is evidence that this method does not give good results [AML94], and it also prohibits analyzing the differences among participants. Fortunately, we were able to use a version of MDS, INDSCAL [You87], that takes as input a proximity matrix for each participant and takes individual differences into account. Another decision when using MDS is how many dimensions to use in the analysis. Like other MDS methods, INDSCAL outputs how well the distances it produces correlate with the input proximities. A graph of this correlation vs. dimension ideally has an obvious knee in the graph, which indicates the number of dimensions to use. Also, a rule of thumb for standard MDS is to use no more than a quarter as many dimensions as there are stimuli [KW78].1 1. INDSCAL uses more information than standard MDS, so it is reasonable to think that more dimensions would be valuable. Unfortunately, we were unable to find an analysis of how many. 15

31 How to measure distance is another issue in doing MDS analysis. The most obvious, most often used metric is Euclidean distance (e.g., d ij2 = ( x j x i ) 2 + ( y j y i ) 2 in 2D). This is a special case of the Minkowski distance metric, which is: r x ia xja p d ijp = a where d is the distance, there are r dimensions, and x ia is the coordinate of point i on dimension a [You87]. When p is 2, this is Euclidean distance. Another common p value is 1, in which case it is called city-block or Manhattan distance. Infinity is also sometimes used for p, which makes the sum equal to the distance along the dimension that differs most. There are sometimes psychological reasons to prefer city-block or Euclidean [Tho68], but generally researchers use the metric that fits their data best. In standard MDS, once the analysis is done, one may have to rotate the solution to find meaningful axes. Fortunately, INDSCAL results in the best-fit rotation and so rotation was not a factor in this work. The final step in MDS analysis is assigning meaning to the axes. Sometimes the experimenter may know the axes in advance. In this work that was not the case, and so two methods were used to determine the axes: 1) inspecting plots of the stimuli and 2) correlation with measurable quantities. More details of MDS analysis are available in [You87]. 16

32 Chapter 2 PDA and Gesture Usage Survey At the beginning of my work on gestures, I wanted to know what problems users were having with gestures and what benefits they saw from gestures. To investigate these issues, I surveyed Palm Pilot and Apple Newton users in the summer of 1997. I expected to find that gestures were infrequently used because people had difficulty learning or remembering them, or because they were misrecognized by the Personal Digital Assistant (PDA). The survey showed that PDA users find gestures valuable and would like to use them for more operations and applications. At the same time, the survey showed that gesture recognition and memorability can be improved. This chapter describes how the survey was performed, the questions that were asked, and the results of the survey. It then analyzes the results and discusses their implications for gesture design tools. 2.1 Method Survey participants were solicited from several Usenet newsgroups related to general pen- based user interfaces and to the Newton and Pilot PDAs in particular. We posted the call for participation to the following Usenet newsgroups: alt.comp.sys.palmtops.pilot comp.sys.pen comp.sys.newton.misc comp.sys.palmtop The newsgroup message contained a very general description of the research and a URL for the questionnaire itself, which was a World Wide Web page. The complete questionnaire and results are given in Appendix A. Although respondents were asked to submit the form only once, we could not determine a simple method to enforce this constraint. Instead, after the data was collected it was sorted by respondent IP address and examined by hand for multiple submissions. Three respondents submitted the form twice and two submitted the form three times. Multiple submissions were deleted, so the analyzed data contains one entry per respondent. 17

33 2.1.1 Questionnaire overview The questionnaire asked about the following topics: Frequency of use of - PDA - various gestures - common PDA applications Opinions of - handwriting accuracy - attributes of gestures General PDA usage - length of time using current PDA - type of PDA used - number of PDAs used - length of time current PDA used Paper vs. PDA usage PDA usage in meetings or discussions1 User demographics - age - gender - occupation - technical sophistication In addition, fields were provided for respondent name and contact information for possible follow-up. This information was optional but we encouraged respondents to provide it by offering a free Berkeley T-shirt to a randomly chosen respondent who provided contact information. The majority of the questions were multiple choice, but free response questions were also included for general comments about gestures and the survey itself. Respondents were required to answer all multiple choice and demographic questions before the questionnaire could be submitted. 1. This data was collected for another research project, NotePals [LD99]. 18

34 2.1.2 Questionnaire implementation The questionnaire was created using emacs and Microsoft FrontPage. It was put on a FrontPage-enabled web server and FrontPage mechanisms were used to save the entered data as HTML. A Perl script was used to extract the data from the HTML file and format it suitably for import into MS Excel. Excel was used to analyze the data. 2.1.3 Questionnaire details Answers to most frequency questions (e.g., How often do you use the delete gesture?) were multiple choice: never, rarely, often, and very often. We decided to use a four choice scale rather than a five choice scale because we wanted to force respondents to state a preference rather than pick the middle choice. In retrospect, this may have been a mistake given that most psychologists use the 5-point Likert scale [Lik32] for this type of question. The only question for which these four choices were not used asked about the frequency of PDA use in general. To get a somewhat quantitative measure, the following answers were provided: less than once per day, once per day, 2-5 times per day, more than 5 times per day. For questions involving a value judgement (e.g., How would you rate the accuracy of your PDA's gesture recognition?), the multiple choice answers provided were terrible, bad, good, and excellent. One section of the survey asked respondents if they used built in handwriting recognition and how they would rate its accuracy. Also in this section, we asked if respondents used Graffiti,1 and if so how they would rate its accuracy. A significant part of the survey concerned thirteen operations that a gesture might invoke, shown in Table 2-1. Some operations in Table 2-1 exist on a particular PDA or in specific applications. Of those operations that do exist, some can be invoked using a gesture but some cannot. We 1. Graffiti is an alphabet, each of whose characters is a single stroke [Lee94], that has been popularized by the Palm Pilot. 19

35 Operation Newton gesture Pilot gesture Delete Select Insert line/paragraph Insert letters/words Move cursor Next field Previous field Open record Undo Close Scroll up Scroll down Transpose Table 2-1 Operations asked about in survey and corresponding gestures on the Apple Newton and Palm Pilot, where gestures exist. Blank spaces indicate that no gesture exists. 20

36 included operations that have no gesture to measure how much respondents know about gestures. For each operation, respondents were asked how often they used a gesture for that operation and why they did not use it more often. The questionnaire did not give any indication as to whether there was a gesture for each operation or if the operation was even possible on the PDA. We carefully considered reasons why PDA users might not use gestures and asked about the following reasons on the questionnaire: Operation is not available to my knowledge. Gesture is not available to my knowledge. Cannot remember the gesture. Poor computer recognition of the gesture. Not applicable. Another section asked respondents to what extent they agreed or disagreed with eight positive statements about gestures, given in Figure 2-1. For each statement, respondents were asked to indicate agreement or disagreement on a scale of 1 to 4, where 1 meant agree strongly and 4 meant disagree strongly. The questionnaire also asked how often respondents used six common applications: calendar, address book, to-do list, electronic mail, drawing, and note taking. Respondents were asked to rank these applications according to how often they were used. Instead of solely using a numeric rank, respondents could also select a not available/never used Gestures are powerful. Gestures are easy to learn. Gestures are efficient. Gestures are easy to use. The computer always recognizes the gestures I make. A gesture is available for every operation for which I want a gesture. Gestures are convenient. Gestures are easy to remember. Figure 2-1 Agree/disagree statements about gestures. 21

37 answer. Respondents were asked to give each application a unique numeric rank. Unfortunately, the form did not enforce this restriction and some respondents chose the same rank for multiple applications. To find out about tasks that PDAs might support better, we asked respondents to type in a common task they performed on paper but did not perform on their PDA. In addition, we asked how often they performed this task, and why they did not use the PDA for it. Several questions asked how PDAs are used in meetings or discussions.1 Specifically, we asked respondents: How often they used their PDA in meetings or discussions How often respondents were in meetings where others used PDAs What kinds of notes respondents entered on their PDAs during meetings What kinds of notes respondents take during meetings, both on PDAs and on paper, that they share with others after the meeting For the third question, we provided a list of seven note types from which respondents could select, including an other type for which respondents could specify their own. The answer for the fourth question was free form. Finally, demographic information about respondents was gathered. These questions asked users to specify the following: Age Gender Level of education Technical sophistication Occupation Respondents specified age in years in a free-response box. Four responses were provided for education: high school, some college, college degree, masters/professional degree, and PhD/MD. For technical sophistication, four numbered choices were provided, with one end labeled not at all and the other extremely. 1. These questions were for a related research project on shared note-taking [LD99]. 22

38 2.2 Results We collected data with this survey from June 10, 1997, through July 14, 1997. Our questionnaire included questions on several different topics. The following subsections present the results about the following topics: demographics, PDA usage, gesture usage, opinions about gestures, handwriting, application usage, paper vs. PDAs, and PDA meeting usage. 2.2.1 Demographics One hundred forty-two users responded to the survey. Of these, 42 (30%) used Newtons at the time of the survey, 99 (70%) used Pilots, and one (< 1%) used another PDA. For many questions, responses differed substantially depending on the type of PDA used, therefore Newton users and Pilot users will be analyzed separately and we ignore the other PDA response. The most common profession was computer programmer/software engineer (38% of Newton users, 27% of Pilot users). The next most common was sales/marketing (10%) for Newton users and manager/executive (20%) for Pilot users. Significantly, one third of Pilot users and half of Newton users had a technical job dealing with computers. Profession counts and percentages are shown in Table 2-2. Respondents as a whole were technically sophisticated, as shown in Figure 2-2. On a technical sophistication scale of 1 to 4, only 4% of Pilot users ranked themselves in the lower (less sophisticated) half. Newton users were even more sophisticated, with all but two of the Newton users (5%) giving themselves the most sophisticated rating. The most common education level for both groups was a bachelors degree (43% of Newton users, 44% of Pilot), although masters and professional degrees were also common (26% of Newton users, 30% of Pilot). Education levels are shown in Figure 2-3 In terms of gender, the Newton respondents were 7% female and Pilot users were 9% female. This distribution is substantially more skewed than the Internet at large in 1997, whose users were 31.30% female, according to GVUs WWW User Survey [GVU97]. 23

39 Newton Pilot Profession Newton Pilot % % Programmer/software Engineer 16 27 38.1 27.3 Student 3 3 7.1 3.0 Computer Scientist 1 3 2.4 3.0 System Administrator 1 4 2.4 4.0 Network Engineer 3 0 7.1 0.0 Education 2 3 4.8 3.0 Health care 3 2 7.1 2.0 Sales/marketing 4 4 9.5 4.0 Manager/executive 3 20 7.1 20.2 Math/science/engineering 3 4 7.1 4.0 Webmaster 0 2 0.0 2.0 Analyst 0 3 0.0 3.0 Other 2 16 4.8 16.2 Total 41 91 97.6 91.9 Table 2-2 PDA user professions. 100 New ton 90 Pilot 80 70 Percentage of users 60 50 40 30 20 10 0 1 2 3 4 Technical sophistication Figure 2-2 PDA user technical sophistication. Fractions are separated by type of PDA. 24

40 High school Education level Some college College degree Masters/professional degree MD/PhD Newton Pilot 0 0.1 0.2 0.3 0.4 0.5 Fraction of users Figure 2-3 PDA user education level. Fractions are separated by type of PDA used. 2.2.2 General PDA usage Respondents reported using their PDAs very frequently, as shown in Table 2-3. Most of the respondents to our survey had been using their current PDA for less than one year. The duration of PDA usage is shown in Figure 2-4. The average time was 7.5 months for Newton users and 5.4 for Pilot users. It is interesting that so many Newton users started using their PDA recently, possibly due to the introduction of the (then) newest model, the MessagePad 2000. 2.2.3 Gesture usage A summary of gesture usage is shown in Table 2-4. Reported gesture usage frequencies were distributed very differently between Pilot and Newton users. Many Newton users reported using most gestures very often. The distributions for insert line and insert space Times/day Newton Pilot 5 71% 74% Table 2-3 PDA usage frequency, as a percent of users of each type of PDA. 25

41 0.6 0.5 Fraction of users 0.4 0.3 0.2 0.1 0 0-3 3-6 6-9 9-12 12-15 15-18 18-21 21-24 >24 Months used New ton Pilot Figure 2-4 Time using current PDA. were close to normal, but for other gestures the number of users listing other frequencies of use were much lower than those responding very often. A representative gesture usage distribution for Newton users is given in Figure 2-5. Unlike those given by Newton users, the gesture use frequencies reported by Pilot users were different for each gesture. Another difference between Newton and Pilot users is that Pilot users did not use gestures as often as Newton users. Newton users reported fewer problems with gestures than Pilot users. Several Newton users who never or rarely used the gesture for insert line indicated a problem with Standard Standard Gesture Average deviation Average deviation Delete 3.8 0.6 3.3 0.9 Select 3.5 1.0 3.0 1.2 Move cursor 3.5 0.9 2.8 1.1 Insert letters/words 3.3 0.8 2.6 1.3 Insert line 2.5 1.0 2.8 1.2 Next field 1.8 1.2 2.3 1.1 Previous field 1.6 1.0 2.0 1.0 Open record 1.8 1.2 1.9 1.2 Table 2-4 Gesture usage frequency. 1 = Never, 2 = Rarely, 3 = Often, 4 = Very often. 26

42 80 71 70 60 % Newton users 50 40 30 20 17 10 10 2 0 Never Rarely Often Very often Delete gesture frequency of use Figure 2-5 Representative delete gesture usage frequency for Newton users. bad recognition by the computer or inability to remember the gesture. Similarly, the infrequent users of insert letters/words cited poor recognition. Difficulty remembering gestures was the most common reason given by Pilot users for infrequent gesture use. Poor recognition of gestures was also frequently reported. A surprising result was the relationship between gesture existence and frequency of use. One would expect that users would answer that they never used gestures that do not actually exist. Most Newton users did answer never for gestures that did not exist. For Newton users, frequency of use and gesture existence were highly correlated (0.94). What is surprising is that this was not the case with Pilot users. For Pilot users, frequency of use and gesture existence were completely uncorrelated (0.02). 2.2.4 Opinions about gestures As Table 2-5 shows, respondents had generally positive feelings about gestures. Newton users agreed with all but two of the eight positive statements made about gestures and Pilot users agreed with all but three of the eight. The statement with which users disagreed most was that a gesture exists for every command for which they would like one. That is, many users would like gestures for a wider range of operations. 27

43 Newton Pilot Statement Standard Standard Average deviation Average deviation Gestures are powerful 1.5 0.6 1.7 0.8 Gestures are efficient 1.7 0.7 1.7 0.8 Gestures are easy to use 1.8 0.7 2.1 0.8 Gestures are convenient 1.8 0.8 1.9 0.8 Gestures are easy to learn 1.9 0.7 2.3 0.9 Gestures are easy to remember 2.0 0.7 2.6 0.9 Gestures are always recognized 2.5 0.9 2.7 0.8 A gesture is available for every 3.0 0.9 3.1 0.7 operation for which I want a gesture Table 2-5 Opinions about gestures. 1 = Strongly agree, 4 = Strongly disagree. Averages indicating disagreement are in bold. Pilot users slightly disagreed that gestures are easy for them to remember and always recognized by the computer. Newton users neither agreed nor disagreed with this statement. Overall, Newton users were slightly more positive about gestures than Pilot users. For all agree/disagree questions, Newton users agreed as much or more than Pilot users. The responses for both groups of users for all opinion questions were close to normal distributions. 2.2.5 Handwriting The majority of Newton and Pilot users rated the handwriting recognition on their PDA positively. The average for both sets of users was between good and excellent. On a scale of 1 to 4, the average ratings were 3.4 and 3.1 for Newton and Pilot users, respectively. Only 7 percent of Newton users and 11 percent of Pilot users rated handwriting recognition negatively. Graffiti was used by two Newton users and all Pilot users. Interestingly, 13% of Pilot users did not rate their PDAs handwriting recognition and Graffiti identically, even though Graffiti was at that time the only handwriting recognition available. Consequently, Graffiti was rated by Pilot users as slightly more accurate than handwriting recognition, at 3.4. 28

44 2.2.6 Application usage One part of the survey asked how often a set of common PDA applications are used. As seen in Figure 2-6, the most popular Newton applications are note taking, calendar, to-do list, and address book, which are ranked approximately the same. Pilot users ranked calendar, address book, and to-do list as the most often used. Pilot users did note taking substantially less often than other applications and less often than Newton users did. Users of both PDAs ranked drawing and email as the least often used applications. The application rankings were normally distributed, except for note taking by Newton users, which had spikes at first place (i.e., most often used) and fourth place and very low frequencies elsewhere. 2.2.7 Paper vs. PDAs Respondents were asked about tasks for which they used paper but did not use their PDA. For users of both PDA types, the single most common response to this question was note taking, as seen in Figure 2-7. Some respondents were specific about the type of note taking they did and some were not. The specific types ranged from short notes of the type Calendar Addressbook To-do list Email Pilot Drawing Newton Note taking 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 Figure 2-6 Application use: higher numbers indicate more frequent use. Thin lines show standard deviation. 29

45 Telephone messages Pilot Email or letters New ton To-do list Task Mark up Draw ing Note taking Mathematics 0% 5% 10% 15% 20% 25% 30% Figure 2-7 Tasks done on paper instead of on PDA. Some respondents listed more than one task, others listed none. typically put on post-it notes to longer notes of the type taken in meetings, lectures, presentations, etc. We put all of these in one category: note taking. For Newton users, drawing was the other task reported as frequently done on paper but not a PDA. For Pilot users, the tasks next most often named were taking telephone messages and drawing. This question used a free-form response which respondents were not required to answer, but most did (60% of Newton users, 74% of Pilot users). The questionnaire also asked why the task was not done on a PDA. The single most common reason given by Newton users was that the screen is too small (19% of Newton users listed this reason). The other two common reasons listed by Newton users were slow or inaccurate recognition (12%) and inadequate connectivity or compatibility with other computers and applications (10%). Pilot users gave a wider variety of reasons. The two most popular were that it is faster to use paper (18%) and the small PDA screen (13%). Most respondents were not specific about what they meant by faster to use paper. However, some specific reasons given by a few are: they do not write quickly with Graffiti, and paper is faster due to the time required to find the Pilot, turn it on, and select or the appropriate application. 30

46 The two next most common reasons given by Pilot users were that it has poor support for drawing and it is easier to use physical paper or notes, such as post-it notes (10% for both). Some Pilot users elaborated on this, saying they prefer physical paper since it is easier to leave a note with a person or in a particular place. 2.2.8 PDA meeting usage Both Newtons and Pilots are used often in meetings. On a scale of 1 (Never) to 4 (Very often), the averages for Newton and Pilot users were 3.0 and 3.1, respectively, and the standard deviation was 0.7 for both. Raw frequency use data are shown in Table 2-6. Both types of users reported they were less frequently in meetings where others used PDAs. The average frequencies were 2.0 and 2.3 and the standard deviations were 0.8 and 0.7, respectively. The types of notes taken by users in meetings are shown in Figure 2-8. The total usage percentage is greater than 100 since respondents could indicate more than one type of note. As seen in the figure, there is a group of four note types that are used substantially more than other note types. It is interesting that there is little difference between Newton and Pilot users for all note types. 2.3 Discussion Three conclusions can be drawn from the results presented in the previous section: Gestures are valuable in current interfaces. PDAs do not have enough gestures. People use Newtons and Pilots differently. % Frequency Newton Pilot 1 Never 0 1 2 19 22 3 50 51 4 Very often 31 26 Table 2-6 Frequency of use of PDAs during meetings. 1 = never, 4 = very often. 31

47 The following subsubsections discuss the benefits and shortcomings of gestures, why more gestures are needed, how the two PDAs are used differently, and the limitations of this survey. 2.3.1 Benefits of gestures Users of Pilots and Newtons alike were very positive about gestures. Of the eight opinion questions asked, respondents were most critical of gestures because of the small number available. Both groups agreed that gestures are powerful, efficient, easy to use, and convenient. This positive view of gestures was very surprising to us, since we thought users had more problems with gestures than they reported in this survey. When one considers how the survey data was gathered and the resulting high technical sophistication of the respondents, the result is less surprising. 2.3.2 Shortcomings of gestures In spite of the technical sophistication of the respondents, there were two areas in which they were neutral or negative about gestures, and one area in which they were negative To-do/reminders Contact info Ideas to review Note type Events to review Ideas to share Events to share Other 0 10 20 30 40 50 60 70 80 90 100 Newton Pilot % usage Figure 2-8 Types of notes taken in meetings. 32

48 about PDA interfaces (see Table 2-5). This subsubsection will discuss the negative opinions about gestures and the next will discuss the PDA interface. 2.3.2.1 Gesture recognition Both Newton and Pilot users believe that gestures are not always recognized correctly. When PDAs were first popularized, they were criticized, fairly or unfairly, for their poor handwriting recognition. Clearly, it is important for handwriting to be recognized accurately, but it is even more important for gestures to be correctly recognized, because gestures invoke operations. Whereas misrecognition of a character is easily detected by the user, misrecognition of an operator may not be. That is, if a gesture is misrecognized it will cause an unintended operation to be performed, and users may have difficulty determining what happened. Furthermore, an unintended operation is likely to be more difficult to correct than an incorrectly recognized character. As one Pilot user commented, cut/copy gestures are risky. 2.3.2.2 Gesture memorability Users were also dissatisfied with gesture memorability. Newton users agreed that gestures are easy to remember, but Pilot users disagreed. A few users specifically commented that memorability was a problem. A Newton user wrote, Need a pop-up list of available gestures. Another commented, PDA needs to have [a] small reference sticker about gestures. Before conducting the survey, we hypothesized that PDA users might have difficulty with gestures because they are difficult to remember. Unlike many interaction techniques, gestures use recall rather than recognition, which implies that pen-based UI designers must make gestures easy to remember. This goal can be achieved by, for example, designing gestures that are easier to remember and using interaction techniques that help users remember gestures. 2.3.3 The need for more gestures Even more than the two areas discussed in the previous section, users were dissatisfied with the number of gestures available. One Newton user wrote, Need to be able to define new gestures, and another wrote, Wish there was a way to add gestures or have a few undefined gestures I could map to specific text-editing tasks. 33

49 Gestures could be very useful on a PDA, where screen space is at a premium and the primary (and often only) input device is a pen. Surprisingly, therefore, the two most popular PDAs support very few gestures. There are several possible reasons for this paucity of gestures. It is possible that designers believed it would be too difficult for novices to learn gestures, so designers want to minimize the number of gestures. However, in spite of difficulties novices may have learning gestures, the additional method of invoking operations would still be advantageous for expert users. Another potential reason for the lack of gestures is that it is difficult for the PDA to recognize gestures from a large set. Although this may have been the case for early PDAs, it is no longer an obstacle considering the processing power of modern PDAs. Finally, it is possible that it is difficult to design good gestures, so designers have only chosen simple, obvious ones. Although gesture input is not a new idea, interface designers do not have the same experience with designing interactions as with traditional graphical user interface components. Furthermore, designing good gestures requires expertise beyond that needed for traditional GUIs, namely knowledge of gesture recognition and of human ability to learn and remember the gestures. The tools for UI design do not provide assistance in gesture design. In short, the novelty of gestures for many designers could explain, at least in part, why current pen-based UIs have so few gestures. 2.3.4 PDA usage models Surprisingly, the results on application usage only show a slight difference in usage between Newtons and Pilots. There are several reasons people might use the two devices differently. The Newton is better suited to be a notebook than the Pilot for several reasons, primarily because it has a significantly larger screen. Users might also prefer the Newton for note taking because it recognizes normal English printing and script, whereas the Pilot only recognizes Graffiti. Also, text may be entered and stored on the Newton as unrecognized ink, which the user may or may not decide to have recognized later. In addition, the Newtons built-in software allows the user to draw and include text with the drawings. Conversely, the 34

50 Pilots smaller size makes it more convenient to carry everywhere, which is desirable for a datebook or address book. Although Newtons are used as notebooks more often than Pilots are, when users do take notes, they take the same kinds of notes on both PDAs (at least in meetings), as shown in Figure 2-8. It is interesting that the two kinds of shared notes (i.e., events to share and ideas to share) are the two least used note types. This lack of shared note-taking indicates a need for more and better collaborative software [LD99]. 2.3.5 Survey limitations An oddity in the Pilot gesture usage is the low correlation between usage and existence. As mentioned in the results section, a surprisingly large number of Pilot users reported using gestures that do not exist on the Pilot. Although we attempted to make it clear what we meant by gesture, it is possible that Pilot users misunderstood, perhaps because Graffiti is composed of single strokes that are similar to gestures. The main limitation of this survey is that the results only have qualitative value because the respondents of our survey are self-selected and thus not representative. We could not locate a representative sample of PDA users and ask that they all complete our survey; we posted a request for participation on several Usenet newsgroups. As mentioned earlier, readers of these newsgroups are likely to be technically sophisticated and highly motivated about the technology. Since PDAs are still relatively new, many if not most owners at the time of the survey were early adopters. Due to the nature of the respondents, we believe they are more enthusiastic about the technology and more sympathetic to its shortcomings than most users. A broader survey might paint a less rosy picture of PDAs and gestures. Another limitation of conducting this survey over the web is that no data verification could be done. Even had it been done in person, some of the demographic data may not have been conclusively verified, but with a web-based survey, any respondent could claim to be any age, gender, or have any profession.1 We have no reason to believe our 1. As the famous cartoon put it, On the Internet, nobody knows youre a dog. 35

51 respondents were dishonest, but lack of verification is a potential problem with the results presented here. 2.4 Summary This chapter presented the results of a survey of Pilot and Newton users. Three important findings are: Users appreciate the benefits that currently available gestures afford. Users want applications to support more gestures. Gestures should be more recognizable and easier to remember. 36

52 Chapter 3 Development and Evaluation of a Simple Gesture Design Tool The survey of PDA users described in Chapter 2 indicated that users want their PDAs to support more gestures. One approach to providing more gestures is to define a canonical set of gestures that are known to be easy for people to learn and remember and that are recognized well by the computer. However, there are many different application areas for pen-based computing, from text annotation to graphic layout to architectural CAD. Each area has its own domain-specific operations that one might want to invoke using a gesture. Due to the wide variety of different operations, and the fact that new application domains may arise, it is not possible to develop a single, canonical gesture set.1 Instead, our strategy was to develop a tool to aid designers in creating good gesture sets, where good means easy for humans to learn and remember and easy for the computer to recognize. This chapter describes the design and evaluation of a simple gesture design tool. The first section gives background about gesture recognition. Next, the gesture design tool is described. The next section describes an experiment designed to explore the gesture design process and to evaluate the tool. Experimental results and analysis are then given, followed by a discussion of the experiment. The chapter ends with some conclusions and lessons learned. 3.1 Gesture Recognition Algorithm There are different types of gesture recognition algorithms. Two common ones are neural network- and feature-based. Neural network recognizers generally have a higher recognition rate, but they require hundreds if not thousands of training examples [Pit91]. This requirement virtually prohibits iterative design, so instead we chose to use the Rubine feature-based recognition algorithm [Rub91a, Rub91b]. This algorithm: 1. We do believe that a standard set of gestures will be used for those operations that are common to many applications, such as cut, copy, and paste, but there will still be a need for many domain- specific gestures. 37

53 1. requires only a small number of training examples (15-20) per gesture class, 2. is freely available, 3. is easy to implement, 4. has a reference implementation available (from Amulet [M+97]), and 5. has been successfully used by other research systems [CL96, LM95, LM01]. A gesture set is composed of gesture classes, each of which is a single type of gesture (such as a circle to perform the select operation). A class is defined by a collection of training examples. The goal of a recognizer is to correctly decide to which class in the set a new, unknown gesture belongs. A feature-based algorithm does this classification by measuring properties, called features, such as initial angle and total length, and computing statistically which class it is most like. Rubines algorithm computes eleven different features about each gesture. Unfortunately, his algorithm only works with single stroke gestures. I decided that the challenge of single-stroke gesture design was sufficiently difficult that this limitation would not constrain the research, and that did turn out to be the case. Before the recognizer can classify gestures, the designer must first train it by providing example gestures for each class. During recognition, the feature values of the unclassified gesture are statistically compared with the feature values of the training examples to determine which class the new gesture is most like. The following subsection describes Rubines algorithm in more detail. The next subsection describes a preexisting gesture recognizer training program, called Agate. 3.1.1 Rubines gesture recognition algorithm For reasons cited above, Rubines algorithm was nearly ideal for use in this research. This subsection gives details about how the algorithm works. For more details, see [Rub91b]. Gesture recognition, or classification, is the process of deciding into which predefined class of gestures a new gesture belongs. Rubines algorithm takes a statistical approach to recognition, based on a set of measurable geometric features about the gesture. First, the recognizer is trained on a set of example gestures whose classes are known. Then, when a new gesture is presented, its features are measured and compared to the feature values of the known gesture classes. 38

54 1. Cosine of the initial angle 7. Sine of the angle between the 2. Sine of the initial angle first and last points 3. Length of the bounding box 8. Length of the gesture diagonal 9. Total angle traversed 4. Angle of the bounding box 10. Sum of the absolute value of the diagonal angle at each point 5. Distance between the first and 11. Sum of the squared value of the last points angle at each point (sharpness) 6. Cosine of the angle between the 12. Maximum speed (squared) of the first and last points gesture Figure 3-1 List of features from Rubines recognizer. The remainder of this subsection describes specific aspects of Rubines recognizer. First, the fundamental features used in the recognizer are described. Next, the recognition algorithm is given. (Although in practice the recognizer must be trained before it can recognize gestures, the recognition process will be described first because it is simpler and will make the training process clearer.) Finally, the training process is described. 3.1.1.1 Rubines features The features Rubine chose are summarized in Figure 3-1 and described below (based on the gesture shown in Figure 3-2). P is the number of points in the gesture, which are numbered from 0 to P1. Each point p i has spatial coordinates x i and y i and a time coordinate t i . the cosine and sine of the initial angle of the gesture1 x2 x0 f 1 = cos = --------------------------------------------------------- ( x2 x0 ) 2 + ( y2 y0 ) 2 y2 y0 f 2 = sin = --------------------------------------------------------- ( x2 x0 ) 2 + ( y2 y0 ) 2 1. The angle itself cannot be used because it is discontinuous at 0o (a.k.a. 360o). 39

55 Figure 3-2 Example gesture with feature annotations (from [Rub91b]). the length and angle of the bounding box diagonal f3 = ( x max x min ) 2 + ( y max y min ) 2 y max y min f 4 = atan --------------------------- x max x min the distance between the first and last points f5 = ( xP 1 x0 ) 2 + ( yP 1 y0 )2 the cosine and sine of the angle between the first and last points f 6 = cos = ( x P 1 x 0 ) f 5 f 7 = sin = ( y P 1 y 0 ) f 5 the length of the gesture Let x p = x p + 1 x p , y p = y p + 1 y p P2 f8 = x p2 + y p2 p=0 40

56 the total angle traversed x p y p 1 x p 1 y p Let p = atan -------------------------------------------------------- x p x p 1 y p 1 y p P2 f9 = p p=1 the sum of the absolute value of the angle at each point P2 f 10 = p p=1 the sum of the squared value of the angle at each point P2 f 11 = p2 p=1 the maximum speed (squared) of the gesture Let t p = t p + 1 t p P 2 x 2 + y 2 p p f 12 = max ------------------------ - t p 2 p=0 the duration of the gesture. f 13 = t P 1 t 0 Rubine chose these features because they relate to observable geometric properties, which is helpful for people understanding the recognition, and they are continuous, which is assumed by feature-based recognition. In practice, discrete features work with feature- based recognition, including Rubines recognizer, if they are normally distributed. In his work he did not actually use the time-based features (e.g., f 12 and f 13 ) because recognition worked well enough without them. 3.1.1.2 Recognition This section will describe how gesture recognition works in two ways. It begins with a geometric explanation and follows with a formal explanation. 41

57 Geometric explanation Geometrically, the thirteen features described above can be thought of as axes in a 13- dimensional space. For this description, we will assume that the space is Euclidean, although it is not. The formal description below describes the space correctly. A particular feature vector corresponds to a point in this space. By computing a feature vector for a gesture, we can map a gesture into this space. A gesture class can be described by a feature vector that is the average of the feature vectors of all its training examples, plus a description of their standard deviation. In the feature space, we can think of the gesture class as a spherical cloud centered on the mean feature vector for the class. The cloud is dense near the center and becomes thinner as one moves away from the center. A class with a small standard deviation is point-like because its cloud thins quickly, whereas a class with a large standard deviation is spread out over more space. To perform recognition of an unknown gesture, the feature vector is computed to determine where the gesture is in the space. The gesture is recognized as the gesture category with the most dense cloud at that point. Usually gesture classes have similar standard deviations, so this result is identical to the gesture class closest in the space to the unknown gesture. If the clouds of two or more gesture classes overlap, it means the gesture classes are similar in terms of one or more features, and they will be more difficult for the recognizer to differentiate. Formal explanation When an unknown gesture is presented to the recognizer, it computes the feature values for the gesture. During the training process, the recognizer computes a set of weights for each class w c i for 0 i F , where F is the number of features, f i is the value of feature 42

58 i , and c denotes the gesture class.1 The recognizer computes the following objective function, v c , for each of the C classes: F v c = w c 0 + wc i fi 0c

59 The common covariance matrix, ij , is estimated by averaging all of the c ij : C 1 C1 ij = c ij C + E c c = 0 c=0 The estimated common covariance matrix is inverted to give ( 1 ) ij . The weights, w c j , are computed from the estimates as follows: F w c j = ( 1 )ij fci 1jF i=1 F 1 w c 0 = --- w c i f 2 ci i=1 If the common covariance matrix cannot be inverted as-is, it can usually be made invertible by removing some elements and replacing their rows and columns with dummy values (in effect, not using some of the features). The recognizer does not always correctly recognize the candidate gesture. One can examine the probability that the feature was correctly classified and from that decide whether to accept or reject the classification. The probability that a gesture g with feature vector f is correctly classified as class i is:1 C 1 vj vi P(i g ) = 1 e j = 0 Rubine recommends rejecting gestures for which P ( i g ) < 0.95 . He also recommends measuring the difference between a candidate gesture and the mean gesture for its chosen 1. e is the base of the natural logarithm (approximately equal to 2.71828). 44

60 class and rejecting gestures that are too far away. Specifically, the difference is a Mahalanobis distance in feature space, which is given by: F F 2 = ( 1 )jk ( fj fij ) ( fk fik ) j = 1k = 1 1 Rubine recommends rejecting gestures for which 2 > --- F 2 , although this strategy will 2 unfortunately reject some good gestures as well. Obviously, these thresholds should be adjusted depending on the specific context (e.g., whether the operation the gesture invokes can be undone). Assuming that the values of each feature are normally distributed across the gestures in each class, the computed weights will result in an optimal recognizer. This restriction is often not completely met and sometimes not even close to satisfied. I believe this restriction is a source of some of the recognition difficulties participants in my experiments have encountered. Nevertheless, the recognizer is robust enough on non- normal data to be useful in prototyping gesture sets. For more detail on Rubines algorithm, see [Rub91a]. For more information on statistical classification, see a book on multivariate analysis, such as [Krz88]. 3.1.2 Experiences with Agate The Garnet [MGD+90] and Amulet [M+97] toolkits include implementations of Rubines algorithm, in LISP and C++, respectively, and a training tool for the recognizer called Agate [M+97, LM93]. Before beginning this research, we used these toolkits and Agate to better understand current generation tools and how they could be improved. Although Agate is a fine recognizer training tool, it was not intended to be a gesture design tool. Agate allows the designer to enter examples to be recognized, so it is possible for a designer to discover that a recognition problem exists. Unfortunately, Agate provides no support for discovering why a recognition problem exists or advice on how to fix it. As a first step toward solving this problem, we decided to build a new tool that exposes some of the information about the recognition process. We hoped that exposing recognition-related information would have two benefits. First, we believed that by showing designers some 45

61 of the details underlying the recognition, they could more easily determine why recognition problems exist and gain insight into how to fix them. Second, we thought being able to see recognition-related information would improve our intuition about how the recognition algorithm worked and thereby suggest better interfaces for interacting with it. 3.2 Gesture Design Tool Description We built a prototype gesture design tool (gdt) that is loosely based on Agate. This section discusses the differences between gdt and Agate and describes the different parts of the gdt UI in detail. The most significant improvement of gdt over Agate is a collection of tables intended to help designers discover and fix recognition problems. Other enhancements include: multiple windows for viewing more information at once; cut, copy, and paste of training examples and gesture classes; and the ability to save and reuse individual classes. The remainder of this section describes gdt in more detail. (For more about how to use gdt, see the tutorial in B.2, p. 205). gdt allows designers to enter and edit training examples, train the recognizer, and recognize individual examples. Figure 3-3 shows the gdt main window with a gesture set loaded. Each gesture class is represented by its name and an exemplar gesture (currently, the first training example for the class). In this example, the only fully visible classes are select, delete, undo, redo, and bigger font (in the top part of the window). The user has drawn a gesture to be recognized, which is shown in the white region at the bottom of the window. The recognition result can be seen across the middle of the window. The example shows that the gesture was correctly recognized as delete and gives some additional information about how well it was recognized. From the main window the user can, among other things, see all gesture classes in the set, open a gesture class to examine its training examples, call up data about the gesture set, and enter test gestures to be recognized. gdt allows the designer to examine training examples for a class and enter new ones. Figure 3-4 shows some training examples for delete. The individual examples can be deleted, cut, copied, and pasted. New examples can be added by drawing them in the 46

62 white area at the bottom. Also, gesture classes can be individually saved and loaded to or from files. The most important use for this operation is to copy a class from one gesture set to another. Gesture classes Selected gesture class Recognized gesture Figure 3-3 gdt main window. Training examples Area for drawing new training examples Figure 3-4 gdt class window. 47

63 Unlike Agate, gdt also provides tables and a graph to aid the designer in discovering and fixing computer recognition problems. One table, called the distance matrix, shown in Figure 3-5, highlights gesture classes that are difficult to differentiate. Distances are a weighted Euclidean distance (specifically, Mahalanobis distance [Krz88]) in the recognizers feature space. This table is helpful because the designer can easily detect classes that have little distance between themthat is, a low number in the table. Gesture classes that are close to each other in feature space are more likely to be confused with one another by the recognition algorithm. To enable the user to more easily find similar gesture classes, distances above a threshold value are grayed out. The slider on the right side of the window may be used to set the threshold. For example, one can see in the figure that the most similar classes are smaller font and undo because the distance between them (i.e., 6.5) is the smallest. Another table provided by gdt is the classification matrix (shown in Figure 3-6), which tallies what the training examples are recognized as. The row and column names list gesture classes. Each cell contains the percentage of training examples of the class specified by the row name that were recognized as the class specified by the column name. A misrecognized training example is either a poorly entered training example or a sign Gesture classes Gesture classes Distance threshold slider Figure 3-5 gdt distance matrix. To find the distance between two classes, find one class name in the column labels and the other class name in the row labels, and find the entry at that column and row. Larger distances are grayed out, based on the value of the threshold slider on the right. 48

64 Belonging to Recognized as Figure 3-6 gdt classification matrix. that the gesture class to which it belongs is too similar to another class. To make interesting entries stand out, misrecognized entries are colored red (dark shaded in this dissertation) and diagonal entries that are less than 100% are colored yellow (light shaded in this dissertation). In this example, six percent of select examples were misrecognized as delete and six percent of smaller font examples were misrecognized as undo. The graph provided by gdt is a graphical display of raw feature values for all training examples (see Figure 3-7). We thought such a display might help designers to determine why classes were similar and how to change them to make them less similar. Unfortunately, although it did prove useful for the author, it was too complicated for others to comprehend and so it was not used in the experiment (described below). In addition, gdt has a test procedure in which it asks the user to draw gestures from the current set. The tool tallies how well the test gestures are recognized. In a single test run, gdt displays each class name and exemplar five times in a random order and asks the user to draw it each time. After the test, gdt displays the overall recognition rate, or test score, and shows how the entered gestures were recognized in a format identical to the classification matrix (Figure 3-6). Also, after the test gdt allows the user to examine the gestures drawn during the test. This feature was not available during the experiment. 3.2.1 Usage example This section describes how gdt was used by the author to find a recognition problem with a gesture set. 49

65 Figure 3-7 gdt feature graph. All feature values of all training examples for all classes are in the graph (only 2 features fit in the window at once, however). During the experiment described in the next section, people other than the author used gdt to create gesture sets. The author did some informal evaluation of these gesture sets, and discovered one in which gestures were misrecognized for no apparent reason. I thought perhaps there were bad training examples, so I brought up the classification matrix. However, it revealed no training problems. I next looked at the distance matrix to see how far apart the gesture classes were. Several classes seemed more similar to the recognizer than I thought they would be based on looking at them. To see why the recognizer thought they were similar, I used the feature graph. It showed that several gesture categories whose features related to length (e.g., total length, total angle, total absolute angle, and sharpness) were all nearly identical. It was because too many features were very similar that the recognizer was having trouble disambiguating them. It was also clear why that was the case in the feature graph: the average length of training examples in two gesture classes was very high compared to the rest. The extreme length of examples in those two classes made the other classes indistinguishable. 3.2.2 gdt Limitations A tension in designing a gesture design tool is the extent to which it should be recognizer- independent versus recognizer-dependent. The benefits of recognizer-independence are obvious: the tool can be run with any recognition technology at design time and a different 50

66 technology at run-time. On the other hand, by using recognizer-dependent features, the tool may offer better advice, but at the cost of non-portability. In the design of gdt, we decided to include some of both types of features. The Rubine algorithm is good for prototyping gesture sets, but designers may want to use different recognition technology in a final product. Some features of gdt will apply to many types of recognizers, while others are specific to the Rubine algorithm. Recognizer- independent features are: entry and editing of training examples (Figure 3-4), the classification matrix visualization (Figure 3-6), and the test mode. Conversely, the distance matrix (Figure 3-5) and feature visualizations (Figure 3-7) may not apply to other recognizers. 3.2.3 Implementation details gdt was implemented entirely in Java, using the Swing user interface toolkit. Including the recognizer, gdt is about 11,600 lines long. During the experiment, it was run on a 200MHz Pentium Pro with 64MB RAM using the Visual Caf 2.1 Java run-time environment. Of all the opinions we solicited about the system in the post-experiment questionnaire, system speed was ranked the lowest. In other words, the participants were most upset with the speed and responsiveness of the tool. We suspect the system did not have enough main memory and so gdt often stalled while swapping.1 3.3 Experimental Method The previous section described a simple gesture design tool, gdt. This section describes an experiment whose purpose was to characterize this style of gesture design tool. We ran this experiment to evaluate gdt and, more importantly, to gain insight into the process of designing gestures. Prior to the experiment, we formulated the following hypotheses: Participants could use gdt to improve their gesture sets. The tables gdt provided would aid designers. PDA users and non-PDA users would perform differently. 1. Interestingly, people who had used PDAs were more forgiving of system sluggishness and unreliability than those who had not used them. 51

67 We recruited two types of participants: technical (mostly computer science undergraduates) and artistic (architects and artists). We paid each participant $25 for participating. We ran ten pilot participants and ten in the experiment proper. The experiment was performed in a small lab where participants could work without distraction. A video camera recorded participant utterances and facial expressions. The computer screen was videotaped using a scan converter. All computer interaction except for the post-experimental questionnaire was done on a Wacom display tablet1 using a stylus. The experimenter was present in the room during the experiment, observing the participant and taking notes. The experimenter was allowed to answer questions if the answer was contained in the materials given to the participant. The total time for each participant ranged from 1.5 to 2.5 hours. All participants were required to sign a consent form based on the standard one provided by the U.C. Berkeley campus Committee for the Protection of Human Subjects (see B.8, p. 225). The rest of this section describes the different steps of the experiment. 3.3.1 Demonstration Participants were shown a demonstration of gdt. They were shown how to enter gestures to be recognized, how to examine gesture classes and training examples, and how to read the distance and classification matrices. 3.3.2 Tutorial Next, participants were given a printed tutorial about gdt that gave a simple description of the Rubine algorithm and showed how to perform the tasks necessary to do the experiment. The tutorial also described the distance and classification matrices. 3.3.3 Practice Task To allow participants to familiarize themselves with gdt, we asked them to perform a practice task. This task was their first opportunity to actually use the tool. In this task, they were given a gesture set containing one gesture class and asked to add two new gesture classes of fifteen examples each with specified names and shapes. After adding them, participants were to draw each of the two new gestures five times to be recognized. 1. A display tablet combines a pen tablet with an LCD. 52

68 Figure 3-8 Baseline gesture set used in gdt experiment. 3.3.4 Baseline Task We wanted to compare recognition rates from the experimental task across participants, but recognition rates will vary across participants (e.g., due to being neat or sloppy). To account for this variance, we measured the recognition rate of a standard gesture set for each participant. The gesture set used was the same one used for the experimental task, which had fifteen gesture classes, each of which already had fifteen training examples. This set is shown in Figure 3-8. Since users were not familiar with the gdt test procedure, we did not want to rely on a single test. We asked participants to perform the test twice with the experimental set. A drawback of the Rubine algorithm is that a gesture drawn by one person (such as the participant) may not be recognized well if the gesture set was trained by a different person (such as the experimenter). We wanted to know whether participants could improve their recognition rate by adding their own examples to the preexisting gesture set. To test this issue, participants were asked to add five examples of their own to each gesture class in the initial experimental set and do another test. The resulting gesture set we term the 53

69 1. Cut 6. Thicker lines 2. Copy 7. Thinner lines 3. Paste 8. Eraser (i.e., switch to the eraser 4. Align left tool) 5. Align center vertically 9. Pen (i.e., switch to the pen tool) Figure 3-9 Operations for which participants invented gestures. baseline gesture set. We recorded the recognition rate for this test and used it as the target recognition rate for the experimental task. 3.3.5 Experimental Task The experimental task was to invent gestures for ten new operations, shown in Figure 3-9, and add these gestures to the baseline gesture set. Participants were told to design gestures that were recognizable by the computer and were easy for people to learn and remember. As an incentive, we offered $100 to the creator of the best gesture set, as judged by the experimenters. The participants entered training examples for the new gesture classes. Some participants used the tables or did informal testing to find recognition problems. After entering all the classes, each participant ran a test in gdt. If a participant did not reach the target recognition rate, the experimenter asked the participant to try to improve the recognition rate. Participants were asked to work until they had either achieved a recognition rate equal to or better than the target recognition rate or until they had spent ninety minutes on the experimental task. Participants were not told that there was a time limit until five minutes before the end, when they were asked to finish up and take the recognition test (again). 3.3.6 Post-experiment Questionnaire After the experiment, participants were led to a second computer (to avoid negative effects suggested by [RN96]) where they used a web browser (with a mouse and keyboard) to fill out a questionnaire. The questionnaire asked for three basic types of information: 1) opinions about various aspects of gdt and the experiment, 2) PDA usage, and 3) general demographics. 54

70 3.4 Results and Analysis This section describes and analyzes the results of the experiment. First, we will discuss evidence for or against our proposed hypotheses. Then we will discuss general results related to the gesture design process. 3.4.1 Hypotheses One important question we wanted to answer was whether participants could use gdt to improve their gesture sets. We measured improvement as the difference between the best recognition rate achieved during the experimental task and the recognition rate of the first test done during the experimental task, called the initial test. We found that on average participants improved the average recognition rate of their gesture sets by 4.0% (from 91.4% to 95.4%).1 This difference was statistically significant (p < 0.006, 2 tailed t test). The difference between the best recognition rate during the experiment and the baseline rate was not statistically significant. This finding is encouraging because it means that participants were able to add gestures to the set and retain the original level of recognition. Figure 3-10 shows participants performance on the baseline test, initial test, and the best score received during the experimental task. We also wanted to know if the distance matrix, classification matrix, or test result tables were helpful in designing gesture sets. Six of the ten participants used the classification matrix. Eight used the distance matrix. Seven looked at the test results. For each table, including the test results, we compared the performance of those who used them and those who did not. Surprisingly, usage of none of the three tables had a significant effect. As a metric of gesture set goodness, we measured the average pairwise distance between all gesture classes in the final gesture set of each participant. This metric is a reasonable measure of goodness because classes that are farther apart are less likely to be misrecognized. Among other things, we asked participants on the post-experiment questionnaire to rate their user interface design experience and if and for how long they had used a PDA. We found that average pairwise distance correlated both with UI design 1. Recognition rates are the fraction of gestures drawn by the participant during a test run that were correctly recognized by the program. 55

71 experience and with length of PDA usage (correlation coefficients 0.67 and 0.97). In other words, participants who had designed UIs or used PDAs designed objectively better gesture sets. 3.4.2 Gesture Design Process This subsection describes qualitative observations made during the experiment. First we discuss general gesture design strategies. Next we list problems that participants had in the gesture design process. 3.4.2.1 Overall strategies and observations No specific strategy was given to participants for how to design good gestures. Most participants followed this general strategy: 1. Think of one or more new gesture classes. 2. Enter training examples for the class. 3. Informally enter examples of the new class(es) to see if they are recognized. 4. Look at the new class(es) statistics in the classification matrix and/or distance matrix. 5. Possibly modify the class(es) just entered. 6. Repeat until all new classes are entered. 100 99 98 97 Percent correctly recognized 96 95 94 93 92 91 90 89 88 87 86 85 1 2 3 4 5 6 7 8 9 10 Participant Target Initial Best Figure 3-10 Recognition rates by participant for different stages of the gdt experiment. 56

72 Not all participants used all evaluation methods. Specifically, many participants skipped steps 3, 4, and/or 5, especially before their first test run. Many participants attempted to use metaphors to help design gestures. For example, two said they were trying to make the cut gesture scissor-like. Two said they wanted to base the paste gesture on glue somehow. Another wanted the copy gesture to be something where youre drawing a double and to mimic what one would do with a real eraser for the eraser gesture. Participants noticed that some commands have more direct representations than others. One commented, I have an easier time with gestures that are geometrical as opposed to ones that are more abstract like copy and paste. Another said, Some of these [operations] have real-world examples, and Some metaphors are simply visual but others are trying to represent concepts. 3.4.2.2 Specific problems Although participants could use gdt to improve their gesture sets, it was not an easy task. This section discusses specific problems participants encountered in designing gestures. 1. Finding and fixing recognition problems. Participants had difficulty finding and fixing recognition problems. On the post-experiment questionnaire, using a scale of 1 (difficult) to 9 (easy), finding recognition problems was ranked 5.8 and fixing them was ranked 4.6. The average best recognition rate was 95.4%, which we do not believe is good enough for commercial applications.1 These problems were likely due to a lack of understanding of the recognizer, which many participants expressed verbally. 2. Adding new classes. We also found that adding new gesture classes caused a statistically significant drop of 2.4% in the recognition rate of the preexisting gestures (p < 0.041, 2 tailed t test). Most participants did not seem aware that this problem might occur. Many participants thought a low recognition rate was a problem with how they drew the gestures during the test, both for preexisting gesture classes and the new classes they entered. 3. New similar class. One way new classes were observed to cause a problem is by being too similar to one or more existing classes. Sometimes the participant noticed this problem 1. Other researchers report that users found a 98% recognition rate inadequate [BDBN93]. 57

73 by informally testing the recognition (i.e., just drawing in the main window and seeing how it was recognized) or with the distance matrix. However, not all participants watched for this problem. 4. Outlier feature values. Another way new classes were seen to cause recognition problems is by having feature values that were significantly different than the values of many old classes. The outlier values caused the values for old classes, which were close together by comparison, to clump together. Unfortunately, these features were important for disambiguating the old classes, and so by adding the new classes the old ones became harder to recognize correctly. Although this problem is an issue for Rubines algorithm, it may not be for other recognition algorithms. 5. Drawing gestures backwards. Since several features used in the Rubine recognizer depend on the starting point, it is important for users to be consistent about the placement of the starting point and the initial direction. Unfortunately, some participants drew test gestures backwards (i.e., starting at the end and going to the beginning), either because they had not learned the gesture well enough or because the start and end points of the gesture were too close together, and it was unclear which direction was the correct one. 6. Radical changes. Participants also varied by what strategy they used to try to solve recognition problems. When they thought two classes were confused with one another, some participants made a small change in one of the two classes. Other participants made a dramatic change to one of the problem classes. One of the success metrics in the experimental task was how much the recognition rate improved from the beginning of the experimental task to the best recognition rate achieved during the experimental task. The improvement in recognition rate of participants who made radical changes was lower than the improvement of those who did not make radical changes (1.4% vs. 6.6%), and this difference was significant (p < 0.006, 2 tailed t test). 7. Over-testing. When faced with a test score lower than the target, some participants elected to take the test again, because they thought they had been sloppy when entering the gesture. They thought if they were neater they would do better. Sometimes this strategy succeeded and other times it did not. 58

74 8. Limited test support. Participants in the experiment relied heavily on the test procedure. At present, the tool has only rudimentary support for testing how well a gesture set is recognized. The only test results available were the count of how many gestures of each class were recognized and the overall recognition rate. 9. Multiple gestures for one operation. Several participants wanted to experiment with different gestures for the same operation. For example, a participant wanted to experiment with several gestures for the pen operation and so made three classes with three different gestures: pen, pen 2, and pen 3. Unfortunately, the alternative classes affect the recognition of one another, which is undesirable since the final set will contain at most one of them. We learned a great deal about the gesture design process from this experiment. Based on its results, we think that a tool like gdt is valuable, but it falls short of an ideal gesture design tool in many ways. The next section discusses implications of the experiment for building a better gesture design tool. 3.5 Discussion This section discusses results from the experiment and what features a better gesture design tool might have. The first subsection discusses our experimental hypotheses. The second subsection discusses implications of our experiment for building a better gesture design tool. 3.5.1 Hypotheses This subsection discusses why the experimental hypotheses were validated or refuted. Participants could use gdt to improve their gesture sets. The confirmation of this hypothesis did not surprise us, but we were surprised that on average participants were only able to reach a 95.4% recognition rate, although some reached as high as 98% (see Figure 3-10). We believed that this low performance was because the participants did not understand how the recognizer worked and because the tool was not very sophisticated. What we did not expect was that none of the tables provided by gdt had a statistically significant effect on the performance of participants. We anticipated that the distance matrix, in particular, would be useful to participants in finding what we expected to be the most common cause of recognition problems: two (or more) gesture classes that are too 59

75 close together. We believe that it was not useful because it was too abstract and because users did not have a good understanding of how the recognizer works. The tool should not require them to understand how the recognizer works, but instead it should provide higher level feedback. For example, rather than show the user n2 distance numbers whose units are completely abstract, tell the user, Class A and class B are too similar. We also expected that the classification matrix would be useful because we expected some training examples in every set to be misrecognized. In fact, training examples were rarely misrecognized. Although a fair number of participants consulted the distance matrix and classification matrix, the majority focused much more on the test results and seemed to base many decisions about where recognition problems were on it. We believe this did not improve their performance because the test results are determined not only by the quality of the gesture set, but by the goodness of the test examples themselves. From looking at the test results, participants could not tell what the reason for an unsatisfactory test was. As expected, we found that performance correlated with participants self-ranked UI design experience. We believe this result is due to experience with the design, prototype, evaluate cycle, which is common to UI and gesture design. Also as expected, PDA usage correlated with performance, which is likely due to familiarity with gestures. 3.5.2 Gesture Design Tool Implications Both the experiment and our own experiences with gesture design and gdt have given us ideas about what capabilities a gesture design tool should have, which are discussed in this subsection. We describe the lessons we learned from the experiment and what they imply for future gesture design tools. Next, we discuss design ideas that arose from our own experiences. 3.5.2.1 Experimental Lessons The single biggest lesson we drew from the experiment was that users do not understand recognition, so the tool must take an active role in warning them about recognition problems and in guiding them to solutions to those problems. We think that many problems participants encountered in the gesture design process (especially problems 37, described above) could be ameliorated by making the tool more active. 60

76 As well as active feedback mechanisms, the experiment also suggested other capabilities that would enhance a gesture design tool. One such capability is better support for testing gesture sets. The testing feature was very popular in the experiment. A gesture design tool should make it easy to create a set of test gestures that are to be recognized against a recognizer trained with another gesture set. These test gesture sets should be as easy to edit, load, and save as the training gesture sets. This enhancement addresses problem 8. Another desirable capability is the ability to enable or disable individual gesture classes without deleting and re-entering them. In particular, it would greatly ameliorate problems 8 and 9. In addition, the experiment suggested the idea of allowing gesture examples and classes to be dragged and dropped between classes and sets, respectively. This capability would solve problem 9 (via drag-and-drop between the main set and a scratch set). We also learned from the experiment that the lack of understanding about how the recognizer worked greatly hindered participants both in discovering and in fixing recognition problems. The tool should enable UI designers to make a gesture set for an application without being experts on gesture recognition. Unfortunately, this knowledge is required to successfully use current tools. One capability that would aid designers in understanding the recognition process is to graphically explain the features. For example, superimposing a graphic representation of each feature on top of a gesture might help designers understand what the features are and thus how to change one or more classes to make them less similar. For example, Figure 3- 11 shows a graphic representation of the angle between the first and last points feature. The lighter dotted lines are guidelines and the darker dotted lines are the value of the feature.1 1. They could be drawn in different colors on-screen. 61

77 3 E 2 1 S Figure 3-11 Angle between first and last points visualization. S is the start point. E is the end point. 1 is a horizontal ray from S. 2 connects S and E. 3 represents the value of the feature, which is the angle between 1 and 2. 3.5.2.2 Additional Capabilities Besides the results that arose directly out of the experiment, we have ideas from our own experience about capabilities that should improve a gesture design tool. We describe those features below. One useful capability is assistance in making a class size- or rotation-independent. If the user indicated that a class should be size-independent, for example, the tool could generate training examples of different sizes by transforming existing examples. This feature could be extended to other types of independence besides size and rotation, such as drawing direction. A weakness of the Rubine algorithm is that it works best if the values of each feature within each gesture class are normally distributed. That is, for a given class and feature, the distribution of a features value across training examples in that class should be approximately normal (ideally, the degenerate case of all values being identical or nearly so). If a feature that is important for disambiguating two classes is not normally distributed in some third class, the recognition of the first two classes might suffer. For example, one might have a gesture set whose classes are disambiguated based on size. If a class is added that is intended to match squares of two very different sizes, then some of its examples will be large and others small, which will make its size feature bimodally distributed. This non-normal distribution may hinder the other classes from being disambiguated based on size, because the recognizer will treat size as an unimportant feature since it is irrelevant to 62

78 recognizing the class with examples of widely varying size. Therefore the recognizer will not be able to use size to disambiguate classes that are different from one another primarily because of differences in size. This problem could be solved by breaking the square class into a big square class and a small square class. The tool could notice the need for such a split and do it automatically (or after designer confirmation). At application run-time, the recognition system could combine the split classes into one so they appear to be one class to the application. Such a split may be necessary after automatically generating examples to make a class size- or rotation-independent, as discussed above. Other recognition systems, especially voice recognition systems, show the user the n-best recognized results instead of just the best recognized one [MNAC93]. This feature would be useful for a gesture design tool as well. 3.6 Summary It is difficult to design gesture sets that are well-recognized. There are many pitfalls of which to be wary, and many of them are all but invisible to those unfamiliar with recognition technology. It was very difficult for the participants in our experiment to attain a good recognition rate for their gesture set, and we believe this was due in large part to difficulty in understanding the recognizer. For example, it was common for participants to make radical changes to their gestures to fix recognition problems, but this strategy was worse than making gradual changes. We found that experience with PDAs made a difference in gesture quality. People who had used PDAs before or designed PDA interfaces before performed better. A good design tool would enable people who do not have experience with PDAs to design good gestures. To perform difficult task of gesture design, designers will require significantly better gesture design tools than are currently available. 63

79 Chapter 4 Gesture Similarity Experiments Gestures should be easy to learn and remember. Gestures that are perceived as similar to one another by people may be confused more often by users than gestures perceived as different. Currently, the only way gesture designers can test whether two gestures will be perceived as similar by people is with their subjective judgement or by performing time- consuming experiments where people judge the similarity of their gestures. To make it easier for designers to create gestures that users would not perceive as too similar, we wanted our gesture design tool to predict similarity. Then, it could warn designers when their gestures would be seen as similar by people, and advise them how to make the gestures less similar. We performed three experiments designed to investigate human-perceived gesture similarity. The first experiment presented participants with triads1 of gestures, and for each triad participants selected the gesture that was most different from the others. The second experiment was the same, except that different gestures were used to investigate the effect of particular geometric features. The third experiment was a web-based survey in which participants rated the similarity of pairs of gestures on a 5-point Likert scale. This chapter describes these three experiments, in turn. It concludes with a summary. 4.1 Similarity Experiment 1 In the first experiment, we attempted to make a gesture set consisting of gestures that varied widely in terms of how people would perceive them. The gesture set is shown in Figure 4-1. It was designed by the author based on personal intuition with three criteria in mind: 1) to span a wide range of possible gesture types, 2) to have differences in orientation (e.g., 14 and 15, 20 and 22) and 3) to be small enough so that the total number of trials could be viewed in a thirty-minute session (364 triads in this case). 1. A triad is a group of three things. 64

80 Figure 4-1 Gesture set for first gesture similarity experiment. A dot indicates where each gesture starts. (Gestures are not numbered consecutively because they were chosen from a larger set.) We considered whether the participants should draw the gestures or not. Drawing them would mimic actual usage more closely, but it would have lengthened the experiment. In order to accommodate more participants and more gestures, we decided not to have participants draw the gestures. Instead, the test program animated the gestures to show participants the dynamic nature of the gestures. 4.1.1 Participants We recruited twenty-one people from the general student population at U.C. Berkeley to participate in the experiment. We required that they be able to operate the computer and tablet. Each participant was paid $10 (US). 4.1.2 Equipment For the main part of the experiment participants used a display tablet (a Mutoh MVT-12P) attached to a PC running Windows NT. The experimental application was written in Java. 65

81 4.1.3 Procedure Participants were asked to sign a consent form and were shown an overview of the experiment: Thank you for agreeing to participate in this experiment. This experiment is about how people judge gesture similarity A gesture is a mark made with a pen to cause a command to be executed, such as the copy-editing pigtail mark for delete. (Some people use gesture to mean motions made in three dimensions with the hands, such as pointing with the index finger. This is not what is meant by gesture here.) During this experiment, we will ask you to look at gestures and decide how similar you think they are. There is no right or wrong answer. This task is completely voluntary. We do not believe it will be unpleasant for you, but if you wish to quit you may do so at any time for any reason. Participants were then shown the tablet and the task was explained to them. The program displayed all possible combinations of three gestures, called triads. The order of the triads was randomized independently for each participant, as was the on-screen ordering of gestures within each triad. Figure 4-2 is a representative screen shot of the triad program. For each triad, participants selected the gesture that seemed most different from the others by tapping on it with a pen. The program recorded the selections of the users and computed the dissimilarity matrix. A dissimilarity matrix is a table of how dissimilar each gesture is compared to every other gesture. It is symmetric, with zero values on the Figure 4-2 Triad program for the first and second similarity experiments. 66

82 diagonal (since each gesture is assumed to be not at all dissimilar to itself1). It is similar to the distance matrix provided by gdt (Figure 3-5), except that the values are human- perceived dissimilarity rather than distances in the feature space of the recognizer. The program was run using a practice gesture set of five gestures, shown in Figure 4-3, so that participants could become familiar with the program and the tablet. Participants were asked to select the gesture in each triad that seemed most different to them. After the practice task, they performed the task again using the experimental gesture set of fourteen gestures (Figure 4-1). All possible triads of gestures were shown to each participant exactly once, for a total of 14 = 364 triads. 3 When the experimental task was completed, participants filled out a questionnaire which asked: 1) their impressions of the task and program and 2) demographic information about themselves. To avoid bias, this questionnaire was a web form that was filled out on a different computer than the one used for the experimental task, as suggested by Reeves and Nass [RN96]. 4.1.4 Analysis The goals of the analysis were: 1) to determine what measurable geometric properties of the gestures influenced their perceived similarity and 2) to produce a model of gesture similarity that, given two gestures, could predict how similar the two gestures would be perceived to be by people. Figure 4-3 Practice gesture set for the first similarity experiment. 1. There is evidence that an asymmetric measure of (dis)similarity is more appropriate in some circumstances [SJ99], but for simplicity our metric is symmetric. Also, there is evidence that people sometimes perceive identical items to have non-zero dissimilarity [SJ99], but for simplicity we assume self-similarity is always zero. 67

83 0.3 0.25 Interval w/ 0.2 proximities Ordinal w/ stress proximities 0.15 Interval 0.1 Ordinal 0.05 0 6 5 4 3 2 D Figure 4-4 First similarity experiment, stress vs. dimension. This graph shows a knee a dimension (D) equals 3. It also shows ordinal with proximities gave the best fit to the data (i.e., the lowest stress value). The first goal was addressed using plots of gestures generated by multi-dimensional scaling (MDS), using the INDSCAL procedure in SPSS [You87]. MDS is commonly used in psychology to reduce the dimensionality of large data sets so they can be interpreted. In these plots, the Euclidean inter-gesture distances corresponded to gesture dissimilarities reported by the participants. By examining these plots, we were able to determine some geometric features that contributed to similarity. To determine the number of dimensions to use, we did MDS in two through six dimensions and examined plots of stress1 and goodness-of-fit (r2) versus dimension, shown in Figure 4-4 and Figure 4-5, respectively, 1. Kruskals stress, from the INDSCAL procedure. 68

84 1 0.9 0.8 0.7 Interval w/ proximities 0.6 Ordinal w/ rsq 0.5 proximities Interval 0.4 Ordinal 0.3 0.2 0.1 0 6 5 4 3 2 D Figure 4-5 First similarity experiment, r 2 vs. dimension. There is no obvious knee in the curve, but the graph shows that ordinal with proximities is the best fit (i.e., highest r 2 ). to find the knee1 in the curve2. Similarity data was analyzed with MDS using interval/ ratio and ordinal models, and with and without the SPSS proximities procedure as a preprocessing stage. The ordinal model with proximities gave the best fit, so it was used for all subsequent analysis. We tried Euclidean and city-block (or Manhattan) distance metrics, and used Euclidean distances for subsequent analysis since it provided a better fit to our data than the city-block metric. 1. The knee is the place where the slope of the graph changes dramatically. For example, in Figure 4-4, there is a knee in the interval curve at dimension (D) equals 3. 2. Examination of stress vs. dimension and r2 vs. dimension is a standard MDS technique for determining the dimensionality to use for further analysis. 69

85 The second goal, predicting human perception of similarity, was achieved by running regression analysis to determine which of many measurable geometric features of a gesture correlated with the reported similarity. Regression also produced weights indicating how much each feature contributed to the similarity. To compute the similarity of two gestures, feature values are computed. Feature values and weights together give the positions of the gestures in feature space. The similarity of the gestures is given by the Euclidean distance between the two points in the feature space, where smaller distance means greater similarity. Let F be the number of features and D the number of dimensions in the model and let C be the matrix of coefficients as computed by regression: w 11 w 12 w 1F w 21 w 22 w 2F C = w D1 w D2 w DF If f is the feature vector for a gesture g, then the coordinates of g in feature space c g are given by: c g = Cf Then, the distance d between two gestures c g and c h is the Euclidean distance between their coordinates: d = ( c g c h ) ( c g c h ) The features used in our regression analysis came from a few sources. Features 111 were taken from Rubines gesture recognizer [Rub91b]. Others were inspired by plots from the MDS analysis. That is, we looked at the plots of gestures in the MDS output and tried to see what features gestures on one side of the plot had in common with one another, and how they differed from gestures on the other side of the plot. The list of features that we thought might predict similarity is given in Table 4-1. We wanted our model to be computable, so we did not include features whose values were 70

86 only obtainable by subjective judgement. New features used in the final regression analysis are described below. (See Figure 3-2 and 3.1.1.1, p. 39, for definitions of Rubines features.) Aspect o f 12 = abs ( 45 f 4 ) Curviness f 13 = p p: p > t where the threshold t was chosen to be 19o to fit the data. 1. Cosine of initial angle 11.Sharpness 2. Sine of initial angle 12.Aspect [abs(45o #4)] 3. Size of bounding box 13.Curviness 4. Angle of bounding box 14.Total angle / total length 5. Distance between first and last points 15.Density metric 1 [#8 / #5] 6. Cosine of angle between first and last 16.Density metric 2 [#8 / #3] points 17.Non-subjective openness [#5 / #3] 7. Sine of angle between first and last 18.Area of bounding box points 19.Log(area) 8. Total length 20.Total angle / total absolute angle 9. Total angle 21.Log(total length) 10.Total absolute angle 22.Log(aspect) Table 4-1 Possible predictors for similarity. Features 1-11 are taken from Rubines recognizer (see 3.1.1.1, p. 39). 71

87 Total angle divided by total length f9 f 14 = ---- f8 Density metric 1 f8 f 15 = ---- f5 Density metric 2 f8 f 16 = ---- f3 Non-subjective openness f5 f 17 = ---- f3 Area of bounding box f 18 = ( x max x min ) ( y max y min ) Log(area) f 19 = log ( f 18 ) Total angle divided by total absolute angle f9 f 20 = ------ f 10 Log(total length) f 21 = log ( f 8 ) Log(aspect) f 22 = log ( f 12 ) 72

88 4.1.5 Results The data were initially analyzed in 5 dimensions because at that time we were not aware that some MDS experts advise against using more dimensions than one quarter of the number of stimuli, because of the risk of modeling noise in the data [KW78]. This experiment used 14 gestures, which makes the maximum number of dimensions 3. The 5 dimensional analysis yielded a model of gesture similarity that correlated 0.782 (p < 0.00095, 2-tailed t test) with the reported gesture similarities. We also analyzed the data in 3 dimensions. This model correlated 0.701 (p < 0.0052, 2-tailed t test) with the reported gesture similarities. Below is the analysis in 5 dimensions, followed by the 3-dimensional analysis. 4.1.5.1 5 Dimensions We were able to derive a model of gesture similarity that correlated 0.782 (p < 0.00095, 2- tailed t test) with the reported gesture similarities. The 21 participants took an average of 26 minutes to complete the experimental task. The total time for each participant was approximately 40 minutes. The multi-dimensional scaling indicated that the optimal number of dimensions was five (stress = 0.112, r2 = 0.864). The coordinates of the gestures in these five dimensions are given in Table 4-2. For ease of comprehension, we plotted the gesture positions two dimensions at a time (shown in Figure 4-6, Figure 4-8, and Figure 4-9). Examination of the plot of dimensions 1 and 2 (Figure 4-6) quickly showed that dimension 1 was strongly correlated with how curvy the gestures were for example, g5 and g40 are curvy and g32 and g28 are straight, as shown in Figure 4-7. The curviness metric was derived in an attempt to capture our intuitive notion of curviness and to match this dimension from the MDS plot.1 Features for dimension 2 were determined by using linear regression to predict the dimension 2 coordinate from the predictor features (Table 4-1). The features that predict dimension 2 are total absolute angle and log(aspect). 1. Curviness of a gesture was computed by adding up all inter-segment angles within the gesture whose absolute value was below a threshold ( 19 ). The threshold was chosen so that the metric would agree with the authors curviness judgements of gestures in experiment 1. 73

89 We also plotted dimensions 3 and 4, shown in Figure 4-8, and dimensions 4 and 5, shown in Figure 4-9. It was observed in the first MDS plot (Figure 4-6) that short, wide gestures were perceived as being very similar to narrow, tall ones and that both types were perceived as different from square gestures. Angle of bounding box represented the difference between thin and square gestures, but not the similarity of tall vertical and short horizontal ones. We created the aspect feature to represent this relationship. Table 4-3 shows which features strongly correlate with each dimension, based on a regression analysis. Although the most important (i.e., lower numbered) dimensions are predicted by relatively few features, the other dimensions require many features. A separate regression analysis was done for each dimension, using the computed feature values as the independent variables and the coordinates on each MDS dimension as the dependent variable. The coefficients from the regression analysis are given in Table 4-4. These coefficients are used in the equations from 4.1.4 to compute gesture similarity. Dimension Gesture number 1 2 3 4 5 4 1.009046 0.916723 0.72938 0.533006 1.58836 5 1.21778 0.738405 1.271805 0.751999 0.96763 6 0.839775 1.072349 0.28303 1.02865 1.357475 13 0.226127 1.49624 0.87677 0.16051 0.855659 14 1.14816 1.24596 0.39531 1.2416 0.48689 15 1.11076 1.30201 0.44437 1.07989 0.21252 18 1.27402 0.25838 1.647313 0.957723 0.56165 19 0.785083 0.69337 2.319796 0.96253 0.2376 20 1.015448 1.197832 0.70001 0.66532 1.076016 22 0.997235 1.12641 0.72374 0.536185 1.268327 28 0.794889 0.908604 0.16613 0.371846 1.98364 32 1.104321 0.76245 0.452271 1.512996 0.424277 36 0.88544 0.90709 1.11347 1.21948 0.65227 40 1.13576 0.705168 0.25897 1.694223 0.07093 Table 4-2 Gesture coordinates from MDS for similarity experiment 1, 5D analysis. 74

90 The derived model predicts the reported gesture similarities with correlation 0.782 (p < 0.00095, 2-tailed t test). The MDS model upon which it is based fits the data only slightly better, so this is a good fit. 1.5 Dimension 2: Total absolute angle & Log(aspect) g20 g6 1.0 g22 g5 g28 g4 g40 .5 0.0 g18 -.5 g19 g32 g36 -1.0 g14 g15 g13 -1.5 -2.0 -1.5 -1.0 -.5 0.0 .5 1.0 1.5 Dimension 1: Curviness & Angle/distance Figure 4-6 MDS plot of dimensions 1 and 2 for first similarity experiment. Dimension Correlated features (in order of decreasing importance) 1 Curviness, Angle / distance 2 Total absolute angle, Log(aspect) 3 Density 1, Cosine of initial angle 4 Cosine of angle between first and last points, Cosine of initial angle, Sine of initial angle, Distance between first and last points, Angle of bounding box 5 Aspect, Sharpness, Cosine of initial angle, Total angle Table 4-3 Predictor features for similarity experiment 1, listed in decreasing order of importance for each dimension. 75

91 Dimension 2: Total absolute andgle & Log(aspect) 1.5 simple lines g20 g6 1.0 g22 g5 g28 g4 g40 .5 circular 0.0 Straight g18 -.5 Curvy g19 g32 g36 -1.0 complex lines g14 squiggles g15 g13 -1.5 -2.0 -1.5 -1.0 -.5 0.0 .5 1.0 1.5 Dimension 1: Curviness & Angle/distance Figure 4-7 MDS plot of dimensions 1 and 2 for first similarity experiment, with hand annotations showing groupings based on geometric properties, such as straightness vs. curviness. Dimension Feature 1 2 3 4 5 Cosine of initial 0.580199004 0.814554697 0.604946888 angle Sine of initial 0.480121986 angle Angle of 0.53175652 bounding box Distance between first 0.004912005 and last points Table 4-4 Coefficients of similarity model from first similarity experiment. 76

92 Dimension Feature 1 2 3 4 5 Cosine of angle between first 1.224196927 and last points Total angle 0.048878923 Total absolute 0.204093791 angle Sharpness 0.138386161 Aspect 5.706718273 Curviness2 0.29285916 Angle/dist 0.29285916 Density 1 0.053814637 Table 4-4 Coefficients of similarity model from first similarity experiment. (Continued) 2.0 g40 g32 1.5 Dimension 4: various features (see table) 1.0 g18 g5 .5 g4 g22 g28 0.0 g13 -.5 g20 g19 -1.0 g15 g6 g36 -1.5 g14 -1.5 -1.0 -.5 0.0 .5 1.0 1.5 2.0 2.5 Dimension 3: Density 1 & cos(initial angle) Figure 4-8 MDS plot of dimensions 3 and 4 for first similarity experiment. Features for dimension 4 are given in Table 4-3. 77

93 1.5 g6 g22 g20 1.0 g13 Dimension 5: various features (see table) g36 .5 g32 g19 0.0 g40 g15 g14 g18 -.5 g5 -1.0 -1.5 g4 g28 -2.0 -2.5 -1.5 -1.0 -.5 0.0 .5 1.0 1.5 2.0 Dimension 4: various features (see table) Figure 4-9 MDS plot of dimensions 4 and 5 for first similarity experiment. Features the dimensions represent are given in Table 4-3. Dimension Feature 1 2 3 4 5 Log(aspect) 0.342099977 CONSTANT 2.082550756 1.372313843 0.401207817 1.545189697 2.602023183 Table 4-4 Coefficients of similarity model from first similarity experiment. (Continued) 4.1.5.2 3 Dimensions The three-dimensional MDS analysis fit the reported gesture similarities with stress = 0.154, r2 = 0.814. The coordinates of the gestures in the MDS output space are given in Table 4-5. This configuration is plotted as two 2-dimensional figures, Figures 4- 10 and 4-11. The dimensions were identified by linear regression. 78

94 Dimension Gesture number 1 2 3 4 -0.98229 -1.13674 -0.61719 5 1.13958 -1.22919 0.46115 6 -1.12452 0.23433 -0.92749 13 0.00475 1.50688 -0.21413 14 1.21324 1.24094 0.09008 15 1.14117 1.22544 0.12895 18 1.30765 -0.58324 1.12053 19 -0.51436 0.52218 2.37394 20 -1.18059 -0.16320 -1.23774 22 -1.16684 -0.43558 -1.14896 28 -0.65028 -1.39429 -0.16012 32 -0.96996 0.09565 1.37945 36 0.80289 1.29138 -0.73480 40 0.97957 -1.17456 -0.51367 Table 4-5 Gesture coordinates from MDS for experiment 1, 3D analysis. 2.0 g13 1.5 g36 g15 Dimension 2: Total angle / total 1.0 g14 g19 .5 g6 g32 0.0 g20 g18 absolute angle -.5 g22 -1.0 g4 g40 g28 -1.5 g5 -1.5 -1.0 -.5 0.0 .5 1.0 1.5 Dimension 1: Curviness & Angle/distance Figure 4-10 Dimensions 1 and 2 of 3D configuration for similarity experiment 1. 79

95 3 g19 2 g32 Dimension 3: Total Length g18 1 g5 g15 0 g28 g14 g40 g13 g36 g6 -1 g4 g22 g20 -2 -1.5 -1.0 -.5 0.0 .5 1.0 1.5 2.0 Dimension 2: Total angle / total absolute angle Figure 4-11 Dimensions 2 and 3 of 3D configuration for similarity experiment 1. From the output of MDS we used regression to model the gesture dissimilarities. This simpler model correlated 0.701 (p < 0.0052, 2-tailed t test) with the reported gesture similarities. The coefficients for this model are given in Table 4-6. The constant term is derived from the regression and added to the distance regardless of the values of the features. Dimension Feature 1 2 3 Total length 0.006024 Curviness2 0.294231 Angle/dist 47.39248 Total absolute angle / total angle -1.14127 CONSTANT -2.18976 0.096681 -1.78323 Table 4-6 Coefficients of similarity model for experiment 1 from 3D MDS analysis. 80

96 4.2 Similarity Experiment 2 The results of the first similarity experiment were encouraging, but we wanted to test the predictive power of our model for new people and different gestures. We also wanted to explore how systematically varying different features would affect perceived similarity. To investigate how feature variations would affect perceived similarity, three new gesture sets of nine gestures each were created. The first set was designed to explore the effect of total absolute angle and aspect (see Figure 4-12). The second set was designed to explore length and area (see Figure 4-13). The third set was designed to explore rotation-related features such as the cosine and sine of initial angle (see Figure 4-14). Figure 4-12 Similarity experiment 2, gesture set 1. It was used to explore absolute angle and aspect. 81

97 Figure 4-13 Similarity experiment 2, gesture set 2. It was used to explore length and area. In addition to examining the effects of particular features, we wanted to determine the relative importance of the features. The most straightforward way to perform this test is to combine all gestures into one big set and have participants look at all possible triads from the combined set. Unfortunately, combining all of these gesture sets into one set results in far too many triads, based on the time per triad taken for the first experiment. To allow us to compare the three sets against one another without a prohibitively large gesture set, two gestures from each of the three gesture sets were chosen and added to a fourth set of gestures (see Figure 4-15). All participants were shown all possible triads from all four gesture sets. 82

98 Figure 4-14 Similarity experiment 2, gesture set 3. It was used to explore rotation. 4.2.1 Participants Twenty new participants were recruited from the general student population. As in experiment one, the only requirement was that they be physically able to use the computer and stylus. Each participant was paid $15 (US). 4.2.2 Equipment The same equipment was used as in the first experiment. 4.2.3 Procedure The procedure was the same as in experiment one, except participants saw triads from four gesture sets rather than one. Each participant saw all possible triads of gestures from each 3 + = 538 9 13 set, for a total of 3 3 triads. 4.2.4 Analysis This experiment was analyzed using the same techniques as the first experiment, MDS and regression. First, a combined analysis was done, using the data from all four gesture sets. The goal of the combined analysis was the same as experiment one: to determine 83

99 Figure 4-15 Similarity experiment 2, gesture set 4. It includes gestures from Figures 4-12, 4-13, and 4-14. what features were used for similarity judgements and to derive a model for predicting similarity. Many pairwise dissimilarity measures were missing from the data, because not all possible triads of all gestures were presented to the participants. Fortunately, this was not a problem because the INDSCAL version of MDS can accommodate missing data. In addition to the combined analysis, data from each of the first three sets was analyzed independently. The focus of the independent analyses was to determine how the targeted 84

100 features affected similarity judgements. Set four was not analyzed separately since it did not target any specific features; its purpose was to provide similarity judgements that overlapped the other three sets and allowed their data to be combined. Figure 4-16 and Figure 4-17 show the stress and r 2 , respectively, for the data for each gesture set and for the combined data. We decided to analyze the individual data sets in 3 dimensions, due to the small number of gestures in each set, and because the knee in many of the curves in these figures falls there. We used 4 dimensions for the combined data set since the combined data set has many more stimuli (i.e., gestures), and because the knee in its curve falls there. 0.35 0.3 0.25 0.2 all 1 Stress 2 3 0.15 4 0.1 0.05 0 2 3 4 5 6 D Figure 4-16 Similarity experiment 2, stress vs. dimension in MDS analysis. 85

101 0.95 0.9 0.85 0.8 0.75 all 0.7 1 RSQ 2 0.65 3 4 0.6 0.55 0.5 0.45 0.4 2 3 4 5 6 D Figure 4-17 Similarity experiment 2, r 2 vs. dimension in MDS analysis. Lastly, the model derived from experiment 1 was used to predict the perceived similarity of the gestures in experiment 2. These predictions were compared with the similarities reported by participants in experiment 2. 4.2.5 Results The best number of dimensions for MDS for the experiment 2 data was 3 (stress = 0.08663, r2 = 0.89539). Unfortunately, when the data was plotted, the meaning of the dimensions was not as obvious as in experiment 1. (A plot of dimensions 1 and 2 is shown in Figure 4-18). Table 4-7 shows which features correlate with each dimension, based on a regression analysis. The derived model predicts the reported gesture similarities with correlation 0.71 86

102 2.0 g40 g50_nw Dimension 2: Total abs. angle & Sine of angle between first 1.5 g50_se g14 g60_se g60_nw 1.0 g36 g60_s g50_w g60_n g50_ne g60_sw .5 g50 g50_n g50_s g60_ne g50_sw 0.0 g60_w g70_ne g50_e g70_sw -.5 g60_e g60 g13 g6 g70_w -1.0 g20 g41 g70_n and last points g70_nw g70_e g70 -1.5 g70_se g70_s -2.0 -2.5 -2.0 -1.5 -1.0 -.5 0.0 .5 1.0 1.5 2.0 Dimension 1: Log(aspect), Density 1 Figure 4-18 MDS plot of dimensions 1 and 2 for second similarity experiment (combined data). (p < 0.000003, 2-tailed t test). The coefficients from the regression analysis are given in Table 4-8. Separate analyses of individual gesture sets (shown in Figures 4-12, 4-13, and 4-14) revealed that: 1) bounding box angle is an important feature and 2) alignment or non-alignment with the normal coordinate axes is significant for similarity. Dimension Correlated features (in order of decreasing importance) 1 Log(aspect), density 1 2 Total absolute angle, Sine of angle between first & last points 3 Density 2, Non-subjective openness Table 4-7 Predictor features for similarity experiment 2 (using data from all experiment 2 gesture sets). 87

103 Dimension Feature 1 2 3 Log(aspect) -0.807630 Density 1 0.0488007 Density 2 -1.66162 Total absolute angle 0.126439 Sine of angle between first & last points -0.646946 Non-subjective openness -2.16772 CONSTANT -1.72921 -1.17378 4.53146 Table 4-8 Coefficients for overall similarity model for second similarity experiment. Analysis of the first gesture set (i.e., the 50 series, shown in Figure 4-12) gave a three dimensional MDS plot. This gesture set was intended to show the effects of absolute angle and aspect. We found that absolute angle is highly correlated with the first dimension (0.81), and aspect is highly correlated with the second dimension (0.77). Unfortunately, the absolute angles of gestures in this set covaried greatly with the values of several other features, so it was not possible to determine whether absolute angle was significant. Strong covariance with other features was not a problem for aspect. However, bounding box angle correlated even more strongly with dimension two (0.92) than aspect. Data from the second gesture set (i.e., the 60 series, shown in Figure 4-13) were surprising. Its analysis was done in four dimensions. It was intended to discover the effect of length and area, but although length and area correlate well with dimension four (0.83 and 0.92, respectively), they are both only weakly correlated with the first three dimensions (highest value of 0.46 for length and dimension 3). Since INDSCAL dimensions are ranked in order of importance, it appears that neither length nor area are very significant contributors to similarity judgement. The third gesture set (i.e., the 70 series, shown in Figure 4-14) also provided interesting results. One might expect similarity among gestures that are rotations of one another to be proportional to the amount of rotation, but this was not the case. Instead, the gestures whose lines were horizontal and vertical were perceived as more similar to one another than to those gestures whose components were diagonal. The perceived similarity of 88

104 gestures whose components are aligned in the same directions is consistent with findings on texture in the vision community [BPR83]. This set was analyzed in five dimensions. There were differences among participants in their similarity judgements. A graph of participants showed a clump with outliers trailing off. We removed these outliers and redid the analysis, but this change did not appreciably improve the MDS model. In this experiment, we derived another model for predicting gesture similarity. The model fit the data well, but it was not as simple as the model from the first similarity experiment. 4.3 Similarity Experiment 3: Pairwise Similarity Survey The previous similarity studies were useful, but all the similarity judgements participants made in those experiments were relative. That is, they were all of the form gestures A and B are more similar than gestures C and D. In addition to a relative metric about gesture similarity, we wanted our gesture design tool to be able to determine when two gestures were very similar to each other. To create an absolute model, we needed absolute similarity judgements. We chose 37 realistic gestures with a wide variety of shapes (see Figure 4-19), for a total of 1332 pairs. We asked participants to make similarity judgements about them by showing a pair at a time. The remainder of this subsection describes this experiment and its results. 4.3.1 Participants We recruited participants from newsgroups related to design, pen-based applications, and pen-based UIs, and from members of our department. We collected similarity judgements from 266 participants. 4.3.2 Procedure The survey was conducted on the web. The first web page gave an introduction to the survey and asked for simple demographic information. It was followed by the actual gesture survey. Participants were shown pairs of gestures; for each pair they were told to judge how similar they were on a scale of 1 (different) to 5 (similar). An example survey question is 89

105 shown in Figure 4-20. Participants were asked to judge as many or as few pairs as they desired, but they were encouraged to judge at least 20. Gesture pairs were randomly distributed among the participants. The survey page contained a quit button so they could stop at any time. 4.3.3 Results and Analysis Two hundred and sixty-six people participated in the survey and they made 5451 similarity judgements in total, for an average of 20.5 judgements per participant. Interestingly, people ranked the gesture pairs as dissimilar much more than similar, as shown in Table 4-9. We predicted the similarity of a pair of gestures g a and g b as follows. We began with the feature values of g a and g b for each feature in the set of 22 that we used in the previous similarity analysis (see Table 4-1). These 22 features were the basic features we used to try to predict similarity. Figure 4-19 The 37 pen gestures used in the similarity web survey. 90

106 Figure 4-20 Survey web page for third similarity experiment with sample gesture pair. Similarity rating # of Responses % of Responses Cumulative % different 1 2235 41.0 41.0 2 1464 26.9 67.9 3 812 14.9 82.8 4 617 11.3 94.1 similar 5 323 5.9 100.0 Table 4-9 Frequency of similarity ratings for experiment 3. 91

107 We wanted to directly account for differences between gestures in a pair, so we also composed the basic features in three different ways to generate three more sets of 22 features each. Let f i ( g ) be feature i of gesture g and let ga and g b be two gestures in a pair. For each basic feature, we created three composition features using the following equations: 1. The absolute value of the simple difference: D ( f i ) = abs ( f i ( g a ) f i ( g b ) ) 2. The absolute value of the fractional difference:; f i ( ga ) fi ( gb ) F ( f i ) = abs ---------------------------------- f i ( ga ) + fi ( gb ) 3. The log of the absolute difference:1 L ( f i ) = log ( abs ( f i ( g a ) f i ( g b ) ) + 1 ) The original features for each of the two gestures plus the features composed in the three ways above gave us 22 4 = 88 separate features. All of these features were used as possible predictors in our analysis. Since participants rated similarity on a scale of 1 to 5, we decided a similarity rating of 4 or 5 indicated similar gestures, and 13 indicated not similar gestures. We analyzed the data using several techniques, initially with all the data collected in the survey. The first technique used was logistic regression, which was chosen because our dependent variable (similar vs. not similar) is binary. Overall, the equation derived from this method was correct 86.6% of the time. Using the same data that was used to create the model, its accuracy predicting not similar gestures was 94.3%; for similar gestures it was 35.5%.2 1. The +1 term is included to avoid a value of infinity when the absolute value term is zero. 2. The overall correctness is a weighted average of the two cases and not a simple average, because 4511 gesture pairs were rated not similar and 940 were rated similar. 92

108 We also used logistic regression with only the composite features as predictors. It was correct 86.8% overall, 97.5% for not similar pairs and 35.1% for similar pairs. The next analysis was multidimensional scaling followed by linear regression, as we previously reported [LLRM00]. Very little data was collected for each participant, so average similarity across participants was used as input to MDS. Its predictions correlated 0.45 with reported similarity (p < 1023). The new MDS model was not nearly as good as the 5D model from the first experiment (presented in 4.1.5, p. 73), which correlated 0.782 with its own data. We hypothesize that this is because the previous study had a full data set for each participant, while this study had few data points per participant. Finally, we used linear regression using all the features as predictors. It was correct 85.8% overall, 100% for not similar pairs, and 17.4% for similar pairs. Logistic regression gave the most promising results, so we tested it on novel data. We randomly chose 10% of the participants and independently chose 10% of the gestures and separated all 1484 judgements related to either the evaluation participants or evaluation gestures (32.5% of the total judgements) into an evaluation data set. Using the remaining data, we built models using logistic regression with either all features or with pairwise features only (derived using D ( f i ) , F ( f i ) , and L ( f i ) , described above) and compared their predictions with the evaluation data. The better model, which used only pairwise features, had accuracies of 87.7% overall, 99.8% for not similar pairs, and 22.4% for similar pairs. The discriminant function for this model, using the composition functions defined above, is shown in Figure 4-21. Two gestures are considered to be similar if the discriminant function of them is greater than or equal to 0.5. The logistic regression model assumes that similar and not similar gestures are distributed with the same frequency as in the experiment (i.e., 940 similar vs. 4511 similar, or 17.2% similar and 82.8% not similar). If this distribution is different, the model may perform differently. Figure 4-22 is a plot of the likelihood of the model being correct for different percentages of similar gestures in the population. As the left side of the graph shows, if the fraction of similar gestures is very low, similar gestures will not be predicted well. 93

109 discrim ( g i, g j ) = 0.498 D ( f 4 [ angle of bounding box ] ) + 0.0935D ( f 15 [ density1 ] ) 0.3536D ( f 6 [ cos(angle between first and last points) ] ) 0.0079D ( f 5 [ distance between first and last points ] ) 1.8936D ( f 22 [ log ( aspect ) ] ) 0.3006F ( f 13 [ curviness ] ) 1.0105F ( f 15 [ density1 ] ) + 0.0003F ( f 2 [ sin ( initial angle ) ] ) 0.2606L ( f 3 [ size of bounding box ] ) 0.7285L ( f 13 [ curviness ] ) 0.3862L ( f 1 [ cos ( initial angle ) ] ) 1.1698L ( f 2 [ sin ( initial angle ) ] ) + 2.8473L ( f 22 [ log ( aspect ) ] ) 0.5458L ( f 11 [ sharpness ] ) 0.5261L ( f 9 [ total angle ] ) Figure 4-21 The discriminant function for predicting human-perceived similarity. Conversely, if the fraction of dissimilar gestures in the population is very low, dissimilar gestures will not be predicted well, as shown by the right side of the graph. Fortunately, the model has good overall accuracy for any distribution of similar vs. not similar gestures, with a minimum value of just below 80% when there are approximately equal numbers of similar and dissimilar gestures in the population (which corresponds to the middle of the graph). The single most significant feature was the difference between A) the log of the sine of the initial angle of the two gestures. Figure 4-23 shows examples of high, medium, and low sine of initial angle. Another way to look at the model is to determine how likely it is to be correct, given two novel gestures and its prediction about them. Assuming the same distribution of similar start direction high medium low Figure 4-23 The single most significant feature: sine of initial angle. 94

110 100% 90% 80% 70% 60% Accuracy 50% 40% 30% 20% 10% 0% 0 20 40 60 80 100 % similar in population Similar Not similar Overall Figure 4-22 Accuracy of logistic regression similarity model as a function of percentage of similar gestures in the population. vs. not similar gestures as in our survey, and using the accuracy of our best model on its evaluation data, we can predict how likely the model is to be correct. When it predicts that a pair is similar, it will be right 95.9% of the time. When it predicts that a pair is not similar, it will be right 86.1% of the time. Our motivation for developing this model is so that our gesture design tool can warn designers about gestures that people are likely to perceive as similar. For this purpose, it is beneficial that our model predicts not similar pairs very well (99.8%) and so generates very few false positives of predicted similarity ( 100% 99.8% = 0.2% ). It does not predict similar pairs as well (22.4%), so it has a high miss rate ( 100% 22.4% = 77.6% ). However, we believe for a gesture design tool it is better to miss a possible similarity than to identify a perception that does not exist. 95

111 4.4 Discussion This section discusses the results of the experiments described above and the challenges involved in designing the experiments and analyzing their results. 4.4.1 Results Human perception of similarity is very complicated, even for simple shapes [LS91]. Shapes like pen gestures can be viewed as similar or dissimilar based on many different perceptual cues. In the face of this difficulty, we are pleased at how well our model predicts similarity. The correlation of the similarity models with the data from the two experiments is shown in Table 4-10, with significance levels. The best is the 5D model, which correlates with its own data 0.782 (p < 0.00095, 2-tailed t test) and with experiment 2 data 0.589 (p < 0.0002, 2-tailed t test) and experiment 3 data 0.387 (p < 8.96 x 1024, 2-tailed t test). The model built from experiment three data is unlike the models from experiments 1 and 2 because it produces a binary (i.e., similar or not similar) result instead of a continuously-valued similarity estimate. Also, experiments 1 and 2 collected relative Experiment Correlation p Model Expr. 1 Expr. 2 Expr. 3a Expr. 1 Expr. 2 Expr. 3b 5D 0.782 0.589 0.387 0.000947 0.000250 8.96 x 1024 1 3D 0.701 0.507 0.350 0.00524 0.00220 8.96 x 1024 2 0.466 0.449 0.339 0.0932 0.00771 8.96 x 1024 Table 4-10 Summary of similarity models. Shows how each model correlates with data from each experiment. The model from experiment 3 is not shown because it produces a similar/not similar answer, so correlation is not appropriate. a. Unlike the experiment 1 and 2 data, higher numbers in the experiment 3 data mean more similar instead of less. Thus these correlations are negative. b. Experiment 3 collected many data points (5451), so even a low correlation is highly statistically significant. The fact that all three are identical probably means that this value is the lowest our statistical tool (SPSS) could report. 96

112 similarity judgements, not absolute ones, as in experiment 3. Therefore we cannot directly test the model from experiment 3 against the data from experiments 1 and 2. However, we would expect that gesture pairs that model 3 predicts as similar would be rated more similar by people than gesture pairs that model 3 predicts as dissimilar. We tested this hypothesis as follows. First, we divided all pairs of gestures from experiments 1 and 2 into a similar group and a not-similar group, based on the predictions of model 3. Next, for each pair of gestures, we tabulated the similarity ratings derived directly from the data from experiments 1 and 2. We then compared the average similarity ratings of the similar group with the average similarity rating of the not-similar group, using a t test (2- tailed). The results of this comparison are given in Table 4-11. As it shows, gestures predicted as similar by model 3 were rated significantly differently from gestures predicted to be non-similar by model 3. In addition to providing us with model 3, experiment 3 also allowed more effective use of model 1 because it allowed us to determine a threshold similarity value. For our tool to use model 1 to predict when gestures are too similar, we needed to know what threshold value for model 1 means too similar. We computed the value of model 1 for all pairs of gestures in experiment 3. We then used logistic regression to get an equation for predicting when gestures are similar (i.e., when the reported similarity is 4 or 5, as in the creation of model 3) with model 1 similarity as the only predictor. Based on this regression, we Data Statistic Experiment 1 Experiment 2 Avg. Std. dev. Avg. Std. dev. Relative dissimilarity rating for non- 8.024 3.581 6.30 2.82 similar Relative dissimilarity rating for 2.185 3.799 2.68 2.67 similar Significance of difference (p) 1.05 x 1021 1.65 x 1023 Table 4-11 Evaluation of similarity model 3 using data from experiments 1 and 2. For each experiment, ther is a significant difference in dissimilarity rating reported by participants between gestures that model 3 predicted as similar vs. non-similar. 97

113 determined that the threshold value for model 1 is 0.3487. That is, any two gestures whose model 1 similarity is less than this threshold will probably be perceived as similar by people. Model 1 and model 3 can be used together to detect similar gestures, by considering a pair similar if either model says it is similar. This algorithm was evaluated using all the data from experiment 3, and its predictions are quite close to those of model 3 alone. It is correct overall 86.5%, 99.3% for not similar pairs, and 25.4% for similar pairs. It is probably a better metric to use than model 3 alone, because we believe the tiny drop in overall correctness is offset by greater robustness due to being based on more data and on different types of models. We were pleased to find that a small number of features explain the three most salient dimensions. In experiment 1, we saw that dimensions one through three can be predicted based on only two features each. Several possible explanations exist for the larger number of features needed for dimensions four and five. One is that the underlying perceptual model is complex. Another explanation is that the gesture set used in the experiment was not complex enough or did not vary in the right ways to illuminate those dimensions. It was surprising to us that neither length nor area were significant factors in experiment 1, so the 60 series (Figure 4-13) in experiment 2 was designed to investigate the effect of these two features. Experiment 2 confirmed the results of experiment 1; neither length nor area was a significant feature for similarity. To validate the models produced by the first two experiments, each model was used to predict the similarities between all pairs of gestures used in the other experiment. These predictions were compared with the reported similarities from the other experiment. The correlation between the prediction of experiment 1 and the data from experiment 2 was 0.56 (p < 0.0005, 2-tailed t test). The correlation between the prediction of experiment 2 and the data from experiment 1 was 0.51 (p < 0.058, 2-tailed t test). Based on these correlations, the model derived from experiment 1 is a slightly better predictor of gesture similarity than the model from experiment 2. Our results are consistent with Attneaves study of simple geometric shapes [Att50], which found that sometimes the logarithm of basic features was a better predictor of 98

114 similarity than the feature itself. We found that the logarithm of aspect had more influence on similarity than aspect itself. Also, the range of distances among feature values of our gestures was large and did not combine linearly, as shown by the better fit of the Euclidean distance metric over the city-block metric. 4.4.2 Design and Analysis The primary challenge in designing the similarity and memorability experiments was creating good stimuli (i.e., gesture sets). For the first similarity experiment, we wanted the stimuli to span the perceptual feature space. However, this goal was difficult to achieve because we did not know the structure of the perceptual feature space in advance. We culled gestures from an informal survey of colleagues and from the gdt experiment (see Chapter 3) in an attempt to create a well-rounded gesture set. For the second similarity experiment, we wanted gestures that varied with respect to particular features. gdt, our first gesture design tool was modified to display values for these features, but the process was still difficult. In particular, some of the features we wanted to investigate covaried with other features, which made the results difficult to interpret. For the third similarity experiment, we went back to a general set of gestures. It was created by taking two gestures from each of the other three and then adding a few more gestures to give the set a wide variety of gesture shapes. We were concerned at the outset that developing a model for similarity would be complicated by differences among the participants. However, in spite of the individual differences, the model from experiment 1 does have predictive power. Although analyzing different groups of participants separately was not useful for our data, more data might make it feasible to create multiple models, each of which models a subset of users well. In that case, a gesture design tool could use multiple similarity metrics and notify the designer about similar gestures along any metric. The designer may want two gestures to be similar or dissimilar, depending on the semantics of the operations they are used for. The first two similarity experiments each resulted in a model for similarity, and they are different. It is difficult to say which is better, but we think the model from experiment one 99

115 is slightly preferable. It predicts the data from the other experiment slightly better than experiment two predicts its data. Also, it uses more features, and thus may capture more about the underlying psychological model. The model from experiment 3 is different from the other two models in two ways: 1) it predicts absolute similarity instead of relative similarity and 2) it only gives a boolean result instead of a continuous-valued similarity metric. Model 3 is not very helpful for making gestures more or less similar because it can only distinguish relative similarity very coarsely (i.e., similar vs. not similar) whereas the other two models provide a continuous value for similarity. For example, suppose a designer changes two gestures slightly to make them either more or less similar. Models 1 and 2 can tell the designer whether the gestures are more or less similar than they were before, but model 3 is likely to give the same boolean answer it gave before (i.e., similar or not similar). However, for warning about when two gestures are too similar in an absolute sense, model 3 is probably the best one since it is based on absolute similarity data. Models 1 and 3 are used in quill to predict when gestures will be perceived by people as similar, so that quill can warn the designer. We found MDS to be useful, but also limited. It was extremely helpful in the early stages of analyzing experiment one, when we had little idea of what features might affect similarity. It inspired us to invent several significant features, including curviness, aspect, and density. Another potential benefit of MDS is the ability to analyze differences in participants, which are discussed above. Although it was useful for discovering candidate predictors for similarity, our use of MDS was qualitative. For the quantitative analysis, when we needed to create a predictive model, we used standard linear regression (experiments 1 and 2) or logistic regression (experiment 3). 4.5 Summary Gesture set designers may want their gestures to be similar or dissimilar depending on the semantics of the operations. We have shown that perceptual similarity of gestures is correlated with well-defined computable features such as curviness. With these features, we have derived a computable, quantitative model for perceptual similarity of gestures that correlates 0.56 with reported similarity (model 1). Using our model, we can predict 100

116 how similarity people will perceive gestures to be relative to one another. We also have a model that can predict in absolute terms whether people will perceive gestures as similar or not, with an accuracy of 87.7% (model 3). We expect similarity predictions to be a useful addition to our gesture design tool. Our models and our experiences of experimental design and analysis should provide an excellent starting point for further investigation into gesture similarity. When integrated with our gesture design tool, our models will allow designers to create gestures that are less confusing to users. 101

117 Chapter 5 Gesture Memorability Experiment A useful feature for a gesture design tool would be the ability to advise the designer when gestures may be difficult for users to learn and remember. We designed and conducted an experiment to investigate gesture memorability. The experiment had two goals: 1. Learnability. To determine whether gestures are easier to learn and remember if similar operations are assigned dissimilar or similar gestures. 2. Memorability. To determine what geometric factors influence gesture learnability and memorability. As with the similarity experiment, we wanted to formulate an equation for estimating learnability and memorability based on computable features of the gestures. To summarize the above goals: we wanted to discover the effects of 1) gesture grouping and 2) gesture geometry on gesture learning and recall. Unfortunately, the learnability and memorability of many named items is often strongly dependent on how well the name and the item go together. In our work, we refer to this property as iconicness. Since iconicness is dependent in part on the meaning of the names, it is outside the scope of this work. We wanted to attempt to separate its effects from the gesture grouping and geometry. We used two tactics for disentangling iconicness from the grouping and geometry results we wanted. The first tactic was to run the experiment with three different mappings between gestures and gesture names/commands (shown in Table 5-1). The first set of mappings (the iconic set) was designed to have each gesture assigned to the command best suited for it (i.e., most iconic) and to have similar gestures assigned to similar commands. For example, it contains several natural groups of commands, such as open file and new file, and nudge up, nudge down, nudge right, and nudge left. Commands within these groups were assigned similar gestures. The second set of mappings (the non- iconic set) was designed to be not as iconic as the first, but to use the same gestures and the same commands, and to keep similar gestures with similar commands. To satisfy these constraints, we reassigned the commands to gestures by group to keep similar gestures 102

118 with similar commands. For example, the similar commands bring to front and send to back were assigned to similar gestures in the first mapping and were assigned to two different similar gestures in the second mapping. The third set of mappings (the random set) was randomly chosen by the computer. Our second tactic to disentangle iconicness from grouping and geometry was to include a questionnaire at the end of the experimental session that asked each participant for their subjective judgement of the iconicness of the mappings they saw. We knew that the memorability of an object is often influenced by its name, so we wanted to collect information about how names influenced memorability of our gestures. We wanted to make the command and gesture sets coherent, so they were all chosen to be operations for a drawing program (e.g., MacDraw, xfig, and Microsoft PowerPoint). We created a list of drawing program commands that one might want to invoke with a gesture and asked members of our research group to each independently create what they thought would be a good gesture for each command. Gestures for the experiment were then chosen from those created by the research group, by participants in the gdt experiment (see Chapter 3), and by the author. Commands were chosen by the author. The list of commands is shown in Figure 5-1. All three sets of gesture-command mappings used the same gestures and the same commands. The gestures used and how they mapped to commands in each of the three conditions are shown in Table 5-1. The following subsections describe different aspects of the experiment in detail. First, the participants are described, followed by the equipment. Then, the experimental procedure is given. Next, the analysis methods are described, followed by the results. Finally, the experiment and its results are discussed. 5.1 Participants We recruited participants from the general U.C. Berkeley student population. The only restriction was that they be able to physically perform the experiment. 103

119 bring to front Bring an object to the front of the stack (in front of all other objects). send to back Send an object to the back of the stack (behind all other objects). nudge up Move an object a small amount up. nudge down Move an object a small amount down. nudge right Move an object a small amount right. nudge left Move an object a small amount left. group Make a new object that combines several objects. ungroup Change the group back into its component objects. change color Change the color of an object. thicker lines Make the lines of an object thicker. thinner lines Make the lines of an object thinner. bigger text Make the text of an object bigger. smaller text Make the text of an object smaller. zoom in Zoom the view in (more magnified). zoom out Zoom the view out (less magnified). show/hide grid Show or hide the grid lines. open file Open a file. save file Save the current file. new file Make a new file. flip vertical Flip an object vertically. flip horizontal Flip an object horizontally. rotate Rotate an object. cut Put an object on the system clipboard. copy Copy an object to the system clipboard. paste Copy an object from the system clipboard. delete Delete an object. duplicate Copy an object onto the canvas. undo Undo the previous action. redo Redo a previously undone action. align left Align some objects along their left edges. align right Align some objects along their right edges. align top Align some objects along their top edges. align bottom Align some objects along their bottom edges. scroll up Scroll the view up. scroll down Scroll the view down. scroll right Scroll the view right. scroll left Scroll the view left. Figure 5-1 Commands for memorability experiment. 5.2 Equipment The experiment used the same equipment as in the similarity experiment (see 4.1.2, p. 65). 104

120 Command Gesture Case 1 Case 2 Case 3 More iconic Less iconic Random bring to front thinner lines undo send to back thicker lines scroll left nudge up align top align left nudge down align bottom align right nudge right align right group nudge left align left paste group smaller text new file Table 5-1 Memorability experiment gestures, along with the command mappings for the three experimental conditions. 105

121 Command Gesture Case 1 Case 2 Case 3 More iconic Less iconic Random ungroup bigger text bigger text change color show/hide grid redo thicker lines zoom out bring to front thinner lines zoom in thicker lines bigger text bring to front copy smaller text send to back send to back zoom in save file nudge up Table 5-1 Memorability experiment gestures, along with the command mappings for the three experimental conditions. (Continued) 106

122 Command Gesture Case 1 Case 2 Case 3 More iconic Less iconic Random show/hide grid new file open file open file cut duplicate new file copy ungroup flip vertical delete smaller text flip horizontal duplicate align top rotate change color thinner lines Table 5-1 Memorability experiment gestures, along with the command mappings for the three experimental conditions. (Continued) 107

123 Command Gesture Case 1 Case 2 Case 3 More iconic Less iconic Random cut rotate flip horizontal copy undo zoom in paste redo delete delete group scroll right duplicate ungroup zoom out undo flip horizontal nudge left redo flip vertical nudge down Table 5-1 Memorability experiment gestures, along with the command mappings for the three experimental conditions. (Continued) 108

124 Command Gesture Case 1 Case 2 Case 3 More iconic Less iconic Random align left scroll left cut align right scroll right scroll down align top scroll up change color align bottom scroll down scroll up zoom out open file flip vertical scroll up nudge up nudge right Table 5-1 Memorability experiment gestures, along with the command mappings for the three experimental conditions. (Continued) 109

125 Command Gesture Case 1 Case 2 Case 3 More iconic Less iconic Random scroll down nudge down save file scroll right nudge right show/hide grid scroll left nudge left align bottom save file paste rotate Table 5-1 Memorability experiment gestures, along with the command mappings for the three experimental conditions. (Continued) 5.3 Procedure Unlike in the similarity experiment, each participant was asked to participate on two different days, one week apart. On the first day, they were taught the gestures to a predetermined level of proficiency and on the second day their recall was tested, and they relearned the gestures to the previous level. Also, at the end of the experiment on the second day, they filled out questionnaires about their demographics and the experiment. The two subsections below describe the procedure for both days. 5.3.1 First Day Participants were asked to sign a consent form. They were then given an overview of the experiment. They were shown the program used in the experiment and it was explained to them. Participants were shown a list of the commands to be used in the experiment that explained the meanings of the commands. 110

126 Figure 5-2 Screen that the participant sees while waiting for the experimenter to judge a gesture. The first experimental phase was an introduction to the gestures and the experimental software. The computer showed participants each command with its gesture, and participants drew each gesture on the display tablet. In this introductory phase, the participants gestures were not marked right or wrong. The second phase was the training phase. The participant sat at one computer and the experimenter sat at another computer in the same room, so that the participant was facing away from the experimenter. The goal of training was for the participant to learn all the gestures to a predetermined level. The computer showed the participant all the commands, one at a time, in a random order. For each command, the participant drew what they thought was the correct gesture for that command and pressed the Ok button (or just pressed Ok if they had no idea what the correct answer was), as shown in Figure 5-2. The experimenters computer then displayed the correct answer along with what the participant drew (see Figure 5-3). The experimenter marked the answer right or wrong, unless the participant did not draw anything, in which case the computer could automatically mark it wrong. If it was right, the computer simply went on to the next command. If it was wrong, the computer showed the participant the correct answer for five seconds before going on to the next command (see Figure 5-4). 111

127 Figure 5-3 Screen that the experimenter uses to judge gestures. Figure 5-4 Screen showing participant the correct gesture after an incorrect answer. After showing all the commands, the computer randomly shuffled the gestures and started the process again. Once a gesture was remembered four times in a row, it was removed from the training set and not shown again during training. The third phase was the test phase. Participants were shown each command once and asked to draw the corresponding gesture. As in the training phase, the experimenter marked the answers right or wrong on a separate computer, but participants were not told whether their answers were correct or not. 112

128 5.3.2 Second Day The purpose of the second day was to determine how well participants remembered the gestures after one week and to measure how easily they were able to relearn the gestures to the previously tested level. Thus, the fourth phase in the experiment was a retest, which was identical to the third phase from the first day. The fifth and last phase of the experiment proper was relearning. Participants once again learned the gestures until they could correctly remember them all four times in a row, as in the previous training phase (2). After the relearning phase, participants were shown a questionnaire about the gesture- command mappings. They were shown a web page with all the gestures and their commands and asked to rate each one on a scale of one to five according to how well the gesture went with the command, where one was the worst and five was the best. An example question from this survey is shown in Figure 5-5. This survey provided a subjective measure of iconicness. Finally, participants filled out another web form that asked about their overall experience of participating in the experiment and about basic demographic information such as their age and major (see Appendix D). In this section, rate each gesture/command pair based on how well you think they go together. Higher numbers mean they go together better than lower numbers. There are no right or wrong answers. 1. bring to front j1 n k l m n j2 n k l m j3 n k l m j4 n k l m j5 k l m Figure 5-5 Example iconicness question. 113

129 5.4 Analysis We sought to answer several questions from our analysis: 1. Was there a difference in learnability and/or memorability among the three sets of mappings? 2. Are iconic gestures more easily learned or remembered than non-iconic ones? 3. Does the similarity metric derived from the first similarity experiment (4.1, p. 64) predict which gestures were confused with one another?1 4. Can we derive equations to predict learnability and memorability based on computable geometric properties of the gestures? To answer question 1, we used a t test. For questions 2 and 3 we used correlation. For question 4 we used linear regression to derive equations for learnability and memorability based on geometric features. For question 4, we wanted not just equations that modeled our data, but equations that could predict the learnability and memorability of novel gestures. To test the predictive capabilities of the equations, we removed a random segment of the experimental data to form a test set. Specifically, for each of the three name-gesture mappings, two participants were chosen at random (of 12, 13, and 3 participants, respectively). The test data was not used to build the model, but instead we attempted to predict the values in the test set using the model built with the remainder of the data. One of the variables we wanted to measure was the false positive ratethat is, how often each gesture was incorrectly recalled by the participants. The raw experimental data said whether each participants answer was correct or not and whether it was blank or not, but it did not say what the correct answer was, because it would have slowed down the experiment too much if the experimenter had to classify each incorrect answer during the experiment. That is, if the participant did not draw anything, that was recorded as a blank answer. If the participant drew something that was incorrect, all that was recorded was that it was incorrect, but not what it was. Therefore, after the experiment, we manually examined every non-blank incorrect answer for all participants. We classified each wrong 1. This memorability experiment took place before the third similarity experiment (see 4.3, p. 89). 114

130 answer as either another gesture or as none of the gestures. We also recorded if the participant drew the gesture rotated, flipped horizontally or vertically, or backwards compared to the actual gesture in the set. 5.5 Results A number of participants did not participate in all phases of the experiment. Several who participated in the first day did not return for the second day (despite email or phone reminders by the experimenter). The participant breakdown by mapping and by phase are given in Table 5-2. We measured many different variables in the experiment: 1. Training time (per gesture) 2. Number of training trials needed (per gesture) 3. Percent of gestures correct in test (per gesture) 4. Percent of gestures correct in retest (per gesture) 5. Number of retraining trials (per gesture) 6. Subjective iconicness of gesture-name mappings (per gesture), on a scale of 19 Phase Mapping Day 1 Day 2 Learning Test Retest Relearning 1 (more iconic) 17 16 12 12 2 (less iconic) 18 16 13 13 3 (random) 8 7 3a 3a Total 43 39 28 28 Table 5-2 Number of memorability experiment participants by mapping and phase. a. Even though 3 is a small number of participants, we found statistically significant differences between the random mapping and the other two. Because of this finding and because of the difficulty of recruiting and retaining participants for such a lengthy experiment (approximately 2 hours for the first day), we decided we did not need more participants for the random mapping. 115

131 7. Number of misses (for each gesture, number of times, on average, that a participant did not choose it when it was the correct choice) 8. Number of bad hits (for each gesture, number of times, on average, that a participant chose it instead of the correct answer) We compared the three name-gesture mappings with one another across the above variables. The only significant differences between mappings 1 and 2 were test accuracy and number of misses. There were significant differences between mappings 1 and 3 and between 2 and 3 for many of the variables. The data are summarized in Table 5-3. It is interesting that the number of retraining trials needed was very close to the minimum possible (i.e., four) for all three mappings. The lower bound on retraining trials may explain why there was no significant difference of this metric between mappings 1 and 3, and only a marginally significant difference between mappings 2 and 3. Significance of differences between Averages per mapping mappings (p) 1 (more iconic) 2 (less iconic) 3 (random) Overall Variable 1 vs. 2 1 vs. 3 2 vs. 3 1 Training 8.43 8.70 19.2 10.6 3.10 x 1011 1.38 x 1010 time per trial (seconds) 2 Training 6.36 6.59 9.61 7.08 1.54 x 1011 2.67 x 109 trials 3 Test % 95.3 86.7 86.5 90.1 0.000757 0.000000471 4 Retest % 69.5 70.5 49.3 67.4 0.000161 0.0000210 5 Retraining 4.07 4.04 4.31 4.09 0.127 0.0695 trials 6 Iconicness 3.69 3.73 2.69 3.64 0.0000134 0.00000218 7 Misses 2.30 2.67 10.1 3.51 0.0391 6.10 x 1014 6.04 x 1013 8 Bad hits 1.74 1.65 6.11 2.29 6.11 x 109 4.74 x 109 Table 5-3 Summary of memorability experiment. The left side gives average values. The right side gives p values where there is a significant difference, or a blank if there is no significant difference (p > 0.1). 116

132 Correlation with iconicness per mapping Variable 1 2 3 Overall 1 Training time -0.794 -0.729 -0.593 -0.772 2 Training trials -0.768 -0.741 -0.596 -0.793 3 Test % 0.402 0.132 0.217 0.232 4 Retest % 0.759 0.733 0.444 0.796 5 Retraining trials 0.703 0.597 0.401 0.715 6 Iconicness n/a n/a n/a n/a 7 Misses 0.835 0.780 0.585 0.810 8 Bad hits 0.643 0.703 0.265 0.578 Table 5-4 Correlations of iconicness with other variables. All values are significant at p < 0.015 except italicized values, which are not significant. See Table 5-5 for exact p values. Significance of iconicness correlation (p) per mapping Variable 1 2 3 Overall 1 Training time 4.40 x 109 3.09 x 107 0.000111 2.24 x 108 2 Training trials 2.87 x 108 1.60 x 107 0.0000995 4.80 x 109 3 Test % 0.0136 4 Retest % 5.08 x 108 2.44 x 107 0.00585 3.84 x 109 5 Retraining trials 0.00000125 0.0000945 0.0140 6.61 x 1007 6 Iconicness n/a n/a n/a n/a 7 Misses 1.36 x 1010 1.30 x 108 0.000143 1.29 x 109 8 Bad hits 0.0000176 0.00000124 0.000180 Table 5-5 Significance of correlation of iconicness with other variables. Blank p values indicate the correlation is not statistically significant. See Table 5-4 for correlations. We found that iconicness was significantly correlated with most performance-related variables in most conditions. Table 5-4 shows the correlations and Table 5-5 shows the significance levels as t test p values. These strong correlations indicate a relationship of iconicness with learnability and memorability. We expected that the similarity of gestures, as measured by our metric derived from the first similarity experiment (see 4.1.5, p. 73), would correlate with their confusability in 117

133 the memory task. To test this hypothesis, we created a confusion matrix by counting how many times each gesture was misremembered as every other gesture. We then computed the correlation of this matrix with the gesture similarities computed by model 1, and found that the similarity correlated with the confusability 0.19, 0.19, and 0.18 for mappings 1, 2, and 3, respectively, and 0.19 overall. These correlations are not statistically significant. This may be because of the strong effect of iconicness. We also wanted to predict learnability and memorability based on gesture geometry. Equations were derived to predict several features: misses, bad hits, retest accuracy, training trials, and training time. As described above, we removed the data from six participants and used the remaining data to derive the equations, using linear regression. Data used to derive the model are the model data. Data excluded from the model derivation and used to evaluate the model are the test data. The derived coefficients for the equations are given in Table 5-6. The correlation and statistical significance of the match between the model and a) the data used to make the model and b) the test data are given in Table 5-7. As shown in the table, misses in the test data were predicted well, and bad hits and number of training trials were predicted fairly well. Unfortunately, the most important metricthat is, the retest accuracy ratewas not predicted well by the model, nor was training time. We examined how often participants misrecalled a gesture and drew a rotated version of the correct gesture, or drew it horizontally flipped, vertically flipped, or backwards. These types of mistakes were very rare. Of all misremembered gestures, only 3.6% were rotated, Retest Learning Learning Feature Misses Bad Hits % trials time (Constant) 1.304 1.571 89.29 4.579 4678 cos(initial angle) 2.102 17.96 1.839 4655 cos(ending angle) 2.184 18.71 1.953 4775 total length 0.006053 13.42 total angle 0.1244 1.137 0.1145 276.5 total absolute angle 0.1604 0.1108 Table 5-6 Coefficients for learnability and memorability predictions. (See 3.1.1.1, p. 39, for a description of the features). 118

134 Data set Model Test Statistic Correlation p value Correlation p value Misses 0.774 0.0000000196 0.524 0.000875 Bad Hits 0.4907 0.00205 0.393 0.0162 Retest % 0.440 0.00641 0.0856 Training trials 0.750 0.0000000910 0.374 0.0226 Training time 0.718 0.000000572 0.279 0.0940 Table 5-7 Correlation of memorability and learnability prediction with model data (used to create the model) and test data (data set aside and not used in model creation). Blank p values indicate no statistical significance. Italic p values indicate weak statistical significance. 7.2% were horizontally flipped, 10.4% were vertically flipped, and 3.1% were backwards. A gesture design tool could reduce this problem by automatically creating new training gestures that were rotated, flipped, etc., based on training examples that the designer entered (see Chapter 8 for more discussion of this feature). 5.6 Discussion The first two mappings were designed to be different; the first was supposed to be more iconic and thus easier to learn and remember than the second. It appeared to be easier to learn, as shown by higher test accuracy, but no other differences between these two sets were found. In retrospect, it is clear that we should have pilot tested the first two mappings to ensure that the second was less iconic than the first. It is interesting that the test accuracy and miss count are significantly different between the first two mappings, when none of the other variables are significantly different. In particular, it is surprising that iconicness is not also different, given the strong correlation of iconicness and miss count. We had hoped to determine the effect of iconicness on learnability and memorability based on differences between the first two mappings. That was not possible, due to the virtually identical iconicness of the first two mappings. However, we had iconicness, learnability, and memorability data for each gesture individually, so we were able to compare iconicness with learnability and memorability between individual gestures. These comparisons clearly showed that iconicness is related to almost all the 119

135 memorability and learnability metrics we considered. This outcome is gratifying since it is consistent with the study of the memorability and learnability of other named entities [Shi99]. We expected to find a relationship between the predicted similarity (using the similarity metric derived from the previous experiment) and learnability and memorability, but the data did not show a statistically significant relationship. This result may be due to inadequacies in the similarity metric or to insufficient data from the memorability experiment. After the success of the similarity model, it was disappointing that the memorability model does not predict how well a gesture will be remembered after one week. This result may be due to insufficient data, or to the large effect of iconicness, which we are at present unable to model. We knew that iconicness was an important factor in learnability and memorability for other things and that it would likely be significant for gestures as well, but we hoped that we could factor out iconicness to some extent by using three different mappings with varying degrees of iconicness. Unfortunately, since it turned out that mappings one and two were equally iconic, we only had two degrees of iconicness (i.e., good and random). Our advice to designers is to make their gestures iconic when possible. This advice may be the key to learnability and memorability. It was also surprising that the model is able to predict misses and bad hits but not retest accuracy. We collected little data for the random case because little was needed to show that the non- random cases were significantly better, statistically. Also, the random case was very time- consuming for both participants and experimenters, so much so that several participants for the random case did not finish the experiment. However, this meant that we did not have very much data from the case where the impact of iconicness was least. In the future it would be useful to collect more data to improve the model. For this purpose, a random mapping would probably be best. 120

136 5.7 Summary This chapter described experiments on gesture learnability and memorability. The memorability experiment confirmed that iconicness is important in the learnability and memorability of gestures. We built a computational model that was able to predict misses, bad hits, and training trials based on geometric properties of the gestures. However, it was not able to predict recall accuracy or training time. Based on the partial success of prediction based on geometry, we believe it may be possible with more data to partially predict memorability. 121

137 Chapter 6 quill: An Intelligent Gesture Design Tool We found that designers could create gestures with gdt and that, by trial and error, they could improve the recognition of their gestures. However, we believe users of pen-based user interfaces will demand better recognition than most participants in our study were able to achieve1. We created a new gesture design tool, called quill, that was inspired by gdt and the results of the gdt evaluation. It also takes advantage of the results from the similarity experiments to give feedback about human-perceived similarity of gestures. It was prototyped on paper and in PhotoShop and implemented in Java using the JDK 1.3, and the Swing UI toolkit. It consists of approximately 26,000 lines of code spread across 93 classes. This chapter describes the development of quill. It begins with a description of the goals for quill. Following is a discussion of how the gesture data in quill is organized and named. Next, the most important feature of quill, active feedback, is described. Then an example of using quill is given. The next two sessions describe issues with the interface, and how the interface evolved. Then implementation issues are discussed, followed by a summary. A human factors experiment to evaluate quill is presented in the next chapter. 6.1 Goals Based on our experience with gdt, three goals were developed for quill: 1) to provide active feedback, 2) to provide advice about human similarity and memorability and 3) to be easy for designers to use. 6.1.1 Active feedback In the gdt evaluation, many participants had difficulty discovering that recognition problems existed in the gesture set they were designing. gdt included tables based on how the gestures would be recognized by the computer to help designers discover recognition 1. Conventional wisdom is that even a 98% recognition rate is inadequate [BDBN93]. 122

138 problems. However, many participants did not consult these tables, and the tables were confusing to use. The goal of active feedback was to help designers find problems, with both recognition and human perception, by informing the designer of possible problems without waiting for the designer to ask for the advice. Also, the active feedback should be in plain English and use diagrams so that designers with no recognition background can understand it. Active feedback also addresses the problem designers had of not knowing how to fix recognition problems once they were discovered. When quill provides feedback to the designer about problems, it should also give advice about how the problem can be fixed. 6.1.2 Human learnability and memorability It is frustrating to a user when gestures are misrecognized, but gestures are also not very useful if the user cannot remember them. If two gestures invoke different operations, the designer probably does not want them to appear similar since users may easily confuse one gesture for another, although our experiments have yet to show this correlation (see Chapter 5). To help the designer create gestures that will be easier for people to learn and remember, quill should give feedback to the designer about gestures that people may perceive to be similar (see 6.3.5, p. 127). 6.1.3 Easy to use Ease of use is a standard goal for user interfaces, and quill is no exception. This goal is challenging in quill because the application must explain to designers how to modify their gestures to better fit the requirements of the recognizer and human perception, both of which are complex systems. Few designers are trained in recognition or perception, so jargon and equations from those fields are inappropriate. Instead, quill should avoid explaining how the recognizer works or how our perceptual model works as much as possible. Where we need to explain technical details, quill should use drawings and plain English that is intelligible to designers. 123

139 6.2 Gesture Hierarchy and Naming Gestures are the actual objects that quill users want to manipulate, so it is important that the scheme used to organize them is easy for users to understand. This section describes how gestures are organized and how gesture structures are named. 6.2.1 Gestures and structures of gestures In quill, we created the following 5-level hierarchy: Gesture: A single mark or glyph. Gesture category: A collection of gestures that define a type of gesture (i.e., a gesture class). For example, a collection of left-to-right straight lines might define the scroll right gesture category. Gesture group: A collection of gesture categories, typically ones that perform related operations. For example, one might have an Edit group that contains the cut, copy, and paste gesture categories, and a View group that contains the zoom in and zoom out categories. Gesture set: A collection of gesture categories and gesture groups, typically all the categories and groups for a particular application (or a mode in an application for applications in which different sets of gestures are valid in different modes). Gesture package: A training set (i.e., a gesture set used to train the recognizer) and zero or more test sets (i.e., gesture sets used to test the recognition of the training set). Gesture groups and gesture packages are not strictly necessary, but experience suggests they are useful for quill users. If an application has a large number of gesture categories, the designer may wish to organize them by type, which gesture groups support. Gesture packages were introduced to allow tight coupling of test sets with a training set, which was suggested in the the gdt evaluation. Collectively, gesture packages, gesture sets, gesture groups, gesture categories, and gestures are called gesture objects. Gesture objects that may have children (i.e., all except gestures) are gesture containers. 124

140 6.2.2 How to name gesture objects Given the levels of hierarchy presented in the previous section, we had to decide how to name the different levels. This issue is important because people see the names of the levels in the interface and in the documentation, and if the names are confusing it will prevent users from generating a mental model that matches what quill does. Some of the issues we considered in naming the levels of hierarchy are: Do the names suggest the right containment? That is, would people expect a package to contain sets to contain groups to contain categories? Which level of the hierarchy should get the special name gesture? Two different schemes were considered. In the first, the lowest level of the hierarchy, the one denoting an individual mark, is example and the second level (now called gesture category) is gesture. The advantage of this scheme is that in discussing gestures and interfaces using them, we have found it convenient to talk about the copy gesture rather than the copy gesture category, which sounds cumbersome. The second scheme is the current scheme, where the lowest level is gesture and the one above is gesture category. To resolve the first issue, we used our best intuition about which of these collection names was bigger than the others. This may not be how other people see it. For the second issue, we decided to use the current scheme because: 1) we thought example for the lowest level would be confusing, 2) it was more consistent to have gesture in all the names, and 3) it made more sense for gesture to be the building block. We sometimes use example to describe the lowest level of the hierarchy, since it is natural to talk about those gestures as training examples for the recognizer. 6.3 Active Feedback In our experiment with gdt, we found that people did not use the tables and graph that may have helped them find and fix recognition problems. We addressed this problem in quill by automatically detecting problems and alerting the user to them. This section describes the different types of problems quill detects. 125

141 6.3.1 Recognizer similarity If two gestures are similar to the recognizer, it is more likely that they will be misrecognized. To help designers find this potential problem, quill compares all pairs of gesture categories and reports when two of them are very similar, in terms of the metrics used by the gesture recognizer. See 3.1.1.2, p. 41, for more information about recognition. 6.3.2 Outlying category The outlying category problem was observed during the gesture design study using gdt (see 3.4.2.2, p. 57, outlying feature value problem). It occurs when two categories are very similar because a third, outlying, gesture category is very different. When quill detects that this problem has occurred, it reports the problem as an outlying category problem, not a recognizer similarity problem, so the designer knows that the outlying category is causing the problem. The implication is that the outlier is causing the problem, not the two categories that are similar. Only very experienced gesture designers understand this problem, so quill can make a significant difference for less experienced designers. 6.3.3 Misrecognized training example Normally, all training gestures should be recognized as the category of which they are a member. quill tests all training gestures and reports on ones that are not correctly recognized. Most misdrawn gestures are caught this way. This test also may indicate when two gesture categories are too similar for good recognition, since its likely that if two categories are similar their training gestures will be easily confused. 6.3.4 Outlying gesture Usually if a training example is misdrawn it will be misrecognized and flagged as described in the previous section (6.3.3). However, sometimes a training example may be misdrawn yet still be correctly recognized. Misdrawn gestures give the recognizer false data about what the gesture category is supposed to be like, which may result in recognition problems. For this reason, quill looks for training examples that are very different from others in their category, and reports these outlying gestures to the user. 126

142 6.3.5 Human similarity Previous warnings relate to how the recognizer sees the gestures. Another important factor in gesture usage is how easily gestures can be learned and remembered by people. To make their gestures easier to learn and remember, the designer may want gestures that invoke unrelated actions to appear dissimilar. To help the designer do this, quill checks all pairs of categories to determine if its likely that people will perceive the two categories in the pair as very similar, based on the gesture similarity studies described in Chapter 4. If the gestures are in the different groups, quill reports this similarity to the user and suggests how to make the gestures less similar. If the gestures are in the same group, there is no warning because it is likely that gestures in the same group perform similar functions, and it may be helpful for gestures with similar functions to be perceptually similar. For example, the operations scroll up and scroll down should have similar gestures. 6.3.6 Duplicate category or group name Users of quill and end-users of the gesture sets designed in quill are likely to be confused if categories and groups are not named uniquely. For this reason, quill warns the designer if two (or more) categories or groups have the name. 6.4 quill Example This section describes the quill interface and gives an example of using quill. An overview of the quill interface is given in Figure 6-1. The first step in using quill is creating some gesture categories, and optionally some gesture groups to organize the categories. Then the designer enters training gestures by drawing them in the drawing area. The designer can view the categories and training gestures simultaneously in quill subwindows on the quill desktop, as shown in Figure 6-2. Once training gestures have been entered, recognition can be tested. When the designer selects the training set and draws a gesture, quill recognizes it and displays the result, as shown in Figure 6-3. For a detailed example of creating gesture categories and entering gestures, and an example of entering a gesture group and a test set, see the quill Tutorial in Appendix E. 127

143 Goodness metrics Menu bar Windows Training/Test selector Tree view Drawing area Log Figure 6-1 quill UI overview. 6.5 User Interface Issues This section describes some of the issues we encountered while designing the user interface for quill. 6.5.1 Display of gesture objects The goal of quill is to allow the construction of effective gesture sets. How gesture packages and their components are displayed is thus very important. quill displays gesture objects in two views: tree or desktop. 1. Tree view. Gesture objects are displayed in a standard tree widget (see Figure 6-4). It is easy to browse the structure of the package with this view. 2. Desktop view. The right part of the quill main window is a desktop where graphical views of gesture objects are displayed (see Figure 6-5). In addition to 128

144 Figure 6-2 quill with multiple windows, showing several gesture categories open at once. windows displaying gesture objects, windows displaying advice are displayed on the desktop when requested by user clicks in the summary log. quill allows an arbitrary number of views of the same object at different places. This was inspired by emacs [Sta87], which allows an arbitrary number of views and which has shown to be very useful to programmers. 6.5.2 Display of active feedback The most important feature in quill is active feedback, which is provided for potential problems like misrecognized training examples or gestures that people will perceive as being similar (see 6.3, p. 125). How, where, and when to present active feedback are important interface issues. We call feedback about a specific problem a notice. 6.5.2.1 How In the gdt experiment, we observed that people had difficulty understanding the tables and graph about recognition that gdt provided. Based on this finding, we decided the feedback 129

145 Recognized gesture in green Recognized gesture highlighted Recognition result Gesture that was drawn Figure 6-3 quill recognizes a gesture and shows the result several different ways. Figure 6-4 quill tree view of a training gesture set. 130

146 Figure 6-5 quill desktop view of a gesture set, gesture group, example (training gesture), and gesture category (clockwise from top left). in quill should be easily intelligible, so quill uses plain English where possible. When unfamiliar terms are presented, they are hyperlinked to descriptions in the on-line reference manual. As well as normal English, the reference manual uses diagrams to explain terms. 6.5.2.2 Where At any given time the system may have several different feedback items for the designer. For this reason, quill has a log at the bottom of the main window to display the feedback in summary form. We wanted to keep the messages in the log terse so experts would not have to read many words to understand a message, and to conserve screen space. However, most feedback in quill is not immediately intelligible from the message in the log, so notices also include hyperlinks that the user can click to obtain additional information. This additional information is usually shown in new windows in the desktop area, although two types of information, how to and reference, are shown in dedicated top-level windows (i.e., independent and outside of the quill main window). 131

147 6.5.2.3 When There are several options for when to provide feedback to users. One is to give feedback as soon as possible. This approach has the advantage that the user knows what action caused the feedback. However, for feedback related to recognition, this strategy is likely to produce premature and incorrect feedback, because the way the recognizer views the gestures will change over time, as gestures are added to the set. We decided to treat feedback differently depending on whether it was related to recognition or not. Feedback not related to recognition is shown immediately, so that users know what caused it. Feedback related to recognition is delayed until the user explicitly trains the recognizer or draws a gesture to be recognized, because we assume that at that time the user has stopped entering gestures and is starting to test the recognition. 6.5.2.4 Expiration Equally important as showing feedback at the right time is removing feedback at the right time. Feedback should be removed when the user tells the program to ignore the problem or when the problem is fixed. One problem is determining what fixed means. One strategy is when a notice is created, remember the current state of the relevant gesture objects and when the state changes more than some amount, remove the notice. A problem with this strategy is deciding how much change is enough. Another strategy is to evaluate the whole gesture set every time it changes to determine if notices are still valid. Initially, the Java version of quill evaluated the entire gesture set any time any part changed, and reported all notices that applied to the new state of the set. We soon realized that this design was not workable, however, because users were deluged with the same notices over and over again and ignored them all. We changed quill so that now it only displays notices in the log that have not been displayed before. Warning icons in the tree and desktop view are shown for all notices that are applicable, not just new notices. 6.5.3 Test vs. training gesture sets The gdt experiment showed that explicit support for test gesture sets was important. This section discusses the issues of how tightly coupled test and training sets should be and where in the gesture hierarchy test gestures should be. 132

148 6.5.3.1 Coupling of test and training sets In gdt, the recognizability of one gesture set could be tested by using it as a training set and using another gesture set as a test set. A single file stored a single gesture set, so there was no coupling by the system between a training set and associated test sets. The original paper prototype had this same design. However, users of the paper prototype strongly preferred that the program support linking test sets with the training set for which they were made. For this reason, in quill we decided to have every test set bound to a training set inside a gesture package. 6.5.3.2 Location of training gestures in the hierarchy In the original quill design, the gesture object hierarchy had four levels: gesture set, gesture group, gesture category, and gesture. A gesture set was the highest level, and the unit of storage. We considered putting test gestures in two different places: 1) in gesture categories, so that each gesture category would have training gestures and possibly test gestures as well; and 2) in a separate hierarchy parallel to the training set. We discussed how test sets would be used, and thought that designers would at any given time be working primarily on training gestures or on test gestures and would not closely interleave the two. Therefore, we decided to have a separate test set hierarchy, where each test set has its own gesture groups and categories and those categories contain the test gestures. 6.5.4 One vs. many windows An early decision we made about the quill UI was whether it should have multiple top- level windows for displaying and editing gestures, or if it should have only one window (i.e., Single Document Interface (SDI) vs. Multiple Document Interface (MDI)). gdt opened a new top-level window for every view of a gesture set or gesture category that was opened (e.g., see Figure 3-3 and Figure 3-4). gdt was designed mainly with X Windows [QO93] in mind, and under X it is common for applications to have multiple top-level windows and the environment for managing windows is very flexible and customizable. However, many more people, including GUI designers, use Microsoft Windows than X Windows, and under Windows it is more difficult to manage windows. And, most applications have only one top-level window. For these reasons, we decided 133

149 that quill would have one top-level window per gesture package, which normally corresponds to the collection of gestures that are applicable to one application. 6.6 User Interface Evolution This section describes how the quill user interface evolved from sketches, through a paper prototype and its evaluation, to the final Java version. 6.6.1 Sketches The first phase of the quill interface was a low-fidelity prototype consisting of a set of sketches. Figure 6-6 shows the basic concept for the interface. Figure 6-7 is a storyboard showing a new gesture category being added and gestures added to the category. Figure 6- 8 shows a bad training example being be reported to the user. Figure 6-9 shows how the user could obtain more information about a warning notice. An early idea for showing training gestures and test gestures for the same category together is shown in Figure 6-10. We constructed two simple scenarios and generated storyboards. In the first scenario, the designer enters a gesture set, and one training example for the copy gesture category in Figure 6-6 Sketch of quill prototype. 134

150 Figure 6-7 Storyboard of quill prototype, showing creation of a new gesture category. Top-left shows an empty gesture package. Top-right shows a new gesture category added. Bottom-left shows the category renamed and one gesture added. Bottom-right shows several gestures added. Figure 6-8 Sketch of quill prototype. This shows a bad training example being reported. 135

151 Figure 6-9 Storyboard of quill prototype. At the top left the user clicks more info to show the top right. From there the user can click sharpness to show the bottom left screen or Plot sharpness to show the bottom right. Figure 6-10 A way to show training gestures and test gestures for a category together. the set is bad. Figure 6-11 shows the how the interface initially reports the problem to the designer. If the designer then opens the copy category, quill shows the training example gestures with the bad one outlined in red (see Figure 6-12). 136

152 Figure 6-11 quill low-fi prototype, scenario 1. Initial report of a bad training example. The notice is shown in the log at the bottom, and a warning icon is placed next to the category in the tree and desktop views. In the second scenario, quill has told the designer that two categories are too similar for the recognizer. Figure 6-13 shows the interface after the designer has clicked the More info button. The designer may not know what angle of the bounding box is, and so clicks on angle of the bounding box. (The designer knows it is a hyperlink because it is blue.) quill displays additional information as shown in Figure 6-14. 6.6.2 Evaluation of paper prototype Next, we made a paper prototype based on the sketches, and performed an informal user study using the paper prototype [Ret94]. The paper prototype used printouts of the online sketches with hand drawings for additional elements (e.g., pull-down menus, dialog boxes, and subwindows). Participants had two tasks in the study using the paper prototype. For the first task, they were shown the interface as in Figure 6-6 and told to add a new category and draw training 137

153 Bad training example Figure 6-12 quill low-fi prototype, scenario 1. A bad training example is highlighted in red. Figure 6-13 quill low-fi prototype, scenario 2. The designer saw the warning at the bottom and has clicked on the More info button. angle of the bounding box is a blue hyperlink. 138

154 Figure 6-14 quill low-fi prototype, scenario 2. The designer clicked on angle of the bounding box and is shown a graphical explanation (in the window above: Feature: angle of the bounding box). examples for it. An example of a hand-drawn element is Figure 6-15, which was overlaid on top of Figure 6-6 when the user created the normal zoom category. The second task was to test the recognition of one gesture set using another gesture set. They started with two copies of Figure 6-6, one to represent the training set and another to represent the test set (the title bar in one printout was changed to say igdt - exampleset1test to indicate that it was a test set). When they did the Test Gesture Set Figure 6-15 Empty gesture category overlay, used in paper prototype evaluation. 139

155 Figure 6-16 Dialog box from the paper prototype evaluation. Used to choose a test set. operation, the interface brought up the dialog box in Figure 6-16 to allow them to choose the test set to use. Four graduate students from our research group participated in the study. During and after the study, participants made a number of suggestions about how to improve the interface: Combine menu items Set/Evaluate recognition and Set/Evaluate similarity into one: Set/Analyze. We made this change. Make sure that the visual indications that a gesture object is selected and that there is a problem with it are distinct. This was difficult to do in the paper prototype, but was included in the Java version. Make hyperlinks underlined as well as blue [all participants]. We did this in the next version. The dialog box for choosing a test set (Figure 6-16) was confusing. Users were not sure if it was used to choose the set of training gestures or test gestures. The next version eliminated this dialog entirely. Move the Set/New operation to the File menu. We significantly reorganized the menus for the next version. Rename gesture category to gesture. Only one of our four evaluators made this comment and we decided against it. (See 6.2.2 for a discussion of naming.) 140

156 When the system first discovers a possible problem, show only the warning icon next to the gesture object. Only show text at the bottom when requested, such as with a mouse click on the icon. We thought that designers should have more initial information than just an icon, so we ignored this request. Rename More info button to Details [2 participants]. We thought More info was more descriptive. The evaluators also suggested some new features that they thought would improve the interface: In the tree view, show the thumbnail instead of a static icon. We agreed that this is desirable, and it was implemented in a later version. Have the program be able to change a gesture to fix it instead of (or as an alternative to) telling the user how to. For example, if the program realizes that two gestures are too similar because of their length, have an option to automatically make one shorter or the other longer. This would be a great feature, but would be very difficult to implement. Drag and drop of gestures in and between the tree and the desktop. Also a good feature, but not implemented due to high cost/benefit. Pare down the number of menus. We did this in the next version. Couple test sets more tightly with their training set [all participants]. Originally, quill was designed to have no knowledge of whether a gesture set was for training or for testing and treated them identically except when explicitly requested to test one set using another. However, since all four participants wanted a tighter coupling, quill was redesigned to have it. (See 6.5.3 for more discussion of training and test sets.) The experiences of the participants raised some questions that were difficult to answer: Should groups of groups be allowed? We decided not to allow them since we believed their use would be rare and would only confuse designers. Should all gesture categories be in groups or can they be in gesture sets directly? We decided on the latter since it seemed useful for small applications. 141

157 Allow renaming a gesture category or group in the tree view simply by typing rather than invoking a menu item? This feature was not implemented for some time because it is non-standard, but after some pilot tests with the Java version, it was added. Should quill have a toolbar? If so, what should be in it? Some obvious candidates are new gesture category, cut, copy, and paste. We were not sure if it was worthwhile considering the screen space it would consume. Also, we didnt know what operations would be the most common ones. Originally the message area at the bottom of the screen was a single-message area like many applications use to report that they are currently saving a file. This evaluation suggested that a log, such as the compilation log in most IDEs (integrated development environments), would be more appropriate. The next version used a log. When do warning messages go away? This is a thorny issue and has only been resolved after trying different strategies and doing pilot testing (see 6.5.2.4). 6.6.3 Java implementation After the evaluation of the paper prototype, we discussed whether the next prototype should be on paper or in software. We decided that it should be in software because a large component of the interface is the interactions on the tree and desktop views, and these are cumbersome to capture in a paper prototype. In retrospect, it probably would have been better to do at least one more paper iteration since the effort of the Java implementation was much higher than expected. In spite of the effort required to make an interactive paper prototype, it would have been easier than the Java version turned out to be. Figure 6-17 shows the main window for the Java version with annotations. The following sections briefly describes the parts of the interface. 6.6.3.1 Training area This area shows the training set and the groups, and the gesture categories they contain. Individual gestures are not shown here. Clicking on a name, folder icon, or gesture icon selects an object (and deselects everything else in the tree). Clicking on a notice icon 142

158 Goodness metrics Menu bar Windows Training/Test selector Training area Drawing area Log Figure 6-17 quill UI overview. scrolls the log at the bottom of the window so that the relevant warning is in view. Double- clicking on an object creates a new window that shows the object. 6.6.3.2 Windows Windows appear in the right part of the main window and are used to browse gesture categories, groups, and sets and to show information about suggestions in the log window (such as misrecognized training examples). Most windows show gesture sets, gesture groups, gesture categories, or individual gestures, as shown in Figure 6-17. However, windows may show other things, such as gestures that are not recognized correctly. Figure 6-18 shows several misrecognized gestures. Each gesture has a green textual object label and a red-labeled button below it. The green label says which gesture category the gesture is supposed to be. The red-labeled button says which gesture category it was recognized as. Clicking on the button creates a 143

159 window that shows the gesture category it was recognized as. This feature helps the designer decide whether the gesture is a mistake and should be deleted, or whether it should remain. 6.6.3.3 Drawing area This area is used for drawing gestures. If a gesture category is selected in the tree view or if a gesture category window is active, a gesture drawn here will be added to the selected gesture category. If a gesture window is active, a gesture drawn here will replace it. If a gesture group or set is selected in the tree or a gesture group or set window is active, an example drawn here will be recognized. Results of the recognition will be shown in the log, and the label for the recognized gesture will turn green in the training area and in any windows in which it appears. During certain operations (e.g., training the recognizer), the application is unable to accept drawn gestures, and during this time this area will turn gray. 6.6.3.4 Log This view is the primary means for the application to communicate to the user about what it is doing. Many different types of messages may appear here. Some examples are: Notification of autosave. Notices about problems that the program has detected with the gesture set. Figure 6-18 Misrecognized gestures. 144

160 Recognition results. Errors. 6.7 Implementation Issues quill is a large, complex application, as can be seen in the class diagram in Figure 6-19. Many classes developed for gdt were generalized and became a library for generic gesture manipulation, display, and interaction. The figure shows only classes developed specifically for the quill application, not the generic library classes. Portions of this diagram are shown in more detail in related subsections below. This section describes interesting issues that arose in the implementation of quill. 6.7.1 Generic property mechanism In implementing gdt, basic data structures, such as gesture and gesture category, became encrusted with data that was specific to the gdt application and probably not useful for other gesture-using applications. We took this lesson from gdt1 and for quill modified the gesture objects to have a generic property mechanism. All gesture objects can be assigned arbitrary name-value pairs where the name is a string and the value is any Java object. Interested parties can register a listener with a gesture and will receive an event any time the objects properties change. Gesture objects have some built-in properties. For example, all gesture objects have a parent property, and gesture containers (i.e., gesture package, gesture set, gesture group, and gesture category) have a children property. Support for cloning properties along with the rest of a gesture object was also implemented. Initially there was only one kind of property. However, once properties started to be used to store notices, it became clear that this was inadequate. For example, notices should not be copied when a gesture object is cloned (which happens, for example, when it is copied to the clipboard and then pasted). Therefore we added to the property mechanism the ability to set a property as transient, in which case it would not be saved to disk or cloned. 1. Generic features were also inspired in part by SUIT [PYD91], with which I worked as an undergraduate, and by the generic property mechanism in Java/Swing [WC99]. 145

161 interface interface interface gdt CommandUtil GDTConstants Commander GestureAcceptor (from default) (from default) creates (from default) (from default) creates0..1 (from default) #replaceAcceptor #addAcceptor 0..1 #recognizeAcceptor 0..1 GDTUtil (from default) SwingWorker (from default) 0..1 #trainingWorker #testWorker 0..1 GDTClipboard interface CompositeGestureTree creates GestureDesktop 0..1 uses GestureObjectDisplay (from default) 0..1 creates #trainingTree (from default) uses (from0..1 default) (from default) -gestureObjectDisplay 0..1 #testTree -gdtClipboard #desktop 0..1 #gestureTrees 0..1 GestureFlavorFactory GestureTree creates DisplayFactory GInternalFrame uses 0..1 #trainingTreeModel (from default) (from default) (from default) (from default) 0..1 #testTreeModel SimpleGestureObjectPanel GestureContainerPanel 0..1 #model #testTree 0..1 (from default) (from default) #trainingTree 0..1 GestureTreeModel GestureTransferable uses (from default) (from default) uses GestureObjectPanel GestureCategoryPanel GestureGroupPanel GestureSetPanel ShadowGestureSet (from default) (from default) (from default) (from default) (from default) creates EnableListener (from default) ShadowGestureContainer GestureCategoryThumbnailPanel GestureGroupThumbnailPanel (from default) (from default) MainFrame WindowManager (from default) uses uses, creates (from default) creates (from default) uses, creates 0..1 uses uses uses uses uses #mainFrame Clipboard #mainFrame -mainFrame #mainFrame -mainFrame -mainFrame 0..10..1 -mainFrame uses -mainFrame #mainFrame 0..10..1 -mainFrame 0..1 0..1 0..1 0..1 0..1 uses, creates uses uses uses uses uses 146 TaskManager creates Analyzers Desktop creates creates 0..1 HowToManager (from default) (from default) creates 0..1 #analyzers (from default) 0..1 0..1 #noticeHandler windows SummaryLog #summaryLog creates uses RecognizerSimilarityTester #defaultManager 0..1 uses NoticeHandler (from default) uses (from default) ReferenceManager #summaryLog interface (from default) uses 0..1 #humanGoodness uses uses uses 0..1uses uses uses uses (from default) uses Notice uses uses, creates HumanGoodness (from default) Tree display (from default) #defaultManager classes, and dark gray boxes are interfaces. 0..1 #recognizerGoodness creates AbstractNotice RecognizerGoodness HowToPanel (from default) Main window (from default) (from default) creates creates0..1 creates #howToPanel #howToPanel #howToPanel creates 0..1 0..1 -howToPanel creates 0..1 creates HumanSimilarityNotice AbstractLikelyNotice AbstractInfoNotice AbstractDefiniteNotice AbstractPossibleNotice (from default) (from default) (from default) (from default) (from default) Notice management MisrecognitionNotice DuplicateNameNotice FeatureNotice OutlierNotice RecognizerSimilarityNotice OutlyingGestureNotice (from default) (from default) (from default) (from default) (from default) (from default) Figure 6-19 quill class diagram. Light gray boxes are classes, medium gray boxes are abstract

162 6.7.2 Problem detection In the gdt experiment, we saw that users had great difficulty discovering recognition problems. For quill, we implemented detection of and feedback about several different types of potential problems with gestures. Most of these problems are detected in the classes Analyzers, except for recognizer similarity, which has its own class, RecognizerSimilarityTester, because of its complexity. This section describes issues involved in detecting the different kinds of problems about which quill provides active feedback. 6.7.2.1 Recognizer similarity One problem is when gestures are very similar to each other with respect to the recognizer, which increases the chance of the recognizer confusing them. For this problem, similarity is measured using the Mahalanobis distance in the feature space used by Rubines recognizer. We were not sure what the right threshold was for considering two categories to be too similar. We had to experiment with different values to see which one gave useful results without too many false positives. When we first created gdt, we created gesture sets and used the distance matrix to see how far apart in feature space the gestures were. Informal experimentation led us to think that a distance of 8 was a good threshold. However, later when we created gesture sets in quill, that value appeared to cause too many warnings when the gestures actually seemed to be recognized well. Therefore, for the quill evaluation, we changed the threshold to 4. In that evaluation, it may be that this number was too low and should be adjusted to approximately 6, which is the value the released version of quill uses. 6.7.2.2 Outlying category If two categories are detected as being too similar (see previous section), it may be due to another gesture category being too different. More formally, suppose two or more categories are differentiated primarily based on one feature, f. If another category is added whose value for f is extremely different from the value of f for the pre-existing categories, then the pre-existing categories will become squeezed together in feature space, and thus hard for the recognizer to differentiate. 147

163 To detect this, we need to compare the complete training set against the training set without each of the categories, in turn. Its not obvious exactly what to look at in this comparison. Possibilities: 1. Inter-category distances. What is the right distance? Only warn if two categories are closer than the usual threshold? 2. Distance of candidate bad category from other (non-candidate) categories, based on variance of feature across non-candidate categories. Warn if candidate is farther than some number of standard deviations from the non-candidate categories? 3. Feature weights. If a feature weight is substantially less when one category is enabled than when its disabled, that category is a problematic outlier. The solution we decided on is, when two categories are too similar for the recognizer, to try disabling all other categories, one at a time, and see if that makes the two categories farther apart. If disabling any category significantly increases the distance between the original 2, suggest changing it (see #2, below). This strategy raises the following issues: 1. How much distance is needed to make it significant? If the two get just over the usual similarity threshold, we may not want to count that, but instead require that they get 1.5-2 times farther than usual threshold apart. However, we do not want to miss a chance to improve things by setting the threshold too high. 2. How to suggest changing the problematic category? The underlying problem is that original two categories are too similar. quill reports a category as outlying if removing it from the set causes the two similar categories not to be too similar any more, and if their new distance apart is more than twice what it was with the potential outlier in the gesture set, in order to prevent a small change near the distance threshold from counting. 6.7.2.3 Misrecognized training example Finding misrecognized training examples is straightforward: for each enabled training gesture, check that it is recognized as the group of which it is a member. To fix it, the designer needs to know which feature is causing the gesture to be different from its category. To determine which feature is most significant, quill takes the difference of feature values between the gesture and the average for its category and multiplies each 148

164 difference by the weights that the classifier uses for that feature and category. The product with the highest absolute value indicates the feature along which the gesture is most different from the category it is in. 6.7.2.4 Human similarity quill examines all pairs of gesture categories and reports if any are too similar for people, using metrics derived from the similarity experiments described in Chapter 4. It uses the similarity metrics from the first similarity experiment (4.1.5, p. 73), with the threshold determined using data from the similarity survey (4.4.1, p. 96), and the pairwise similarity survey (4.3.3, p. 90). If a pair of categories is similar based on either metric, it is reported as too similar. It also computes which feature is most responsible for the two being similar, but in different ways for the two metrics. The first metric (from the first triad trial) is like the recognizer similarity metric in that it is based on computing coordinates in a feature space for the two categories and then computing the distance between these coordinates. To determine the most significant feature, quill first determines the dimension in feature space along which the two categories differ least. It reports the feature that contributes most to that dimension. The second metric (from the pairwise survey) is based on logistic regression, and is a simple linear predictor. To determine which feature is most responsible for similarity, quill computes the value of each term and finds the term with the smallest absolute value. The feature corresponding to that term is the one that should be changed to make the gestures different. 6.7.2.5 Outlying gesture quill decides if a training example is an outlier based on its distance in recognizer feature space from the centroid of the gesture category to which it belongs. The only issue is how far away it needs to be to be considered an outlier. Informal pilot testing revealed that it needs to be surprisingly far to avoid false positives (i.e., to avoid reporting a training example as an outlier when it is not). Five standard deviations was not enough, so quill uses ten. 149

165 6.7.3 Goodness metrics To give designers an idea of how good quill thinks their gestures are, quill reports two goodness metrics, one for human goodness and one for recognizer goodness. These metrics are based on the same problems about which quill gives feedback (see 6.1.1, p. 122 and 6.3, p. 125). We decided to use a system in which each problem that quill detects is worth a certain number of points. The points for all problems that exist at a given time are added up to give a penalty that counts against the goodness. Problems related to recognition count against recognizer goodness and problems related to human perception count against human goodness. We thought about making zero the best value and either adding or subtracting the penalty to it. However, after discussion with members of our research group, we decided that it was more natural for the best score to be a high number and subtract the penalty from it so that lower numbers mean lower goodness. We chose 1000 as the best, a perfect score. These metrics are computed by the classes HumanGoodness and RecognizerGoodness. 6.7.4 Notices Notices are a core feature of quill, and also a source of many implementation issues. The previous section discussed how the analysis was conducted. This section discusses engineering considerations about how notice computation affects the rest of the application, how notices are propagated up the gesture hierarchy, how to avoid showing the same notice multiple times, and how to handle notices with different levels of severity. 6.7.4.1 Background analysis One issue with notices is that computing them may take a long time, so early in the design of quill we considered different strategies for using threads to do this computation. Table 6-1 shows the strategies that were considered and the pros and cons of each. We analyzed the options as follows, by strategy number: 1. This strategy would be easy to implement, but would create an annoying interface where too much of the time the user would not have control. 150

166 # Description Pros Cons Implementation issues 1 Lock all user Simple mental Prohibits user from taking Have to keep a count actions. model. any actions during analysis. of the number of long- running threads. The UI is locked any time more than 0 such threads are active 2 Disable all user Allows read- Users may be confused Same as #1, except actions that would only actions about why some actions are only modification change any state. during long- permitted but others are not. operations get running Would have to provide disabled. operations. feedback so users knew what was going on. 3 Allow all actions, Users can take Users may not realize Have to catch all but if a change any action, any certain actions will cancel a changes and cancel the occurs that affects time. long-running operation, long-running ops. a long-running which they may not want to Should canceled ops operation, cancel do. The UI could prompt automatically restart? it. them (e.g., This action will cancel [long-running op foo]. Do it anyway?), but that would probably be very annoying. 4 Allow all actions. Users can take Results of a long-running Easiest of them all. Whatever happens, any action, any operation will not be happens. time. consistent with the state of Easy to the world, which may implement. confuse users. 5 Disable all user Allows Users may be confused Have to keep track of actions that would maximum non- about why some actions are what pieces of state change state conflicting permitted but others are not. are being used at any currently in use by actions possible. Would have to provide given time. long-running feedback so users knew Have to know what operations. what was going on. pieces of state every The set of enabled actions command could may change while a long- change and enable/ running operation is disable them all running. appropriately. Table 6-1 Strategies for handling analysis in a background thread. 2. This strategy would not be much more work to implement than #1, and would be much better for the user since many commands do not modify the state of the gesture objects. 3. This strategy is better than #4, but without confirmations long-running operations will often be canceled and the user would have to restart them. Using confirmations would annoy users too much to be useful. 151

167 4. This strategy is not good at all. Allowing the results of analyses to be inconsistent with the state of the world could too easily confuse users. 5. This strategy offers maximum flexibility for the user, but at a substantial implementation overhead, since all pieces of state would have to be tagged and queried for every operation. It would probably only be advantageous if background operations are modifying the state while others are reading it, in which case a semaphore type of system would have to be used. quill has two different types of long-running operations: user-initiated and system- initiated. We decided to use different strategies for these two different types. For user- initiated operations, we chose #2 because we thought disabling some actions in response to a user action would not confuse users. Also, the implementation effort (i.e., disabling operations that modify the data) was not extraordinarily difficult, compared to option #5. For system-initiated operations, we chose #3, with the system automatically restarting actions that are canceled due to modifications. We wanted users to be able to act freely without actions being enabled and disabled for no apparent reason. Since the user did not initiate these long-running actions, users will not know and thus not be distracted by these processes stopping and restarting when changes are made. The TaskManager class schedules the background analysis tasks. It also monitors the gesture package and restarts the analysis tasks when a change is made. 6.7.4.2 Propagation Gesture objects with notices are displayed with an icon next to them in the tree and desktop views. However, if an object at a low level of the hierarchy, such as a gesture category or a gesture, has a notice, it and thus the notice icon next to it may not be visible, so users might look at their gesture set and think that it is fine when in fact there are hidden notices. To avoid this problem, we decided to have notice icons show next to a gesture object if it or if any of its descendents have a notice. Correctly propagating the notices was somewhat complicated and raised several issues, which are discussed below. 152

168 Who is responsible for (un)setting Notices on gesture containers? One option for setting and unsetting notices on gesture container is for the gesture container to do it, by listening to its children for changes to their properties. The advantage is that property changes propagate up the tree for other reasons, already, and containers already listen to them. Alternatively, every gesture object could modify all its ancestors when its notices change. We decided that the gesture container should manage this. What is stored as the value of a notice property? The issue here is what is actually stored in the notice property with each gesture container about its descendents notices. We considered three different options: 1. A reference count of the total number of notices its descendents have. It is straight- forward and easy to implement. However, it is also simplistic and it would be easy to mistakenly forget to update it and corrupt it. Also, a sophisticated interface may require more information than this. 2. A list of all notices that its descendents have. This makes it easy to iterate over every notice in the subtree exactly once, but inefficient to iterate over all descendents that have at least one notice exactly once (since some descendents may be referenced by more than one notice). 3. A list of all descendents that have notices. The opposite of #2: easy to iterate over descendents with notices, but inefficient to iterate over all notices. We decided against reference counting since we thought we would probably want a more sophisticated interface than it would allow. Since notices nearly always refer to multiple gesture objects, but gesture objects will have multiple notices much less often, we decided on option #3. Whether to use the same property for intrinsic notices vs. inherited notices (from descendents) Using the same property for both intrinsic notices and inherited notices would have several advantages and disadvantages. The advantages are that it would mean fewer properties to manage, and it is likely that the display of and interaction with intrinsic and inherited properties should be the same. However, there are significant disadvantages: 153

169 1. It makes it very difficult to display or interact with the two types differently if that is ever desired. 2. A gesture container cannot tell if its descendents have intrinsic notices or not, so if a descendent inherits notices, the ancestor will see the notice multiple times (once in the gesture object that has it intrinsically, and also in inherited lists). 3. We may want different kinds of values for the inherited and intrinsic notice properties. With different properties for intrinsic and inherited notices, it would be easier to handle the two differently in the interface, however it would be slightly harder to treat them identically. The property system does not support property inheritance. At the time of this decision, we did not know whether treating intrinsic and inherited notices the same or different would be the common case, although it was planned to be the same. However, it was important that notices be listed only once in a container, which would not occur if inherited and intrinsic notices used the same property (disadvantage #2 above), so we decided to use different properties for intrinsic and inherited notices. It also allowed the two properties to have different types of values, which was useful. (The inherited notices property is a list of descendents with notices and the intrinsic property is a list of notices.) 6.7.4.3 Avoiding duplicates Every time any gesture object changes, quill analyzes the entire gesture package to look for new notices. As originally implemented, this method caused many notices to be shown multiple times in the log. It was clear that this design would be annoying to the designer, and that a notice should not be shown again if it has already been shown (or at least if it had been shown recently). To fix this problem, after running an analysis to generate new notices, quill looks at the list of new notices and compares it with the current set of notices. Only notices that are in the new set and not in the current set are shown in the log. This policy does mean that a notice that becomes invalid will be printed again if it becomes valid again, but this is probably desirable. 154

170 Severity level (higher = more Notice name General type severe) When to display Duplicate name Informational 1 Immediately Human similarity (different Possible problem 2 Immediately (will not groups) improve/change based on later gestures) Human similarity (same group) Informational 1 Immediately Misrecognized training Likely problem 3 Delayed examples Outlying category Definite problem 4 Delayed Outlying gesture Possible problem 2 Delayed Recognizer similarity Definite problem 4 Delayed Table 6-2 Notice summary. 6.7.4.4 Severity Initially, all notices had the same severity. However, it became clear in pilot testing that designers need to be told that some of the notices are more important than others, so that they do not spend time on unimportant ones. An implementation issue was whether to use subclasses for different severities. The advantages of using subclasses are: Severity can be checked with instanceof. Common functionality can be unified (e.g., getSeverityLevel(), getIcon()). It will be easier to renumber severity levels based on general type, if necessary. Disadvantages of using subclasses are: Extra complexity since it adds classes to the system. Adding a new severity level adds a new class. A summary of the types of notices, their general type, their severity levels, and when they should be displayed is given in Table 6-2. (For a discussion of when notices should be displayed, see 6.5.2.3, p. 132). We decided that the advantages of subclasses outweighed the disadvantages, so made classes for the different severity levels, and subclassed those for each notice type. The classes that make up the notice hierarchy are shown in Figure 6- 20, along with classes used to help display notices. 155

171 AbstractNotice (from default) AbstractInfoNotice AbstractPossibleNotice AbstractDefiniteNotice AbstractLikelyNotice (from default) (from default) (from default) (from default) HumanSimilarityNotice OutlyingGestureNotice DuplicateNameNotice FeatureNotice OutlierNotice RecognizerSimilarityNotice MisrecognitionNotice (from default) (from default) (from default) (from default) (from default) (from default) (from default) 156 Figure 6-20 quill notice classes. ReferenceManager 0..1 (from default) #defaultManager creates creates 0..1 creates creates #howToPanel creates creates 0..1 0..1 uses uses uses uses uses uses HowToManager 0..1 HowToPanel #howToPanel 0..1 #howToPanel (from default) -howToPanel uses SummaryLog (from default) uses (from default) #defaultManager

172 6.7.5 Annotations Displays of gesture objects in quill may be annotated for several reasons, such as having a notice on it or its descendent, being correctly recognized, being selected, or being disabled. A problem that arose early in quill was how to implement different annotations in a flexible, scalable, maintainable way. This problem is complicated by the fact that some annotations are specific to a particular view of a gesture object, such as whether it is selected or not, and some are independent of view, such as whether the gesture object is disabled. 6.7.5.1 Where to store annotations The general property mechanism added to gesture objects in quill was the obvious way to store view-independent annotations. View-dependent information is stored with the class that implements the view itself. 6.7.5.2 Where to put the code Originally, we had hoped to develop a generic annotation mechanism that would allow annotations to be added to displays of gesture objects and composed in a general way. However, due to the difficulty of designing such a system, how complex it was almost certain to be, and the fairly small number of annotations in quill, we decided to use a monolithic strategy, instead. Every class that shows a view of gesture objects knows about all annotations that could be displayed on those objects, except for selection. Selection was common enough that it was factored out into a general selection class. 6.7.6 Layout issues Two problems arose when implementing a widget layout for the desktop section of the display as they were originally designed in the sketches (for example, Figure 6-6 and Figure 6-12). 6.7.6.1 Desktop windows We wanted users to be able to have multiple windows open on the desktop at a time and have control over their size. At the same time, we did not want users to have to spend a lot of time managing the windows. To reduce management, we originally decided to allow windows to resize only vertically and to keep them tiled (see Figure 6-12 for a desktop with two windows on it). 157

173 Unfortunately, it turned out to be extremely difficult to implement this. The built-in behavior of windows on a desktop is like the standard multi-document interface (MDI) that many Microsoft applications use (which is similar to many windowing systems). Windows can be moved and resized arbitrarily and can overlap. We attempted to override this behavior, but it was hard to make it exactly right, and some early pilot testing indicated that users might not find it intuitive since it was different from the standard. For these reasons, we decided to abandon the custom window layout policy and use the standard MDI policy. The high-level classes used to display gesture objects on the desktop are shown in Figure 6-21. Low-level classes, such as GestureDisplay, which draws an individual gesture, are not included because they are part of a generic gesture-manipulation package we developed, and are not part of quill proper. 6.7.6.2 Scrollable lists of gesture objects In general, quill cannot display all the gesture objects in a container (e.g., gestures in a gesture category) at once. Those objects must be in a scrollable list, and to save screen space they should be laid out left-to-right and wrap around, the way that text does in a text GestureDesktop (from default) GestureSetPanel creates (from default) GestureContainerPanel interface (from default) GestureObjectDisplay uses (from default) GestureGroupPanel GestureCategoryPanel (from default) (from default) SimpleGestureObjectPanel GestureObjectPanel ? (from default) (from default) GestureGroupThumbnailPanel GestureCategoryThumbnailPanel (from default) (from default) Figure 6-21 quill classes for displaying gestures on the desktop. 158

174 editor. One might expect this feature to be easy to implement but it turned out to be surprisingly difficult with our toolkit (Java/Swing 1.3). After significant work, we arrived at a solution that does most of what we wanted. Unfortunately, it was unable to produce the layout shown in Figure 6-11. Instead, we decided that a view of a gesture set will show each group as a thumbnail gesture that is representative of the group (specifically, the first gesture from the first gesture category in the group), as in Figure 6-5. 6.7.7 Factory-style display generator A common operation that many parts of the quill code perform is to generate a view of a gesture object, sometimes of the entire object, sometimes a thumbnail of it. There are many different kinds of gesture objects, some of which are displayed differently as thumbnails vs. whole and some which are not. A factory design pattern [GHJV95] was used to provide flexibility in binding gesture objects to the objects that display them. That is, the factory class, DisplayFactory, takes a gesture object and whether it should be displayed normally or as a thumbnail. The factory class returns a widget object that is able to display the specified gesture object appropriately, based on its class and whether it should be a thumbnail or not. For example, if a GestureCategory object were passed to DisplayFactory, it would return a GestureCategoryPanel object, which is a widget that displays a GestureCategory. 6.8 Summary The design of quill is based on gdt, on our evaluation of gdt, and on the results of our experiments on gesture similarity1. It includes active feedback and a greatly improved interface. Its features include: gesture set management, including test set support; active feedback; and natural visual display for gestures. Managing active feedback appropriately involved many challenges in interface design and implementation. The next chapter describes an evaluation of quill. 1. quill is available for download from http://guir.berkeley.edu/quill. 159

175 Chapter 7 quill Evaluation We believe that quill is a substantial improvement over gdt and existing tools for designing gestures. A human factors experiment was performed to evaluate the usability of quill and determine whether quill helps designers produce measurably better gestures. This experiment had two goals: 1) to observe designers using quill and qualitatively assess their usage and experiences and 2) to determine whether gesture sets designed with the aid of the active feedback in quill are measurably better than gesture sets designed without active feedback. We judged gesture set quality in terms of how confusable they would be for the recognizer and how similar people would perceive them to be, based on our human similarity metrics. The rest of this chapter describes the experiment. The first section describes the participants, and the next the equipment. Next, the procedure is described. Then the quantitative results and analysis are given, which include the finding that the human goodness of a participants gesture set correlated with their overall rating of quill. Following are the qualitative results, which include the observation that some participants understood and used feedback productively, but some participants did not understand it. The results are then discussed. The next section describes a follow-up experiment that could be performed to further investigate the utility of active feedback in quill. The chapter concludes with a summary. 7.1 Participants We wanted to find only professional designers. We chose to carry out the experiment at the Microsoft Usability Lab because they were able to recruit the participants we wanted. The Microsoft Usability Lab recruited 111 participants, 10 of whom were professional user interface designers and one of whom was a professional web designer. Participants were given $50 gift certificates2 for their participation. 1. Originally, 13 participants were recruited, but one canceled and another did not show up. 160

176 7.2 Equipment Participants used quill running on a Fujitsu Lifebook S Series laptop computer. They primarily interacted with the computer using a Wacom PL-300 display tablet, although they also used the keyboard. A scan converter and VCR were used to record the screen, and a Hi-8 camera on a tripod recorded the participants. To fill out the post-experiment questionnaire, participants used Internet Explorer running on a different Windows computer. 7.3 Procedure The experimental procedure consisted of four parts: 1) a training exercise, 2) the first experimental task, 3) the second experimental task, and 4) a post-experiment questionnaire. The two experimental tasks were a long gesture design task and a short gesture design task. For each participant, the active feedback in quill was enabled for one task, and for the other task it was disabled. The order of tasks and the feedback condition were randomized so that participants were evenly divided among the conditions (i.e., Latin Square design). Although each participant performed both tasks, it was not a true within-subjects design, because the tasks were not of equal difficulty or length, so performance across them could not be compared. In the future, a strict within-subject design should be followed, where both tasks are as equal in length and difficulty as possible. Before the tasks began, participants were given an opportunity to get comfortable with the physical setup of the display tablet and chair. Then participants were given an overview of the experiment to read themselves (see F.1, p. 276). The training exercise was to read the quill tutorial (see E.1, p. 242) and perform the tasks listed in it, which were short. Participants retained the tutorial for the remainder of the experiment. The long experimental task was to design a gesture set for a presentation editing application (e.g., Microsoft PowerPoint). The gesture groups and categories are shown in 2. Participants chose whether they would receive a gift certificate from either Eddie Bauer or Tower Records and Video. This type of compensation is standard at the MS Usability Lab. 161

177 Outline Font New bullet Increase font size Select item Decrease font size Indent Change font Unindent Bold Format Italic Increase line spacing Underline Decrease line spacing Misc Left justify Insert picture Center justify New slide Right justify Figure 7-1 Gesture groups and categories for the long experimental task (presentation editor). Figure 7-1. Participants were given at most one and a half hours to complete this experimental task. After that time they were told to quit even if they had not finished creating and editing all the gestures. The short experimental task was to create a gesture set for a web browser, with the groups and categories as shown in Figure 7-2. Participants were given one hour to complete this task. After the second experimental task, participants filled out an online questionnaire using a different computer than the one used for the other tasks (to avoid negative effects suggested by [RN96]). At the end, participants were given a post-experimental handout that gave more details about quill and the experiment. Also, the experimenter answered any remaining questions they had. Navigate Bookmarks Back Add bookmark Forward Edit bookmarks Home Misc Reload Add annotation Email page Figure 7-2 Gesture groups and categories for the short experimental task (web browser). 162

178 7.4 Quantitative Results and Analysis The quantitative data from one participant, #4, had to be removed from consideration, because he did not follow instructions. He discovered a flaw in the human goodness metric that caused its value to be inflated when a very small number of training examples are entered (34 per gesture category). The experimenter instructed him to enter the same number of examples as the other participants for consistency, but he did not. Three dependent variables were measured for each task for each participant: 1. The human goodness of the gesture set 2. The recognizer goodness of the set 3. The time to finish the task. The data for the tasks are summarized in Table 7-1. A multivariate ANOVA was performed to find significant differences due to a) feedback, b) whether the task was first, and/or c) the task (long vs. short). Few statistically significant effects were found. The long task took participants significantly longer to perform than the short task (51.1 vs. 22.7 minutes, p < 0.0092), and had significantly lower human goodness (690 vs. 940, p < 0.011), with or without active feedback. Long task Short task First First With No Yes Overall No Yes Overall feedback No 667 550 637 1000 800 900 Human Yes 550 900 725 1000 1000 1000 goodness Overall 608 812 690 1000 900 940 No 999 997 998 999 999 999 Recognizer Yes 996 992 994 1000 1000 1000 goodness Overall 998 993 996 999 999 999 No 47.0 44.0 46.2 16.0 25.7 20.8 Time Yes 43.7 65.0 54.3 13.0 29.7 25.5 Overall 45.3 59.8 51.1 15.2 27.7 22.7 Table 7-1 Means for tasks in the quill evaluation. Goodness is on a scale of 01000, where 1000 is perfect. Time is in minutes. 163

179 1000 Yes 900 Human Goodness 800 No 700 FIRST 0 600 1 0 1 With Feedback Figure 7-3 Interaction between feedback and task order. Values are average human goodness (scale is 01000, higher is better). Also, when long and short tasks where considered together, there was a strong interaction between task order and whether it had feedback or not, as shown in Figure 7-3. For tasks performed first, the human goodness was higher if the task had feedback than if it did not. However, when the second task had feedback, it had slightly worse human goodness than when it did not. The effect of feedback alone was not statistically significant, and was positive in some cases and negative in others. In the long task, it had a positive effect on human goodness overall, but not for those who had feedback on their second task. In the short task it had a positive effect on human goodness for the first task, and no effect for the second (because it was at the maximum regardless of feedback). Recognizer goodness was slightly worse with feedback in the long task, but slightly better with feedback in the short task. Quantitative questions from the post-experimental questionnaire are summarized in Table 7-2. The questionnaire had eight sections: 1) overall reactions to quill, 2) suggestions, 3) learning, 4) system capabilities, 5) tutorial, 6) long task, 7) short task, and 164

180 Question # Section Description Answer 1 Answer 9 1 Overall reactions to quill terrible wonderful 2 frustrating satisfying 3 dull stimulating 4 difficult easy inadequate adequate 5 power power 6 rigid flexible 8 terrible wonderful 9 frustrating satisfying Suggestions 10 dull stimulating 11 difficult easy inadequate adequate 12 power power 13 rigid flexible 15 Exploration of features by trial and error difficult easy Learning 16 Remembering names and use of commands discouraging encouraging 17 Tasks can be performed in a straightforward manner difficult easy 18 Tasks can be performed in a straightforward manner never always 20 System speed too slow fast enough System capabilities 21 The system is reliable never always 22 Correcting your mistakes easy difficult 23 Ease of operation depends on your level of experience never always 25 The tutorial is confusing clear Tutorial 26 Information from the tutorial is easily understood never always 28 The task was confusing clear 29 Overall, the task was difficult easy Long task 30 Thinking of new gestures was difficult easy 31 Finding recognition problems was difficult easy 32 Fixing recognition problems was difficult easy 33 Entering new gesture categories was difficult easy 34 Testing recognizability of gestures was difficult easy Table 7-2 Summary of questions in post-experiment survey. 165

181 Question # Section Description Answer 1 Answer 9 36 The task was confusing clear 37 Overall, the task was difficult easy Short task 38 Thinking of new gestures was difficult easy 39 Finding recognition problems was difficult easy 40 Fixing recognition problems was difficult easy 41 Entering new gesture categories was difficult easy 42 Testing recognizability of gestures was difficult easy How many kinds of PDAs have you regularly used 44 (including your current one, if any)? Demographics How many years has it been since you first used a PDA 45 regularly? 47 For how many months have you used your current PDA? 48 How often do you use your PDA? 49 What is your age (in years)? 57 How many years have you spent designing UIs? Table 7-2 Summary of questions in post-experiment survey. (Continued) Overall reactions to quill terrible wonderful j1 n k l m n j2 n k l m j3 n k l m j4 n k l m j5 n k l m j6 n k l m j7 n k l m j8 n k l m j9 n k l m j NA k l m 1. frustrating satisfying j1 n k l m n j2 n k l m j3 n k l m j4 n k l m j5 n k l m j6 n k l m j7 n k l m j8 n k l m j9 n k l m j NA k l m 2. Figure 7-4 Example questions from quill post-experimental questionnaire. 8) demographics. Questions in all sections except the last asked participants for ratings on a 9-point Likert scale with different scales for each question. For example, question 1 asked participants what their overall reaction to quill was on a scale of terrible (answer 1) to wonderful (answer 9), shown in Figure 7-4. For the full questionnaire, see F.6, p. 283. Answers from the post-experimental questionnaire were correlated with performance (i.e., human goodness, recognizer goodness, and time). Correlations and the significance levels 166

182 Question # Pearson Correlation Significance (2-tailed) Human Recognizer Human Recognizer Time Time goodness goodness goodness goodness 1 0.767 0.205 -0.222 0.010 0.570 0.538 5 0.902 0.704 -0.560 0.000358 0.023 0.092 28 -0.115 0.808 0.393 0.752 0.005 0.261 31 -0.439 0.134 0.836 0.205 0.712 0.003 33 0.327 0.757 0.130 0.356 0.011 0.721 34 0.504 0.843 0.461 0.167 0.004 0.212 40 0.452 0.747 -0.192 0.261 0.033 0.649 49 -0.296 -0.828 0.163 0.406 0.003 0.653 Table 7-3 Correlations of questionnaire answers with performance. Statistically significant correlations are in bold. (Responses to questions on the long task and the short task (2834 and 36 42) are correlated with performance on the long and short tasks, respectively.) of questions with statistically significant correlations are given in Table 7-3. (See F.7, p. 290 for the full table.) As shown in the table, performance was significantly correlated with several questions about the tasks and quill. Human goodness of produced gesture sets correlates with overall reaction to quill as terrible vs. wonderful and as inadequate power vs. adequate power. Recognizer goodness correlates with several questions about the long task, but with only one about the short task. This disparity is probably due to the nearly uniform high scores on the short task. Interestingly, younger people created gestures with significantly higher human goodness than older people. Human goodness is based on similarity metrics (described in Chapter 4) whose data was mostly acquired from college students. The negative correlation of age with human goodness in the current experiment may indicate an age effect in the similarity metric. 7.5 Qualitative Results This section describes common themes derived from comments participants made during the experiment and on the post-experiment questionnaire. Participants had many comments about the suggestions quill gave about how to improve their gestures. Some participants read, understood, and used the feedback productively. 167

183 However, some participants disagreed with some of the feedback, especially for gestures that were letters or contained letters, probably because the human similarity metric is derived from non-letter data. Other participants did not understand what quill was trying to tell them. In describing what was wrong with the gestures, quill used some mathematical terms, which prompted comments from some participants that the feedback was too technical and That's not too good for humans; [it] might be good for mathematicians. It used diagrams to explain what the technical terms meant, but a few participants said that it would have been more useful if it had been more visual. One participant suggested explaining the geometric features (e.g., size of bounding box and sine of initial angle) using the actual gestures the user was working on, rather than stock images, The error/suggestion text could potentially show my problem gesture with the error highlighted or overlaid. Few participants read the quill tutorial (see Appendix E) carefully. As an example of the problems this behavior caused, most participants skipped a step instructing them to turn off feedback that was labeled Before you begin in bold. A few participants had to be reminded to do the tasks in the tutorial, and one had to be reminded twice. A few participants thought that the tutorial was overly wordy, which may explain why most participants did not read it carefully. The naming of objects in quill (i.e., gesture set, gesture group, gesture category, gesture) confused a few participants. One said, Having gesture in all the names makes the names difficult to distinguish. Another commented, Test Set and Training set are too similar - probably because there are always tests in training courses and tests are a part of training. Some participants seemed unclear on the difference between the training set and test sets. The tutorial discusses test sets and gives an example of their use, but this explanation was clearly inadequate for some participants. (Or perhaps they did not read it carefully enough.) Participants had several suggestions about how to improve the quill interface. When something is selected in the tree, display it on the desktop automatically. Currently, the user creates new views of objects explicitly with the menu item View/New View. We discussed this feature in the design of quill, but decided 168

184 that explicitly creating views was better, which may not have been the right decision in light of this experiment. Right-button context menus to operate on objects. Support for undo/redo. Synchronized selection between tree and desktop. (Currently, there is an active selection in the tree or the desktop, but not both.) Add a time stamp to the suggestions, so that its easier to tell what is current. A few participants wanted more information on or support for a larger context. For example, one participant wanted more information about the application in which the gestures would be used. Another wanted to be able to run the application for which the gestures were being designed and try out the gestures in context. She believed it would help her discover whether the gestures would be memorable and whether they made sense. 7.6 Discussion gdt provided a small amount of help to designers in figuring out how to make their gestures easier for the computer to recognize, but its tables of numbers were incomprehensible to most users. The suggestions in quill were designed to be more accessible by putting the information in English and supplementing it with pictures. In spite of this effort, many participants did not understand the suggestions, and the suggestions did not consistently help participants make better gesture sets. We believe the suggestions can be made more accessible by changing the language to be less technical, by using more diagrams, and by using dynamically generated diagrams based on the actual gestures entered by the designer. We also believe that more training would be helpful in getting the most value from quill. The tutorial used for the experiment did not cover all the features in quill, and those that it did include were not covered in depth. We decided to keep the tutorial as short as possible to keep the length of the experiment as short as possible. (The long task took some participants nearly ninety minutes. A few participants took over 2.5 hours for the whole experiment.) In retrospect, participants may have been able to use quill more effectively if they had gone through longer training. In particular, some participants clearly could have 169

185 benefited from more time spent on testing vs. training sets and on a step-by-step example of how to use quills suggestions. A professional technical writer helped edit an early version of the tutorial. Nevertheless, most participants did not read it carefully nor absorb it well, so it is unclear how much a longer tutorial would have helped. Overall, the effect of feedback was mixed and in no case statistically significant. For the first task, feedback had a positive effect, but when feedback was provided in the second task, it had no effect or a negative effect. Unfortunately, it is impossible to determine the effect of feedback using a within-subjects analysis due to the difference in difficulty between the long and short tasks. There was great variance among participants in their understanding of quills feedback. Although it seemed more approachable than the tables and graph in gdt, it was still too technical for many participants. Several subjective judgements about quill that were collected in the post-experiment questionnaire correlated with performance measures. For the long task, higher recognition goodness correlated with thinking the task was clear as opposed to confusing, entering new gesture categories was easy, and testing recognizability of gestures was easy. Also, people with a higher overall reaction to quill and people who thought quill had adequate power created gestures with higher human goodness. These correlations are not surprising, since we would expect that people who do better on the task would think it is easier. What is surprising is that people who took longer to perform the long task also tended to think that finding recognition problems was easy. We might expect that people who easily find recognition problems finish the task more quickly. A problem with the human similarity suggestions is that they were not always right. The models used in quill to predict human similarity are not perfect, and participants rightly disagreed with it at times. quill seemed especially prone to overestimate similarity when a gesture was or contained a letter. This flaw is probably due to the similarity models being based entirely on non-letter gestures. It seems likely that people would perceive letter and non-letter shapes differently. A small difference in a non-letter shape might be perceived by people as a large difference in a letter. Ideally, quill would have separate similarity models for letter and non-letter gestures. Future experiments to measure similarity of letters are required to provide models for letter similarity. 170

186 Naming has been an issue with quill since the first version of the tutorial. There was much discussion about how to name the different gesture containers with the technical writer who helped edit the tutorial. The present system was the best that we could create, but it was confusing to some participants. A diagram of the relationship between the different containers might be helpful, although it would lengthen the tutorial. There is a long list of features that quill was designed to have, and many of them have not been implemented due to lack of time. Most of the suggestions for improvement that participants made were already on the list of desired features, but there were a few new ones, such as timestamping the suggestions. Also, some issues were illuminated by participants comments. For example, we were unsure how best to handle selection across the tree and desktop, but based on this evaluation it seems clear that they should be synchronized together. One important change that needs to be made to quill before another evaluation is to fix the goodness metrics so that they are not biased toward few training examples per gesture category. The way the current metrics are computed, there is a penalty for training examples that are bad in some way (e.g., misrecognized). Thus, the more training examples, the more opportunity for a penalty on goodness. However, maximizing the current goodness metrics will result in fewer examples per category (around 23) than is ideal for the recognizer (typically 1015). A possible approach is to penalize goodness based on the fraction of problematic training examples rather than the absolute number of them. In retrospect, based on the results of this experiment, there are several aspects that should have been done differently. The single most important aspect is to enable within-subjects comparisons, which was not possible in this experiment because the difficulty of the long and short tasks was too different. Differences between conditions for the same participant could be attributed to different difficulties of the task instead of to an effect of having feedback or not. Any future evaluation of quill should use two tasks of equal difficulty so that a strict within-subjects comparison can be done. Another problem with this experiment is that the short task was too easy, as shown by the high goodness scores. In contrast, the scores in the long task did vary, so it is probably an 171

187 appropriate length. Also, the tutorial needs to be improved, both in terms of including more information and in making it easier to read and absorb. It is difficult to determine the ideal length for an evaluation of quill. On one hand, quill is not intended to be walk-up-and-use, but rather to be a tool like Photoshop, which designers use extensively and with which they become familiar. Therefore, an experiment such as the one we performed in which participants spend approximately twenty minutes learning the tool is unrealistic. On the other hand, it is difficult to perform an experiment with expert users of a research system. Ideally, the experiment would include a longer training session so that participants could learn the tool better. Unfortunately, the longer the experiment is, the more difficult it is to recruit participants, especially when the best participants are highly trained professionals. 7.7 Proposed quill Experiment To further investigate the effect of feedback on gesture set quality, we designed another experiment to evaluate quill that we have not carried out. This section describes the new experiment and discusses the issues involved in designing the experiment. The goals of this experiment are to determine how effective active feedback is in helping designers create good gesture sets. It is different from the previous experiment primarily in that it uses a within-subjects design. 7.7.1 Participants Students who have taken a UI course or UI professionals will be recruited. 7.7.2 Method The experiment has a training phase, followed by two experimental tasks. In the training phase, the participant works through a tutorial that shows how to use quill. (The tutorial is an expanded version of the one used in the first experiment.) It will probably take 3040 minutes. In each of the two experimental tasks, each participant will be given a gesture set that has known problems, and will be asked to improve it. For one task, a participant will use quill without suggestions and for the other task will use quill with suggestions (on a different 172

188 gesture set). The order of suggestions/no suggestions, the order of gesture set, and the mapping of tool to gesture set will be randomized (via Latin Square design). Participants will be asked to improve the gesture set until they believe they can make no more progress. (Based on the previous experiment, this will probably take between one and two hours.) They will be instructed to keep the gestures as much like the originals if possible, since otherwise they may simply remove all of them and replace them instead of trying to improve them and the task will be just like creating a gesture set from scratch. quill will give participants feedback about what it thinks is wrong with the gestures and suggestions for improving them. Participants will be encouraged to consider its suggestions, but will not be required to abide by them. The two gesture sets to be improved will be created beforehand. They will have the same number of groups and gestures, and will have the same number and types of problems. As in the first quill experiment, participants will fill out a post-experimental questionnaire about their experience with the tasks and tool, and some basic demographics. 7.7.3 Analysis Recognizer goodness, human goodness, and task time will be compared across the two experimental conditions (i.e., feedback vs. no feedback). Also, subjective rating of the tool and the experiment from a post-experimental questionnaire will be compared across the two conditions. The null hypothesis is that the goodness metrics, the time required, and subjective ratings are identical for the feedback and no feedback conditions. 7.7.4 Experimental Design This section discusses issues that arose in designing this experiment. 7.7.4.1 How to evaluate the hypothesis that feedback has an effect Whether or not feedback has an effect on the quality of gestures produced, the time required to design the gestures, or the subjective experience of gesture designers is the central purpose of the experiment. The basic idea is for designers to use quill and either another tool or quill with its feedback disabled to improve two gesture sets, one with each tool, either to some preset quality level or for a certain amount of time. Time, gesture set 173

189 quality, and/or subjective evaluation of tools by participant will be measured and compared. 7.7.4.2 How to measure gesture set quality Intuitively, we want to know if quill helps designers make their gesture sets better, but it is not obvious what better means. We considered the following metrics: 1. Recognizer goodness 2. Human goodness 3. How well the computer can recognize test gestures drawn by the designer 4. How well other people can learn and remember the gestures For this experiment, we think the recognizer and human goodness are the best options. Recognizer goodness is easy to implement and it predicts how easily the gestures will be recognized. Human goodness has the disadvantage that it is based on gesture similarity and not directly on memorability or learnability, but it may correlate with memorability and learnability. Another possibility is to measure how well the computer can recognize test gestures drawn by the designer, as in the experiment with gdt (see 3.3, p. 51). However, we think this metric should not be used because it is an opportunity for individual variation, since some participants may draw test gestures more carefully than others. We decided against option #4 for two reasons. One is that it requires a great deal of time and effort and could not realistically be used to evaluate all gesture sets from the experiment. The other reason is that the effect of quill on learnability and memorability is probably small and hard to measure, because its only mechanism for this effect is its similarity metric, which does not correlate perfectly with perceived similarity. Even if the similarity metric were perfect, the memorability experiment (see Chapter 5) showed that the match between gesture name and shape has a large effect on learnability and memorability, and this effect is likely greater than geometric similarity. 7.7.4.3 Stopping condition Another issue is how to decide when the participant should stop trying to improve their gesture set. We considered these options: 1. Time limit 174

190 2. Gesture set quality, as measured by the metrics 3. When the participant is happy with the gestures We decided on #3 rather than #1 because we found it difficult to motivate participants in the first quill experiment to use all the time allotted for the experiment. We are ethically obligated to allow participants to leave at any time, and to tell them they may leave at any time, so without a good reason to continue, participants will stop when they feel their gestures are good enough, or if they cannot see a way to make them better. Option #2 is tempting because it is not subjective, but the human similarity metric in quill makes mistakes too often. Also, if there is not adequate information about how to get a higher score on the metric, participants will get frustrated and give up, which some participants did in the no-feedback condition in the first quill experiment. 7.7.4.4 What participants We considered these groups of participants: 1. Professional UI designers 2. Students who have taken a user interface course (or otherwise have UI background) 3. Any student In an ideal world, we would recruit only professional UI designers, since they are the target audience for quill. Unfortunately, they are too hard to recruit exclusively, so we think students with UI experience would be acceptable participants, as well. Students with no UI experience are not suitable, however, because they are too far from the target quill user, and because they might not understand quill or the task. 7.7.4.5 What to compare quill with? An important issue is what tool should quill be compared with in the experiment. We considered using these tools: 1. Agate (the recognizer training tool from Amulet) 2. gdt (our prototype gesture design tool) 3. quill with feedback disabled 175

191 The third option is the best one, because it focuses the experiment on the feedback, which is the unique contribution of quill. A comparison with other tools might only prove that quill has a better UI, since it inherits years of usage experience with the first two tools. 7.7.4.6 What task It is very important to choose a good task for this experiment. We considered two types of tasks: a) a straightforward comparison of quill and quill without feedback in which each is used for equivalent tasks and b) using quill without feedback to create a gesture set and then improving it with quill. We decided that option a would be better since it is a better comparison of quill and quill without feedback since both are being used for the same task. We considered several tasks: 1. Create entirely new gesture sets 2. Improve existing gesture sets by changing the gestures in the set 3. Take existing sets and only add gestures 4. Take existing sets and improve them by adding gestures and possibly changing original ones We decided on #2 because improving gesture sets is what quill is supposed to help designers do. Also, it allows us to avoid a problem that occurred in the first quill experiment: some participants in the feedback condition received little feedback, because quill did not find any problems with their gestures. By using an initial gesture set known to have problems, we can ensure that participants in the feedback condition actually get feedback. 7.7.4.7 How much to improve quill before this experiment In the first experiment, it became clear that quill has a number of usability issues. Before another experiment, it would be worthwhile to test its usability more and improve the quill user interface. 7.7.4.8 Within- or between-participants? In light of individual differences shown in the first experiment, the second experiment must be within-participants. Therefore, the two tasks done with each tool should be made as nearly identical in difficulty as possible. 176

192 7.7.4.9 How much training on quill to give quill should be easy to use. However, gesture design is a complicated task and quill is a complicated tool. We believe that one reason that feedback did not help more in the first quill study is that participants did not have much training with quill. In the real world, it is reasonable to expect designers to spend more than 20-30 minutes learning quill. To keep the experiment from lasting an excessively long time, some potentially helpful features were not explained in the tutorial, such as being able to disable gesture categories without removing them. Also, some topics clearly need to be explained more for people to understand them, such as training vs. test sets. Unfortunately, a longer tutorial would make the experiment even longer and reduce the number of participants we could run and make it harder to recruit participants. In the first experiment, most participants did not read the tutorial carefully, and some skipped exercises in the tutorial and had to be instructed to do them. This problem would likely increase with a longer tutorial. However, in spite of the problems associated with a longer tutorial, we think it is required for participants to truly benefit from quill, and also to be a fair and realistic evaluation. 7.8 Summary This chapter described a study of the usage of quill by ten professional user interface designers and one web designer. In this study, each participant created two gestures sets with quill, one large set and one small set. Each participant had feedback enabled for one task, but not for the other. All participants were able to create gesture sets using quill, but the effect of feedback was mixed and not statistically significant. Some participants were helped by quills active feedback, but some were not. Also, feedback was helpful for the first task, but for the second task it was either neutral (for the long task) or slightly detrimental (for the short task). The unclear effect of feedback is due in part to it being too technical for some participants. The feedback needs to be improved so that it is more accessible to non-technical designers. 177

193 The experiment also highlighted shortcomings in the similarity metric. People did not always agree with the similarity metric, especially about its judgements of gestures that included letter shapes. The similarity metric needs to include a component that measures the similarity of letters or letter-like shapes, which will require collecting data about their perceived similarity. We also found aspects of the experiment that should be done differently in future evaluations. The most significant improvement is to change to a strict within-subjects experimental model, which was not possible in this experiment due to the differing difficulty of the long and short tasks. 178

194 Chapter 8 Future Work There are several areas for future work related to this dissertation. One is further evolution of gesture design tools. The experiment on gesture design with quill suggested a number of ways in which it could be improved. The most important of these is making the suggestions easier to understand, and bringing them to the designers attention at the appropriate time. There is also more work to be done on human perception of gesture similarity. Similarity experiment 3 (4.3, p. 89) collected a lot of data, but little from each individual. It would be useful to have a study where more data was collected for each participant, to reveal differences between individuals. Both the triad approachused in similarity experiments 1 (4.1, p. 64) and 2 (4.2, p. 81)and the pairwise approachused in similarity experiment 3have advantages. Triads allow a larger number of gestures to be compared with the same number of judgements. Pairs allow for absolute similarity judgements rather than only relative ones. To know when to provide feedback in a gesture design tool it is very useful to have an absolute measure of gesture similarity, so the pair approach would be better. In similarity experiment 3, judgements tended to cluster towards low similarity. It may be that the 15 range was too small and a larger range, such as 19, would give finer results. One area where more data is needed is for gestures that incorporate letters. In the quill evaluation, we observed participants using gestures with letter or letter-like parts. It seems virtually certain that because of experience with letters, people perceive them differently than they perceive non-letter shapes, and judge their similarity differently. The gestures used in our similarity experiments purposely contained no letters. quill should probably differentiate between letter and non-letter shapes and use different similarity metrics for each type of shape. In addition to more data, another approach is to analyze the similarity data with different perceptual models. The similarity models developed in this dissertation are based on metric similarity functions, specifically the Euclidean distance metric. However, there is 179

195 evidence that other methods, such as fuzzy sets, are a better model for perceptual similarity [SJ99]. This technique may produce a model that predicts similarity more accurately. Another approach to use machine learning techniques to classify similar and not similar gestures, such as neural networks or support vector machines [DPHS98]. We decided to use a simple statistical classifier for gesture recognition, because it requires few examples to train. However, we have thousands of data points from the similarity survey, so machine learning may be suitable for this classification task. Also, this dissertation studied only visual similarity of gestures. It is possible that the motor similarity of gestures would be different from the visual, and would also have an impact on the memorability of the gestures. Related to motor similarity is the motor difficulty of drawing gestures. It would be useful for a gesture design tool to warn a designer if a gesture were difficult or error-prone to draw. This prediction might be accomplished using a modeling approach like the one taken by Accot and Zhai, who have measured and modeled the performance of drawing pre-defined paths [AZ01, AZ99, AZ97]. Gesture memorability is another area for future work. This dissertation attempted to find a link between gesture geometry and gesture memorability, but was unsuccessful. Due to the strong effect of naming on memorability, it is impossible for geometry to predict memorability completely, but it would be very powerful if one found a model that could predict some part of memorability based on geometry. A possible approach is suggested by the study of handwriting instruction by Karlsdottir [Kar97], who compared the learning times of different styles of handwriting. He found that styles with regular entry strokes are learned most quickly, and that the fastest to produce are styles with short ascenders and descenders and strokes with curvature that is not too high. It is possible that similar effects would occur when learning gestures, and if so, analysis of these effects would be valuable for a gesture design tool. This dissertation did not investigate prediction of iconicness, but it would be useful in predicting the memorability of gestures. Using a linguistic analysis of gesture names may enable the tool to in some way understand the names and determine whether the gesture 180

196 matches the name or not. As a very simple example, a gesture whose name contains the words left or right should probably have a horizontal line segment. Icons are another way concepts are graphically represented in user interfaces. It would be interesting to study how guidelines for icon design might apply to gesture design [Hor94]. gdt and quill were both developed as stand-alone tools, but a designer creating gestures with one of them would not be creating gestures for its own sake. As one of the participants in the quill study said, the tool would be much more useful if it were situated in a broader context. Interesting future work could be done to integrate quill into a pen- based user interface framework, possibly with a tool such as SILK [LM95, LM01]. Also, the interfaces for both tools were designed to be as modeless as possible, so that designers could take any action at any time. However, it may be useful to adopt a more workflow-oriented approach, such as Klemmer and colleagues used in SUEDE for helping speech interface designers prototype their interfaces [KSC+00]. Although it is easy to add new features to Rubines recognizer, gdt and quill use only the default features. In the similarity experiments, we discovered features such as curviness, aspect, and density that could be used in Rubines recognizer. These features may improve accuracy and may make the recognizer behave more intuitively since these features are used by humans in their similarity judgements. As well as extending Rubines recognizer, it would be interesting to extend quill to use a different recognizer, such as one based on neural networks and/or a multistroke recognizer. Changing the recognition technology would change what gesture relationships constitute recognition problems, the way quill detects recognition problems, and the type of advice quill can offer about how to fix recognition problems. However, the general quill framework for editing, training, and offering advice could remain the same. There are a number of features that would be helpful in quill. One would be the ability to specify a set of gesture categories for a command, rather than just one category. This could be helpful for recognition in a case where a gesture category is size independent and others in the set are not. If one gesture category is trained with gestures of greatly varying sizes, the recognizer may have difficulty differentiating other gestures based on size. If the 181

197 designer could specify multiple size gestures for the same command, this problem could be avoided. Currently, this feature must be done in application code. Another useful feature is the ability to show advice using gestures the designer has entered as examples rather than static images. Based on the quill evaluation, we believe this would help designers understand the advice in the context of their own design. There are also various features that would make quill more usable. For example, some users were frustrated by the time they had to spend managing the many internal windows that quill creates as it is used. quill would be easier to use if it required fewer windows or if it were easier to manage the windows. A challenging, but potentially useful feature for quill is automatic repair of recognition problems. For many recognition problems, quill can determine the gesture(s) causing the problem and can also determine which geometric feature(s) of the gesture(s) need to change and in what direction in order to fix the problem. Currently, the strategy is to tell the designer and let the designer change the gesture, because the designer can keep other properties about the gesture constant, such as its iconicness. However, it might be possible for quill to change the gesture(s) to fix the problem. Suppose gesture g needs to be changed so that feature f is smaller. A simple and general way to choose a new g is to mutate g in many different ways and choose the one that is most similar to the original g except that its value for f is smaller. One issue with this approach is what the best way to measure similarity is. Two possibilities are the human similarity metric from similarity experiment 1 and recognizer similarity (i.e., distance in recognizer feature space). It would also be helpful if there were a database of known gestures. New gestures could be automatically compared against this database to find potential conflicts. Also, this database could be searched by a designer by shape or keyword to find gestures that might be useful in a new application. 182

198 Chapter 9 Conclusions This dissertation described the motivation, development, and evaluation of quill, an intelligent tool for gesture design. We were interested in gesture design because pen-based computing is becoming more popular and more prevalent, and gestures can be an effective way to interact using a pen-based interface. To determine user opinions of gestures, we surveyed PDA users (see Chapter 2). This survey revealed that people like using gestures, but that they sometimes have difficulty remembering them and the computer often misrecognizes them. We hypothesized that designers of pen-based user interfaces could create gestures that were easier for people to learn and remember and easier for the computer to recognize if they had a tool to support gesture design. We built a prototype gesture design tool, namely gdt (see Chapter 3), and used it to investigate the gesture design process by observing people designing gesture sets. We found that it was difficult for people to create good gesture sets. In particular, we believed it would be useful to designers if the tool gave them active feedback about possible problems and how to fix them. We wanted to create a new gesture design tool that would tell designers when their gestures might be difficult for the computer to recognize or difficult for people to learn and remember. However, we did not have a way to measure difficulty of learning and remembering gestures. People sometimes find it difficult to remember items if they are too similar to one another, so we decided to investigate why people perceive gestures to be similar. We performed three experiments to measure human perception of gesture similarity. From these experiments, we derived three computational models of gesture similarity based on gesture geometry, the first and last of which we decided to use (see Chapter 4). The first model predicts how similar a pair of gestures will seem to people relative to another pair of gestures. The second model predicts whether or not a pair of gestures will be perceived as similar by people. These models are used in our second gesture design tool, named 183

199 quill, to warn designers when their gestures will seem similar to people. They also are used to tell the designer how to make their gestures less similar to people. We also performed an experiment to determine what factors affect gesture memorability. This experiment showed iconicnessthat is, how well the gesture and its name matchedto be the single most important factor for gesture memorability, which is consistent with memorability of other objects. We were unable to derive a model for predicting memorability because iconicness is not something we can measure (at least, not yet). Using the results of the gdt study and the similarity experiments, we designed, prototyped, and built quill, an intelligent gesture design tool. It gives unsolicited feedback to designers about potential problems with their gestures and advises designers on how to fix these problems. quill detects many types of problems, including gestures that are likely to be misrecognized and gestures that people may perceive to be similar and therefore find confusing. Its feedback about these problems and advice about how to fix them were designed to be intelligible to non-technical designers, so quill uses plain English and simple diagrams. Professional user interface designers were recruited, and we observed them using quill. We found that the feedback provided by quill was helpful to some designers, but was not helpful to others because it was too technical. We also believe that the training provided in the experiment did not adequately explain the features of quill for designers to use them most effectively. This experiment revealed that the similarity metrics did not predict similarity of letters well, which caused participants to disagree with the tool at times. Overall, participants had a positive impression of quill and were able to use it to create gesture sets. In summary, the main contributions of this dissertation are as follows. We discovered roadblocks in the gesture design process, such as the difficulty of finding and fixing recognition problems. We confirmed the importance of iconicness for gesture memorability. We discovered important features for gesture similarity judgements by humans and derived a computational model for predicting human-perceived gesture similarity. We built the first intelligent tool for gesture design, quill, which helps designers 184

200 improve the computer recognition of their gestures. Our evaluation of quill showed that advice can be helpful to designers in improving their gestures. This dissertation enables designers to create better gestures for pen-based user interfaces. It also allows a wider group of designers to create good gestures for pen-based UIs. Improving gestures and making gesture design more widely accessible is important because gestures are a powerful interaction technique, especially for pen-based user interfaces. People frequently use gestures on paper and other traditional media to communicate with other people. This dissertation advances the state-of-the-art to allow people to more easily use gestures to communicate with computers. 185

201 Bibliography [AML94] F. Gregory Ashby, W. Todd Maddox, and W. William Lee. On the dangers of averaging across subjects when using multidimensional scaling or the similarity-choice model. Psychological Science, 5(3):144151, May 1994. [AN00] James Arvo and Kevin Novins. Fluid sketches: Continuous recognition and morphing of simple hand-drawn shapes. CHI Letters: Human Factors in Computing Systems, 2(2):7380, November 2000. [App97] Apple Computer, Inc. MessagePad 2000 Users Manual. Apple Computer, Inc., Cupertino, CA, 1997. [Att50] Fred Attneave. Dimensions of similarity. American Journal of Psychology, 63:516556, 1950. [AZ97] Johnny Accot and Shumin Zhai. Beyond Fitts law: Models for trajectory- based HCI tasks. In Human Factors in Computing Systems (SIGCHI Proceedings), pages 295302, New York, March 1997. ACM. [AZ99] Johnny Accot and Shumin Zhai. Performance evaluation of input devices in trajectory-based tasks: An application of the steering law. In Human Factors in Computing Systems (SIGCHI Proceedings), pages 466472, New York, May 1999. ACM. [AZ01] Johnny Accot and Shumin Zhai. Scale effects in steering law tasks. CHI Letters: Human Factors in Computing Systems, 3(1):18, April 2001. [Bau94] Thomas Baundel. A mark-based interaction paradigm for free-hand drawing. In Proceedings of the ACM Symposium on User Interface and Software Technology (UIST), pages 185192. ACM, ACM Press, November 1994. [BDBN93] Robert Briggs, Alan Dennis, Brenda Beck, and Jay Nunamaker, Jr. Whither the pen-based interface? Journal of Management Information Systems, 9(3):7190, 1992-1993. [BG98] Margaret M. Burnett and Herkimer J. Gottfried. Graphical definitions: expanding spreadsheet languages through direct manipulation and gestures. ACM Transactions on Computer-Human Interaction, 5(1):133, March 1998. [BP98] Ravin Balakrishnan and Pranay Patel. The PadMouse: Facilitating selection and spatial positioning for the non-dominant hand. In Human Factors in Computing Systems (SIGCHI Proceedings), pages 916. ACM, April 1998. [BPR83] Jacob Beck, K. Prazdny, and Azriel Rosenfeld. A theory of textural segmentation. In Jacob Beck, Barbara Hope, and Azriel Rosenfeld, editors, 186

202 Human and Machine Vision, volume 8 of Notes and Reports in Computer Science and Applied Mathematics, chapter 1, pages 138. Academic Press, New York, NY, 1983. Proceedings of the Conference on Human and Machine Vision, Aug. 1981. [CHWS88] Jack Callahan, Don Hopkins, Mark Weiser, and Ben Shneiderman. An empirical comparison of pie vs. linear menus. In Human Factors in Computing Systems (SIGCHI Proceedings), pages 95100, New York, 1988. ACM. [CL96] Stphane Chatty and Patrick Lecoanet. Pen computing for air traffic control. In Human Factors in Computing Systems (SIGCHI Proceedings), pages 87 94. ACM, Addison-Wesley, April 1996. [Com01] Palm Computing, 2001. http://www.palm.com/. [CS91] Robert Carr and Dan Shafer. The Power of PenPoint. Addison-Wesley, Reading, MA, 1991. [DHT00] Christian Heide Damm, Klaus Marius Hansen, and Michael Thomsen. Tool support for cooperative object-oriented design: Gesture based modeling on an electronic whiteboard. CHI Letters: Human Factors in Computing Systems, 2(1):518525, April 2000. [DPHS98] Susan Dumais, John Platt, David Heckerman, and Mehran Sahami. Inductive learning algorithms and representations for text categorization. In Proceedings of ACM CIKM, pages 148155. ACM, November 1998. [EM98] Neil R. N. Enns and I. Scott MacKenzie. Touchpad-based remote control devices. In Human Factors in Computing Systems (SIGCHI Extended Abstracts), pages 229230. ACM, April 1998. [FDZ98] Andrew Forsberg, Mark Dieterich, and Robert Zeleznik. The music notepad. In Proceedings of the ACM Symposium on User Interface and Software Technology (UIST), pages 203210, New York, NY, November 1998. ACM, ACM Press. [FHM95] Clive Frankish, Richard Hull, and Pam Morgan. Recognition accuracy and user acceptance of pen interfaces. In Human Factors in Computing Systems (SIGCHI Proceedings), pages 503510. ACM, Addison-Wesley, April 1995. [GHJV95] Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides. Design Patterns: Elements of Reusable Object-Oriented Software. Addison-Wesley professional computing series. Addison-Wesley, 1995. [GR93] David Goldberg and Cate Richardson. Touch-typing with a stylus. In Human Factors in Computing Systems (SIGCHI Proceedings), pages 95 100. ACM SIGCHI, Addison Wesley, April 1993. 187

203 [GVU97] GVUs WWW user survey, April 1997. Available at http:// www.gvu.gatech.edu/user_surveys/survey-1997-04. [HB92] Karl-Heinz Hanne and Hans-Jrg Bullinger. Multimedia Interface Design, chapter 8, pages 127138. ACM Press, 1992. [HHN90] Tyson Henry, Scott Hudson, and Gary Newell. Integrating gesture and snapping into a user interface toolkit. In Proceedings of the ACM Symposium on User Interface and Software Technology (UIST), pages 112 122. ACM, ACM Press, October 1990. [HL00] Jason I. Hong and James A. Landay. SATIN: A toolkit for informal ink- based applications. CHI Letters: UIST, 2(2):6372, November 2000. [Hor94] William Horton. The Icon Book: Visual Symbols for Computer Systems and Documentation. John Wiley & Sons, 1994. [Kar97] Ragnheidur Karlsdottir. Comparison of cursive models for handwriting instruction. Perceptual and Motor Skills, 85(3):11711184, 1997. [KB94] Gordon Kurtenbach and William Buxton. User learning and performance with marking menus. In Human Factors in Computing Systems (SIGCHI Proceedings), pages 258264. ACM, Addison Wesley, April 1994. [KFN95] Naoki Kato, Natsuko Fukuda, and Masaki Nakagawa. An experimental study of interfaces exploiting a pens merits. In Yuichiro Anzai, Katsuhiko Ogawa, and Hirohiko Mori, editors, International Conference on Human- Computer Interaction, volume 1 of Advances in Human Factors/ Ergonomics, pages 555560. Information Processing Society of Japan and others, Elsevier Science, July 1995. [KHW86] Henry S.R. Kao, Mak Ping Hong, and Lam Ping Wah. Handwriting pressure: Effects of task complexity, control mode and orthographic differences. In Henry S.R. Kao, Gerard P. Van Galen, and Rumjahn Hoosian, editors, Graphonomics, volume 37 of Advances in Psychology, pages 4766. Elsevier Science Publishers, 1986. [KN95] Naoki Kato and Masaki Nakagawa. The design of a pen-based interface SHOSAI for creative work. In Yuichiro Anzai, Katsuhiko Ogawa, and Hirohiko Mori, editors, International Conference on Human-Computer Interaction, volume 1 of Advances in Human Factors/Ergonomics, pages 549554. Information Processing Society of Japan and others, Elsevier Science, July 1995. [Krz88] W. J. Krzanowski. Principles of Multivariate Analysis: A Userss Perspective, volume 3 of Oxford Statistical Science Series. Oxford University Press, New York, NY, 1988. 188

204 [KSC+00] Scott R. Klemmer, Anoop K. Sinha, Jack Chen, James A. Landay, Nadeem Aboobaker, and Annie Wang. SUEDE: A Wizard of Oz prototyping tool for speech user interfaces. CHI Letters: UIST, 2(2):110, November 2000. [KSL83] Henry Kao, Daniel Shek, and Elbert Lee. Control modes and task complexity in tracing and handwriting performance. Acta Psychologica, 54(1-3):6977, 1983. [KW78] Joseph B. Kruskal and Myron Wish. Multidimensional Scaling. Number 07- 011 in Quantitative applications in the social sciences. Sage Publications, Beverly Hills, CA, 1978. [Lan96] James A. Landay. Interactive Sketching for the Early Stages of User Interface Design. PhD thesis, Carnegie Mellon University, Pittsburgh, PA, December 1996. #CMU-CS-96-201. Also tech report CMU-HCII-96-105. [Lan01] James A. Landay, 2001. Personal communication. [LD99] James A. Landay and Richard C. Davis. Making sharing pervasive: Ubiquitous computing for shared note taking. IBM Systems Journal, 38(4):531550, 1999. Available at http://www.research.ibm.com/journal/sj/ 384/landay.html. [Lee94] Yvonne L. Lee. PDA users can express themselves with Graffiti. InfoWorld, 16(40):30, October 1994. [Lik32] Rensis A. Likert. A technique for the measurement of attitudes. Archives of Psychology, 1932. [Lip91] James Lipscomb. A trainable gesture recognizer. Pattern Recognition, 24(9):895907, September 1991. [LLR97] Allan Christian Long, Jr., James A. Landay, and Lawrence A. Rowe. PDA and gesture use in practice: Insights for designers of pen-based user interfaces. Technical Report UCB//CSD-97-976, U.C. Berkeley, 1997. Available at http://bmrc.berkeley.edu/papers/1997/142/142.html. [LLRM00] A. Chris Long, Jr., James A. Landay, Lawrence A. Rowe, and Joseph Michiels. Visual similarity of pen gestures. CHI Letters: Human Factors in Computing Systems, 2(1):360367, April 2000. http://www.acm.org/pubs/ citations/proceedings/chi/332040/p360-long/. [LM93] James A. Landay and Brad A. Myers. Extending an existing user interface toolkit to support gesture recognition. In Stacey Ashlund, Kevin Mullet, Austin Henderson, Erik Hollnagel, and Ted White, editors, Adjunct Proceedings of Human Factors in Computing Systems (SIGCHI), pages 91 92. ACM, ACM Press, April 1993. 189

205 [LM95] James Landay and Brad Myers. Interactive sketching for the early stages of user interface design. In Human Factors in Computing Systems (SIGCHI Proceedings), pages 4350. ACM, Addison-Wesley, April 1995. http:// www.acm.org/pubs/citations/proceedings/chi/223355/p63-landay/. [LM01] James A. Landay and Brad A. Myers. Sketching interfaces: Toward more human interface design. IEEE Computer, 34(3):5664, March 2001. http:// www.cs.berkeley.edu/%7Elanday/research/publications/silk-ieee-publish% ed.pdf. [LNHL00] James Lin, Mark W. Newman, Jason I. Hong, and James A. Landay. DENIM: Finding a tighter fit between tools and practice for web site design. CHI Letters: Human Factors in Computing Systems, 2(1):510517, 2000. http://www.acm.org/pubs/citations/proceedings/chi/332040/p510-lin/. [LS91] Alejandro A. Lazarte and Peter H. Schonemann. Saliency metric for subadditive dissimilarity judgments on rectangles. Perception and Psychophysics, 49(2):142158, February 1991. [LT95] D. Lopresti and A. Tomkins. Computing in the ink domain. In Yuichiro Anzai, Katsuhiko Ogawa, and Hirohiko Mori, editors, International Conference on Human-Computer Interaction, volume 1 of Advances in Human Factors/Ergonomics, pages 543548. Information Processing Society of Japan and others, Elsevier Science, July 1995. [Lue01] Erich Luening. Study charts sharp rise in handheld sales. CNET WWW site, January 2001. http://news.cnet.com/news/0-1006-200-4601431.html. [Mey95] Andr Meyer. Pen computing. SIGCHI Bulletin, 27(3):4690, July 1995. [MGD+90] Brad A. Myers, Dario Giuse, Roger B. Dannenberg, Brad Vander Zanden, David Kosbie, Ed Pervin, Andrew Mickish, and Philippe Marchal. Garnet: Comprehensive support for graphical, highly-interactive user interfaces. IEEE Computer, 23(11), November 1990. [MHA00] Jennifer Mankoff, Scott Hudson, and Gregory D. Abowd. Providing integrated toolkit-level support for ambiguity in recognition-based interfaces. CHI Letters: Human Factors in Computing Systems, 2(1):368 375, April 2000. [MMZ95] J. Craig McQueen, I. Scott MacKenzie, and Shawn Zhang. An extended study of numeric entry on pen-based computers. In Proceedings of Graphics Interface, pages 215222. Canadian Information Processing Society, Canadian Information Processing Society, 1995. [MNAC93] I. Murray, A. Newell, J. Arnott, and A. Cairns. Interactive Speech Technology: Human factors issues in the application of speech input/output to computers, chapter 10, pages 99107. Taylor & Francis, 1993. 190

206 [MNR94] I. Scott MacKenize, Blair Nonnecke, and Stan Riddersma. Alphanumeric entry on pen-based computers. International Journal of Human-Computer Studies, 41:775792, 1994. [MS90] Palmer Morrel-Samuels. Clarifying the distinction between lexical and gestural commands. International Journal of Man-Machine Studies, 32:581590, 1990. [MSB91] I. Scott MacKenzie, Abigail Sellen, and William Buxton. A comparison of input devices in elemental pointing and dragging tasks. In Human Factors in Computing Systems (SIGCHI Proceedings), pages 161166. ACM SIGCHI, Addison Wesley, April 1991. [M+95] Thomas Moran et al. Implicit structures for pen-based systems within a freeform interaction paradigm. In Human Factors in Computing Systems (SIGCHI Proceedings), pages 487494. ACM, Addison-Wesley, April 1995. [M+97] Brad A. Myers et al. The Amulet environment: New models for effective user interface software development. IEEE Transactions on Software Engineering, 23(6):347365, June 1997. [MZ97] I. Scott MacKenzie and Shawn X. Zhang. The immediate usability of Graffiti. In Proceedings of Graphics Interface, pages 129137, Toronto, 1997. Canadian Information Processing Society. [NL00] Mark W. Newman and James A. Landay. Sitemaps, storyboards, and specifications: A sketch of web site design practice. In Designing Interactive Systems (DIS 2000), pages 263274. ACM, August 2000. [Pil01] Mike Pilone. KGesture, 2001. Available at http://www.slac.com/ %7Empilone/projects/. [Pit91] James A. Pittman. Recognizing handwritten text. In Human Factors in Computing Systems (SIGCHI Proceedings), pages 271275, New York, NY, Apr-May 1991. ACM SIGCHI, ACM Press. [PL92] Ken Pier and James A. Landay. Issues for location-independent interfaces. Technical Report ISTL92-4, Xerox Palo Alto Research Center, December 1992. [PYD91] Randy Pausch, Nathaniel Young, and Robert DeLine. SUIT: The Pascal of user interface toolkits. In Proceedings of the ACM Symposium on User Interface and Software Technology (UIST), pages 117126. ACM, ACM Press, November 1991. [QO93] Valerie Quercia and Tim OReilly. X Window System users guide for X11 Release 5, volume 3 of The definitive guides to the X Window System. OReilly and Associates, fourth edition, November 1993. 191

207 [Ret94] Marc Rettig. Prototyping for tiny fingers. Communications of the ACM, 37(4):2127, April 1994. [RN96] Byron Reeves and Clifford Nass. The media equation: how people treat computers, television, and new media like real people and places. Center for the Study of Language and Information; Cambridge University Press, Stanford, Calif.: Cambridge [England]; New York, 1996. [Rub91a] Dean Rubine. The Automatic Recognition of Gestures. PhD thesis, Carnegie Mellon University, December 1991. Tech report CMU-CS-91-202. [Rub91b] Dean Rubine. Specifying gestures by example. In Computer Graphics (SIGGRAPH), pages 329337. ACM SIGGRAPH, Addison Wesley, July 1991. [Sen01] Sensiva, Inc. Sensiva product brochure, 2001. Available at site http:// www.sensiva.com/. [Shi99] Arthur Shimamura, 1999. Personal communication. [SJ99] Simone Santini and Ramesh Jain. Similarity measures. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(9):113, September 1999. [Sof01] Opera Software. Mouse gestures in opera, 2001. Available at http:// www.opera.com/windows/mouse.html. [Sta87] Richard Stallman. GNU Emacs manual. Free Software Foundation, 6th ed., emacs version 18 for unix users edition, March 1987. [Tho68] Hoben Thomas. Spatial models and multidimensional scaling of random shapes. American Journal of Psychology, 81(4):551558, 1968. [TK95] Mark Tapia and Gordon Kurtenbach. Some design refinements and principles on the appearance and behavior of marking menus. In Proceedings of the ACM Symposium on User Interface and Software Technology (UIST), pages 189195. ACM, November 1995. [TSW90] Charles C. Tappert, Ching Y. Suen, and Toru Wakahara. The state of the art in online handwriting recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(8):787808, August 1990. [UFA95] Figen Ulgen, Andrew Flavell, and Norio Akamatsu. Recognition of on-line handdrawn geometric shapes by fuzzy filtering and neural network classification. In Yuichiro Anzai, Katsuhiko Ogawa, and Hirohiko Mori, editors, International Conference on Human-Computer Interaction, volume 1 of Advances in Human Factors/Ergonomics, pages 567572. Information Processing Society of Japan and others, Elsevier Science, July 1995. 192

208 [vMH93] Hanneke van Mier and Wouter Hulstijn. The effects of motor complexity and practice on initiation time in writing and drawing. Acta Psychologica, 84(3):231251, 1993. [Wat93] Richard Watson. A survey of gesture recognition techniques. Technical Report TCD-CS-93-11, Department of Computer Science, Trinity College, Dublin 2, 1993. [WC99] Kathy Walrath and Mary Campione. The JFC Swing Tutorial: A Guide to Constructing GUIs. Addison-Wesley, July 1999. [WRE89] Catherine Wolf, James Rhyne, and Hamed Ellozy. The paper-like interface. In Gavriel Salvendy and Michael Smith, editors, Designing and Using Human-Computer Interfaces and Knowledge Based Systems, volume 12B of Advances in Human Factors/Ergonomics, pages 494501. Elsevier, September 1989. [You87] Forrest W. Young. Multidimensional Scaling: History, Theory, and Applications. Lawrence Erlbaum Associates, Hillsdale, NJ, 1987. [Zha93] Rui Zhao. Incremental recognition in gesture-based and syntax-directed diagram editors. In Human Factors in Computing Systems (SIGCHI Proceedings), pages 95100. ACM SIGCHI, Addison Wesley, April 1993. [ZKKM95] R. Zhao, H.-J. Kaufmann, T. Kern, and W. Mller. Pen-based interfaces in engineering environments. In Yuichiro Anzai, Katsuhiko Ogawa, and Hirohiko Mori, editors, International Conference on Human-Computer Interaction, volume 20B of Advances in Human Factors/Ergonomics, pages 531536. Information Processing Society of Japan and others, Elsevier Science, July 1995. 193

209 Appendix A PDA User Survey This appendix contains the PDA user survey and its results. See Chapter 2 for more information about this survey. A.1 PDA User Survey If you use a pen-based Personal Digital Assistant (PDA), we would like to ask a favor of you. We are researching ways to improve pen-based user interfaces, but to do so we need feedback from you, the users. Please take a few minutes to answer the questions below. By taking a few minutes to answer the questions below you will be greatly assisting us in improving the pen-based interfaces of tomorrow. Be assured that your answers will be kept strictly confidential. Submit the form only once per user. One respondant, selected at random, will receive a U.C. Berkeley t-shirt as a token of our appreciation for participating. Only respondants who supply contact information will be eligible. Please answer the following questions only if you use a pen-based Personal Digital Assistant (PDA). This survey does not apply to keyboard-based PDAs or portable computers. 1. How many kinds of PDAs have you regularly used (including your current one)? 2. How many years has it been since you first used a PDA regularly? 3. Which PDA do you currently use? j Apple Newton n k l m n j US Robotics Pilot n k l m j Sony MagicLink n k l m j Other k l m 4. For how many months have you used your current PDA? 5. How often do you use your PDA? j Less than once per day k l m n j Once per day k l m n j 2-5 times per day k l m n j More than 5 times per day k l m n 6. Please use the box below to tell us what applications or features you would like your PDA to have that it does not. 7. How often do you use a PDA while talking or meeting with other people? 194

210 j Never n k l m n j Rarely n k l m j Often n k l m j Very k l m often 8. How often are you in meetings where other people are using PDAs? j Never n k l m n j Rarely n k l m j Often n k l m j Very often k l m 9. When you enter notes into your PDA during meetings, what sort of notes do you enter? (Check all that apply.) c Names, addresses, phone numbers, etc. d e f g c Personal to-do items or reminders d e f g c Meeting events for you to review later d e f g c Meeting events to be shared with others d e f g c Ideas you have that you will review later d e f g c Ideas you have to be shared with others d e f g c Other d e f g (please specify) 10. What sorts of things do you write in your notes (both PDA and paper) that you share with others after a meeting? 11. Do you use your PDA's built-in handwriting recognition? j Yes n k l m n j No k l m 12. How would you rate the accuracy of your PDA's handwriting recognition? j Terrible n k l m n j Bad n k l m j Good n k l m j Excellent k l m 13. Do you use Graffiti? j Yes n k l m n j No n k l m j I don't know k l m 14. If you use Graffiti, how would you rate its accuracy? j Terrible n k l m n j Bad n k l m j Good n k l m j Excellent k l m 15. Please rank the following types of PDA application according to how often you use them, with 1 being the most often used. Please give each application a differentnumber. 1 Most often used Note taking and editing 1 Most often used Drawing 1 Most often used Email 1 Most often used To-do list 1 Most often used Addressbook 1 Most often used Appointments/Calendar 16. What is the single most common task that you perform on paper that you do not 195

211 perform on your PDA? 17. How often do you perfom this task (on paper or PDA)? j Never n k l m n j Rarely n k l m j Often n k l m j Very often k l m 18. Why don't you perform this task on your PDA more often? Some of the following questions involve "gestures". A gesture is a mark or stroke made with a pen that invokes an operation rather than enters data. For example, many pen-based applications have a gesture for deleting words or graphical objects. 19. How often do you use gestures to invoke operations versus using other methods (e.g., menus, buttons). j Always use gestures n k l m n j Mostly use gestures n k l m j Mostly use other methods k l m j Always use other methods k l m n 20. How would you rate the accuracy of your PDA's gesture recognition? j Terrible n k l m n j Bad n k l m j Good n k l m j Excellent k l m In the next section, please mark an answer in each column for each of the listed operations. In the left column, indicate how often you use a gesture for the operation relative to the times that it would be useful or appropriate to do so. In the right column, indicate the primary reason you do not use it more often than you do in those instances (or "Not applicable" if there is no such reason). Please be sure to chose an item from each column for each operation. 21. Delete j Never k l m n j Operation k l m n is not available j Rarely k l m n j Gesture is not available k l m n j Often k l m n j Cannot remember the gesture k l m n j Very k l m n often j Poor computer recognition of the gesture k l m n j Not applicable k l m n 22. Scroll up j Never k l m n j Operation k l m n is not available j Rarely k l m n j Gesture is not available k l m n j Often k l m n j Cannot remember the gesture k l m n j Very k l m n often j Poor computer recognition of the gesture k l m n 23. Next field 196

212 j Never k l m n j Operation k l m n is not available j Rarely k l m n j Gesture is not available k l m n j Often k l m n j Cannot remember the gesture k l m n j Very k l m n often j Poor computer recognition of the gesture k l m n 24. Open record j Never k l m n j Operation k l m n is not available j Rarely k l m n j Gesture is not available k l m n j Often k l m n j Cannot remember the gesture k l m n j Very k l m n often j Poor computer recognition of the gesture k l m n 25. Select j Never k l m n j Operation k l m n is not available j Rarely k l m n j Gesture is not available k l m n j Often k l m n j Cannot remember the gesture k l m n j Very k l m n often j Poor computer recognition of the gesture k l m n 26. Scroll down j Never k l m n j Operation k l m n is not available j Rarely k l m n j Gesture is not available k l m n j Often k l m n j Cannot remember the gesture k l m n j Very often k l m n j Poor computer recognition of the gesture k l m n 27. Transpose (swap two adjacent characters) j Never k l m n j Operation k l m n is not available j Rarely k l m n j Gesture is not available k l m n j Often k l m n j Cannot remember the gesture k l m n j Very k l m n often j Poor computer recognition of the gesture k l m n 28. Move cursor j Never k l m n j Operation k l m n is not available j Rarely k l m n j Gesture is not available k l m n j Often k l m n j Cannot remember the gesture k l m n j Very k l m n often j Poor computer recognition of the gesture k l m n 29. Previous field 197

213 j Never k l m n j Operation k l m n is not available j Rarely k l m n j Gesture is not available k l m n j Often k l m n j Cannot remember the gesture k l m n j Very often k l m n j Poor computer recognition of the gesture k l m n 30. Insert line or paragraph j Never k l m n j Operation k l m n is not available j Rarely k l m n j Gesture is not available k l m n j Often k l m n j Cannot remember the gesture k l m n j Very k l m n often j Poor computer recognition of the gesture k l m n 31. Undo j Never k l m n j Operation k l m n is not available j Rarely k l m n j Gesture is not available k l m n j Often k l m n j Cannot remember the gesture k l m n j Very k l m n often j Poor computer recognition of the gesture k l m n 32. Close j Never k l m n j Operation k l m n is not available j Rarely k l m n j Gesture is not available k l m n j Often k l m n j Cannot remember the gesture k l m n j Very often k l m n j Poor computer recognition of the gesture k l m n 33. Insert letters or words j Never k l m n j Operation k l m n is not available j Rarely k l m n j Gesture k l m n is not available j Often k l m n j Cannot remember the gesture k l m n j Very k l m n often j Poor computer recognition of the gesture k l m n In the next section, please indicate to what extent you agree or disagree with the statements. (1 = agree strongly, 4 = disagree strongly) 34. Gestures are powerful. Agree strongly nj1 n k l m j2 n k l m j3 m k l m n 4 Disagree strongly j k l 35. Gestures are easy to learn. Agree strongly nj1 n k l m j2 n k l m j3 n k l m j 4 Disagree strongly k l m 36. Gestures are efficient. Agree strongly nj1 n k l m j2 n k l m j3 n k l m j 4 Disagree strongly k l m 198

214 37. Gestures are easy to use. Agree strongly nj1 n k l m j2 n k l m j3 n k l m j 4 Disagree strongly k l m 38. The computer always recognizes the gestures I make. Agree strongly nj1 n k l m j2 n k l m j3 n k l m j 4 Disagree strongly k l m 39. A gesture is available for every operation for which I want a gesture. Agree strongly nj1 n k l m j2 n k l m j3 n k l m j 4 Disagree strongly k l m 40. Gestures are convenient. Agree strongly nj1 n k l m j2 n k l m j3 n k l m j 4 Disagree strongly k l m 41. Gestures are easy to remember. Agree strongly nj1 n k l m j2 n k l m j3 n k l m j 4 Disagree strongly k l m 42. Please use the box below to make any additional comments you have about gestures. 43. What is your age (in years)? 44. What is your gender? j Male n k l m n j Female k l m 45. What is the highest level of education you have reached? j High school n k l m n j Some college n k l m j College degree n k l m j Master's/Professional k l m degree nj PhD/MD k l m 46. How technically sophisticated do you consider yourself to be? (1 = not at all, 4 = extremely) Not at all n j1 n k l m j2 n k l m j3 n k l m j 4 Extremely k l m 47. What is your occupation? 48. Please use the box below for comments about the survey itself. 49. Optional: It might be useful for us to contact you for more information about your answers. If you would not mind being contacted, please enter your name, email address, and telephone number below. All information you provide will 199

215 remain absolutely confidential. This is completelyoptional. Name Email address Telephone number Please double-check to make sure you have answered all applicable questions. Submit the survey Thank you very much for your assistance. A.2 Results The following three tables show the raw results from the survey. Participant ID has been added for finding corresponding rows across the tables. 200

216 hw_recognition_accuracy meeting_use_frequency use_hw_recognition meeting_frequency addressbook_rank current_pda_time note_taking_rank graffiti_accuracy events to review to_do_list_rank ideas to review events to share use_frequency Participant ID calendar_rank drawing_rank ideas to share first_pda_use use_graffiti PDA_types email_rank names other todo pda other note types paper task 1 2 1 newton 2 3 2 3 1 1 1 1 1 4 0 2 1 7 7 2 4 3 Large scale notes 2 1 2 newton 2 4 4 2 1 1 1 1 1 4 0 1 1 4 1 1 1 1 Math Notes 3 1 0 newton 19 4 3 1 1 1 3 0 1 1 7 7 3 2 4 Writing papers 4 1 2 newton 23 4 4 3 1 1 1 1 1 1 1 4 0 1 1 3 6 2 4 5 None 5 1 1 newton 12 3 2 2 1 1 1 1 1 1 1 2 0 3 4 5 6 2 3 1 notes while on telephone 6 1 0 newton 1 4 3 1 1 1 1 1 1 1 Games :-) 1 3 0 1 2 6 4 1 5 3 Note Taking messages to be left 7 1 3 newton 3 3 3 2 1 1 1 1 1 1 1 3 0 1 1 6 2 5 3 4 behind for others 8 1 1 newton 3 4 2 1 1 1 1 1 1 3 0 1 5 6 4 1 2 3 quick to-dos Drawing customer 9 1 1 newton 6 4 3 1 1 1 1 1 1 3 0 3 4 7 7 2 3 1 designs 10 1 0 newton 3 3 2 2 1 1 1 1 1 3 0 1 2 6 7 4 3 1 drawing 11 1 2 newton 2 4 3 1 1 1 1 1 1 1 1 3 0 1 1 7 7 2 5 6 addition 12 1 2 newton 4 4 3 2 1 1 1 1 1 4 0 2 1 7 3 6 2 5 13 1 2 newton 3 4 4 2 1 1 1 1 1 1 1 3 0 3 4 6 5 1 4 2 14 1 2 newton 18 4 3 2 1 1 1 1 4 0 4 1 2 7 5 4 3 15 1 1 newton 7 4 3 2 1 1 1 1 1 3 0 1 2 6 4 1 3 5 16 1 2 newton 12 3 2 4 1 1 1 1 4 0 4 4 6 5 3 1 2 project planning 17 1 0 newton 2 4 3 2 1 1 1 1 1 1 1 and administration 1 4 0 3 3 5 6 2 4 1 sketching 18 1 1 newton 8 4 2 1 1 1 1 3 0 1 5 6 3 4 1 2 19 1 1 newton 2 2 2 1 1 1 1 1 0 2 1 4 1 4 7 3 2 5 20 2 1 newton 2 3 3 2 1 1 1 1 1 1 1 3 0 3 3 6 3 4 1 2 scribbling 21 2 1 newton 3 4 3 3 1 1 1 1 1 1 4 0 3 1 6 3 4 2 5 jot notes while working mark up paper from 22 2 1 newton 1 3 2 2 1 1 1 1 4 0 1 1 2 3 1 1 1 others 23 2 2 newton 24 3 3 2 1 1 1 1 1 4 0 4 4 6 5 1 3 2 patient charting 24 2 4 newton 3 4 4 1 1 1 1 4 0 2 1 4 2 7 1 1 note-taking 25 2 1 newton 3 4 4 2 1 1 1 1 1 1 3 0 3 4 6 5 2 3 1 large notes 26 2 2 newton 16 4 4 2 1 1 1 1 1 1 1 3 0 3 1 6 2 3 4 5 Quick notes for others 27 2 5 newton 48 4 4 3 1 1 1 1 1 4 0 3 5 6 4 3 2 1 28 2 4 newton 3 4 4 2 1 1 1 1 1 1 1 4 0 3 1 5 6 3 4 2 29 2 3 newton 4 4 3 3 1 1 1 1 1 3 0 3 5 4 7 1 3 2 30 2 1 newton 4 4 4 2 1 1 1 1 1 3 0 1 1 5 6 4 2 3 31 2 5 newton 3 4 3 2 1 1 1 1 1 1 3 0 3 3 5 2 1 1 1 32 3 3 newton 2 4 4 2 1 1 1 1 1 1 1 1 4 0 4 1 6 2 4 5 3 none- hardly use paper 33 3 3 newton 1 4 4 4 1 1 1 1 1 1 1 1 4 0 4 4 5 6 1 3 2 large scale diagrams 34 3 3 newton 2 4 3 3 1 1 1 1 1 4 0 1 1 3 6 5 2 4 drawing 35 3 4 newton 12 1 3 1 1 1 1 3 0 3 4 7 6 1 3 2 drawing 36 3 3 newton 2 4 4 2 1 1 1 1 1 1 1 3 0 1 2 5 3 1 1 1 37 4 5 newton 8 4 3 2 1 1 1 4 0 3 4 5 6 3 2 1 38 4 6 newton 5 1 3 2 Patient notes 1 4 0 3 1 7 5 2 4 3 sketching diagrams 39 5 6 newton 3 4 4 4 1 1 1 1 1 1 1 4 0 4 1 3 3 3 1 1 e-mail 40 6 4 newton 2 4 3 2 1 1 1 3 0 3 4 6 5 3 2 1 Lecture notes 41 1 3 newton, mp 110 30 3 3 1 1 1 1 2 1 3 1 2 7 4 6 5 very short term post-it 42 1 3 newton, MP2000 3 4 3 2 1 1 1 1 4 0 3 4 7 5 1 2 3 notes writing letters to friends, 43 1 1 other, Casio Zoomer 12 3 2 1 1 1 1 1 0 2 0 3 2 2 7 2 1 2 etc. 44 1 0 pilot 3 4 2 1 1 1 1 1 4 1 4 3 5 7 1 1 1 None 201

217 gesture_recognition_accuracy previous_field_frequency one_for_every_command move_cursor_frequency open_record_frequency scroll_down_frequency gesture_use_frequency previous_field_reason insert_line_frequency next_field_frequency paper_use_frequency move_cursor_reason transpose_frequency scroll_up_frequency open_record_reason scroll_down_reason tech_sophistication always_recognized easy_to_remember insert_line_reason next_field_reason transpose_reason delete_frequency scroll_up_reason select_frequency insert_frequency close_frequency undo_frequency Participant ID easy_to_learn delete_reason select_reason insert_reason close_reason undo_reason easy_to_use convenient education powerful efficient gender age 1 3 2 3 4 0 1 2 3 0 1 0 1 2 1 2 1 1 4 0 3 0 3 0 1 2 1 2 4 0 2 2 1 2 3 3 1 3 41 1 3 4 2 4 1 4 4 0 4 0 1 1 1 1 4 0 4 0 1 1 4 0 1 1 3 0 4 0 4 0 4 0 1 1 1 1 2 2 1 2 19 1 2 4 3 2 2 3 3 0 3 0 3 0 3 0 3 0 3 0 3 0 3 0 3 0 2 0 2 0 3 0 2 0 1 2 2 2 2 3 2 1 29 1 4 4 4 1 2 4 4 0 4 0 1 0 4 0 4 0 4 0 1 2 4 0 1 0 4 0 4 0 4 0 4 0 2 2 3 2 3 4 3 2 19 1 1 4 5 3 3 2 4 0 1 2 1 1 4 0 3 0 1 2 1 2 4 0 1 1 2 3 1 2 1 2 4 0 1 3 1 1 4 3 1 3 30 1 4 4 6 3 2 3 3 4 4 0 4 0 3 0 4 0 4 0 1 1 4 0 4 0 3 0 4 0 4 0 4 0 1 1 2 2 3 4 2 1 37 1 5 4 7 3 1 4 4 0 1 2 1 2 1 2 4 0 1 2 1 1 4 0 1 2 3 0 1 2 1 2 4 0 1 1 1 1 1 1 1 1 34 1 3 4 8 3 2 3 4 0 1 2 1 2 1 2 3 0 1 0 1 2 4 0 1 2 3 4 1 2 1 2 3 4 3 2 1 3 4 4 3 3 26 1 3 4 9 3 3 3 4 0 1 2 1 0 4 0 4 0 1 2 1 1 4 0 1 0 2 4 1 2 1 2 4 0 2 3 2 2 2 3 2 2 25 2 4 4 10 4 3 3 3 0 2 2 2 3 2 2 4 0 2 3 2 2 4 0 2 2 1 2 1 2 1 2 1 2 2 3 3 3 3 2 3 3 37 1 3 4 11 4 2 3 2 4 1 2 1 2 1 2 4 0 1 2 1 1 4 4 1 2 4 4 1 2 1 2 4 4 2 1 2 2 3 4 2 2 36 1 5 4 12 2 2 3 4 0 3 0 3 0 3 0 3 0 3 0 3 0 3 0 3 0 3 0 3 0 3 0 3 0 2 3 3 3 4 2 2 2 27 2 3 4 13 4 3 3 4 0 1 0 1 0 1 0 4 0 1 0 4 0 1 0 1 0 4 0 1 0 1 0 4 0 3 1 3 1 2 4 3 3 33 1 5 3 14 1 2 3 4 0 1 2 4 0 1 2 4 0 1 2 3 0 3 0 1 2 1 3 1 2 1 2 3 0 1 2 1 2 2 1 1 2 26 1 4 4 15 3 2 3 3 4 1 3 1 1 3 4 4 0 1 1 1 2 4 0 1 2 3 4 4 0 4 0 4 0 2 3 1 2 2 3 1 2 26 1 3 4 16 1 2 4 4 0 1 2 1 1 1 2 4 0 1 2 1 1 4 0 1 2 4 0 1 2 1 2 3 0 1 2 1 2 1 3 1 2 23 1 3 4 17 4 2 4 4 0 4 0 4 0 4 0 4 0 4 0 1 2 4 0 4 0 3 0 4 0 4 0 4 0 1 1 1 1 2 3 1 1 25 1 3 4 18 4 1 3 4 0 1 2 1 2 1 2 4 0 1 2 1 1 4 0 1 2 3 0 3 0 1 2 4 0 2 2 2 2 3 1 1 1 23 1 4 4 19 3 2 3 4 0 1 2 3 0 3 0 3 0 1 2 1 2 3 0 2 3 3 0 1 2 1 2 3 0 1 1 1 1 2 3 1 2 29 1 2 4 20 3 2 3 4 0 1 2 1 2 3 0 4 0 1 2 1 3 4 0 1 2 1 2 1 3 1 3 3 0 1 2 2 2 2 3 2 2 22 1 4 4 21 2 3 3 1 2 1 2 1 2 1 2 1 2 1 2 2 3 1 0 1 2 3 2 1 2 1 2 3 2 2 3 2 2 4 4 3 3 34 1 3 4 22 3 2 3 4 0 4 0 4 0 4 0 4 0 4 0 1 2 4 0 4 0 4 0 4 0 4 0 4 0 1 1 1 1 2 2 1 1 30 2 5 4 23 4 3 4 4 0 1 2 1 2 1 2 4 0 1 2 1 2 4 0 1 2 2 4 1 2 1 2 4 0 2 3 2 2 3 4 2 2 28 1 3 4 24 3 2 4 4 0 1 2 1 0 1 2 4 0 1 2 1 2 4 0 1 2 2 0 4 0 1 2 2 0 1 2 2 3 2 4 1 1 38 1 4 4 25 3 2 4 4 0 1 2 1 2 1 2 4 0 1 2 2 3 4 0 1 2 2 3 1 2 1 2 3 0 2 2 2 2 3 3 2 3 29 1 3 4 26 3 2 3 4 0 1 0 1 0 1 0 4 0 1 0 2 3 4 0 1 0 2 3 1 0 1 0 3 4 1 2 1 1 2 4 1 2 21 1 3 4 27 2 2 3 4 0 1 2 1 2 1 2 4 0 1 2 1 1 4 0 1 2 3 0 1 2 1 2 3 0 1 2 1 2 2 3 2 2 29 1 3 4 28 2 2 4 4 0 1 2 4 0 1 2 4 0 1 2 1 1 4 0 3 3 4 0 1 2 1 2 4 0 1 2 1 1 2 3 1 2 39 1 3 4 29 3 2 3 4 0 1 1 4 0 1 1 4 0 1 1 3 0 3 0 3 0 3 0 1 1 1 1 4 0 1 1 1 1 2 2 1 1 24 1 3 4 30 3 2 3 4 0 2 2 3 0 3 0 4 0 2 3 1 1 4 0 3 0 4 0 4 0 4 0 4 0 1 1 1 1 2 3 1 1 29 1 3 4 31 3 2 4 4 3 4 1 1 1 1 1 4 0 1 1 3 0 4 0 1 1 1 3 1 2 1 2 3 0 2 2 2 2 3 3 2 2 30 1 4 4 32 1 2 4 4 0 2 2 1 2 1 2 4 0 1 2 1 2 4 0 1 2 2 4 1 2 1 2 4 0 1 2 1 1 2 2 2 2 27 1 3 4 33 3 1 4 4 0 1 2 1 1 3 0 4 0 1 2 1 2 4 0 1 1 4 0 1 2 1 2 4 0 1 1 2 2 2 4 2 1 28 1 4 4 34 4 1 4 4 0 4 0 4 0 1 2 3 0 3 0 1 2 4 0 2 0 3 0 4 0 4 0 4 0 2 2 2 1 1 2 2 2 24 1 4 3 35 4 2 3 4 0 1 2 1 1 1 1 3 0 1 2 1 1 3 0 1 1 1 1 1 2 1 2 2 4 1 2 2 2 4 4 2 3 32 1 4 4 36 3 2 3 4 0 4 0 1 0 4 0 4 0 4 0 2 3 4 0 1 0 1 3 3 0 4 0 2 3 2 2 3 2 4 3 3 3 20 1 2 4 37 3 2 4 4 0 1 1 3 0 1 0 4 0 1 0 1 1 4 0 1 0 2 4 1 1 4 0 3 4 2 2 2 2 3 3 2 2 40 1 2 4 38 1 2 1 3 4 1 2 1 2 1 2 3 4 1 2 1 2 1 2 1 2 2 4 3 4 1 2 3 4 1 1 2 2 3 4 2 1 52 1 3 4 39 2 2 4 4 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 4 0 1 3 2 3 3 3 3 3 36 1 1 4 40 3 2 3 4 0 1 2 1 2 1 2 1 2 1 2 1 2 2 4 1 2 2 4 1 2 1 2 2 4 1 2 2 2 3 4 3 3 29 1 2 4 41 2 3 3 4 0 1 0 1 2 1 2 1 2 1 2 2 4 3 4 1 2 2 4 1 2 1 2 3 4 2 2 1 1 3 4 2 2 33 1 5 4 42 2 1 3 4 0 1 0 1 0 1 0 4 0 1 0 1 1 4 0 1 0 1 3 1 0 1 0 2 0 3 2 1 3 1 3 3 3 28 1 2 4 43 3 3 2 1 3 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 2 3 2 2 4 4 2 2 33 1 5 4 44 1 2 4 1 1 1 3 4 0 1 1 1 1 1 1 1 1 3 0 2 0 3 0 3 0 1 1 1 0 2 2 2 2 2 3 2 2 24 1 2 4 202

218 Participant ID occupation 1 Computer Programmer/Senior Engineer 2 Student 3 Student 4 System Administrator 5 Software Engineer 6 Professor 7 Software Engineer 8 Technical Marketing Engineer 9 Engineer 10 Computer Progammer 11 technical sales 12 Software Engineer 13 Human Factors Engineer (Ergonomics) 14 Systems Engineer 15 software engineer 16 Student/Student Body President 17 systems engineer 18 Project Officer 19 20 Software Developer 21 IT Consultant 22 vision scientist 23 sales 24 environmental manager 25 Pre-sales Systems Engineer 26 Technical Manager 27 Physicst 28 software engineer 29 Consultant (Information Services) 30 Laparoscopic Surgeon 31 32 Client/Server Programmer 33 computer scientist 34 Computing and Network Engineer 35 healthcare 36 37 physician/scientist 38 Attorney 39 Software Test Engineer (Newton Inc.) 40 Educational Multimedia Developer 41 Technical Writer 42 Systems Programmer 43 software engineer 44 Student 45 Software Engineer - Sun Microsystems 46 Technical Assistant/System Administrator 47 Technology Coordinator for a school district 203

219 Appendix B Evaluation of gdt This appendix contains documents and results from the gdt experiment (see Chapter 3): an overview of the experiment that was given to participants; a gdt tutorial; the practice task; the experimental task; the post-experiment handout; the post-experiment questionnaire1; the experiment script; the consent form; and results of the questionnaire. B.1 Experiment Overview Thank you for agreeing to participate in this experiment. This experiment is about the Gesture Design Tool (gdt), whose purpose is to aid designers in creating gesture sets for pen-based user interfaces (PUIs). A gesture is a mark made with a pen to cause a command to be executed, such as the copy-editing pigtail mark for delete. (Some people use gesture to mean motions made in three dimensions with the hands, such as pointing with the index finger. This is not what is meant by gesture here.) We will be asking you to do the following things during this experiment: Observe a demonstration of gdt Read the gdt tutorial Perform the practice task Perform the familiarization task Perform the experimental task Fill out a questionnaire about yourself and your experiences In this experiment we are not testing you. If you become frustrated or have difficulty with any of the tasks, it is not your fault. The tools you will be using are still under development, and the task is not and easy one. This task is totally voluntary. We do not believe it will be unpleasant for you (in fact we hope you will find it interesting), but if you wish to quit you may do so at any time for any reason. 1. Questions about the system and the experiment were based on Shneidermans Questionnaire for User Interaction Satisfaction [Shn91]. 204

220 We would like to know how you feel and what you are thinking as you do the task. For that purpose, we will be videotaping you. We can get even more information if you talk about what you are doing and thinking as you perform the task. You may find it awkward at first, but it would be very helpful for us and we think you will get used to it after a few minutes. The experimenter may remind you to think aloud during the task. For our results to be as valid as possible, it is important that all participants are treated identically and given the same information. Because of this, the experimenter may not be able to answer all of your questions. We have attempted to provide all of the information you will need to do the task in the tutorial and demonstration. This does not mean you should not ask questions. If they cannot be answered immediately, the experimenter will note them and answer them for you later. B.2 GDT Tutorial Purpose The purpose of the Gesture Design Tool (gdt) is to aid designers in creating gesture sets for pen-based user interfaces (PUIs). A gesture is a mark made with a pen to cause a command to be executed, such as the copy-editing pigtail mark for delete. (Some people use gesture to mean motions made in three dimensions with the hands, such as pointing with the index finger. This is not what is meant by a gesture here.) gdt helps designers of gestures as follows. It allows a designer to explore attributes of their gestures that influence recognition so the designer can determine if their gestures will be difficult for the computer to disambiguate. Terminology and Concepts This section introduces terms that are useful for talking about gestures and concepts that are needed to understand gdt. Terminology A pen-based application that uses gestures allows the user to draw a number of different kinds of gestures to perform different tasks. For example, a pigtail may indicate delete, a caret insert, and a circle select. 205

221 pigtail caret circle gdt only supports single-stroke gestures. (A stroke is made each time the pen is put down, moved, and let up.) We call each of these different kinds of gestures a gesture class. A collection of gesture classes we call a gesture set. How feature-based recognition works Although we hope that in the future designers can use gdt effectively without knowing anything about recognition, that is presently not the case. Therefore this section will introduce some basic concepts from feature-based recognition that gdt users will need to understand. The description is intentionally not mathematically rigorous in the hope that it will therefore be more accessible. Feature-based recognizers categorize gestures using certain attributes of the gestures. These attributes are called features and may include such properties as total length of the gesture, total angle of the gesture, and size of bounding box of the gesture. Total length is the sum of the length of each segment of the gesture. Total angle is the sum of all angles between the segments. One such angle is shown below. 206

222 Size of the bounding box is shown as the length of the diagonal line in the figure below. gdt is built with thirteen features built-in, although the two related to time are not generally used. gdt uses an implementation of Rubines recognizer as described in his SIGGRAPH '91 paper, Specifying Gestures by Example and uses the features described in that paper: 1. Cosine of the initial angle 2. Sine of the initial angle 3. Length of the bounding box diagonal 4. Angle of the bounding box diagonal 5. Distance between first and last point 6. Cosine of the angle between the first and last points 7. Sine of the angle between the first and last points 8. Total length 9. Total angle traversed 10.Sum of the absolute value of the angle at each point 11.Sum of the squared value of those angles 12.Maximum speed (squared) [normally not used] 13.Duration [normally not used] For every gesture, the recognizer computes a vector of these features called the feature vector. The feature vector is used in training and recognition as follows. Training The recognizer works by first being trained on a gesture set. Then it is able to compare new gestures with the training set to determine to which gesture class the new gesture belongs. 207

223 During training, for each class the recognizer uses the feature vectors of the examples to compute a mean feature vector and covariance matrix (i.e., a table indicating how the features vary together and what their standard deviations are) for the class. Recognition When a gesture to be recognized is entered, its feature vector is computed and it is compared to the mean feature vector of all gesture classes in the gesture set. The candidate gesture is recognized as being part of the gesture class whose mean feature vector is closest to the feature vector of the candidate gesture. For a feature-based recognizer to work perfectly, the values of each feature will be normally distributed within a class and between classes the values of each feature will vary greatly. The next section describes how to use gdt to determine if this is the case. Usage First, a general overview of using gdt will be given. Then the method of entering gestures will be explained, along with other common operations. The last section will discuss how to use gdt to find potential recognition problems. Overview gdt is used to evaluate a proposed gesture set. The designer enters examples of each type of gesture in the gesture set. (The designer may also or instead have others enter examples to get more variation.) Then the designer uses the visualizations provided by gdt to determine what possible problems may exist in the gesture set. The designer then modifies an example or class and reexamines the set. Entering the gestures gdt starts up with an empty gesture set. To make a new class, use the Class->New menu item (i.e., select the New command from the Class menu). First, enter the class name 208

224 in the Name box. To add example gestures to the class, simply draw them in the white area at the bottom of the class window, as shown in the figure below. Individual gestures or classes may be selected by clicking on their icons. The Edit menu contains the common cut, copy, paste, and delete operations, which can be used on gestures or classes. You could use these operations for tasks such as removing mistakenly entered examples. Note that in the gesture set display, the first example is shown. Finding potential recognition problems This section discusses how gdt can be used to find different aspects of a gesture set that may cause problems for the recognizer. A potential problem that may arise in training feature-based recognizers is bad example gestures. You can use gdt to detect this by using the menu item Set->Classification matrix to classify the training examples. It will create a table that shows what percentage of examples from a class is classified into each category. The vertical axis is what class it really belongs to and the horizontal axis is how it is classified by the recognizer. If there are no extreme outlying examples, this table will simply have 100s along the diagonal. If there are outlying examples shown in the table, clicking on the table cell will bring up the 209

225 appropriate category, with the offending example(s) highlighted. You can then delete it (with Edit->Delete) and enter a new example. Another problem may be that two gesture classes are simply too similar to one another for the recognizer to disambiguate. gdt can compute the distances between classes, which is proportional to how different they are for recognition purposes. Use the Set->Distance matrix menu item to see this table. It is read like distance tables on maps: cross-index class A with class B to see the distance between classes. (The table is symmetric, but the entire table is shown for convenience.) There is a threshold slider on the right to gray out uninteresting inter-class distances (i.e., large distances). As a rule of thumb, distances less than about 7 may problematic. If it happens that two classes are too close together, you can look at the individual features for the two classes by clicking on their cell in the table. The feature graph shows the values of each feature for each class. Using this graph, you can see which features are too similar and change one of the gestures to differentiate them. The way the weights are assigned to features may also cause the recognizer to have difficulty classifying some gestures. This may happen if, for example, two classes (let's call them A and B) are distinguished from each other by their size while another class (C) in the same set has examples at two or more substantially different sizes. In order to classify C correctly, the features related to size will probably be given a low weight, which will make disambiguating A and B difficult. This can be discovered using the feature graph (Set->Feature graph) and corrected by breaking C up into two or more separate classes (e.g., C-big and C-little). B.3 Practice Task You will be given a gesture set with one gesture already in it, a circle. Your task is as follows: 1. Create a new class and name it squiggle. 210

226 2. Add 15 examples to the class. Make them look like this: 3. Make another new class and name it caret. 4. Add 15 examples to the caret class that look like this: 5. Go back to the main (i.e. gesture set) window. Draw the squiggle gesture in the white area at the bottom of the window. gdt should recognize it as the squiggle gesture. Draw it 4 more times and see if it is recognized correctly. 6. Draw the caret gesture and see if it is recognized correctly. Do this 5 times. B.4 Experimental Task For this task, we would like you to augment an existing gesture set to include new types of gestures. We have created a gesture set for a drawing program (e.g., MacDraw or Canvas) with the following gesture classes: select delete undo redo bigger smaller font font 211

227 bring to send to shuffle up shuffle down zoom in zoom out front back rotate rotate select all clockwise counterclockwise We would like you to invent new gestures for the following operations: cut copy paste align left align center vertically thicker lines thinner lines eraser (i.e., switch to the eraser tool) pen (i.e., switch to the pen tool) save (the current file) Your goal is to make gestures that you think will be easy for people to learn and remember and that will be easy for the computer to recognize. As in incentive to design good gestures, we will be awarding $100 to the person who does designs the best gestures. (Our criteria will include how well the gestures can be recognized by the computer and how learnable and memorable they are for people.) 212

228 Important: As you do the task, please think aloud. That is, say what you are thinking. It is very important to us to know what you are thinking as you are doing the task. Say what you are trying to do. If you find a part of the task easy or difficult, say so. You may find it awkward at first, but we think you will quickly get used to it. Use the following general outline to perform this task: 1. Decide what shape you want for each of the new gestures 2. Enter 15 examples of each new gesture into gdt. Do not forget to name the new classes, using the appropriate operation name given above (e.g., align left, pen). 3. Check the recognition of the new gestures by entering examples to be recognized and/ or by looking at the visualizations gdt provides, such as the Distance Matrix and the Classification Matrix. 4. If you are unhappy with the recognition, modify your gestures. You may not modify the predefined gestures, except for the 5 examples that you added. 5. If you think that the new gestures are sufficiently well recognized and will be easy to learn and remember, you should test the recognition. Start a test with the Test menu. If the overall recognition rate it reports at the end is not as good as the rate you achieved with the original set, go back to step 3. (The experimenter will tell you the target rate if you do not remember it.) 6. If you wish to continue to work with the gesture set (to make it more learnable, for example), you may. (If you change it further, you will have to go back to step 5 to make sure it is still sufficiently recognizable.) Otherwise, you are finished. B.5 Post-experiment Handout The purpose of this experiment was to investigate the difficulty of creating gestures for a gesture set and to what extend our tool (gdt) aids this task. Some of the questions we are trying to answer are: How easy is it to think of new gestures? How good are the gesture sets made by people who are new to the task? How easy is to tell if there are recognition problems? When there are recognition problems, how easy are they to fix? Is gdt easy to train? 213

229 Are the visualizations provided by gdt useful? Our broader goal in this research is to improve gestures for pen-based user interfaces. We believe that a tool such as gdt will enable interface designers to create better gestures. 214

230 B.6 gdt Experiment Questionnaire We would like you to tell us about your experiences using gdt. Please be honest. We really want to know what you think. After that, the questionnaire asks some questions about your background. Please select the numbers which most appropriately reflect your impresssions about using this computer system. Not Applicable = NA. 1. Overall reactions to gdt terrible wonderful j1 n k l m n j2 n k l m j3 n k l m j4 n k l m j5 n k l m j6 k l m j7 n k l m n j8 n k l m j9 k l m j NA k l m n frustrating satisfying 2. nj1 n k l m j2 n k l m j3 k l m j4 n k l m n j5 n k l m j6 n k l m j7 n k l m j8 n k l m j9 k l m j NA k l m n dull stimulating 3. n j1 k l m j2 n k l m n j3 n k l m j4 n k l m j5 n k l m j6 n k l m j7 n k l m j8 n k l m j9 n k l m j NA k l m difficult easy 4. n j1 n k l m j2 k l m j3 n k l m n j4 n k l m j5 n k l m j6 n k l m j7 n k l m j8 n k l m j9 k l m j NA k l m n inadequate power adequate power 5. nj1 n k l m j2 n k l m j3 n k l m j4 k l m j5 n k l m n j6 n k l m j7 n k l m j8 n k l m j9 k l m j NA k l m n rigid flexible 6. nj1 n k l m j2 k l m j3 n k l m n j4 n k l m j5 n k l m j6 n k l m j7 n k l m j8 n k l m j9 k l m j NA k l m n 7. Enter any general comments about gdt below 215

231 Learning: . Learning to operate gdt difficult easy j1 n k l m n j2 n k l m j3 n k l m j4 n k l m j5 n k l m j6 k l m j7 n k l m n j8 n k l m j9 k l m j NA k l m n . Exploration of features by trial and error discouraging encouraging j1 n k l m n j2 n k l m j3 n k l m j4 n k l m j5 n k l m j6 n k l m j7 n k l m j8 n k l m j9 n k l m j NA k l m . Remembering names and use of commands difficult easy j1 n k l m n j2 n k l m j3 n k l m j4 n k l m j5 n k l m j6 n k l m j7 n k l m j8 n k l m j9 n k l m j NA k l m . Tasks can be performed in a strightforward manner never always j1 n k l m n j2 n k l m j3 n k l m j4 n k l m j5 n k l m j6 n k l m j7 n k l m j8 n k l m j9 n k l m j NA k l m . Enter any general comments about learning below System capabilities: . System speed too slow fast enough j1 n k l m n j2 n k l m j3 n k l m j4 k l m j5 n k l m n j6 n k l m j7 n k l m j8 n k l m j9 k l m j NA k l m n 216

232 14. The system is reliable never always j1 n k l m n j2 n k l m j3 n k l m j4 n k l m j5 k l m j6 n k l m n j7 n k l m j8 n k l m j9 k l m j NA k l m n 15. Correcting your mistakes difficult easy j1 n k l m n j2 n k l m j3 n k l m j4 n k l m j5 n k l m j6 k l m j7 n k l m n j8 n k l m j9 k l m j NA k l m n 16. Ease of operation depends on your level of experience never always j1 n k l m n j2 n k l m j3 n k l m j4 n k l m j5 n k l m j6 n k l m j7 n k l m j8 n k l m j9 n k l m j NA k l m 17. Enter any general comments about system capabilities below Tutorial: 18. The tutorial is confusing clear j1 n k l m n j2 n k l m j3 n k l m j4 k l m j5 n k l m n j6 n k l m j7 n k l m j8 n k l m j9 k l m j NA k l m n 19. Information from the tutorial is easily understood never always j1 n k l m n j2 n k l m j3 n k l m j4 n k l m j5 n k l m j6 n k l m j7 n k l m j8 n k l m j9 n k l m j NA k l m 20. Enter any general comments about the tutorial below 217

233 The experimental task: 21. The task was confusing clear j1 n k l m n j2 n k l m j3 k l m j4 n k l m n j5 n k l m j6 n k l m j7 n k l m j8 n k l m j9 k l m j NA k l m n 22. Overall, the task was difficult easy j1 n k l m n j2 n k l m j3 n k l m j4 n k l m j5 k l m j6 n k l m n j7 n k l m j8 n k l m j9 k l m j NA k l m n 23. Thinking of new gestures was difficult easy j1 n k l m n j2 n k l m j3 n k l m j4 n k l m j5 n k l m j6 n k l m j7 k l m j8 n k l m n j9 k l m j NA k l m n 24. Finding recognition problems was difficult easy j1 n k l m n j2 n k l m j3 n k l m j4 n k l m j5 n k l m j6 n k l m j7 n k l m j8 n k l m j9 k l m j NA k l m n 25. Fixing recognition problems was difficult easy j1 n k l m n j2 n k l m j3 n k l m j4 n k l m j5 n k l m j6 n k l m j7 n k l m j8 n k l m j9 k l m j NA k l m n 26. Entering new gesture classes was difficult easy j1 n k l m n j2 n k l m j3 n k l m j4 n k l m j5 n k l m j6 n k l m j7 n k l m j8 n k l m j9 k l m j NA k l m n 27. Testing recognizability of gestures was 218

234 difficult easy j1 n k l m n j2 k l m j3 n k l m n j4 n k l m j5 n k l m j6 n k l m j7 n k l m j8 n k l m j9 k l m j NA k l m n 28. The classification matrix visualization was unhelpful helpful j1 n k l m n j2 n k l m j3 n k l m j4 n k l m j5 n k l m j6 n k l m j7 n k l m j8 n k l m j9 n k l m j NA k l m 29. The distance matrix visualization was unhelpful helpful j1 n k l m n j2 n k l m j3 n k l m j4 n k l m j5 n k l m j6 n k l m j7 n k l m j8 n k l m j9 n k l m j NA k l m 30. Enter any general comments about the experimental task below The following section asks you about yourself. All information will be kept confidential. 31. How many kinds of PDAs have you regularly used (including your current one, if any)? If zero, skip to question 35. 32. How many years has it been since you first used a PDA regularly? 33. Which PDA do you currently use? j Apple Newton n k l m n j Palm Pilot n k l m j Sony MagicLink n k l m j WinCE-based k l m j Other k l m n j N/A k l m n 34. For how many months have you used your current PDA? 35. How often do you use your PDA? j Less than once per day k l m n j Once per day k l m n j 2-5 times per day k l m n j More than 5 times per day k l m n j N/A k l m n 36. What is your age (in years)? 219

235 37. What is your gender? j Male n k l m n j Female k l m 38. What is the highest level of education you have reached? j High school n k l m n j Some college n k l m j College degree k l m j Master's/Professional degree n k l m n j PhD/MD k l m 39. How technically sophisticated do you consider yourself to be? (1 = not at all, 5 = extremely) Not at all nj1 n k l m j2 n k l m j3 n k l m j4 n k l m j 5 Extremely k l m 40. Have you ever designed a user interface (or part of one) before? j Yes n k l m n j No k l m 41. What is your occupation? 42. Please use the box below for comments about the survey itself. 43. Please enter your name, email address, and telephone number below. All information you provide will remain absolutely confidential. Name Email address Telephone number Please double-check to make sure you have answered all applicable questions. Submit the questionnaire Thank you very much for your assistance. 220

236 B.7 GDT Experiment Script Setup To setup before the participant arrives: 1. Start a copy of gdt and load data/practice.gs in it. This will be used for the practice task. 2. Start two copies of gdt and load data/experiment.gs in both. One will be used for the demo and the other for the experimental task. Be sure to start the correct version of gdt. 3. Enter the participants name into the two gdts that will be used for the tasks. Save one as name-p.gs and the other as name.gs. 4. Make sure paper is available for the experimenter and the participant. Notes If during the tasks the subject has a question that is directly answered in the materials given to the subject (i.e., the tutorial or other handouts), the experimenter will point out the relevant part of the printed material (and make a note so that the materials may be improved). All other questions should be written down and not answered until the very end (in the Post-Experiment section). The experimenter should remind the subject to think aloud while performing the experimental task as necessary. Script Introduction Experimenter (E): Thank you for coming to participate in our experiment. First, Id like you to sign these consent forms. [Participant signs them] You may keep a copy. [hands them another copy] Here is an overview of the experiment. [Gives pre-experiment handout to participant and waits for participant to read it.] Demonstration E: A large part of this experiment will involve using a software program called Gesture Design Tool, or gdt. I will show you a demonstration of gdt now. [E sits at the computer with participant next to E. E gives demonstration as follows.] 221

237 [E deiconifies a copy of gdt that has the initial gesture set for the experimental task already loaded.] E: Here is an example gesture set. You can recognize a gesture like this [E draws the delete gesture in the recognition area and points out where the recognition result appears. (If it is not correctly recognized, E says Sometimes it does not correctly recognize it the gesture.) E draws a bad example on purpose.] E: Sometimes it does not correctly recognize the gesture. E: You can look at the training examples for a gesture class by pressing the barrel button on the pen. [E brings up the training examples for the delete gesture. E scrolls through the examples.] E: You can select and deselect examples by pressing them [clicks on an example]. When one is selected, you can use the edit menu to delete, cut, or copy it and paste it in later [brings up edit menu, waits a few seconds, then deselects the selected example]. E: You can close the gesture class window when you are finished examining the examples. [closes the window with File->Close.] E: gdt has a number of visualizations designed to help find problems with gestures. One shows how different the gesture classes are from one another and is called the Distance matrix. [brings up the distance matrix] To see how different two classes are, cross-index them in the table. You can use the slider to make similar ones stand out. [Slides slider up and down, pausing in between so the participant can see how the display changes.] E: gdt can help find bad training examples by showing how they are recognized. This visualization is called the Classification Matrix [brings up classification matrix]. The rows are labelled according to what class the training example is an example of [points to row names] and the columns are labelled according to how the examples were actually classified [points to column names]. Each cell shows the percentage of examples in each class that were categorized as every different class. Practice task [E quits demo gdt and deiconifies practice task gdt. E brings up gdt with practice set loaded.] 222

238 E: Now we have a small task for you to do to help familiarize yourself with gdt. Most of what you need to know was covered in the demonstration, but we have a tutorial here that you should read before doing the task. Feel free to refer to the tutorial during this task. [Hands participant the tutorial and the handout on the practice task. Starts recordings.] [Participant does practice task.] Baseline Task E: Now we have a small task that will help you become familiar with the gesture set you will be using for the following task. Draw the different gestures as gdt asks you to. [E brings up experimental gdt and starts test mode. P does task.] E: Thanks. [E records recognition rate and saves test data.] Again, please. [E starts test mode. P does task. E records recognition rate and saves test data.] E: Thanks. Now add 5 examples to each class. [P adds the examples.] E: Ok. One more test for now. [E starts test mode. P does task. E records recognition rate and saves test data.] Good. Experimental task E: Now we have a longer task for you. It involves designing new gestures, so we have scratch paper you can use if you wish. [Gives participant handout on experimental task and scratch paper and pen. Important: Brings up experimental task version of gdt.] Please read the entire task before beginning. Try to think aloud as you work. [Participant does experimental task. Experimenter stops participant after one hour if necessary. E checks to make sure recognition rate met. If not, asks P to improve G set.] Questionnaire E: Now we have a short questionnaire wed like you to answer. Please come over to this computer. [Directs participant to questionnaire computer and brings up Netscape with the questionnaire page loaded.] E: Please fill this out. 223

239 [Participant fills out questionnaire. At the end, experimenter tells them what ID number to use.] E: Thats all. Thank you very much. Post-experiment Give the participant the Post-experiment Handout. Save the data from the practice and experimental tasks. Answer any participants questions that could not be answered during the task. Ask the participant if they have any more questions and answer them. Ask participant any questions noted during the experiment. 224

240 B.8 Consent Form # % % ' ) * , . / 1 3 5 7 9 : ; = 3 5 ? @ , B C E = 9 H . / . B = . J K . L 1 5 L K J 1 , L 3 , L ; 1 P Q 1 R L = 3 R . Q P , B 3 , 1 1 = 3 , B . , J : @ / V K L 1 = W R 3 1 , R 1 X 1 V . = L / 1 , L . L L ; 1 [ , 3 ] 1 = 5 3 L * @ _ : . Q 3 _ @ = , 3 . . L ` 1 = a 1 Q 1 * 9 H b @ K Q J Q 3 a 1 * @ K L @ V . = L 3 R 3 V . L 1 3 , / * = 1 5 1 . = R ; C b ; 3 R ; 3 , ] @ Q ] 1 5 L ; 1 J 1 5 3 B , @ _ B 1 5 L K = 1 5 9 7 B 1 5 L K = 1 3 5 . / . = a / . J 1 b 3 L ; . 5 L * Q K 5 @ = V 1 , L ; . L 3 , ] @ a 1 5 R @ / / . , J 5 = . L ; 1 = L ; . , 1 , L 1 = 5 J . L . 9 H _ * @ K . B = 1 1 L @ L . a 1 V . = L 3 , L ; 1 = 1 5 1 . = R ; C H b 3 Q Q . 5 a * @ K L @ R @ / 1 L @ / * @ _ v R 1 . L w x y W @ J . { . Q Q 9 } @ K b 3 Q Q ~ 1 . 5 a 1 J L @ J 1 5 3 B , , 1 b B 1 5 L K = 1 5 L @ . J J L @ . , 1 3 5 L 3 , B . V V Q 3 R . L 3 @ , . , J 1 ] . Q K . L 1 L ; @ 5 1 B 1 5 L K = 1 5 9 ; 1 , * @ K b 3 Q Q ~ 1 . 5 a 1 J L @ . , 5 b 1 = 5 @ / 1 K 1 5 L 3 @ , 5 . ~ @ K L * @ K = @ V 3 , 3 @ , 5 @ _ L ; 1 J 1 5 3 B , . , J 1 ] . Q K . L 3 @ , V = @ R 1 5 5 9 ; 3 5 = 1 5 1 . = R ; V @ 5 1 5 , @ = 3 5 a 5 L @ * @ K @ L ; 1 = L ; . , L ; @ 5 1 , @ = / . Q Q * 1 , R @ K , L 1 = 1 J 3 , J . 3 Q * Q 3 _ 1 9 ; 1 = 1 3 5 , @ 5 K ~ 5 L . , L 3 . Q ~ 1 , 1 v L L @ * @ K _ = @ / L ; 1 = 1 5 1 . = R ; 9 1 ; @ V 1 L ; . L L ; 1 = 1 5 1 . = R ; b 3 Q Q ~ 1 , 1 v L 5 @ R 3 1 L * ~ * ; 1 Q V 3 , B K 5 J 1 5 3 B , ~ 1 L L 1 = B 1 5 L K = 1 5 _ @ = V 1 , ~ . 5 1 J R @ / V K L 1 = 3 , L 1 = _ . R 1 5 9 7 Q Q @ _ L ; 1 3 , _ @ = / . L 3 @ , L ; . L H @ ~ L . 3 , _ = @ / * @ K = 5 1 5 5 3 @ , b 3 Q Q ~ 1 a 1 V L R @ , v J 1 , L 3 . Q 9 ; 1 3 , _ @ = / . L 3 @ , @ ~ L . 3 , 1 J _ = @ / * @ K = 5 1 5 5 3 @ , b 3 Q Q ~ 1 L . B B 1 J b 3 L ; . R @ J 1 , K / ~ 1 = 9 ; 1 R @ = = 1 5 V @ , J 1 , R 1 ~ 1 L b 1 1 , * @ K = , . / 1 . , J , K / ~ 1 = b 3 Q Q ~ 1 L = 1 . L 1 J b 3 L ; L ; 1 5 . / 1 R . = 1 . 5 @ K = @ b , R @ , v J 1 , L 3 . Q 3 , _ @ = / . L 3 @ , 9 H b 3 Q Q , @ L K 5 1 * @ K = , . / 1 @ = 3 J 1 , L 3 _ * 3 , B 3 , _ @ = / . L 3 @ , 3 , . , * = 1 V @ = L 5 @ _ / * = 1 5 1 . = R ; K , Q 1 5 5 * @ K . Q Q @ b 3 L ~ * 5 3 B , 3 , B L ; 1 5 1 R @ , J Q 3 , 1 ~ 1 Q @ b 9 7 _ L 1 = L ; 3 5 = 1 5 1 . = R ; 3 5 R @ / V Q 1 L 1 J C H / . * 5 . ] 1 / * , @ L 1 5 _ @ = K 5 1 3 , _ K L K = 1 = 1 5 1 . = R ; ~ * / * 5 1 Q _ @ = @ L ; 1 = 5 9 { @ b 1 ] 1 = C L ; 1 5 . / 1 R @ , v J 1 , L 3 . Q 3 L * B K . = . , L 1 1 5 B 3 ] 1 , ; 1 = 1 b 3 Q Q . V V Q * L @ _ K L K = 1 5 L @ = . B 1 . , J K 5 1 @ _ L ; 1 / . L 1 = 3 . Q 5 9 } @ K = V . = L 3 R 3 V . L 3 @ , 3 , L ; 3 5 = 1 5 1 . = R ; 3 5 ] @ Q K , L . = * 9 } @ K . = 1 _ = 1 1 L @ = 1 _ K 5 1 L @ V . = L 3 R 3 V . L 1 9 ; 1 L ; 1 = @ = , @ L * @ K R ; @ @ 5 1 L @ V . = L 3 R 3 V . L 1 b 3 Q Q ; . ] 1 , @ ~ 1 . = 3 , B @ , * @ K = 5 L . , J 3 , B . L @ = 3 , = 1 Q . L 3 @ , L @ L ; 1 X 1 V . = L / 1 , L @ _ P Q 1 R L = 3 R . Q P , B 3 , 1 1 = 3 , B . , J : @ / V K L 1 = W R 3 1 , R 1 @ = . , * @ L ; 1 = J 1 V . = L / 1 , L @ = 5 1 = ] 3 R 1 @ _ L ; 1 [ , 3 ] 1 = 5 3 L * 9 H _ * @ K ; . ] 1 . , * K 1 5 L 3 @ , 5 . ~ @ K L L ; 1 = 1 5 1 . = R ; C * @ K / . * R . Q Q / 1 C ) = 9 ? @ , B C . L y x C @ = 5 1 , J / 1 1 Q 1 R L = @ , 3 R / . 3 Q . L . Q Q . , Q R 5 9 ~ 1 = a 1 Q 1 * 9 1 J K 9 } @ K / . * a 1 1 V L ; 1 @ L ; 1 = R @ V * @ _ L ; 3 5 _ @ = / _ @ = _ K L K = 1 = 1 _ 1 = 1 , R 1 9 H . Q 5 @ . B = 1 1 L @ . Q Q @ b / * , . / 1 @ = @ L ; 1 = 3 J 1 , L 3 _ * 3 , B 3 , _ @ = / . L 3 @ , L @ ~ 1 3 , R Q K J 1 J 3 , . Q Q v , . Q = 1 V @ = L 5 . , J V K ~ Q 3 R . L 3 @ , 5 = 1 5 K Q L 3 , B _ = @ / / * V . = L 3 R 3 V . L 3 @ , 3 , L ; 3 5 = 1 5 1 . = R ; 9 225

241 B.9 Results The following two tables show participants answers to the questionnaire. current_pda_time capabilities_1 capabilities_2 capabilities_3 capabilities_4 first_pda_use PDA_types learning_1 learning_2 learning_3 learning_4 creative Age id education gender major 1 30 1 7 7 6 8 5 24 bachelors 2 female 9 NA 9 8 2 18 3 5 4 9 5 3 0 some college 2 male 9 9 9 9 first year l&s cs [undeclared] 3 18 0 6 8 1 9 1 0 some college 0 male 9 9 9 9 EECS/C 4 25 0 9 7 7 5 2 0 bachelors 0 male 9 7 9 3 computer science 5 26 0 9 8 9 7 4 0 bachelors 0 female 9 9 9 9 6 32 0 7 7 5 8 1 bachelors male 9 9 9 8 7 21 0 2 2 2 5 3 some college female 5 6 2 5 EECS 8 22 0 5 8 7 9 3 some college male 8 6 9 8 computer science 9 19 1 4 4 3 5 2 4 some college 1 male 8 6 8 6 Cognitive Science or Computer Science 10 19 0 2 4 4 3 3 some college male 7 8 5 8 computer science 11 31 1 5 5 7 5 5 12 bachelors 1 male 8 8 9 9 12 26 1 6 5 6 6 4 12 masters 1 male 8 8 8 7 architecture 13 26 1 4 4 3 3 0 bachelors female 9 7 8 8 architecture 14 23 3 2 5 8 6 5 some college 5 male 8 7 NA 8 Art 15 39 0 2 5 9 8 5 masters female 6 1 2 7 journalism 16 22 0 7 7 5 4 3 bachelors male 8 7 8 7 CS 17 24 0 2 4 7 6 4 bachelors male 7 8 6 6 CS 18 21 1 3 4 8 2 4 1 some college 0 female 8 8 6 7 EECS 19 19 1 3 4 3 7 2 1 some college 0 male 7 5 8 7 electrical engineering and Comp. Sci. 20 21 0 2 8 3 7 2 some college male 8 5 8 7 EECS major / Business minor Table B-1 gdt evaluation results (part 1). 226

242 tech_sophistication use_frequency ui_design tutorial_1 tutorial_2 overall_1 overall_2 overall_3 overall_4 overall_5 overall_6 task_1 task_2 task_3 task_4 task_5 task_6 task_7 task_8 task_9 id occupation pda 1 artist 8 8 6 9 8 8 newton 9 7 5 NA NA 7 8 NA NA 3 9 8 4 2 2 student 8 8 7 8 9 9 , NA 9 3 1 5 1 9 9 NA 9 9 9 3 0 3 student 9 8 6 9 6 9 9 4 3 9 9 9 9 NA 9 5 9 9 3 4 student 8 6 9 5 7 9 9 3 1 8 4 9 9 9 9 5 9 9 2 5 computer consultant 9 9 8 9 9 8 9 9 8 8 9 9 9 NA NA 3 6 6 3 6 sw engineer 7 5 7 7 5 7 9 5 3 8 3 9 9 7 NA 5 9 9 2 7 student 6 4 3 8 6 4 7 8 1 6 3 8 3 NA NA 4 4 7 1 8 college student 8 8 8 9 7 7 , NA 9 9 6 2 NA 9 8 7 7 5 9 9 1 9 Student 7 3 6 8 8 7 newton 7 8 7 6 4 8 5 1 1 5 4 7 1 1 10 studnet 7 3 8 4 5 8 7 7 2 3 2 2 9 5 5 5 7 7 2 11 web designer 7 8 9 9 7 7 pilot 8 5 3 8 8 9 8 8 8 4 5 5 4 3 12 grad student 7 6 6 8 7 5 pilot 8 6 2 5 4 7 9 NA 9 4 8 8 4 1 13 grad student 7 7 9 8 9 9 newton 8 4 2 3 2 9 4 NA 8 4 8 3 1 14 Artist 8 4 4 6 NA 3 none 8 6 5 4 5 8 5 3 4 3 7 7 1 0 15 student/filmmaker/producer 6 4 7 3 2 6 1 2 2 2 2 9 5 1 1 3 2 2 2 16 student 7 7 5 8 4 8 8 4 8 7 8 7 3 8 5 8 8 2 17 student 8 6 5 7 7 6 8 4 3 8 5 7 7 4 7 5 7 7 2 18 Student 7 7 6 7 5 5 pilot 9 8 7 8 8 8 8 8 NA 4 8 7 2 3 19 student 7 3 7 4 8 7 pilot 9 6 2 3 5 6 7 4 8 5 9 4 1 3 20 Student, undergrad 7 6 6 8 8 7 8 5 3 9 3 9 6 3 8 4 7 8 2 Table B-2 gdt evaluation results (part 2). 227

243 Appendix C Gesture Similarity Experiments This appendix contains the overview, the experiment script, and the post-experiment questionnaire1 from the gesture similarity experiments. C.1 Gesture Similarity Experiment Overview Thank you for agreeing to participate in this experiment. This experiment is about how people judge gesture similarity A gesture is a mark made with a pen to cause a command to be executed, such as the copy-editing pigtail mark for delete. (Some people use gesture to mean motions made in three dimensions with the hands, such as pointing with the index finger. This is not what is meant by gesture here.) During this experiment, we will ask you to look at gestures and decide how similar you think they are. There is no right or wrong answer. This task is completely voluntary. We do not believe it will be unpleasant for you, but if you wish to quit you may do so at any time for any reason. C.2 Similarity Experiment Script Consent Form E: Thanks for coming. Here is a consent form I'd like you to read. The second line is optional. Give consent form to participant(P). Wait for P to sign it. Overview E: Here is an overview of the experiment. Give overview to participant. Wait for P to read it. Bring up triad program with practice set. 1. Questions about the system and the experiment were based on Shneidermans Questionnaire for User Interaction Satisfaction [Shn91]. 228

244 Practice Task E: We have a short practice task for you to do to become familiar with the program youll be using for the experimental task. [pointing out gestures on the screen] These are gestures, or marks you could make with a pen. The dot shows the start and they animate to show their direction. The program will display groups of three gestures, one set after another. For each group of three, tap on the one that you think is most different from the other two. Once the program starts, you will see an indication here [point to lower left corner of window] of how many you have done an how many are left. Bring up practice version of triad program. Start test. Wait for P to finish. E: Thanks. Experimental Task Bring up experimental version of triad program. E: Now do the same thing again. Start test. Wait for P to finish. Save results. E: Now there is a short questionnaire Id like you to fill out on this computer [indicate other computer]. Wait for P to do questionnaire. E: Thank you for participating. 229

245 C.3 Post-experiment Questionnaire Experiment Questionnaire We would like you to tell us about your experiences in the experiment. Please be honest. We really want to know what you think. After that, the questionnaire asks some questions about your background. Please select the numbers which most appropriately reflect your impresssions about this experiment. Not Applicable = NA. 1. Overall reactions terrible wonderful j1 n k l m n j2 n k l m j3 n k l m j4 k l m j5 n k l m n j6 n k l m j7 n k l m j8 n k l m j9 k l m j NA k l m n frustrating satisfying j1 n k l m n j2 n k l m j3 k l m j4 n k l m n j5 n k l m j6 n k l m j7 n k l m j8 n k l m j9 k l m j NA k l m n dull stimulating j1 k l m n j2 n k l m n j3 n k l m j4 n k l m j5 n k l m j6 n k l m j7 n k l m j8 n k l m j9 n k l m j NA k l m difficult easy j1 n k l m n j2 k l m j3 n k l m n j4 n k l m j5 n k l m j6 n k l m j7 n k l m j8 n k l m j9 k l m j NA k l m n inadequate power adequate power j1 n k l m n j2 n k l m j3 n k l m j4 k l m j5 n k l m n j6 n k l m j7 n k l m j8 n k l m j9 k l m j NA k l m n rigid flexible j1 n k l m n j2 k l m j3 n k l m n j4 n k l m j5 n k l m j6 n k l m j7 n k l m j8 n k l m j9 k l m j NA k l m n 2. Enter any general comments below 230

246 System capabilities: 3. System speed too slow fast enough j1 n k l m n j2 n k l m j3 n k l m j4 k l m j5 n k l m n j6 n k l m j7 n k l m j8 n k l m j9 k l m j NA k l m n 4. The system is reliable never always j1 n k l m n j2 n k l m j3 n k l m j4 n k l m j5 k l m j6 n k l m n j7 n k l m j8 n k l m j9 k l m j NA k l m n 5. Correcting your mistakes difficult easy j1 n k l m n j2 n k l m j3 n k l m j4 n k l m j5 n k l m j6 k l m j7 n k l m n j8 n k l m j9 k l m j NA k l m n 6. Ease of operation depends on your level of experience never always j1 n k l m n j2 n k l m j3 n k l m j4 n k l m j5 n k l m j6 n k l m j7 n k l m j8 n k l m j9 n k l m j NA k l m 7. Enter any general comments about system capabilities below The experimental task: 8. The task was confusing clear j1 n k l m n j2 n k l m j3 k l m j4 n k l m n j5 n k l m j6 n k l m j7 n k l m j8 n k l m j9 k l m j NA k l m n 9. Overall, the task was 231

247 10. Enter any general comments about the experimental task below The following section asks you about yourself. All information will be kept confidential. 11. How many kinds of PDAs have you regularly used (including your current one, if any)? If zero, skip the next 4 questions. 12. How many years has it been since you first used a PDA regularly? 13. Which PDA do you currently use? j Apple Newton n k l m n j Palm Pilot n k l m j Sony MagicLink n k l m j WinCE-based k l m j Other k l m n j N/A k l m n 14. For how many months have you used your current PDA? 15. How often do you use your PDA? j Less than once per day k l m n j Once per day k l m n j 2-5 times per day k l m n j More than 5 times per day k l m n j N/A k l m n 16. What is your age (in years)? 17. What is your gender? j Male n k l m n j Female k l m 18. What is the highest level of education you have reached? j High school n k l m n j Some college n k l m j College degree k l m j Master's/Professional degree n k l m n j PhD/MD k l m 19. How technically sophisticated do you consider yourself to be? (1 = not at all, 5 = extremely) Not at all nj1 n k l m j2 n k l m j3 n k l m j4 n k l m j 5 Extremely k l m 232

248 20. Have you ever designed a user interface (or part of one) before? j Yes n k l m n j No k l m 21. What is your occupation? 22. Please use the box below for comments about the survey itself. 23. Please enter your name, email address, and telephone number below. All information you provide will remain absolutely confidential. Name Email address Telephone number Please double-check to make sure you have answered all applicable questions. Submit the questionnaire Thank you very much for your assistance. 233

249 Appendix D Gesture Memorability Experiment This appendix contains documents from the gesture memorability experiment: the overview handout given to participants, the script used by the experimenter, and the post- experiment questionnaire about participants experience taking the experiment1 and their demographic information. D.1 Gesture Memorability Experiment Thank you for coming to participate in our experiment. This experiment is about how peo- ple learn and remember gestures (i.e., marks made with a pen to indicate a command). Specifically, we will be looking at gestures that you might use in a drawing application (e.g., MacDraw, Adobe Illustrator). You will use a computer tablet with an electronic pen for the experiment. The experiment will take place today and one week from today. Day 1 The first day's experiment will consist of three parts Introduction. You will be shown many gestures with their names, one at a time, and asked to draw the gesture. Learning. You will be shown the name of each gesture, one at a time. Each time you are shown a name, we would like you to draw the gesture that goes with the name. If you draw the wrong gesture, you will be shown the correct one before going on. You will practice this task until you have memorized the gestures. (The computer will tell you when to stop.) Test. You will again be shown the gesture names, one at a time, and asked to draw the associate gestures. In this phase, you will not be told whether what you have drawn is correct or not. We expect the first day's activities to take less than a half-hour. Day 2 On the second day we will do the following: 1. Questions about the system and the experiment were based on Shneidermans Questionnaire for User Interaction Satisfaction [Shn91]. 234

250 Test. You will once again be shown the gesture names, one at a time, and asked to draw the associate gestures. In this phase, you will not be told whether what you have drawn is correct or not. Re-learning. You will be shown the name of each gesture, one at a time. Each time you are shown a name, we would like you to draw the gesture that goes with the name. If you draw the wrong gesture, you will be shown the correct one before going on. You will practice this task until you have memorized the gestures. (The computer will tell you when to stop.) Questionnaire. You will be asked to fill out a questionnaire on a different computer. We expect the second days activities to take less than a half-hour. You will receive a check in the mail in several weeks. (We wish it could be sooner, but we have to pay through the University and that unfortunately takes time.) D.2 Memorability Experiment Script Consent form E: [experimenter] Thanks for coming to participate in our experiment. Have a seat over here [indicate the seat at the tablet]. First, Id like you to read and sign this consent form. Hand consent form to participant (P). Wait for P to read and sign. E: Here is a copy for you to keep. Give copy of consent form to P. Introduction E: Unfortunately, the experiment may take longer than we originally thought, but we do not expect it to take longer than an hour. You are free to leave at any time, but we hope you will finish the session because we cannot use the data from your participation unless you do. Ok? Wait for some response. E: This experiment has three phases. In the first phase, we will show you a list of commands that you might use in a computer drawing program. Along with each command we will present a gesture, which is a mark you could make with a pen to invoke a command. The direction the gesture goes in important, so we show the start of each gesture with a dot. Draw 235

251 each gesture as it is shown. During this phase, wed like you to try to memorize which gesture goes with each command. E: In the second phase, we will show you the names of the commands and ask you to draw the gesture that goes with the command. We will go over them until you have learned them well enough to get them right a few times in a row. The computer will tell you when you are done. E: In the third phase we will test your memory by showing you the com- mands for the gestures and asking you to draw the corresponding gestures. E: This is an overview of the experiment. Hand overview sheet to P. Wait for P to read it (or decide not to). E: And this is a list of the operations whose gestures we will ask you to memorize. Hand command list to P. Wait for P to read it (or decide not to). Phase One - Teaching E: Before we get started, I want to make sure you are comfortable. Feel free to raise or lower the chair, or to prop the tablet up. Wait for P to adjust environment. E: Now we are going to do the first phase of the experiment. Start first phase. Wait for it to finish. E: Good. Phase Two - Training E: The next part is the longest part. The computer will show you the name of many commands, one at a time. When it shows each command name, draw the gesture that goes with that command. If you make a mistake, you can redraw the gesture. When you are happy with the gesture, press the Ok button. If you dont draw a gesture and press ok within 5 seconds of seeing it, you will automatically go on to the next one. We would like you to guess if you can, but if you have no idea you may press the Ok button without drawing a gesture first. If you get it wrong, the right gesture will show up for a few seconds. It will go away automatically and the program will show you the next command. This is a difficult task that no one gets right at first, so try not to be frustrated. Do training. If there are any really close calls about whether something is right or not, make a note. E: Thanks. The last part is much shorter. 236

252 Phase Three - Test E: This is a lot like the last part except that you will only see each command once and you wont be told whether your answers are right or not. Do testing. Post-experiment E: Ok, thats it. Thanks for coming. Do you have any questions? Answer questions, if any. Schedule next weeks session if havent already done so. 237

253 D.3 Memorability Experiment Questionnaire We would like you to tell us about your experiences in the experiment. Please be honest. We really want to know what you think. After that, the questionnaire asks some questions about your background. Please select the numbers which most appropriately reflect your impresssions about this experiment. Not Applicable = NA. Overall reactions terrible wonderful j1 n k l m n j2 n k l m j3 n k l m j4 n k l m j5 n k l m j6 n k l m j7 n k l m j8 n k l m j9 n k l m j NA k l m 1. frustrating satisfying j1 n k l m n j2 n k l m j3 n k l m j4 n k l m j5 n k l m j6 n k l m j7 n k l m j8 n k l m j9 n k l m j NA k l m 2. dull stimulating j1 n k l m n j2 n k l m j3 n k l m j4 n k l m j5 n k l m j6 n k l m j7 n k l m j8 n k l m j9 n k l m j NA k l m 3. difficult easy j1 n k l m n j2 n k l m j3 n k l m j4 n k l m j5 n k l m j6 n k l m j7 n k l m j8 n k l m j9 n k l m j NA k l m 4. rigid flexible j1 n k l m n j2 n k l m j3 n k l m j4 n k l m j5 n k l m j6 n k l m j7 n k l m j8 n k l m j9 n k l m j NA k l m 5. 6. Enter any general comments below 238

254 System capabilities: 7. System speed too slow fast enough j1 n k l m n j2 n k l m j3 n k l m j4 n k l m j5 n k l m j6 n k l m j7 n k l m j8 n k l m j9 n k l m j NA k l m 8. The system is reliable never always j1 n k l m n j2 n k l m j3 n k l m j4 n k l m j5 n k l m j6 n k l m j7 n k l m j8 n k l m j9 n k l m j NA k l m 9. Correcting your mistakes difficult easy j1 n k l m n j2 n k l m j3 n k l m j4 n k l m j5 n k l m j6 n k l m j7 n k l m j8 n k l m j9 n k l m j NA k l m 10. Enter any general comments about system capabilities below The experimental task: 11. The task was confusing clear j1 n k l m n j2 n k l m j3 n k l m j4 n k l m j5 n k l m j6 n k l m j7 n k l m j8 n k l m j9 n k l m j NA k l m 239

255 12. Overall, the task was difficult easy j1 n k l m n j2 n k l m j3 n k l m j4 n k l m j5 n k l m j6 n k l m j7 n k l m j8 n k l m j9 n k l m j NA k l m 13. Enter any general comments about the experimental task below 14. We would like to know why you thought the gestures were similar or dissimilar. In the three boxes below, list the three most important things you used to decide gesture similarity, in order of importance (most important first). If you aren't sure, put your best guess. 1. 2. 3. The following section asks you about yourself. All information will be kept confidential. 15. How many kinds of Personal Digital Assistants (e.g., Apple Newton, 3Com PalmPilot) have you regularly used (including your current one, if any)? If zero, skip the next four questions. 16. How many years has it been since you first used a PDA regularly? 17. Which PDA do you currently use? j Apple Newton n k l m n j PalmPilot n k l m j Sony MagicLink n k l m j WinCE-based n k l m j Other k l m j N/A k l m n 18. For how many months have you used your current PDA? 19. How often do you use your PDA? j Less than once per day k l m n j Once per day k l m n j 2-5 times per day k l m n j More than 5 times per day k l m n 240

256 j N/A k l m n 20. What is your age (in years)? 21. What is your gender? j Male n k l m n j Female k l m 22. What is the highest level of education you have reached? j High school n k l m n j Some college n k l m j College degree n k l m j Master's/Professional k l m degree nj PhD/MD k l m 23. How technically sophisticated do you consider yourself to be? (1 = not at all, 5 = extremely) Not at all n j1 n k l m j2 n k l m j3 n k l m j4 n k l m j 5 Extremely k l m 24. Have you ever designed a user interface (or part of one) before? j Yes n k l m n j No k l m 25. What is your occupation (or your major, if a student)? 26. Please use the box below for comments about the survey itself. 27. Please enter the following information below. Your social security number is required for payment.. Name Address City Berkeley State CA Zip Code 947 Social Security Number Email address Telephone number Participant # (Get from Test Administraor): Please double-check to make sure you have answered all applicable questions. Submit the questionnaire Thank you very much for your assistance. 241

257 Appendix E quill Documentation This appendix contains the documentation for quill: a tutorial and a reference manual. E.1 quill Tutorial This is a tutorial for the quill program. quill helps designers of a pen-based user interfaces create good gestures for their interfaces. A gesture is a mark made with a pen or stylus that causes the computer to perform an action. For example, if you were editing text, you might use the following gesture to delete a word: In this tutorial, you will learn how to use quill to: Enter gestures Recognize gestures Edit collections of gestures Organize gestures Test recognizability of your gestures Main Window Figure E-1 shows the quill main window and highlights its main areas. The rest of the tutorial will refer to these areas. The remainder of the tutorial will show you how to do tasks with quill. It starts with the simplest, most basic tasks and then describes more advanced features you can take advantage of. 242

258 Goodness metrics Menu bar Windows Training/Test selector Tree view Drawing area Log Figure E-1 Main window regions. The Basics In your interface, you want the computer to recognize different gestures for each operation. For example, you may want to use gestures for the cut, copy, and paste operations. To be able to recognize gestures, the recognizer needs many examples of each type of gesture. The collection of examples for one type of gesture is a gesture category. You usually want to enter 10 to 15 example gestures for each gesture category. A collection of gesture categories is a gesture set. Once gestures have been entered, quill gives the designer suggestions on improving the gestures so that the computer can recognize them more easily and so they are easier to learn and remember. 243

259 Training the recognizer This section will lead you through an example of using quill to create gesture categories for the editing commands cut, copy, and paste. Then it will show you how to recognize gestures. Figure E-2 shows how quill looks when you first start it. Figure E-2 Initial, empty main window. Before you begin: To help you create good gestures, quill will offer you advice, which will show up in the log area at the bottom of the window. However, we do not want to see those suggestions right now, so turn them off by executing the menu item View/Show Non-Recognition Suggestions. Initially, quill starts with an empty package. A package contains gesture information for one application. It includes a training set, which is a special gesture set that is used to train the recognizer. The training set is shown in the tree view when the Training Set tab is selected, as in Figure E-1. The training set may also appear in windows. A package may 244

260 also contain one or more test sets, which are gesture sets that are used to test how recognizable the training set is. Test sets are shown in the tree view when the Test Set tab is selected, and may also appear in windows. You can ignore test sets for now. You can start to use quill by opening an existing gesture package, but for this tutorial we will create a new package. For each operation we will create a gesture category to contain example gestures for that operation: Click on the Training folder in the tree view to select it and select the menu item Gesture/New Gesture Category to create a new gesture category. You will see that the training set now has the gesture category named gesture1 in it. Select gesture1 and type the name cut and press enter. The gesture category will be renamed to cut. Now you can begin to train the recognizer by adding training examples: Create a new view of cut by single-clicking on it to select it and using the View/New View menu item. You can also double-click on it to avoid using the menu, but double-clicking with a pen is sometimes difficult. A new window 245

261 will open that looks like this: This window is where the training gestures for cut will be displayed. Now draw ten training gestures in the white gesture drawing area. Drawing Area You draw all gestures in the drawing area. It has a different border to show what kind of gestures it expects: Blue: training set gesture Green: test set gesture Red: gesture for immediate recognition Draw the cut gestures like this: In general, we recommend drawing 10-15 gestures for each gesture category. If you draw a gesture you don't like, you can remove it by clicking on it and executing the Edit/ Delete menu item. Create a new gesture category the same way and name it copy. When you open the view for this gesture category, you will get another window with the 246

262 Drawing gestures For most written letters, numbers, symbols, etc., people don't care which direction you draw it in, as long as it looks correct. For example, you could draw the number 1 by starting at the top and moving your pen down, or starting at the bottom and moving it up, and it would be the same to people. The gesture recognizer does not see gestures this way. To it, which end you start at and the direction you draw in are important. For example, it would treat a 1 drawn top-to- bottom and a 1 drawn bottom-to-top differently. In quill, gestures are always drawn with a dot at the beginning. new gesture category in it. You can move and resize the windows the same way that you move and resize desktop windows. (That is, move the windows by dragging their title bars, and resize them by dragging their corners.) Draw 10 gestures for the copy gesture that look something like this: Create another gesture category called paste and draw 10 gestures that look something like this: Viewing Gestures Sometimes it's useful to see an overview of your gestures. 247

263 Click on the Training Set folder in the tree view and execute the menu item View/New View. You should see a new window that shows the gesture categories (and groups) in the training set. Gesture views You can create as many views of a gesture set (i.e., the training set or a test set), gesture group (described in Groups section below), or gesture category as you want. For example, if a gesture set has a lot of categories and groups in it, you can make two views of it to see the ones at the beginning and the ones at the end at the same time. Recognizing Gestures Now that you have a training set, you can recognize gestures Select the training set by clicking on Training in the training area. Draw a gesture in the drawing area. The program may pause briefly while the recognizer is being trained, then it will recognize your gesture. It is possible that the recognizer will not be able to train. If that happens, try drawing more gestures for each of the gesture categories. The recognized gesture category will be displayed in the log at the bottom of the window, and in the drawing area in green. It will also turn green in the training area. (Only the most recently recognized gesture category will be green.) Those are the basics of using quill. You may want to save your gesture package now, using File/Save As... from the menu. The rest of the tutorial describes other features of quill. 248

264 Suggestions You may see suggestions in the log at the bottom of the window. You may also see one of these icons displayed somewhere: , , , . They indicate that quill has a suggestion for you. See the section below on Suggestions for more information. Editing Gestures, gesture categories, and groups (described in Groups section below) can all be edited using the standard editing commands other applications use: cut, copy, paste, and delete. You can select a object by clicking on it and then operate on it using the Edit menu or keyboard shortcuts (shown in the menu). You can select a range of objects by selecting the first or last one, and then holding down shift and clicking on the other end of the range. You can also toggle the selection of individual objects by holding down control and clicking. Managing the display If you create new views in the right side of the window it may get cluttered. You can use the controls shown in Figure E-3 to manage the display. You can move the windows by dragging on their title bars. You can resize them by dragging their corners, or by clicking the minimize or maximize buttons. You can close them with their close boxes. Also, you can resize the different parts of the main window by dragging on the gray separators between them. Groups You dont need groups to make a set and have gestures be recognized, but if you have many gesture categories it may be useful to organize them based on their general type. For example, you might have gesture categories for edit operations (e.g., the ones you entered above) and others for file and formatting operations. quill allows you to organize your gesture categories by creating groups and putting gesture categories in them. (If you find it helpful, you can think of groups as folders or directories in file systems.) To create a group, use the Gesture/New Group menu item. You can rename it the same way you rename a gesture category. 249

265 Minimize Maximize Title bar Close Resize bars Figure E-3 Main window resizing and layout controls. Example group use Here is an example of using groups with the gesture categories you made earlier in the tutorial: Select the training set by clicking on Training in the tree. Create a new group with the Gesture/New Group menu item. Select the new group and type Edit to rename it. Select the first gesture category. Shift-click on the last gesture category to select all the gesture categories. Cut the gesture categories with the Edit/Cut menu item. Select the group you created (Edit). Paste the gesture categories with the Edit/Paste menu item. You now have a group containing gesture categories. 250

266 Test Sets You can see how well your gestures can be recognized by drawing gestures one-at-a-time, like you did above in the Recognizing Gestures section. However, it is tedious to keep drawing the same gestures over again. Instead, you can create a test set, and use it to test the recognizability of your training set. You probably want to draw gestures in the test set that you think should be recognized, and you can then test them and see if they are recognized. If they arent, you might want to edit your training set. You use the Gesture/New test set menu item to create new test sets, and add gestures to the set the same way you add them to the training set. Example test set use You can create test sets this way: Execute the Gesture/New Test Set menu item. You should see a new test set appear in the test window, named testset1. If you want to, rename testset1 by clicking on it, typing its new name, and pressing enter. Click the expansion icon left of the folder icon to show the groups and categories inside the test set. Expand the Edit group the same way. Double click (or use View/New view) on Cut to make a new view. Draw ten test gesture in the gesture drawing area. Make them look like the Cut gesture, but make some of them sloppy. If you want, add test gestures to the other gesture categories (Copy and Paste). Click on the test set (testset1, or whatever you have renamed it to if you renamed it) in the test set window to select it. Execute the Gesture/Test recognition menu item. The log will tell you if all the gestures in the test set were correctly recognized, or if some gestures were misrecognized. If you have open windows showing your test gestures, you may 251

267 see that some of them are misrecognized (if any were). If not, you can click on the Misrecognized gesture link to see which gestures were misrecognized. The Log The log, at the bottom of the main window, shows information about what the application is doing. For example, results of recognizing individual gestures or of testing recognition of test sets. Also, suggestions to you about your set may be displayed there. Suggestions Suggestions come in four different priority levels: Informational. You do not need to take any action from this. Low-priority suggestion. There is something about your gestures that you may want to change. Medium-priority suggestion. The suggested change would probably improve your gesture set. High-priority suggestion. The suggested change would almost certainly improve your gesture set. Like in a web browser, text that is underlined and blue is a link, usually to an example, gesture, or group. Some links bring up the help window with relevant information. You may also see suggestion icons in other places in the interface, such as in the tree view of your training or test sets. In the tree, you can click on a suggestion icon and if there is only one suggestion for that gesture category (or group or set), the log will scroll to the suggestion and highlight it. If there is more than one suggestion for that category (or group or set), then you will get a pop-up menu of suggestions, and when you select one the log will scroll to it an highlight it. Controlling suggestion display You can control what types of suggestions you want quill to give you. You can enable or disable recognition-related suggestions using the menu item View/Show Recognition Suggestions. You can enable or disable all other suggestions with the menu item View/ Show Non-recognition Suggestions. Initially, quill starts with only non-recognition suggestions enabled, because recognition suggestions are usually different depending on 252

268 whether you have entered only a few of the gesture categories in your set or if you have entered most of them. quill automatically enables recognition suggestions when you recognize a gesture or test recognition of a test set. buttons Sometimes you will see the informational icon in a button. Pressing it will give more information about its context. Goodness Metrics quill evaluates your gestures and may give you suggestions in the log. quill also shows you a summary of how good it thinks your gestures are with two goodness metrics, displayed between the menu bar and the training area. Human Goodness One metric is for how good your gestures are in terms of humans, namely whether your gesture categories are different enough from each other that people will not be likely to get confused by them. (If they are in the same group, its ok for them to be similar.) Recognizer Goodness The other metric is how good your gestures are for the recognizer. The recognizer goodness metric goes down if you have training gestures that are misrecognized or extremely different from others in their category, or gesture categories that are too similar to each other. (It is not affected by the test set; test sets are for your own use to test the training set.) The maximum for both metrics is 1000. The higher the goodness value, the better quill thinks your gestures are. When the Metrics Update If you have changed your gesture package but quill has not had time to analyze them, it puts a ? after the goodness metrics. To avoid interrupting you while you work and analyzing intermediate stages of your work, quill does not analyze your gesture package while you are making changes. If you wait a few seconds without making any changes, quill will analyze your gesture package and update the metrics. 253

269 E.2 quill Reference Manual Introduction This is the reference manual for the quill application, which is a program to help designers of pen-based user interfaces create better gestures. A gesture is a mark made with a pen or stylus that causes the computer to perform an action. For example, if you were editing text, you might use the following gesture to delete a word: This reference manual describes the features of quill. Table of Contents Reference Manual The Gesture Hierarchy Terminology The Main Window Analysis and Suggestions How Recognizers Work Human Perception of Gesture Similarity Appendix: Logarithm Reference Manual The Gesture Hierarchy This section describes the different types of objects that the user can operate on with quill. In increasing size, they are: Gestures Gesture categories 254

270 Groups Sets Packages Gestures A gesture is a particular mark that invokes a command. For example, the following mark might invoke the paste command: Gesture Categories A gesture category is a collection of gestures that tell the recognizer how a type of gesture should be drawn. For example, an application might have gesture categories for cut, copy, and paste operations. Usually approximately 15 examples are adequate, although sometimes more may be required. A typical application will have a gesture category for each operation that the designer wants to be available using a gesture. However, if two gestures with very different shapes should invoke the same command, the recognition may be better if they are in separate gesture categories than if they are all in the same gesture category. For example, a square category includes very small squares and very large squares, it might be recognized better if it were two categories, big square and small square. Groups A group is used by the designer to organize gestures. For example, one might have a File group for gestures dealing with file operations, a Format group for formatting gestures, etc. The recognizer ignores groups. Sets A set is a collection of gesture categories and/or groups. quill uses two different kinds of sets for two different purposes. A training set is a set that is used to train the recognizer. A test set is used to test the recognizability of the training set. For example, you might have a 255

271 test set whose gestures are very neatly drawn and one whose gestures are sloppily drawn, to see how well neat vs. sloppy gestures are recognized. There must be a training set for the recognizer to do recognition. Packages A package holds all the gesture information for one application. (Although applications that have multiple modes may require more than one package.) A package contains a training set and may contain one or more test sets. quill creates one top-level window for each package. All quill data files store exactly one package each (although older legacy gesture files may each store a set or a gesture category). Terminology Before the elements of the interface can be described, some interaction terms need to be defined. click Click the left mouse button or tap the pen. right click Click the right mouse button or press the barrel button on the pen. double-click Click twice rapidly. If at first it does not work, try to do it faster. It is sometimes difficult to perform, especially with a pen, so all double-click operations can also be done with the menu. shift click Hold down the shift key and click. control click Hold down the control key and click. You can use a pen or a mouse with quill, although many people find it difficult to draw gestures using a mouse. 256

272 The Main Window The quill main window is shown below: The remainder of this section describes the various parts of the main window. Tree View This area shows the training set and its groups and gesture categories, or the test sets and their groups and categories, depending on the Training/Test selector. Individual gestures are not shown in the tree. Clicking on the name, folder icon, or gesture icon selects an object (and deselects everything else in the tree). Clicking on a suggestion icon scrolls the log at the bottom of the window so that the relevant warning is in view if there is only one suggestion for that line. If there is more than one suggestion, a pop-up menu of suggestions is displayed and when one is chosen the log is scrolled to display it. Shift-clicking extends the current selection. Control-clicking toggles the selection of one object. Double-clicking creates a new window that shows the object. 257

273 Selected objects in the tree view may be edited using the Edit menu. The placement of newly created objects (see Gesture menu section) and the behavior of the gesture drawing area are determined by the selection. Windows Windows appear in the right part of the main window and are used to browse gesture categories, groups, and sets and to show information about suggestions in the log window (such as misrecognized training examples). Windows can be moved by dragging their title bars and resized by dragging their edges or corners. Clicking anywhere in a window will select it. Clicking the close box on the right side of the title bar will close the window. In windows that display objects, individual objects may be selected using the same mechanisms as in the tree view. That is, clicking on an object selects it (and deselects everything else in the subwindow). Shift-clicking extends the selection. Control-clicking toggles the selection of one object. Double-clicking creates a new view of the object in a new window. (Double-clicking with a pen is sometimes difficult. The menu command View/New View can be used, instead.) Selected objects in windows may be edited using the Edit menu. The placement of newly created objects (see Gesture menu section) and the behavior of the gesture drawing area are determined by the selection. The following subsections describe specifics of particular types of windows. 258

274 Set window This window shows the gesture categories and groups contained in a gesture set (e.g., the training set or a test set). 259

275 Group window This window shows the gesture categories contained in a group. Gesture category window This window shows the gestures for a gesture category. 260

276 Gesture window This window shows a single gesture. Misrecognized gesture window This window shows gestures that were misrecognized. Each gesture has a label and a button. The green label says which gesture category the gesture is supposed to be. The red 261

277 button says which gesture category it was recognized as. Clicking on the red button creates a window that shows the gesture category it was recognized as. Drawing area This area is used for drawing gestures. If a gesture category is selected in the tree view or if a gesture category window is active, a gesture drawn here will be added to the selected gesture category. If a gesture window is active, a gesture drawn here will replace it. If a gesture group or set is selected in the tree or a gesture group or set window is active, an example drawn here will be recognized. Results of the recognition will be shown in the log, and the label for the recognized gesture will turn green in the training area and in any windows in which it appears. During certain operations (e.g., training the recognizer), the application is unable to accept drawn gestures, and during this time this area will turn gray. Log This view is the primary means for the application to communicate to the user about what it is doing. Many different types of messages may appear here. Some examples are: Notification of autosave. Suggestions for improving your gesture set. Recognition results. Errors. Menu bar This section describes the menus and their operations. File menu Operations: New package. Create a new package. (Note: each file contains exactly one package.) Open. Open a package file. If another package is open, this creates a new top- level window. Save. Save the current package. If the current package does not need to be saved, this is greyed out. 262

278 Save As... Save the current package using a different name or in a different directory. Page Setup. Set up printing and printer properties. Print. Print the active window. Close. Close the current package. It will prompt the user if the package is unsaved. Quit. Quit quill. It will prompt the user if any packages are unsaved. Edit menu Operations: Cut. Remove the current selection and put it on the clipboard. Copy. Copy the current selection to the clipboard. Paste. Copy the contents of the clipboard into the package. Delete. Delete the current selection. For rules about where in the package pasted objects are placed, see the rules for new objects in Gesture menu. View menu Operations: New View. Create a new view of the selected object(s). Close All Views. Closes all desktop windows. Show Non-recognition Suggestions. Enable/disable the display of new suggestions not related to recognition in the log, such as suggestions related to human-perceived similarity. Show Recognition Suggestions. Enable/disable the display of new suggestions related to recognition in the log Clear suggestions. Clear all suggestions in the log. Gesture menu Operations: New category. Create a new gesture category. 263

279 New group. Create a new group. New test set. Create a new test set. Rename. Rename a gesture category, group, or test set. Enable. Enable the current selection. All objects are enabled by default. Enabled objects are used to train the recognizer and are analyzed by the program for possible suggestions. Disable. Disable the current selection. Unselected objects are not used in training the recognizer, and are not analyzed for most types of problems. Train set / Set already trained. Train the recognizer with the current training set. This is only enabled if the recognizer is not trained or being trained. Any change to the training set makes the recognizer untrained. Analyze set. Examine the set using the enabled analyzers. Test recognition... Test the recognition of the training set by trying to recognize the selected test set(s). Problems are reported in the log. Newly created objects will be placed in the selected object, or in the container object closest to the selected object that can contain the new object. For example, a group cannot be inside of another group, so if gesture A inside group B is selected and the New Group operation is performed, the new group will be added after B (i.e., as a sibling of B), not inside of (i.e., as a child of) A or B. Help menu Operations: Tutorial. Creates a separate window that shows the tutorial. Reference. Creates a separate window that shows this document. About quill. Gives brief information about the program. Analyses and Suggestions To help the designer create good gestures, quill periodically analyzes the training set and provides suggestions. There are several different analyzers, which are described below. 264

280 Training Example Misrecognition One of the analyzers looks for gestures in the training set that are recognized as something other than the gesture category they belong to. Such gestures are usually caused by one of two things: Spurious gesture. The designer may have lifted the pen too soon and truncated the gesture, accidentally put in a single dot, or made some other mistake which caused a gesture to be in the training set that was not intended. This is easy to fix by simply removing the mistaken gesture and drawing another one. Gestures that are very similar to the recognizer. Sometimes the designer may create two gesture categories that are very similar to the recognizer. This problem will normally be detected explicitly (see Gesture Categories Too Similar for Recognizer), but may also cause misrecognized gestures. For example, if gesture categories A and B are similar to each other, then some gestures from A may be misrecognized as B and vice versa. Gesture Categories Too Similar for Recognizer Another analyzer looks at all possible pairs of gesture categories and computes how similar they are to the recognizer. If they are too similar, it will be more difficult for the recognizer to tell them apart, and so more likely to misrecognize gestures. This can be corrected by changing the gesture categories so that they are more different to the recognizer, usually by changing their training gestures to have different values for one or more features. When quill detects that two gesture categories are too similar, it will provide information about how to make them more different. Duplicate Names You rarely want to have multiple gesture categories with the same name. quill often refers to gesture categories by name in its displays and messages, so multiple gesture categories with the same name will be confusing to users and to other designers. Also, when using test sets, the recognizer uses gesture category names to determine if the recognition is correct or not. If gesture category names are not unique, test gestures which should be marked incorrect may be marked correct. 265

281 Gestures Too Similar for Humans Another analyzer looks at all possible pairs of gesture categories and computes how similar humans will perceive them to be. Although it is unproved, we think it likely that if dissimilar operations have similar gesture categories, people may confuse them and find it harder to learn and remember them. On the other hand, it is likely fine for similar operations to have similar gesture categories, and may even be beneficial for learning and remembering them. Outlying Gesture Category Normally, it is easier for the recognizer to recognize gesture categories if they are very different from one another. However, sometimes the type of recognizer that quill uses can have difficulty recognizing gesture categories if many gesture categories are similar in some way but one is very different from them (i.e., is an outlier). When this problem occurs, quill will notify you and tell you how to change the gesture category to make it more like the others and improve the recognition. Outlying Gesture Sometimes designers misdraw training gestures, especially if they are unused to using a pen interface. quill looks for training gestures that are very different from others for the same gesture category, and notifies the designer of those gestures since they may be misdrawn. If you see a gesture labeled an outlier that is not misdrawn, you should press the Gesture Ok button to tell the computer that this outlier is ok. How Recognizers Work Designers can take advantage of most of the features of quill without knowing how the gesture recognizer works. However, in some cases it will be useful to know how the recognizer sees gestures in order to improve their recognition. In the future, quill may support multiple recognizers, but as of now it only supports the recognizer by Dean Rubine1. This recognizer is described in the following section. 266

282 Rubine Recognizer Overview Rubines recognizer is a feature-based recognizer because it categorizes gestures by measuring different features of the gestures. Features are measurable attributes of gestures, and are often geometric. Examples are length and the distance between the first and the last points. To recognize an unknown gesture, the values of the features are computed for it, and those feature values are compared with the feature values of the gestures in the training set. The unknown example is recognized as the gesture with features values that are most like the feature values of the unknown gesture. The following sections describe in more detail how the recognizer works. First, the features are explained. Then the recognizer training process is described. Finally, recognition is described in more detail. Features Rubines recognizer is a feature-based recognizer because it categorizes gestures by measuring different features of the gestures. Features are measurable attributes of gestures, and are often geometric. quill uses the following features: Bounding box This is not really a feature in itself, but several of the features use it. The bounding box for a gesture is the smallest upright rectangle that encloses the gesture. 267

283 Cosine of the initial angle This feature is how rightward the gesture goes at the beginning. This feature is highest for a gesture that begins directly to the right, and lowest for one that begins directly to the left. Only the first part of the gesture (the first 3 points) is significant. Sine of the initial angle This feature is how upward the gesture goes at the beginning. This feature is highest for a gesture that begins directly up, and lowest for one that begins directly down. Only the first part of the gesture (the first 3 points) is significant. Size of the bounding box This feature is the length of the bounding box diagonal. 268

284 Angle of the bounding box This feature is the angle that the bounding box diagonal makes with the bottom of the bounding box. Distance between first and last points This feature is the distance between the first and last points of the gesture. Cosine of angle between first and last points This feature is the horizontal distance that the end of the gesture is from the start, divided by the distance between the ends. If the end is to the left of the start, this feature is negative. 269

285 Sine of angle between first and last points This feature is the vertical distance that the end of the gesture is from the start, divided by the distance between the ends. If the end is below of the start, this feature is negative. Total length This feature is the total length of the gesture. Total angle This feature is the total amount of counter-clockwise turning. It is negative for clockwise turning. 270

286 Total absolute angle This feature is the total amount of turning that the gesture does in either direction. Sharpness This feature is intuitively how sharp, or pointy, the gesture is. A gesture with many sharp corners will have a high sharpness. A gesture with smooth, gentle curves ill have a low sharpness. A gesture with no turns or corners will have the lowest sharpness. More precisely, all gestures are composed of many small line segments, even the parts that look curved. This feature measures the angular change between each line segment, squares them, and adds them all together. The angular change is shown here: Training For every gesture, the recognizer computes a vector of these features called the feature vector. The feature vector is used in training and recognition as follows. The recognizer works by first being trained on a gesture set. Then it is able to compare new examples with the training set to determine to which gesture the new example belongs. 271

287 During training, for each gesture the recognizer uses the feature vectors of the examples and computes a mean feature vector and covariance matrix (i.e., a table indicating how the features vary together and what their standard deviations are) for the gesture. Recognition When an example to be recognized is entered, its feature vector is computed and is compared to the mean feature vector of all gestures in the gesture set. The candidate example is recognized as being part of the gesture whose mean feature vector is closest to the feature vector of the candidate example. For a feature-based recognizer to work perfectly, the values of each feature should be normally distributed within a gesture, and between gestures the values of each feature will vary greatly. In practice, this is rarely exactly true, but it is usually close enough for good recognition. Human Perception of Gesture Similarity quill tries to predict when people will perceive gestures as very similar. Its prediction is based on geometric features, some of which the recognizer also uses. There are some features that are used for similarity prediction that are not used for recognition. These features are described below. Aspect This feature is the extent to which the bounding box differs from a square. A an example with a square has bounding box aspect of zero. 272

288 Curviness2 This feature is how curvy, as opposed to straight, the gesture is. Gesture with many curved lines have high curviness while ones composed of straight lines have low curviness. A gesture with no curves has zero curviness. There is no upper limit on curviness. Roundaboutness This feature is the length of the gesture divided by its endpoint distance. The lowest value it can have is 1. There is no upper limit. Density This feature is how intuitively dense the lines in the gesture are. Formally, it is the length divided by the size of the bounding box. The lowest value it can have is 1. There is no upper limit. 273

289 Log(area) This feature is the logarithm of the area of the bounding box. Log(aspect) This feature is the logarithm of the aspect. Appendix: Logarithm The logarithm is a mathematical function on positive numbers2. The logarithm of a number x is written as log(x). For our purposes, its important properties are: It makes numbers smaller: a < log(a). It preserves ordering: if a > b, then log(a) > log(b). The logarithm of 1 is 0. For all numbers less than one, the logarithm is negative. For all numbers greater than one, the logarithm is positive. The logarithm function looks like this: 274

290 1Rubine, D. Specifying Gestures by Example. Proceedings of SIGGRAPH 91, p. 329- 337. 2For this reference, we'll pretend that you cannot take the logarithm of a negative number. 275

291 Appendix F quill Evaluation This appendix contains documents and results of the post-experiment questionnaire from the quill evaluation. The documents are first: an overview of the experiment, descriptions of the long and short experimental tasks, the post-experiment handout, the experimenter script, the post-experiment questionnaire.1 The results from the post-experiment questionnaire follow. For more details on this evaluation, see Chapter 7. F.1 Overview The purpose of this experiment is to evaluate quill, a program for helping designers of pen-based user interfaces create gestures for their interfaces. A gesture is a mark made with a pen that causes a command to execute. Copy-edit marks used in text editing are examples of gestures. (Some people use gesture to mean motions made in three dimensions with the hands, such as pointing with the index finger. This is not what is meant by gesture here.) We will be asking you to do the following things during this experiment: Read the quill tutorial and perform the tasks it describes Create two new gesture sets Fill out a questionnaire about yourself and your experiences In this experiment we are not testing you. If you become frustrated or have difficulty with any of the tasks, it is not your fault. The tools you will be using are still under development, and the task is not an easy one. This task is totally voluntary. We do not believe it will be unpleasant for you (in fact, we hope you will find it interesting), but if you wish to quit you may do so at any time for any reason. 1. Questions about the system and the experiment were based on Shneidermans Questionnaire for User Interaction Satisfaction [Shn91]. 276

292 We would like to know how you feel and what you are thinking as you do the task. We can get even more information if you talk about what you are doing and thinking as you perform the task. You may find it awkward at first, but it would be very helpful for us and we think you will get used to it after a few minutes. The experimenter may remind you to think aloud during the task. For our results to be as valid as possible, it is important that all participants are treated identically and given the same information. Because of this, the experimenter may not be able to answer all of your questions. We have attempted to provide all of the information you will need to do the task in the tutorial, reference manual, and on-line help. This does not mean you should not ask questions. On the contrary, we encourage you to ask any questions you may have so we can learn what parts of the task are difficult or what parts of the program are confusing. If a question cannot be answered immediately, the experimenter will note it and answer it for you later. F.2 Long Task: Presentation Editor For this task, we would like you to use quill to create gestures for a presentation editing application (e.g., Microsoft Powerpoint). There are no right or wrong answers. The program may give you suggestions about how to improve your gestures. We recommend that you strongly consider them, although you may not want to follow all of them. The groups and gesture categories we would like you to make are (groups are bold, categories are not): Outline New bullet Select item Indent Unindent Font Increase font size Decrease font size 277

293 Change font Bold Italic Underline Format Increase line spacing Decrease line spacing Left justify Center justify Right justify Misc Insert picture New slide Your goal is to make gestures that you think will be easy for people to learn and remember, and that will be easy for the computer to recognize. Important: As you do the task, please think aloud. That is, say what you are thinking or what you are trying to do. If you find a part of the task particularly easy or difficult, say so. You may find it awkward at first, but we think you will quickly get used to it. We would like you to work on improving your gesture set until the end of the time allotted for this task (1.5 hours). However, you may stop early if you want to. F.3 Short Task: Web Browser For this task, we would like you to use quill to create gestures for a web browser (e.g., Netscape Communicator, Internet Explorer). There are no right or wrong answers. The program may give you suggestions about how to improve your gestures. We recommend that you strongly consider them, although you may not want to follow all of them. The groups and gesture categories we would like you to make are (groups are bold, categories are not): 278

294 Navigate Back Forward Home Reload Bookmarks Add bookmark Edit bookmarks Misc Add annotation Email page Your goal is to make gestures that you think will be easy for people to learn and remember, and that will be easy for the computer to recognize. Important: As you do the task, please think aloud. That is, say what you are thinking or what you are trying to do. If you find a part of the task particularly easy or difficult, say so. You may find it awkward at first, but we think you will quickly get used to it. We would like you to work on improving your gesture set until the end of the time allotted for this task (45 minutes). However, you may stop early if you want to. F.4 Post-experiment Handout The purpose of this experiment was to investigate the difficulty of creating gestures for a gesture set and to what extend our tool (quill) aids this task. Some of the questions we are trying to answer are: How easy is it to think of new gestures? How good are the gesture sets made by people who are new to the task? How easy is to tell if there are recognition problems? When there are recognition problems, how easy are they to fix? Is quill easy to train? Is the feedback/suggestions provided by quill useful? Annoying? Something else? 279

295 Our broader goal in this research is to improve gestures for pen-based user interfaces. We believe that a tool such as quill will enable interface designers to create better gestures. quill is the second gesture design tool we created. We ran a study very much like this one two years ago to investigate gesture design. We incorporated what we learned from that study (and others) into quill. F.5 Experimenter Script Overall guidelines Be polite, even if the participant is late, rude, etc. Treat all participants the same. Follow the script. On questions the participant asks: Questions about the experiment itself are fine and you should answer them. However, do not answer questions about quill itself or give advice about gesture design or similar things. If a question is covered in the tutorial or reference, point out the appropriate section and let the participant read it. If you cannot answer a question, write it down and answer it after the end of the experiment. Pay close attention to the participant and take notes on things that seem hard, easy, nice, confusing, frustrating, etc. Script Setup Have ready these copies: 2 generic consent forms 2 A/V consent forms Experiment overview Experimental long task description Experimental short task description Post-experiment handout quill tutorial quill reference 280

296 Load a browser window on the questionnaire computer with the post-experiment questionnaire. Prepare Excel with the template. Start Python control program if not already started. Make sure paper is available for the participant. Get the log ready to record what happens. Set up camera, scan converter, and VCR. Preliminaries E: Thanks for coming to participate in our experiment. Before we begin the experiment itself, you need to sign these two consent forms. Hand generic consent form and A/V consent form to P. Wait for P to read and sign them. Give P a blank copy of each. E: You can have a copy of the consent form for yourself. Here is an overview of the experiment. Hand overview to P. Wait for P to read it. Tutorial E: First, we have a tutorial that explains how to use our gesture design program, quill. Read it and do the exercises in it. Wait for P to do the tutorial. Experimental task 1 E: To help us learn about the program and your experiences using it, we would like you to talk aloud about what you are doing and how you feel. It may seem awkward to you at first, but we think you will get used to it. If you have questions, please ask them. I may not be able to answer all of them at the time, but I will at the end of the experiment. You will be designing gestures, so here is some scratch paper you can use if you want to. Put scratch paper & a pen on the desk near P. E: Here is a description of the first experimental task. Read the entire task before starting. Start whenever youre ready. Let me know when youre done. 281

297 Hand first task description to P. Start camera & VCR while P is reading the handout. Wait for P to do task 1. Dont forget to take notes, and remind them to talk through what theyre doing. E: Thanks. Experimental task 2 Save their task 1 gesture package and load the package for task 2. E: Here is a description of the second experimental task. Start when youre ready. Tell me when you are done. Hand second task description to P. Wait for them to do the task. E: Thanks. Just one more part, a questionnaire. Questionnaire Maximize the questionnaire window on the questionnaire computer. E: Please fill out this questionnaire. Save the gesture package from task 2 while waiting for P to do questionnaire. If interpretation questions come up, say, Interpret it as best you can. If you want to make a comment about how you interpreted it, there is a general comments box near the end. E: Thanks. Post-experiment Give P the post-experiment handout. If any questions came up during the experiment that you could not answer then, answer them now. Ask P if they have any other questions. 282

298 F.6 quill Experiment Questionnaire We would like you to tell us about your experiences usingquill. Please be honest. We really want to know what you think. After that, the questionnaire asks some questions about your background. Please select the numbers which most appropriately reflect your impresssions about using this computer system. Not Applicable = NA. Overall reactions toquill terrible wonderful j1 n k l m n j2 n k l m j3 n k l m j4 n k l m j5 n k l m j6 n k l m j7 n k l m j8 n k l m j9 n k l m j NA k l m 1. frustrating satisfying j1 n k l m n j2 n k l m j3 n k l m j4 n k l m j5 n k l m j6 n k l m j7 n k l m j8 n k l m j9 n k l m j NA k l m 2. dull stimulating j1 n k l m n j2 n k l m j3 n k l m j4 n k l m j5 n k l m j6 n k l m j7 n k l m j8 n k l m j9 n k l m j NA k l m 3. difficult easy j1 n k l m n j2 n k l m j3 n k l m j4 n k l m j5 n k l m j6 n k l m j7 n k l m j8 n k l m j9 n k l m j NA k l m 4. inadequate power adequate power j1 n k l m n j2 n k l m j3 n k l m j4 n k l m j5 n k l m j6 n k l m j7 n k l m j8 n k l m j9 n k l m j NA k l m 5. rigid flexible j1 n k l m n j2 n k l m j3 n k l m j4 n k l m j5 n k l m j6 n k l m j7 n k l m j8 n k l m j9 n k l m j NA k l m 6. 7. Enter any general comments about quill below 283

299 Suggestions inquill In one of your tasks, you received suggestions about your gestures from quill. Please rate these suggestions in this section. terrible wonderful j1 n k l m n j2 n k l m j3 n k l m j4 n k l m j5 n k l m j6 n k l m j7 n k l m j8 n k l m j9 n k l m j NA k l m 8. frustrating satisfying j1 n k l m n j2 n k l m j3 n k l m j4 n k l m j5 n k l m j6 n k l m j7 n k l m j8 n k l m j9 n k l m j NA k l m 9. dull stimulating j1 n k l m n j2 n k l m j3 n k l m j4 n k l m j5 n k l m j6 n k l m j7 n k l m j8 n k l m j9 n k l m j NA k l m 10. difficult easy j1 n k l m n j2 n k l m j3 n k l m j4 n k l m j5 n k l m j6 n k l m j7 n k l m j8 n k l m j9 n k l m j NA k l m 11. inadequate power adequate power j1 n k l m n j2 n k l m j3 n k l m j4 n k l m j5 n k l m j6 n k l m j7 n k l m j8 n k l m j9 n k l m j NA k l m 12. rigid flexible j1 n k l m n j2 n k l m j3 n k l m j4 n k l m j5 n k l m j6 n k l m j7 n k l m j8 n k l m j9 n k l m j NA k l m 13. 14. Enter any general comments about suggestions below Learning: 15. Learning to operate quill 284

300 difficult easy j1 n k l m n j2 n k l m j3 n k l m j4 n k l m j5 n k l m j6 n k l m j7 n k l m j8 n k l m j9 n k l m j NA k l m 16. Exploration of features by trial and error discouraging encouraging j1 n k l m n j2 n k l m j3 n k l m j4 n k l m j5 n k l m j6 n k l m j7 n k l m j8 n k l m j9 n k l m j NA k l m 17. Remembering names and use of commands difficult easy j1 n k l m n j2 n k l m j3 n k l m j4 n k l m j5 n k l m j6 n k l m j7 n k l m j8 n k l m j9 n k l m j NA k l m 18. Tasks can be performed in a strightforward manner never always j1 n k l m n j2 n k l m j3 n k l m j4 n k l m j5 n k l m j6 n k l m j7 n k l m j8 n k l m j9 n k l m j NA k l m 19. Enter any general comments about learning below System capabilities: 20. System speed too slow fast enough j1 n k l m n j2 n k l m j3 n k l m j4 n k l m j5 n k l m j6 n k l m j7 n k l m j8 n k l m j9 n k l m j NA k l m 21. The system is reliable never always j1 n k l m n j2 n k l m j3 n k l m j4 n k l m j5 n k l m j6 n k l m j7 n k l m j8 n k l m j9 n k l m j NA k l m 22. Correcting your mistakes difficult easy j1 n k l m n j2 n k l m j3 n k l m j4 n k l m j5 n k l m j6 n k l m j7 n k l m j8 n k l m j9 n k l m j NA k l m 285

301 23. Ease of operation depends on your level of experience never always j1 n k l m n j2 n k l m j3 n k l m j4 n k l m j5 n k l m j6 n k l m j7 n k l m j8 n k l m j9 n k l m j NA k l m 24. Enter any general comments about system capabilities below Tutorial: 25. The tutorial is confusing clear j1 n k l m n j2 n k l m j3 n k l m j4 n k l m j5 n k l m j6 n k l m j7 n k l m j8 n k l m j9 n k l m j NA k l m 26. Information from the tutorial is easily understood never always j1 n k l m n j2 n k l m j3 n k l m j4 n k l m j5 n k l m j6 n k l m j7 n k l m j8 n k l m j9 n k l m j NA k l m 27. Enter any general comments about the tutorial below The first experimental task (presentation editor): 28. The task was confusing clear j1 n k l m n j2 n k l m j3 n k l m j4 n k l m j5 n k l m j6 n k l m j7 n k l m j8 n k l m j9 n k l m j NA k l m 29. Overall, the task was difficult easy j1 n k l m n j2 n k l m j3 n k l m j4 n k l m j5 n k l m j6 n k l m j7 n k l m j8 n k l m j9 n k l m j NA k l m 286

302 30. Thinking of new gestures was difficult easy j1 n k l m n j2 n k l m j3 n k l m j4 n k l m j5 n k l m j6 n k l m j7 n k l m j8 n k l m j9 n k l m j NA k l m 31. Finding recognition problems was difficult easy j1 n k l m n j2 n k l m j3 n k l m j4 n k l m j5 n k l m j6 n k l m j7 n k l m j8 n k l m j9 n k l m j NA k l m 32. Fixing recognition problems was difficult easy j1 n k l m n j2 n k l m j3 n k l m j4 n k l m j5 n k l m j6 n k l m j7 n k l m j8 n k l m j9 n k l m j NA k l m 33. Entering new gesture categories was difficult easy j1 n k l m n j2 n k l m j3 n k l m j4 n k l m j5 n k l m j6 n k l m j7 n k l m j8 n k l m j9 n k l m j NA k l m 34. Testing recognizability of gestures was difficult easy j1 n k l m n j2 n k l m j3 n k l m j4 n k l m j5 n k l m j6 n k l m j7 n k l m j8 n k l m j9 n k l m j NA k l m 35. Enter any general comments about the first experimental task below The second experimental task (web browser): 36. The task was confusing clear j1 n k l m n j2 n k l m j3 n k l m j4 n k l m j5 n k l m j6 n k l m j7 n k l m j8 n k l m j9 n k l m j NA k l m 37. Overall, the task was difficult easy j1 n k l m n j2 n k l m j3 n k l m j4 n k l m j5 n k l m j6 n k l m j7 n k l m j8 n k l m j9 n k l m j NA k l m 287

303 38. Thinking of new gestures was difficult easy j1 n k l m n j2 n k l m j3 n k l m j4 n k l m j5 n k l m j6 n k l m j7 n k l m j8 n k l m j9 n k l m j NA k l m 39. Finding recognition problems was difficult easy j1 n k l m n j2 n k l m j3 n k l m j4 n k l m j5 n k l m j6 n k l m j7 n k l m j8 n k l m j9 n k l m j NA k l m 40. Fixing recognition problems was difficult easy j1 n k l m n j2 n k l m j3 n k l m j4 n k l m j5 n k l m j6 n k l m j7 n k l m j8 n k l m j9 n k l m j NA k l m 41. Entering new gesture categories was difficult easy j1 n k l m n j2 n k l m j3 n k l m j4 n k l m j5 n k l m j6 n k l m j7 n k l m j8 n k l m j9 n k l m j NA k l m 42. Testing recognizability of gestures was difficult easy j1 n k l m n j2 n k l m j3 n k l m j4 n k l m j5 n k l m j6 n k l m j7 n k l m j8 n k l m j9 n k l m j NA k l m 43. Enter any general comments about the second experimental task below The following section asks you about yourself. All information will be kept confidential. 44. How many kinds of PDAs have you regularly used (including your current one, if any)? If zero, skip to question 49. 45. How many years has it been since you first used a PDA regularly? 288

304 46. Which PDA do you currently use? j Apple Newton n k l m n j Palm Pilot n k l m j Sony MagicLink n k l m j WinCE-based k l m j Other k l m n j N/A k l m n 47. For how many months have you used your current PDA? 48. How often do you use your PDA? j Less than once per day k l m n j Once per day k l m n j 2-5 times per day k l m n j More than 5 times per day k l m n j N/A k l m n 49. What is your age (in years)? 50. What is your gender? j Male n k l m n j Female k l m 51. What is your preferred hand (i.e., your handedness)? j Right n k l m n j Left n k l m j Neither (ambidexterous) k l m 52. What is the highest level of education you have reached? j High school n k l m n j Some college n k l m j College degree k l m j Master's/Professional degree n k l m n j PhD/MD k l m 53. How technically sophisticated do you consider yourself to be? (1 = not at all, 5 = extremely) Not at all nj1 n k l m j2 n k l m j3 n k l m j4 n k l m j 5 Extremely k l m 54. How artistic do you consider yourself to be? (1 = not at all, 5 = extremely) Not at all nj1 n k l m j2 n k l m j3 n k l m j4 n k l m j 5 Extremely k l m 55. Have you ever designed a user interface (or part of one) before? j Yes n k l m n j No k l m 56. What is your occupation? 57. If you create UIs or have done so as part of your career, how many years have you spent designing UIs? 58. Please use the box below for comments about the survey itself. 289

305 59. Please enter your name, email address, and telephone number below. All information you provide will remain absolutely confidential. Name Email address Telephone number 60. Enter your participant ID (ask the experimenter) Please double-check to make sure you have answered all applicable questions. Submit the questionnaire Thank you very much for your assistance. F.7 Experimental Results The following table shows how answers on the above questionnaire correlated with experimental performance, as measured by human goodness, recognizer goodness, and time to complete the experiment. Correlations and significance levels are given. Statistically significant correlations and the corresponding significance levels are in bold. 290

306 Sig. (2-tailed) Question # Pearson Correlation Human Recognizer Human Recognizer Time Time goodness goodness goodness goodness 1 0.767 0.205 -0.222 0.010 0.570 0.538 2 0.477 0.109 -0.157 0.163 0.764 0.665 3 0.321 -0.068 -0.044 0.366 0.852 0.903 4 0.460 0.109 -0.257 0.181 0.765 0.474 5 0.902 0.704 -0.560 0.000358 0.023 0.092 6 0.325 0.287 0.002 0.394 0.454 0.997 8 0.498 -0.245 0.138 0.172 0.525 0.724 9 0.367 -0.171 0.245 0.332 0.659 0.525 10 -0.010 0.164 -0.118 0.982 0.697 0.780 11 0.249 -0.103 -0.072 0.518 0.793 0.853 12 0.443 0.446 -0.516 0.232 0.229 0.155 13 0.307 0.182 0.512 0.459 0.666 0.195 15 0.012 -0.260 -0.019 0.973 0.469 0.958 16 0.425 -0.106 -0.076 0.221 0.771 0.836 17 0.539 -0.263 -0.141 0.108 0.464 0.697 18 0.165 -0.133 0.048 0.649 0.714 0.896 20 0.259 0.038 0.352 0.471 0.918 0.319 21 0.062 -0.175 0.411 0.865 0.629 0.237 22 0.473 -0.233 -0.112 0.167 0.516 0.758 23 0.464 0.043 0.066 0.247 0.919 0.876 25 -0.040 -0.219 0.165 0.912 0.543 0.648 26 -0.002 -0.483 0.158 0.996 0.157 0.662 28 -0.115 0.808 0.393 0.752 0.005 0.261 29 0.000 0.554 0.420 1.000 0.097 0.227 30 -0.239 0.186 0.016 0.506 0.606 0.966 31 -0.439 0.134 0.836 0.205 0.712 0.003 32 -0.148 0.477 0.386 0.704 0.194 0.304 33 0.327 0.757 0.130 0.356 0.011 0.721 34 0.504 0.843 0.461 0.167 0.004 0.212 36 -0.195 0.000 0.028 0.590 1.000 0.938 37 0.141 0.005 0.076 0.698 0.990 0.834 38 0.227 -0.066 0.229 0.528 0.857 0.524 39 0.134 0.388 -0.037 0.752 0.343 0.931 40 0.452 0.747 -0.192 0.261 0.033 0.649 Table F-1 Correlations of questionnaire with performance. Statistically significant correlations are bold. (Responses to questions on suggestions (8-13) are correlated with performance on suggestion-enabled task. Responses to questions on long task and short task (28-34 and 36-42) are correlated with performance on the long and short tasks, respectively.) 291

307 Sig. (2-tailed) Question # Pearson Correlation Human Recognizer Human Recognizer Time Time goodness goodness goodness goodness 41 0.411 0.292 0.310 0.238 0.413 0.383 42 0.605 0.189 0.017 0.085 0.626 0.965 44 -0.047 -0.064 -0.083 0.898 0.861 0.820 45 0.149 -0.117 0.140 0.682 0.748 0.701 47 0.402 0.336 -0.599 0.249 0.343 0.067 48 0.135 -0.065 0.087 0.710 0.858 0.811 49 -0.296 -0.828 0.163 0.406 0.003 0.653 57 -0.303 0.021 0.036 0.428 0.957 0.926 Table F-1 Correlations of questionnaire with performance. Statistically significant correlations are bold. (Responses to questions on suggestions (8-13) are correlated with performance on suggestion-enabled task. Responses to questions on long task and short task (28-34 and 36-42) are correlated with performance on the long and short tasks, respectively.) (Continued) 292

Load More