Shengdong ZHAO,Duncan BRUMBY,Mark CHIGNELL,Dario SALVUCCI,and Sahil GOYAL

shared input


Shengdong ZHAO,Duncan BRUMBY,Mark CHIGNELL,Dario SALVUCCI,and Sahil GOYAL.Shared Input Multimodal Mobile Interfaces: Interaction Modality Effects on Menu Selection in Single-task and Dual-task Environments


Audio and visual modalities are two common output channels in the user interfaces embedded in today’s mobile devices. However, these user interfaces typically center on the visual modality as the primary output channel, with audio output serving a secondary role. This paper argues for an increased need for shared input multimodal user interfaces for mobile devices. A shared input multimodal interface can be operated independently using a specific output modality, leaving users to choose the preferred method of interaction in different scenarios. We evaluate the value of a shared input multimodal menu system in both a single-task desktop setting and in a dynamic dual-task setting, in which the user was required to interact with the shared input multimodal menu system while driving a simulated vehicle. Results indicate that users were faster at locating a target item in the menu when visual feedback was provided in the single-task desktop setting, but in the dual-task driving setting, visual output presented a significant source of visual distraction that interfered with driving performance. In contrast, auditory output mitigated some of the risk associated with menu selection while driving. A shared input multimodal interface allows users to take advantage of multiple feedback modalities properly, providing a better overall experience.

KEYWORDS: earPod, eyes-free, shared-input multimodal interfaces


In today’s technology-rich world, devices are increasingly powerful and multi-functional. For instance, a single handheld device can act as a cell phone, music player, digital camera, GPS navigation system, and Personal Digital Assistant (PDA). With increased functionality to support day-to-day tasks, people increasingly turn to such devices in any number of different contexts and settings. An important distinction that can be made between usage scenarios is between those where the user is in a relatively isolated and static environment and those where the user is on the move.

Most interfaces today rely on the visual modality to present information to users, which works well in a relatively isolated and static environment when visual attention is available; however, for users on the move interacting with visual interfaces creates competition for limited visual resources. For example, interaction with an iPod while driving may be distracting and constitutes a potential safety hazard [Salvucci et al. 2007]. Multiple Resource Theory [Wickens, 2002] suggests that using auditory output for the secondary task may alleviate interference in a dual task setting where the primary task is visually demanding.

Due to the diverse usage context a mobile device will likely encounter, allowing access to computing tasks using either output modality lets users choose the most desirable interaction method in a specific context and enhances the overall user experience. To achieve this effect, existing approaches often provide two different interfaces: for instance, the same mobile device supports both dialing a cellular phone via voice commands or via pressing digits on a keyboard.

Instead of using two different (manual vs. voice) interfaces, we propose to use two related interfaces with a shared input mechanism. The two interfaces differ only in their output modalities, resulting in a shared input multimodal interface that can be independently operated using either audio or visual feedback.

We apply this new interface design approach to the design of mobile menus by extending a touch-based auditory menu technique called earPod into an integrated interface that has both an audio and visual interface [Zhao, et al. 2007] (Figure 1&Figure 2).

multimodal mobile fig.1

Figure 1. Using earPod. (a, b) Sliding the thumb on the circular touchpad allows discovery of menu items; (c) the desired item is selected by lifting the thumb; (d) faster finger motions cause partial playback of audio. Size of the touchpad has been exaggerated for illustration purposes. 

multimodal mobile fig.2

Figure 2. Our earPod prototype uses a headset and a modified touchpad 

The original earPod technique is designed for an auditory device controlled by a circular touchpad whose output is experienced via a headset (Figure 2) as is found, for example, on an Apple iPod. Figure 3 shows how the touchpad area is functionally divided into an inner disc and an outer track called the dial. The dial is divided evenly into sectors, similar to a Pie [Callahan, et al., 1988] or Marking Menu [Kurtenbach, 1993; Zhao, et al., 2006; Zhao, et al., 2004]. Using earPod technique for menu selection is illustrated in Figure 1. When a user touches the dial, the audio menu responds by saying the name of the menu item located under the finger (Figure 1a). Users may continue to press their finger on the touch surface, or initiate an exploratory gesture on the dial (Figure 1b). Whenever the finger enters a new sector on the dial, playback of the previous menu item is aborted. In addition to speech playback of menu items, we use non-speech audio to provide rapid navigational cues to the user. Boundary crossing is reinforced by a click sound, and then the new menu item is played. Once a desired menu item has been reached, users select it by lifting the touching finger, which is confirmed by a “camera-shutter” sound (Figure 1c). Users can abort item selections by releasing the operating finger on the center of the touchpad. If a selected item has submenus, users repeat the above process to drill down the hierarchy, until they reach a desired leaf item. Users can skip items rapidly using fast dialing gestures (Figure 1d). All speech sounds used in earPod are human voices recorded in CD quality (16 bits, 44KhZ) using professional equipment.

multimodal mobile fig.3

Figure 3. The functional areas of earPod’s touchpad. Up to 12 menu items can be mapped to the track. The inner disc is used for canceling a selection.  

Previous study [Zhao et al., 2007] has shown earPod to be an effective eyes-free menu selection technique, with comparable performance with iPod style linear menus (Figure 4) for menu selection tasks in the single-task desktop context. This suggested that earPod was a compelling technique for eyes-free scenarios. However, today’s mobile devices can be used in many different contexts and settings (e.g., in a static environment, or in changing environments). A technique optimized for a particular scenario may work poorly for other scenarios, reducing its overall usefulness. In order to design menu techniques that can work well for different scenarios, we expand the design space of earPod [Zhao et al 2007], into a family of menu techniques that differ in modality of feedback and menu style. We then investigate how different points in this design space (modality: visual, audio and audio-visual feedbacks; menu layout style: linear and radial) affect user performance and preference for menu selection in single- and dual-task environments. Finally, based on these results we draw out design recommendations and guidelines for the design of shared input multimodal mobile interfaces that are suitable for both single- and dual-task contexts.

multimodal mobile fig.4

Figure 4. The iPod visual menu (left) and its interaction technique (right)

This research attempts to address the following research questions related to the design of a shared input multimodal mobile menu.

  1. In what situation should each interface in a shared input multimodal menu be used?
  2. What is the benefit (if any) of allowing users to choose which modality to use for mobile input?
  3. Should the two modality of feedback exist simultaneously, or be provided separately?
  4. How do different menu styles affect the design of shared input multimodal interfaces?

 To answer these questions, we performed two rounds of experiments, which evaluated the alternative output modalities under the single-task desktop and dual-task driving conditions respectively. The results, along with design recommendations for both desktop and driving scenarios and general discussions of interface design for mobile and ubiquitous computing, are presented and discussed in later sections of this paper.


The research literature on hierarchical menu layout and on output modality in hierarchical menus is reviewed in the following subsections. We begin by reviewing previous work that has explored the design space for difference menu layout conventions, and that has examined how various output modalities can be used to support user interactions with the system.

Menu Layout

Many menus have been developed for diverse applications and platforms. They can be classified under different systems but this paper focuses on two contrasting menu styles, namely, linear style menus vs. radial style menus. Linear style menus lay out their items linearly where the cost (effort) to access each item is different (Figure 5, right); radial style menus lay out the items radially in a polar coordinate system where there is a constant distance from each item to the center of the circle in which the menu is embedded (Figure 5, left). Items in linear menus are also relative to each other in the sense that they have to be traversed sequentially in order to reach the target item. In contrast, items in radial menus have absolute locations in the sense that, with sufficient skill and knowledge, users can go directly to the target item without having to traverse through other items on the way. Radial menus also have the advantage of allowing items to be placed in a meaningful location, for example, “Open” and “Close” can be placed in two opposite directions and “Previous” and “Next” can be placed in a way that reflects the semantics of those words. In this paper, the linear vs. radial terminology will be used throughout for distinguishing between menu types. However, it should be kept in mind that linear menus are also relative, and that radial menus are also absolute.

multimodal mobile fig.5

Figure 5. Screenshots of the visual radial interface (left) and visual linear interface (right)

Callahan et al. [Callahan et al., 1988] summarized the strengths and weaknesses for each menu style. Linear style menus are easier for arranging items and they are more flexible in the number of choices in a single menu/submenu and are more familiar to users [Sears and Shneiderman, 1994]. However, because items are arranged sequentially, access time to each item is uneven: depending on the initial placement of the cursor, items closer to the cursor are quicker to select than items further away. Radial style menus, on the other hand, lay out items at equal-distance from the center and require constant access time and have better performance than linear style menus [Callahan et al., 1988; Kurtenbach and Buxton, 1994]. However, placing labels in a circular layout requires more space (Figure 5, left), and the number of items allowed in one circular array is typically limited to no more than 12 items due to performance concerns [Kurtenbach and Buxton, 1993; Zhao and Balakrishnan, 2004; Zhao, et al., 2006].

 Output Modality and Menu Design

Interfaces typically require some form of output feedback to guide and inform users. Visual output is very common in current Graphic User Interface (GUI). Haptic output is another possibility. Haptic output is by nature a “private” display and can communicate information even in noisy environments [Wagner, et al., 1999; Luk, et al., 2006]. However, most users are not familiar with haptic-based languages such as the Braille alphabets, making it difficult to use haptics to communicate rich amount of information effectively.

Auditory output, on the other hand, may utilize both speech and non-speech audio, allowing it to communicate information contains rich semantics to users with less learning. Sound can travel in space and is omnidirectional, making it particularly suitable for delivering important messages like alarms and alerts.

There is a large body of work on audio-based icons, for example, using much shorter non-speech audio segments to represent equivalent messages using speech e.g., [Brewster et al, 2003]. This is similar to the use of space efficient icons to represent labels in graphical interfaces, an approach first introduced by Gaver and colleagues [Gaver and Smith, 1991; Gaver, 1989] who designed an auditory system called Sonic Finder for the Apple Macintosh computer using everyday sound to represent objects, tasks, and events. These symbolic sounds, which are analogous to what they represent, are called auditory icons.

Motivated by the needs of blind users, Mynatt (e.g., 1995) discussed methods for creating auditory equivalents of desktop user interface elements. Other work has examined the use of auditory interfaces for specific tasks. [Arons, 1997] described the SpeechSkimmer system for efficiently browsing through recorded speech. [Minoru and Schmandt, 1997] and [Roy and Schmandt, 1996] also described systems for navigating through audio information. Other systems that demonstrated interactive methods for interacting with audio were described by [Schmandt, 1998], and [Schmandt, et al., 2004]. Sawhney and Schmandt described the Nomadic Radio system for using an audio interface to access communication and information services while on the move. [Stifelman, et al., 1993, 2001] examined the problem of note taking and annotation using an auditory interface.

However, relatively fewer studies have focused on the interaction design of auditory menus. One notable exception is Brewster’s work [Brewster, 1998] which investigated the effectiveness of non-speech audio navigational cues in voice menus. In his work, he found that distinctive non-speech audio elements (earcons) were a powerful method of communicating hierarchy information as it is difficult to match auditory icons with suitable iconic sounds for events in an interface. Not every auditory icon will correspond to a sound-producing event which can be easily mapped to corresponding interface actions or states in the real world. However, in this work Brewster used earcons to indicate the current tool and the tool changes, and not as a form of audio feedback for an imminent tool selection by the user.

Dingler et al. [Dingler, et al., 2008] investigated the learnability of sonification techniques such as auditory icons, earcons, speech and spearcons for representing common environmental features. Spearcons are speech stimuli that have been greatly sped up [Walker, et al., 2006]. They found that speech and spearcons were easily learnable when compared with earcons and auditory icons. In fact, earcons were much more difficult to learn than speech. With this study in mind, we used speech was used as the audio feedback for earPod.

Recently, the emergence of mobile computing has inspired researchers to rethink the interaction model of audio command selection. Pirhonen et al. [Pirhonen, et al., 2002] investigated the use of simple gestures and audio-only feedback to control music playback in mobile devices. Brewster et al. [Brewster, et al., 2003] also investigated the use of head gestures to operate auditory menus. Both techniques have demonstrated effectiveness in the mobile environment. However, they have only been investigated with a very limited number of commands. For example, the head gesture menu Brewster et al. created used only 4 options, which is insufficient for the wide range of functionality that exists in today’s devices.

One prior technique that is similar to earPod is Rinnot’s Sonic Texting [Rinnot, 2005], which is a mobile text input method that leverages touch input and auditory output. It uses a two-level radial menu layout to organize the alphabets, and uses a keybong joystick for input. While it is similar to earPod in terms of the use of a radial menu layout and auditory feedback, the earPod technique significantly differs from it in the following aspects. Sonic texting uses joystick instead of touchpad for gestural input. The type of gestures supported for selection is quite different. Sonic texting uses the back and forth movement of the joystick to select text, while in earPod, finger gliding, tapping, and lifting are the primary gestures for interaction. Sonic texting is designed for text input, which is optimized for a specific set of alphabets while earPod is a general menuing technique that can support multiple hierarchies of menu items. Sonic texting has not been formally evaluated, making it difficult to access the viability of sonic texting as a mobile text input method. As mentioned by the author, sonic texting’s goal is not to maximize word-per-minute efficiency, but to create an engaging audio-tactile experience. In the informal evaluation session it was used, it is regarded by many users as a game or musical instrument rather than a mobile textual input method.

Nigay and Coutaz [Nigay and Coutaz, 1993] have proposed a design space for multimodal systems. Their taxonomy defined three dimensions of multimodal system: level of abstraction, concurrency, and fusion. According to their definition, the difference between multimodal and multimedia lies in whether or not the system can interpret the meaning of the different channel of modalities. However, their analysis mostly focused on input modalities. No examples were given on how to apply this principle to systems with multiple output modalities. For example, assuming there is system that plays a movie clip to the audience and within the clip, there are both audio and video playbacks. This system, according to our conventional definition, should be classified as a multimedia system. However, if this system uses text-to-speech instead of raw recordings to play the audio dialogues, according to Nigay and Coutaz’s taxonomy, the system knows the meaning of the output, which will be classified as a multimodal system. However, this can be a bit of misleading since the text-to-speech is not used to provide any system feedbacks about the user input, but to provide content to users.

We would like to amend the definition of Nigay and Coutaz’s taxonomy to include systems with multiple channel of output. If the multiple channel of output is to provide system feedback about user input, then it is considered multimodal; otherwise, it is considered multimedia. This is regardless whether or not the system understands the meaning of the output or not.

It is not a simple task to decide which modality is the best as each has its own advantages and disadvantages. Salmen and colleagues [Salmen, et al., 1999], who weighed the pros and cons of using audio, visual and dual modality in a driving scenario, found that audio modality is good as drivers do not have to refer to the screen and can focus on the road. However, should the list of the audio instructions be too long, drivers may have difficulties recalling the full set of instructions as oral presentation typically takes three times longer to process compared to reading. Thus, visual modality may be useful in instances whereby drivers might want to get information faster but again, there are many issues with this, such as whether scrolling down a page or paging a text is better. Finally, the combination of visual and audio modality may seem like the perfect solution but unfortunately, if used simultaneously, there may be some issues such as too many different audio and visual commands to remember which may lead to frequent mix up [Salmen, et al., 1999].

Within our “shared input exclusive multimodal system”, there are two single modal interfaces, which are designed to function independently for different scenarios. However, the two types of interfaces share the same input mechanism as well as the same mental model. This will train the user with both interfaces since they only differ in the output modalities.

In multimodal user interface design, input may be provided using multiple modalities. For instance, phone numbers may be input by dialing a cellular phone or via voice commands or via pressing digits on a keypad. Different feedback modalities can co-exist where different modalities may be used depending on the context of use. For instance, previous studies have examined and compared unimodal, bimodal, and trimodal feedback conditions [Akamatsu, et al., 1995; Vitense, et al., 2002]. [Jacko, et al., 2003] examined multimodal feedback (for persons who are older and possess either normal or impaired vision) in drag and drop tasks relating to daily computer use. In addition, the game industry has demonstrated successful union of three types of feedback (audio, haptic, and visual)which can provide players with an “immersive” experience in a simulation game [Jacko, et al., 2003].

Sodnik et al [Sodnik, et al., 2008] compared the use of auditory versus visual interfaces for interaction with a mobile device while driving. The proposed auditory interfaces consisted of spatialized auditory cues for menu selection. Though their results indicated that the task completion rate was same for both audio and visual, they found that the driving performance was better and the perceived cognitive load was lower while interacting using auditory interfaces. Not surprisingly, users were distracted by the visual interface while driving, and preferred the audio interface. Pfleging et al present a prototype to combine speech and multi-touch gestures for multimodal input in an automotive environment [Pfleging, et al., 2011].

There are many scenarios in which a user might desire or prefer eyes free interaction [Yi, et al., 2012]. Apart from contexts where eyes free interaction may be less demanding, Yi et al noted that users’ are willing to use eyes free as a form of social acceptance, or even a form of self expression.

The purpose of the present study was to compare the effectiveness of alternative modalities of audio, visual and audio-visual feedback for menu selection tasks in single-task and dual-task scenarios.


If we consider modality and menu style as two dimensions in a design space, a simple analysis (Table I) reveals that there are a number of design alternatives. If we label an interface firstly by its primary feedback modality, followed by its menu style, the popular iPod will fit within the “visual linear” category, whereas the earPod [Zhao, et al., 2007] will reside within the “audio radial” cell. The two alternative interfaces here are the “audio linear” and “visual radial”. Additionally, since audio and visual feedback can co-exist and are not mutually exclusive (unlike menu style), there is a third possible choice in the modality dimension, which is the audio-visual or dual modality. This gives us a 3×2 matrix of six design possibilities. These alternative designs cover a variety of interesting properties and thus warrant further investigation in a multitasking context.

multimodal mobile tab.1

Visual Linear (iPod) and Audio Radial (earPod)

The two interfaces (earPod and iPod-like menu) differ in two aspects, namely, the modality of feedback, and the menu style for presenting and navigating the menu items. For modality, the iPod-like menu primarily relies on a visual display to present the menu options and navigational cues, while earPod does these entirely using audio. In terms of menu style, the iPod uses linear menus (Figure 4) where items are placed linear to each other, and there is no one-to-one mapping between specific input areas to menu items; earPod, on the other hand, adopts a radial menu layout where each menu item is directly mapped to a physical location on the touchpad (Figure 5, left shows an example of radial layout for the visual interface), and allows expert users to access any items in the list in constant time.

 Audio Linear

In Table I, audio linear is the cell next to iPod-like visual menu. It provides spoken-word auditory feedback to the user as they scroll up or down a menu list. In some respects the interface is similar to that used in the popular Apple iPod digital music player, except that in the absence of a visual display, it provides auditory feedback on users’ actions. Moreover, one could imagine that such an audio linear interface could be easily integrated with the existing iPod interface, which is currently available through an open source solution from [Rockbox]. However, linear menus are often slower than radial ones [Callahan, et al., 1988]; therefore, it could be even slower to operate than the visual linear interface. It will be informative to systematically evaluate it against the other cells in the design space.

 Visual Radial

As discussed above, this interface has a radial input area that supports a radial menu layout; that is, where specific spatial regions on the input device have a one-to-one mapping with items in the menu. Figure 5 (left) shows our design of the visual radial interface. Although the interaction method differs, its appearance looks similar to that of marking menu [Kurtenbach, 1993; Zhao, et al., 2007]. Notice that this is in contrast to the linear menu layout [Brumby, et al., 2009], where the input device supports a vertical scroll of a focus point through the menu (see,

Figure 5, right). We might speculate that the performance advantages of the earPod interface discussed earlier [Zhao, et al., 2007], may, in part, be due to the radial menu layout used. We aim to more carefully evaluate the potential performance benefit of using a radial menu layout for selecting items from reasonably sized static menus.

Dual (audio-visual) Linear and Dual (audio-visual) Radial

By providing both audio and visual feedback simultaneously the interface may possibly combine the best of both worlds — namely, they can be operated using either modality, thereby giving users a choice of which modality to attend to in different circumstances. For example, if the device is operated inside one’s pocket, the visual feedback can be ignored. If the device is in a noisy environment, the visual feedback prevails, and the audio feedback becomes less useful. Since both channels of feedback use the same menu style, the training received in either modality can be used in the other. However, simultaneously providing both modalities might waste resources (such as battery power), and one source of feedback has the potential to be distracting if one modality of feedback is preferred: for example, a user who prefers visual feedback could be annoyed by the simultaneous audio feedback.

Modality (audio, visual and dual) and menu style (linear vs. radial) are two dimensions for describing an interesting design space of menu selection. To disentangle the individual effects of the two design dimensions, and further explore the properties of the other four design alternatives relative to the iPod and earPod interfaces in the baseline desktop conditions, we decided on the following 3×2 experimental design that employed all six interfaces from Table I.


The aim of experiment 1 was to systematically evaluate design space for menu selection task along dimensions outlined above, namely, feedback modality and layout style. The study had participants attempt to locate and select a pre-defined item from an eight-item menu as quickly and accurately as possible. In terms of user performance in selection time, based on earlier work, we would expect that radial layout would be faster because radial layout allows direct access to menu items as compared to linear access to menu items by linear layout [Callahan et al., 1988; Kurtenbach and Buxton, 1994]. We also expect visual output modality to be faster because auditory information is regarded as serial and temporal in nature while visual information can be scanned and compared quickly [Zhao, et al. 2007].



Twelve right-handed participants (3 females) ranging in age from 18 to 29 years (mean 22), recruited within a university community, volunteered for the experiment.


A menu selection task was used that required participants to select a target item from a menu. Each menu contained eight items, all of which belonging to the same natural category. Materials were developed from examples of natural categories taken from KidsClick! [!/] and Wikipedia []. Across the set of materials there were eight categories, describing types of Clothing, Fish, Instrument, Job, Animal, Color, Country, and Fruit. For each of these category types there were eight items. For instance, Carp, Cod, Eel, Haddock, Pollock, Redfish, Salmon, and Sardine were used as types of Fish. Each item was a single word, and no word appeared more than once in the database.

The experimental software ran on a Compaq Presario V2000 laptop with 2 GB of RAM running Microsoft Windows XP. Input was controlled by a Cirque EasyCat USB external touchpad. The touchpad was made circular by placing thin plastic overlay to the touchpad area. Both radial and linear menu layout styles were implemented on the circular touchpad. With the linear design, the movement of the thumb around the circular touchpad allows the user to scroll through the list of items in the menu (i.e., much like the interaction technique used for the Apple iPod). In contrast, the radial design subdivides the circular touchpad into discrete regions. In this way, items in the menu are located at particular locations. In both cases, the participant selected the currently highlighted item in the menu by lifting their thumb off the touchpad. A short click sound provided feedback that a selection had been made.

In addition to different layout styles, different output modalities could be used as the participant explored the menu (audio or visual). For visual output, menu items were arranged in a vertical list, one item per line. All text was presented on a 19-inch LCD monitor in font Helvetica, Bold, size 16. A colored box was used to highlight the currently selected item. For the audio output, auditory information was generated using a real-time simulation library. When the user scrolled over an item in the menu, the items label was spoken in a female human voice. Each audio clip took approximately 1-second to playback. The audio recording was interruptible, such that if the user quickly scrolled to the next item, the playback of the first would terminate and the next item would be outputted. Audio output was presented to the participant through standard stereo headphones. For audio-visual feedback, both of these output streams were presented simultaneously to the user.


A 2 x 3 (layout style x output modality) within-subjects design was used to systematically explore a design space for the menu selection task. Menu items were presented in either a radial or linear layout style. The  output was given as audio-only, visual-only or audio-visual. The order that each output modality was used was counterbalanced between participants, while the ordering of menu layout style was randomized within each modality. The main dependent variables of interest were the time taken to select a target item from the menu and the number of errors that were made. Subjective feedback on participants’ preferences for each point in the design space was also gathered during a post-experiment interview.


Participants were informed that they would be required to perform a menu selection task using different design alternatives. They were told that the whole experiment would take about an hour to complete, and that they were free to withdraw at any time if they wished without loss of credit. After receiving these instructions, participants were instructed to put on the headphones and to hold the touchpad with their right hands leaving the thumb off the touchpad.

Participants completed a series of menu selections with each interface type (i.e., for each combination of different layout style and output modality). Before each condition, participants received 8 practice trials (1 block) with a particular interface type so as to familiarize themselves with it. This was to ensure that any prior experience that users had with a certain menu type would not affect the findings of the experiment. Once familiar with the interface, each participant completed 96 trials, consisting of 12 blocks of trials for each of the eight possible target positions. That is, in total, each participant completed 576 experimental trials (2 layout style x 3 output modality x 8 target items x 12 blocks) along with 48 practice trials.

For each trial, the to-be-selected item was presented in the center of the monitor (e.g. “Animal”). Once they had encoded the to-be-selected item, the participant could start the trial menu by pressing the touchpad (i.e., making a selection gesture). The participant then searched the menu for the target. Dependent on condition, participants either received audio-only, visual-only, or audio-visual output as they searched. In the audio conditions, participants heard the spoken names of each traversed menu item through their headphones. In the visual condition, menu items were displayed on the screen (Figure 6). Participants selected the currently highlighted item by lifting their thumb off the touchpad. If an incorrect selection was made, then participants were notified by a visual prompt that informed them of their error. Participants were required to make another selection from the menu and did not progress to the next trial until the target was selected. The trial ended when the participant selected the target, and participants were instructed to locate the target as quickly and as accurately as possible. After each trial, a visual message in the center of the screen instructed participants to press the spacebar to proceed to the next trial. Participants were allowed to take breaks between trials, and breaks were enforced between each interface condition. After completing all of the trials, participants were asked to answer a set of questions about their preferences for each interface design.


For each trial we consider data from when the menu first appeared to when the participant selected an item. For statistical analysis, a 2x3x12 repeated-measures ANOVA was used, and effects were judged significant if they reached a .05 significance level.


Participants made very few selection errors, occurring on only 5% of trials (SD=.005. There was no reliable difference in error-rate regardless of the menu layout style (radial 5.3% vs. linear 4.7%) or of the output modality used (audio 4.8%, visual 5.4%, audio-visual 4.9%). Neither was there any evidence to suggest a change in error-rate over consecutive blocks of trials. Indeed, statistical analysis showed all effects to be non-significant (all F’s < 1.03). We next consider response time data.

multimodal mobile fig.6

Figure 6. Visual stimulus at top of the screen

 Response Time 

For response time data, trials in which an incorrect item was selected on the first selection were removed – thus, we only consider response time for correct selections. Figure 7 shows the mean response time for each of the experimental conditions. Participants were significantly slower at selecting target items when the menu used a linear layout (M=2.33s) rather than a radial layout (M=1.58s), (F1, 11=3.06, p<.001). Participants were also significantly slower when they received only audio feedback (M=2.12s) compared to when they received visual feedback, either in the visual-only condition (M=1.7s) or the audio-visual condition (M=1.83s), (F2, 22=8.69, p<.001). As can be seen in the figure, the difference in response time between the radial and linear layouts increased when audio feedback was used. Indeed, statistical analysis shows that there was a significant two-way interaction between output modality and layout style, (F2, 22=8.69, p<.01).This suggests that using a radial layout only carries performance benefits when the participant has to rely only on audio feedback. \

multimodal mobile fig.7

Figure 7. Response time for all 6 interfaces, sorted by modalities 

Observations & Subjective Preference

Feedback from the post-experimental interviews indicated that the visual radial and audio-visual radial interfaces were the most promising, while the audio linear interface had the lowest user satisfaction score. Almost all of the participants (10/12) reported that they preferred visual feedback to audio feedback. There were however two participants who said that they preferred receiving audio feedback to visual feedback. It is interesting to note that in terms of performance metrics, these participants were nonetheless faster at completing the selection task when they received visual feedback. This suggests that these two participants preference for audio feedback might stem from the novelty of using this interaction technique.


The aim of the Experiment 1 was to evaluate different points in design space for devices that support menu selection. Results show that visual feedback modality affords faster selections presumably because audio takes time to listen, whereas for visual, we can quickly search visually for the to-be-selected item. In terms of layout, radial confers some benefit to a tradition linear; but these benefits are for audio – presumably because the participant has learnt. These results are consistent with [Yin and Zhai, 2006; Callahan et al., 1988], and suggest that visual feedback is optimal for supporting menu selection in single-task desktop environment. We next consider how different design alternatives might be better for alternative contexts of use. In particular, we consider how each of the above design alternatives fair when the user is engaged in some ongoing safety-critical primary task as might be the case if the driver of a car were to use a menu system on a secondary or in-car device.


Although the desktop evaluation results strongly favored visual or dual interfaces, the desktop condition is not the only place where menu selection occurs. Today’s mobile devices are often used while a person is performing another task. In particular, the driver of a car typically drives the vehicle while performing other tasks or dealing with various distractions, and this dual-tasking environment has attracted considerable research attention.

According to a recent survey of American drivers [GMAC2006], menus on mobile devices (specifically the iPod) are commonly used while driving, especially by young drivers ages 18-24. Although dangerous to use as a secondary task while driving, and although their use is widely prohibited at the time of writing this paper (legislation has been introduced in many countries including Australia, France, Germany, Japan, Russia, Singapore, and the UK), people continue to use their cell phones while driving. For instance, compliance with the UK ban has slipped from 90% from its introduction in 2003 to around 75% in 2007. As of this writing there are some 10 million UK motorists who admit to using a phone while driving, even though this activity is against the law [Careless talk].

Previous work on driver distraction resulting from cell phone use shows that it competes for limited visual attention resources, thus harming performance [e.g., Alm and Nilsson, 1994; McKnight and McKnight, 1993]. Other research suggests that cognitive load alone, separate from perceptual/motor load, is sufficient to produce distraction effects. For instance, Strayer and colleagues in a series of studies [Strayer and Johnson, 2001; Strayer, et al., 2003] indicated that the cognitive act of generating a word is sufficient to cause noticeable distraction effects. It is unclear then whether designing a mobile device so that it does not place additional demands on visual attention resources would mitigate the harmful effects of distraction. The increased cognitive load of interacting, even with an eyes-free device such as the earPod, may be sufficient to result in adverse effects for driving performance.

Given that it is difficult to make people stop engaging in secondary tasks while driving, there may be substantial value in directing efforts to better designing mobile devices to make their use by the driver of a car less egregious. That is, a user-centered design approach that is sensitive to the environmental constraints imposed by using a mobile device in the context of an on-going dynamic task.

In Ho et al.’s work [Ho et al., 2007], using multiple sensory (audio + vibrotactile) feedback has been found to reduce the brake response time for drivers In their work, they suggest that the simultaneous usage of audio and tactile warning signals can help to enhance the braking response; while in our paper, we suggest to use the auditory feedback while driving, but visual feedback in single task scenarios. Results from experiment 2 show no additional benefit of combining visual + audio feedback for menu selection in the driving scenario.

Experiment 1 provided empirical results for various menu interaction techniques under the desktop setting. It strongly suggested using the visual modality as the primary mean of feedback in such an environment. However, it is an open empirical question whether visual interfaces also offer performance gains when the user is concurrently engaged in an ongoing dynamic task, such as driving a car. In particular, because visual interfaces demand visual attention, we might assume that this might lead to greater driver distraction than using audio. In the next section of this paper, we describe an experiment that is designed to address this question.

multimodal mobile fig.8



Another twelve participants (1 female) ranging in age from 20 to 35 years (mean 27), recruited within the university community, volunteered for the experiment.


The driving experiment was conducted using a desktop driving simulator. The simulation environment, coded in Java with OpenGL graphics, incorporates a three-lane highway with the driver’s vehicle in the center lane, as shown in Figure 8. The highway includes alternating straight segments and curved segments with varying curvatures, all of which can be driven at normal highway speeds. A second automated vehicle, visible in the rear-view mirror, follows behind the driver’s car at a distance of roughly 50 feet (15 m) so the driver to keep at an adequate speed and distance between the lead car and the car behind.. Construction cones are placed on each side of the driver’s lane to motivate as accurate lane keeping as possible. Previous versions of a very similar environment have been used successfully to study various aspects of driver behavior [e.g., Salvucci, 2001; Salvucci, 2005; Salvucci, et al., 2007].

The hardware setup comprised a desktop computer controlled by a Logitech MOMO® steering wheel with force feedback. The simulation was run on an Apple desktop computer with an Intel Xeon CPU running at 2.00 GHz, 2 GB RAM, and an NVIDIA GeForce 7300 GT graphics card. The environment was displayed on a 30” (69 cm) monitor at a distance of roughly 33” (85 cm) from the driver. The earPod was held in the dominant hand of the participant. For added realism, a soundtrack of real driving noise was run on continuous loop during the driving portions of the study. A rough set up of the experiment is shown in Figure 9.

multimodal mobile fig.9

Figure 9 Set-up of experiment (not drawn to scale)

The experiment was divided into two parts. The first part (desktop condition) replicated Experiment 1 except it was much shorter. This setting allowed users to get familiar with the menu and the interaction techniques before they moved to the second and arguably more difficult part: driving and menu selection at the same time (driving condition). The experiment was designed to simulate a realistic usage scenario.

For the desktop condition, the setup was exactly the same as for Experiment 1 except the following difference, In experiment 2, both audio and visual stimuli were provided simultaneously and this allowed users to pick their preferred stimuli for different interfaces. Because driving is very different from desktop interaction, visual stimuli could be a possible source of distraction, thus complicating the interpretation of results. However, only using audio stimuli would not permit analysis of modality effects. To address these issues, we decided to provide both types of stimuli and allow users to pick the one they would attend to during the experiment. This allowed us to find out which stimuli they actually used during the experiment. To allow better comparison between the two settings, we also used dual stimuli for the desktop condition.

The dual task simulated driving condition is the focus of this experiment, but the desktop condition is also essential since it provides the necessary training for users to get familiar with techniques. This closely simulates real world scenarios where users typically already have some experience with their devices before using them inside vehicles.


A within-participants design was used. The exact design is summarized below. Desktop condition: 12 participants x 6 techniques x 8 items of 1menu configurations: (condition 8) x 5 blocks (4 blocks + 1 practice block for desktop conditions) + Driving condition: 12 participants x 6 techniques x 8 consecutive selection of 1 menu item: (condition 8) x 2 blocks (1 block+1 practice block for driving conditions) = 4032 menu selections in total (2880 + 1152).


During the desktop condition, the instructions remain the same as experiment 1 where participants were asked to complete the menu selection as quickly and as accurately as possible. During the driving condition, for each trial, participants were asked to complete the menu selection task as quickly and as accurately as possible while following an automated lead car that runs at a constant speed of 65 miles/hour (~105 km/h) and to maintain a reasonable, realistic following distance.


After completing the desktop trials, participants were asked to answer a set of questions regarding their experience for the desktop conditions. They then moved to the driving simulator and completed the menu selection tasks while driving. At least 10 seconds elapsed between the end of one menu-selection trial and the start of the next trial; this time allowed participants to perform any necessary corrective steering after each trial and re-center the vehicle to a normal driving state.  (Note that this constraint reduced the number of trials possible in the driving context, but was absolutely necessary to maintain the integrity of the driver performance data.) Participants were allowed to take breaks between trials. Breaks were enforced after a maximum of 15 minutes of driving to avoid fatigue. Before each of the desktop and driving conditions, participants received 8 practice trials (1 block) for that particular interface. Each participant performed the entire experiment in one sitting which took approximately 90 minutes (the desktop condition typically finished within 20 minutes, and the driving condition typically lasted 40 minutes, with the extra time being used for questionnaires and breaks). After completion of the driving session, the same set of questions with the desktop condition was asked again regarding user experience during the trials.


Both accuracy and response time results for the desktop setting were consistent with the ones from experiment 1. Actual numbers varied, but no change was found concerning significance of effects.

Observations & Subjective Preference

Although this experiment differed little from experiment 1, a set of additional questions in the post experimental questionnaire allowed us to gain more insights into users’ experience. Since we use both visual and audio stimuli in this experiment, users were asked “Which stimuli did you attend to during the experiment?” The answers were consistently visual (11/12 subjects), and only 1 subject said both. For the question “Which feedback modality did you use under the dual modality conditions?” The answers were again consistently “visual” or “primary visual”. For the question: “If you only used one type of feedback or primarily used only one type of feedback, did you found the other kind of feedback (audio or visual) distracting?”, 3/12 users answered “Yes, I found the audio feedback a bit distracting”, while most subjects (9/12) said “No”. Based on this feedback, it’s clear that the visual modality is preferred under the single task desktop environment.


There were no significant differences in accuracy for either modality (audio 88.9%, visual 88.3%, and dual 88.8%) or menu style (radial 88.3%, and linear 89.0%). This is consistent with the findings from the desktop settings.

Selection Time

Tests in the driving setting showed some unexpected findings. There was no significant main effect of modality on response time while driving. This is somewhat surprising since response time for audio was significantly slower than for visual in the desktop conditions. However, there was still a significant main effect of menu style, (F1, 11=32.86, p<.001). Radial (3.34 s) was significantly faster than linear (4.12 s), which is also consistent with the findings from the desktop conditions. The average selection time for the six interfaces was: audio radial (3.27 s), audio linear (4.09 s), audio visual radial (3.53 s), audio visual linear (4.15 s), visual radial (3.27 s), and visual linear (4.10 s).

multimodal mobile fig.10

Figure 10. Lateral velocity and following distance by modality.  

Lateral Velocity

In testing interaction in the driving context, arguably the most important aspect of this interaction is the effect on driver performance. One common way to measure performance involves analysis of the vehicle’s lateral (side-to-side) velocity as an indicator of vehicle stability. We computed the average lateral velocity over a time window that included both the interaction with the device and a period of 5 seconds after the completion of the interaction; this latter period accounts for vehicle “correction” that typically takes place after distraction — during which the driver corrects the lateral position of the vehicle — which is best attributed to the immediately preceding interaction trial.

For our experiment, we found a significant effect of modality (F2, 22=6.99, p<.01) but no significant effect of menu style (F1, 11=1.01, n.s.) and no significant interaction between modality and menu style (F2, 22=.03, n.s.). The effect of modality is shown in Figure 10. Pairwise comparisons showed no significant differences between the audio and the dual modalities, but both of these modalities differed significantly from the visual modality (p<.05). The lower lateral velocity (i.e., higher stability) for the audio versus the visual condition indicates, not surprisingly, that the visual attention needed for the visual condition causes additional distraction and reduced performance. Interestingly, the dual condition produces essentially the same reduced distraction as the audio condition, suggesting that drivers relied on the audio portion of the dual interaction while driving (which is supported by the drivers’ post-experiment reports as discussed below).

 Following Distance

Lateral velocity is a measure of the results of distraction on driver performance. Another measure of distraction is the following distance to the lead car: in essence, as drivers feel themselves being distracted, they tend to back away from the car in front of them for safety reasons. We computed the average following distance using the same time window around a particular trial as used for the analysis of lateral velocity.

Overall, as was found in the case of lateral velocity, there was a significant effect of modality on following distance (F2, 22=6.66, p<.01) but there was no significant main effect of menu style (F1, 11=0.07, n.s.) nor was there a significant interaction between modality and menu style (F2, 22=1.44, n.s.). The average following distances by modality are shown in Figure 10. Comparing this graph to the graph for lateral velocity, their similarity strongly suggests that drivers have a sense of the distraction potential for the three modalities: increased distraction as measured by larger lateral velocities led to increased following distances. Thus, drivers responded to the increased distraction by backing off from the lead car and giving themselves, in essence, more room for error.

Observations & Subjective Preference

For the same set of questions asked after the single task conditions, the preferences changed completely for driving. For the question, “Which stimuli did you attend to during the experiment?” The answers were consistently audio (10/12 subjects), and only 2 subjects said both. For the question “Which feedback modality did you use under the dual modality conditions?” The answers were again consistently “audio”. All participants felt that audio was much safer to use than visual while driving. For the question: “If you only used one type of feedback or primarily used only one type of feedback, did you find the other kind of feedback (audio or visual) distracting?”, most users (9/12) reported that they totally ignored the visual feedback thus turning the dual modality interface into an audio only interface. However, users who occasionally glanced at the visual interface found the visual feedback not just distracting, but dangerous, and this point will be revisited below in the discussion section.

Desktop vs. Driving

Further interesting observations come from comparing the desktop conditions with the driving conditions. A new variable called experiment type was introduced into our analysis. The experiment type had two possible values: desktop conditions or driving conditions.


There was a significant main effect of experiment type, (F1, 11=27.26, p<.001). The mean accuracy for desktop conditions (94.4%) was significantly higher than for the driving conditions (88.6%). This is not surprising since the user had to perform a more difficult task (2 selections in a row), and also had to deal with a secondary driving task.

multimodal mobile fig.11

Figure 11. Experiment type x modality interaction  

Response Time

There was a significant main effect of experiment type on response time (F1, 11=133.74, p<.001). The mean selection time in the driving conditions (3.73 s) was significantly slower than the corresponding performance in the desktop conditions (2.63 s). This delay is likely due to the secondary driving task.

There was a significant experiment type x modality interaction, (F2, 22=12.13, p<.001). While response time was significantly slower for the audio conditions in the desktop settings, it was no slower than the other conditions while driving (Figure 11). This finding along with the empirical data obtained on lateral velocity and following distance all strongly suggest that the audio modality may be useful in driving, since it may increase safety without harming performance when interacting with a device.

There was also a significant experiment type x menu style interaction, (F1, 11=32.98, p<.001). A closer examination indicates that the radial menu style has a larger advantage in terms of response time than the linear menu style for the desktop setting (Figure 12).


Audio vs. visual vs. dual

The most dramatic differences were found for the modality of feedback. In fact, both the results and subjective feedback differed markedly between the desktop and driving settings.

multimodal mobile fig.12

Figure 12. Experiment type x menu style interaction   

Desktop setting  

Visual feedback was generally preferred by users, although 2/12 participants told us they preferred audio even in the desktop settings. However even for them, performance on the visual and dual interface was much faster than the audio interfaces. Thus visual feedback is advantageous in this setting. Users’ reported experience for the dual modality interface was interesting. These interfaces received the highest overall ranking, and were ranked either as the favorite ones or right next to the favorite interfaces. However, for users who strictly preferred one kind of feedback (perhaps because they are either audio learners or visual learners), user reaction toward the other modality differed. For users who strictly prefer visual interfaces, they often found audio slightly annoying. However, people who prefer audio feedback are not affected by the presentation of the visual interfaces and tended to rank them as equally preferred to the audio only interfaces. This is perhaps due to people having the ability to close their eyes or to not pay attention to the screen if they are tired of looking [Gaver, 1997].

Overall, visual and dual interfaces were the favorites for desktop settings. Perhaps the best strategy for the desktop setting is the dual interface but having the ability to turn off the audio or visual feedback when needed.

Driving setting

The change in user reaction between desktop and driving settings was dramatic. While the preference of modality still varied slightly for the desktop conditions, audio was consistently judged to be much better than visual for driving. This was true even for users who strongly preferred visual feedback in the desktop settings. One such user said after completing the driving conditions: “Although I prefer visual feedback for desktop, I found it completely useless while driving, where audio is much better.” For the dual interfaces in the desktop setting, users tend to use both modalities of feedback while performing trials. While driving, most users (9/12) completely ignored the visual feedback. Even for users who occasionally glanced at the screen for extra information, they felt negative about it. As one subject put it, “having the option to look at the visual information while driving is potential a safety hazard. I found myself tending to look at it while I was having difficulty finding the desirable item through audio, but it felt very dangerous, and I prefer an audio-only interface since it doesn’t allow me to look at all.”

Linear vs. radial

Compared to modality, menu style had a less dramatic effect, but still generated some interesting findings. Under both desktop and driving conditions, radial menu style yielded better performance than linear, and was more preferred by users. However, compared to the desktop settings, the radial menu style did relatively better in terms of speed in the driving conditions, as described earlier by the experiment type x menu style interaction effect, (Figure 12). This indicates that in the more difficult or complex environment, there is actually more incentive to switch to the radial menu style if possible.

Design for multiple scenarios

As devices become more powerful in terms of the number of features and more portable in terms of size and form factor, they are more likely to be used under a variety of scenarios. This imposes serious issues for interface designers since different scenarios often have very different requirements and constraints, as demonstrated by our study. The best solutions for one scenario may be the worst solution for another scenario, and how to design and find the overall winner across conditions presents significant challenges to HCI researchers and interface designers.  Our exploration here represents one step toward further understanding user interaction with multiple modalities across multitasking environments.

Implication for design for individual scenarios

Based on the results of the experiments reported in this paper, we offer the following recommendation for the design of menu selection interface for use in single-task (i.e., desktop-like) conditions or while the user is engaged with an on-going dynamic task (such as while driving a car).

For desktop settings, while radial and linear menu style each have their advantages — radial style is quicker to access, while linear style is more flexible and easier to design the structure and content of the menu — if the designer has a reasonably sized static menu, using visual or dual radial layout would likely yield better performance. Otherwise, visual or dual linear are also quite usable and are perhaps more suitable for menus that are longer or that have dynamic content.

For driving conditions, we recommend audio radial for reasonably sized static menu. If the menu size is longer or contains dynamic content, audio linear is probably more suitable, and we don’t recommend visual interfaces at all. Even the dual interfaces should be excluded if possible to avoid potential danger. In addition, although audio interfaces are safer under the driving conditions, they still impose a cognitive load which could affect a user’s driving performance.

Implication for design for integrated scenario and shared input multi-modal mobile interfaces

While multi-tasking in mobile scenarios has been heavily investigated in recent years [Pascoe, et al., 2000], significant amount of efforts have been taken into the design of eyes-free interface/interaction techniques for mobile devices; however, as Pascoe and colleagues [Pascoe, et al., 2000] have pointed out, while mobile devices can be accessed on the move, they are frequently used in stationary scenarios where users have the majority of their visual attention available for mobile HCI tasks. In such cases, the visual interface will often have an advantage. While earlier research often tends to focus on either the general use or eyes-free use of mobile devices (Figure 13, usage scenario 2), we believe that both stationary and eyes-free usage scenarios are equally important for mobile interface design and need to considered as an integrated whole rather than two separate scenarios.

multimodal mobile fig.13

Figure 13. Three approaches for designing mobile interfaces and interaction techniques. The first two represent more traditional approaches while the third one is our proposed approach. 

In this integrated scenario, users will use their mobile devices either stationary or on-the-move, and often need to switch between these scenarios according to context. To design for this integrated scenario, it requires designers to consider interface solutions that are suitable for both scenarios, and support the easy switch between their usage to seek a balanced design.

multimodal mobile fig.14

Figure 14. Two interfaces with different input-output modalities (left) vs. the shared input multimodal interface

However, the design of such type of integrated interface can be difficult since the design requirement for the stationary use is quite different from on-the-move use. As demonstrated in our experiment, eyes-free auditory interface is more suitable in the driving scenario while the visual interface is optimal for the desktop usage scenario.  A possible and seemingly promising solution for this type of integrated mobile scenario is the shared input multi-modal mobile interface (Figure 14, right). Since shared input multimodal interfaces share the same input mechanism, they require less additional effort to learn. Furthermore, since the input mechanism is shared between two interfaces, the motor skill required to operate both interfaces is the same. Using either interface also trains the use of the other interface, which could potentially reduce the learning time for users to achieve expert performance in both interfaces. The concept of using common input and multiple output modalities is not new as seen in Brewster and Crease’s audio-enhanced widgets [Brewster and Crease, 1999], who also used multiple output modalities for the same input mechanism. However, in their work, the multiple output modalities complement each other for the same task in the same scenario. As stated by Brewster and Crease, their “aim was to enhance standard graphical menus with more salient feedback to see if menu errors could be solved and also to see if sound was effective as the feedback”. Their approach is not designed for the diverse usage scenarios (which often include both stationary and on-the-move use [Pascoe, et al., 2000]) often encountered by users on their mobile devices today. In our approach, the multiple modalities work independently, and the audio modality has an equivalent role as the visual modality. Our approach is designed for both stationary and on-the-move usage of mobile devices.

The type of multimodal systems we are proposing is a specific type of multimodal system where the audio and visual feedbacks are independent and used non-concurrently. This is considered as an exclusive system according to Nigay and Coutaz’s taxonomy [Nigay and Coutaz, 1993], and an equivalent output multimodal system according to Coutaz et al [1995]. We envision that this type of interface is particularly useful for mobile devices due to their diverse usage scenarios. Therefore, we call our proposed approach the “shared input multimodal mobile interfaces”.

This type of interface can be accessed independently by either modality alone (not excluding the possibility of using them at the same time); this is different from a multi-modal interface where an interface generates feedback using a number of modalities that often complement each other, but not independently. Our results showed that performance and user preference can change dramatically from one scenario to the other; therefore, coercing users to adopt one type of modality is not desirable, in our opinion. Moreover, we advocate that the method for completing a task on the device should be invariant to the interaction technique used. For instance, in the case of earPod, both visual and auditory methods of feedback operate in a consistent manner, meaning that the user need learn only a single mental model of the methods required to select an item from the menu. By offering the same interaction style under different modalities, it also allows dual -modality interfaces that are not possible if the interaction style differs across modalities (just as a speech dial and keypad dial using a visual display for phone numbers may interfere with each other if used simultaneously). At the time of this writing, there are very few interfaces that support such shared input multimodal styles of interaction, and further exploration and evaluation for such interfaces in other application domains might provide a rich and fruitful avenue of research. We believe the shared input multi-modal systems deserve more attention from both mobile researchers and designers.


Due to constraints to keep the experimental procedure within a reasonable time limit, we did not have the opportunity to consider possible learning behavior over a prolonged period of usage; the requirements of the driving task did not allow time for the inclusion of this interesting factor. It is therefore difficult to rule out the possibility that some of the users in our experiment could come to learn to drive more “safely”, even in the visual condition. Indeed, it is well known that performance improves following a power law of practice [Newell and Rosenbloom, 1981]. Further research is required to investigate asymptotic performance. However, since driving is considered a high-risk task, the potential cost associate with in-car training is extremely high. Even if an interface can be learned to be safer in a car, any mistakes during practice could potentially be catastrophic. Thus the current experimental setting may have practical value in guiding safe vehicle interface design.


In conclusion, we investigated the effect of alternative feedback modalities, i.e., audio, visual and audio-visual feedback and menu layout on user performance and preference for menu selection tasks in a single-task desktop setting and a dual-task driving setting. Experimental results indicated that different operational environments can have strong effects on the performance of menu selection using different types of modality of feedback (audio, visual and audio-visual) and different styles of menu layout (linear or radial). Visual feedback produced better user performance and is preferred under single-task conditions; in dual-task conditions it presented a significant source of driver distraction. In contrast, auditory feedback mitigated some of the risk associated with menu selection while driving.

Although driving is an important mobile scenario, there are other common usage scenarios that we have not evaluated here. While not formally tested, the results of experiments are likely to apply to walking and running scenarios since in both cases, users need to pay attention to the road and the environment. We expect the audio menus can benefit the users more in the running scenarios than that of the walking scenario due to increased inconvenience of visually checking status of the mobile device. It will be interesting to verify this hypothesis in future studies.


  1. AKAMATSU, M., MACKENZIE, I. S., AND HASBROUCQ, T. 1995. A comparison of tactile, auditory, and visual feedback in a pointing task using a mouse-type device. Ergonomics, 38(4), 816-827.
  2. ALM, H., AND NILSSON, L. 1994. Changes in driver behaviour as a function of hands-free mobile phones–A simulator study. Accident Analysis & Prevention, 441-451.
  3. ARONS, B. 1997. SpeechSkimmer: a system for interactively skimming recorded speech. ACM Transaction Computer-Human Interaction, 4(1), 3-38.
  4. BACH, K. M., JÆGER, M. G., SKOV, M. B. AND THOMASSEN, N. G. 2008. You can touch, but you can’t look: interacting with in-vehicle systems. Proc. CHI 2008, ACM Press, 1139-1148.
  5. BAILLY, G., LECOLINET, E., AND NIGAY, L. 2007. Wave menus: Improving the novice mode of marking menus. Proc. INTERACT2007, 475-488.
  6. BAILLY, G., LECOLINET, E., AND NIGAY, L. 2008. Flower menus: A new type of marking menus with large menu breadth, within groups and efficient expert mode memorization. Proceedings of advanced visual interfaces (AVI), 15-22.
  7. BERNSTEIN, L.R. 1997. Detection and discrimination of interaural disparities: Modern earphone-based studies, in Binaural and Spatial Hearing in Real and Virtual Environments, R.H. Gilkey and T.R.A. Eds., Editors. Lawrence Erlbaum Associates: NJ, USA.  117-138.
  8. BLATTNER, M., SUMIKAWA, D., AND GREENBERG, R. 1989. Earcons and icons: Their structure and common design principles. Human Computer Interaction. 4(1), 11-44.
  9. BREWSTER, S., LUMSDEN, J., BELL, M., HALL, M., and TASKER, S. 2003. Multimodal ‘eyes-free’ interaction techniques for wearable devices. ACM CHI Conference on Human Factors in Computing Systems. 473-480
  10. BREWSTER, S.A. 1998. Using nonspeech sounds to provide navigation cues. ACM Transaction on Computer-Humam Interaction (TOCHI), 5(3), 224-259.
  11. BREWSTER, S.A. AND CREASE, M.G. 1999. Correcting menu usability problems with sound. Behaviour and Information Technology, 18(3), 165-177.
  12. BREWSTER, S.A., AND CRYER, P.G. 1999. Maximising ScreenSpace on Mobile Computing Devices. ACM CHI Extended Abstracts on Human Factors in Computing Systems. ACM Press, 224-225.
  13. BREWSTER, S.A., LUMSDEN, J., BELL, M., HALL, M., AND TASKER, S. 2003. Multimodal “eyes-free” interaction techniques for wearable devices. Proc. CHI2003, 473-480.
  14. BREWSTER, S.A., WRIGHT, P.C., AND EDWARDS, A.D.N. 1993. An evaluation of earcons for use in auditory human-computer interfaces. Proc. CHI1993, ACM Press, 222-227.
  15. BRUMBY, D.P., SALVUCCI, D.D. AND HOWES, A. 2009. Focus on Driving: How Cognitive Constraints Shape the Adaptation of Strategy when Dialing while Driving. Proc. CHI 2009, ACM Press, 1629-1638.
  16. CALLAHAN, J., HOPKINS, D., WEISER, M., SHNEIDERMAN, B. 1988. An empirical comparison of pie vs. linear menus. Proc. CHI 1988, ACM Press, 95-100.
  17. COUTAZ, J, NIGAY, L., SALBER, D., BLANDFORD, A., MAY, J., and YOUNG, R.M. : Four easy pieces for assessing the usability of multimodal interaction : the CARE properties. In Interact 1995, 115–120, 1995.
  18. DOEL, K.V.D. AND PAI, D.K. 2001. A Java Audio Synthesis System for Programmers. International Conference on Auditory Display. 150-154.
  19. FOLEY, J.D., WALLACE, V.L., AND CHAN, P. 1984. The human factors of computer graphics interaction techniques. IEEE Computer Graphics Application, 4(11), 13-48.
  20. GAVER, W. 1989. The Sonic Finder: An interface that uses auditory icons. Human Computer Interaction, 4(1), 67-94
  21. GAVER, W.W., AND SMITH, R.B. 1991. Auditory icons in large-scale collaborative environments. SIGCHI Bulletin 1991, 23(1), 96.
  22. GAVER, W. 1997. Auditory Interfaces. Handbook of Human-Computer Interaction, 1003-1041.
  23. GRAF, S., SPIESSL, W., SCHMIDT, A., WINTER, A. AND RIGOLL, G. 2008. In-car interaction using search-based user interfaces. Proc.CHI 2008, ACM Press, 1685-1688.
  24. HO, C., REED, R. AND SPENCE, C. 2007. Multisensory In-Car Warning Signals for Collision Avoidance. Human Factors: The Journal of the Human Factors and Ergonomics Society, 49(6), 1107-1114
  25. Jacko, J., Scott, I., Sainfort, F., Barnard, L., Edwards, P., Emery, V., Kongnakorn, T., Moloney, K., and Zorich, B. 2003. Older adults and visual impairment: what do exposure times and accuracy tell us about performance gains associated with multimodal feedback? Proc. CHI 2003, 33-40.
  26. KRISTENSSON, P., AND ZHAI, S. 2004. SHARK2: A large vocabulary shorthand writing system for pen-based computers. Proc. UIST 2004, ACM Press, 43-52.
  27. KURTENBACH, G. 1993. The design and evaluation of marking menus. Ph.D. Thesis, University of Toronto.
  28. KURTENBACH, G., AND BUXTON, W.S. 1993. The limits of expert performance using hierarchic marking menus. Proc. CHI 1993, ACM Press, 482-487.
  29. KURTENBACH, G., AND BUXTON, W.S. 1994. User learning and performance with marking menus. Proc.CHI 1994, ACM Press, 258-264.
  30. LEE, J. D., CAVEN, B., HAAKE, S. AND BROWN, T. L. 2001. Speech-based interaction with In-vehicle Computers: The Effect of Speech-Based E-Mail on Drivers’ Attention to the Roadway. Human Factors: The Journal of the Human Factors and Ergonomics Society, 43(4), 631-640.
  31. LEE, J., FORLIZZI, J. AND HUDSON, S.E. 2005. Studying the effectiveness of MOVE: a contextually optimized in-vehicle navigation system. Proc. CHI 2005, ACM Press, 571-580.
  32. LIAO, C.,GUIMBRETIERE, F., AND LOECKENHOFF, C.E. 2006. Pen-top feedback for paper-based interfaces, Proc. UIST2006. ACM Press, 201-210.
  33. LUK, J., PASQUERO, J., LITTLE, S., MACLEAN, K., LEVESQUE, V., AND HAYWARD, V. 2006. A role for haptics in mobile interaction: initial design using a handheld tactile display prototype. Proc. CHI2006, ACM Press, 171-180.
  34. MARICS, M.A., AND ENGELBECK, G. 1997. Designing voice menu applications for telephones. Handbook of Human-Computer Interaction, M.G. Helander, T.K. Landauer, and P.V. Prabhu, Editors. Elsevier: Amsterdam.  1085-1102.
  35. MCKNIGHT, A. J., AND MCKNIGHT, A. S. 1993. The effect of cellular phone use upon driver attention.  Accident Analysis & Prevention, 25, 259-265.
  36. MILLER, D.P. 1981. The depth/breadth tradeoff in hierarchical computer menus. Human Factors Society. 296-300.
  37. MINORU, K., AND SCHMANDT, C. 1997. Dynamic Soundscape: mapping time to space for audio browsing. Proc. CHI1997, ACM Press, 194-201.
  38. MYNATT, E.D. 1995. Transforming graphical interfaces into auditory interfaces. Proc. CHI1995, ACM Press. 67-68.
  39. NEWELL, A., AND ROSENBLOOM, P.S. 1981. Mechanisms of skill acquisition and the law of practice. J. R. Anderson (Ed.), Cognitive skills and their acquisition. 1-51. Hillsdale, NJ: Lawrence Erlbaum Associates.
  40. NIGAY, L. AND COUTAZ, J. 1993. A Design Space For Multimodal SysTEMS: Concurrent Processing and Data Fusion. In Proceedings of INTERCHI’93, Amsterdam, Netherlands, ACM Press.172-178
  41. NORMAN, K.L. 1991. The Psychology of Menu Selection: Designing Cognitive Control at the Human/Computer Interface 1991: Ablex Publishing Corporation.
  42. PAAP, K.R., AND COOKE, N.J. 1997. Designing menus, in Handbook of Human-Computer Interaction, M.G. Helander, T.K. Landauer, and P.V. Prabhu, Editors. Elsevier: Amsterdam. 533-572.
  43. PASCOE, J., RYAN, N., AND MORSE, D. 2000. Using while moving: HCI issues in fieldwork environments. ACM Transaction on Computer-Human Interaction, 7(3), 417-437.
  44. PIRHONEN, A., BREWSTER, S. AND HOLGUIN, C. 2002. Gestural and audio metaphors as a means of control for mobile devices. Proc. CHI2002, ACM Press. 291-298.
  45. RESNICK, P., AND VIRZI, R.A. 1992. Skip and scan: cleaning up telephone interface. Proc. CHI1992, ACM Press, 419-426.
  46. RINNOT, M. 2005. Sonic Texting. ACM CHI Extended Abstracts on Human Factors in Computing Systems, ACM Press, 1144-1145
  47. ROBERTS, T.L., AND ENGELBECK, G. 1989. The effects of device technology on the usability of advanced telephone functions. Proc. CHI 1989, ACM Press, 331-337.
  48. ROY, D.K. AND SCHMANDT, C. 1996. NewsComm: a hand-held interface for interactive access to structured audio. Proc. CHI1996, ACM Press. 173-180.
  49. SALMEM, A., GROßMANN, P., HITZENBERGER, L. & CREUTZBURG, U. 1999. Dialog Systems in Traffic Environment. Proceedings of ESCA: Tutorial and Research Workshop on Interactive Dialogue in Multi-Modal Systems.
  50. SALVUCCI, D.D. 2001. Predicting the effects of in-car interface use on driver performance: An integrated model approach. International Journal of Human-Computer Studies, 55, 85-107.
  51. SALVUCCI, D.D. 2005. A multitasking general executive for compound continuous tasks. Cognitive Science, 29, 457-492.
  52. SALVUCCI, D.D., AND MACUGA, K. L. 2002. Predicting the effects of cellular-phone dialing on driver performance. Cognitive Systems Research, 3, 95-102.
  53. SALVUCCI, D. D., MARKLEY, D., ZUBER, M., AND BRUMBY, D. P. 2007. iPod distraction: Effects of portable music-player use on driver performance. Proc. CHI 2007, ACM Press, 243-250.
  54. SANTOS, J., MERAT, N., MOUTA, S., BROOKHUIS, K., AND DE WAARD, D. 2005. The interaction between driving and in-vehicle information systems: Comparison of results from laboratory, simulator and real-world studies. Transportation Research Part F: Traffic Psychology and Behaviour, 8(2), 135-146.
  55. SAWHNEY, N. AND SCHMANDT, C. 2000. Nomadic radio: speech and audio interaction for contextual messaging in nomadic environments. ACM Transactions Computer-Human Interaction, 7(3), 353-383.
  56. SCHMANDT, C. 1998. Audio hallway: a virtual acoustic environment for browsing. Proc. UIST1998, ACM Press, 163-170.
  57. SCHMANDT, C., LEE, K., KIM, J., AND ACKERMAN, M. 2004. Impromptu: managing networked audio applications for mobile users. Proc. International conference on Mobile Systems, Applications, and Services. ACM Press, 59-69.
  58. SCHNEIDER, M., AND KIESLER, S. 2005. Calling While Driving: Effects of Providing Remote Traffic Context. Proc CHI 2005, ACM Press, 561-569.
  59. SEARS, A., AND SHNEIDERMAN, B. 1994. Split menus: Effectively using selection frequency to organize menus. ACM Transactions on Computer-Human Interaction, 1(1), 27-51.
  60. STIFELMAN, L., ARONS, B., AND SCHMANDT, C. 2001. The audio notebook: paper and pen interaction with structured speech. Proc. CHI 2001, ACM Press, 182-189.
  61. STIFELMAN, L.J., ARONS, B., SCHMANDT, C., AND HULTEEN, E.A. 1993. VoiceNotes: a speech interface for a hand-held voice notetaker. Proc. INTERCHI1993, ACM Press, 179-186.
  62. STRAYER, D.L., AND JOHNSTON, W.A. 2001. Driven to distraction: Dual-task studies of simulated driving and conversing on a cellular phone. Psychological Science, 12, 462-466.
  63. STRAYER, D.L., DREWS, F.A., AND JOHNSTON, W.A. 2003. Cell phone-induced failures of visual attention during simulated driving. Journal of Experiment Psychology: Applied, 9, 1, 23-32.
  64. SUHM, B., FREEMAN, B. AND GETTY, D. 2001. Curing the menu blues in touch-tone voice interfaces. ACM CHI Extended Abstracts on Human Factors in Computing Systems, ACM Press, 131-132.
  65. VITENSE, H. S., JACKO, J. A., AND EMERY, V. K. 2002. Foundation for improved interaction by individuals with visual impairments through multimodal feedback. Universal Access in the Information Society, 2(1), 76-87.
  66. WAGNER, C.R., LEDERMAN, S.J., AND HOWE, R.D. 2004. Design and performance of a tactile shape display using RC servomotors. Haptics-e Electronic Journal of Haptics Research (, 3(4).
  67. WALKER, J., ALICANDRI, E., SEDNEY, C., AND ROBERTS, K. 1991. In-vehicle navigation devices: Effects on the safety of driver performance. Vehicle Navigation and Information Systems Conference, 2, 499-525.
  68. WICKENS, C. D. 2002. Multiple resources and performance prediction. Theoretical Issues in Ergonomics Science, 3(2), 159-177.
  69. YIN, M., AND ZHAI, S. 2006. The benefits of augmenting telephone voice menu navigation with visual browsing and search. Proc. CHI2006, ACM Press, 319-328.
  70. ZHAO, S., AGRAWALA, M., AND HINCKLEY, K. 2006. Zone and polygon menus: using relative position to increase the breadth of multi-stroke marking menus. Proc. CHI2006, ACM Press. 1077-1086.
  71. ZHAO, S., AND BALAKRISHNAN, R. 2004. Simple vs. compound mark hierarchical marking menus. Proc. UIST 2004, ACM Press, 33-42.
  72. ZHAO, S., DRAGICEVIC, P., CHIGNELL, M., BALAKRISHNAN, R., AND BAUDISCH, P. 2007. earPod: Eyes-free Menu Selection with Touch Input and Reactive Audio Feedback. Proc. CHI 2007, ACM Press, 1395-1404.
  73. DINGLER, T., LINDSAY, J., & WALKER, B. N. (2008). Learnability of sound cues for environmental features: Auditory icons, earcons, spearcons, and speech. Proceedings of the International Conference on Auditory Display (ICAD 2008), Paris, France (24-27 June)
  74. WALKER, B. N., NANCE, A., & LINDSAY, J. (2006). Spearcons: Speech-based Earcons Improve Navigation Performance in Auditory Menus. Proceedings of the International Conference on Auditory Display (ICAD 2006), London, England (20-24 June). pp. 63-68
  75. SODNIK, J., DICKE, C., TOMAZIC, S., and BILLINGHURST, M. (2008.) A user study of auditory versus visual interfaces for use while driving. International Journal of Human Computer Studies, 66(5):318–332.
  76. BO YI, XIANG C.,MORTEN F., ZHAO S.: Exploring user motivations for eyes-free interaction on mobile devices. ACM CHI 2012: 2789-2792
  77. PFLEGING B., KIENAST M., SCHMIDT A., DÖRING T. et al. (2011): SpeeT: A Multimodal Interaction Style Combining Speech and Touch Interaction in Automotive Environments. In: Adjunct Proceedings of the 3rd International Conference on Automotive User Interfaces and Vehicular Applications, 2011


  1. “Careless talk”, available at Last accessed date: 14/10/2010
  2. “Countries that ban cell phones while driving”, available at Last accessed date: 14/10/2010
  3. “KidsClick!”,  web search for kids by librarians available at!/. Last accessed date: 14/10/2010
  4. “Rockbox”, available at Last accessed date: 14/10/2010
  5. “2006              GMAC   Insurance              National                Drivers   Test”,     available               at Last accessed date: 14/10/2010
  6. “Wikipedia Contents: Categories”, available at Last accessed date: 14/10/2010

Written by