|
![]() |
Homepage / Publications & Opinion / Archive / Articles, Lectures, Preprints & Reprints![]() It would be good to talk! Peter Cochrane & Fred Westall Introduction To date human-machine interfaces have been technology limited and dominated by the switch, button, keyboard and screen. Our natural mode would see us talking to the TV, VHS, washing machine, car, and computer. The inconvenience of typing, remembering obscure commands and navigating GUIs may soon look quaint as a modest degree of speech recognition and synthesis can now afford a reasonable control regime. Whilst conversational interfaces are some way off, the integration of speech recognition, natural language processing, speech synthesis and search engines will allow the creation of new paradigms. In this contribution we discuss some of the issues relating to human interaction with machines from the standpoint of user needs. Challenges And Opportunities ? Figure 1 Speech technology disciplines Speech and language processing is an amalgam of many broadly based disciplines (Fig 1). Real-world applications using spoken commands therefore require a broad based, holistic approach to realise systems that are acceptable for public use. Engineering and computing need to be complemented by expertise in man-machine interaction, human perception, physiology, acoustics, linguistics, natural language processing, and many others. Let's talk! Interactive Voice-Response (IVR) Services In most current applications the dialogue flow is system controlled. Information is typically accessed an item at a time and a rigid interaction enforces certain strategies for confirmation, rejection, error detection, correction and reprompting. It can be frustrating to be 'steered' through a complex dialogue by a machine only to find yourself lost or furnished with the wrong information. This calls for a different approach to dialogue design and systems are beginning to emerge where it is possible for us to respond more spontaneously, flexibly and take control of the interaction. For example, the next generation system might have the following 'conversation' with you:
This illustrates some desirable features of 'more natural' interactive applications. They do not constantly prompt you for explicit information. It uses discourse and domain knowledge to resolve ambiguous expressions such as 'give me the next one' and is able to resolve references to earlier information. You can interrupt (barge in) at any stage, and the interrupting phrase is recognised. You are in control of whether the full message is heard or a summary given. The system discards hesitations and grunts but is able to identify words of significance wherever they occur (directed word spotting). Such a dialogue shifts the initiative in the conversation from the computer to the user. More fluent interactions of this type inevitably place greater demands on the underlying technology and require more language knowledge than with more structured dialogues. Future Directions The current drive to improve the naturalness of text-to-speech systems will continue with the emphasis in areas such as;
Undoubtedly the current emphasis on large - vocabulary, speaker-independent systems will continue. To make such systems scaleable to network applications, more work will be needed on improving noise robustness and out-of-vocabulary rejection. There is also interest in developing systems that can adapt to the characteristics of the user's voice and the transmission environment. Techniques such as 'barge-in' and word spotting will also grow in importance in the drive to make systems more natural, and more acceptable. To enhance significantly the basic performance of recognisers, more knowledge is needed in areas such as feature extraction, and the building in of knowledge as it becomes available on the physiology of the ear and brain as a coupled 'system'. Improvements to modelling of speech can be expected in a drive to reduce the limitations of the existing statistical model based approaches. New paradigms based on mixed mode (acoustic, speech and visual) cues may offer one way forward. As interactive speech systems go operational they will generate new field experience and customers' service perception data. Such 'real-world' data will provide a new R&D focus on technology improvement. Statistical design and advanced data visualisation techniques will also help researchers interpret this volume of data. By examining large numbers of interactions, statistical models of dialogues can be generated and used to optimise deployed dialogues developed using traditional techniques. This offers the potential for a unified understanding from low level signal processing to higher level semantics. There will also be an increasing interest in multilingual recognition and the associated ability to recognise the language of the speaker from just a sample of speech. Ultimately the ability to identify the topic domain as a means of focusing the recognition resources will have to be developed. Future directions in natural language processing We have already reached the state where people just can't cope with the information mountains. Even with oxygen and 'siege tactics' these mountains are becoming too high to scale! Such challenges require true language interpretation and generation capabilities, which in turn have to be supported by contextual reasoning. An increasing range of tasks indeed call for combinations of these speech, language and discourse functions. Recent developments have seen Information Agents that 'learn' the interests of the user and select appropriate items from databases. A text summariser has also been produced, which interactively abridges articles by extracting the most important sentences. Tests have shown that an abridgement of only 5% of the original will contain roughly 70% of the information in a written abstract, while a 25% abridgement contains near 100%. It is widely recognised that navigating through hypertext, using a mouse can be tedious and/or confusing. Speech is more direct, natural, and can take the user directly to the information needed. Speech does not replace GUIs - instead the best solution may be an integrated interface allowing the different input and output modes to complement and enhance each other. An experimental system which integrates continuous speech input and output with a WWW browser to provide direct access to a subset of the BT Business catalogue is now on line. This covers a range of products such as telephones, answering machines and phone systems. Users can ask questions of the type: ?which phones have on-hook dialling and cost less than 60 pounds?, and ?which ones come in grey?? These technologies offer the user the attractive prospect of receiving summarised information, customised to individual needs, over the mobile phone from anywhere. Person-To-Person 'A person with one eye and one ear who lives in a bathroom with stroboscopic lighting. He/she has a mouthful of marbles, severe hearing loss above 3kHz and a 35km long throat, which explains the delay between the brain and mouth. This person has no body and a single finger which is used for signalling to the rest of the world. The primary speaking interface to the world's largest machine is by means of a plastic replica of a banana which is held close to the side of the face' It is remarkable that we are able to communicate as well as we do given the constraints that we impose on natural communication. Now imagine a world where the environment provides the gateway to seemingly infinite quantities of information. A place where you can interact with real and virtual worlds, and where you can see, hear, sense and touch the person you are communicating. Imagine roaming from reality into virtual space or work in a virtual business. Imagine being able to use eyes and hands-free natural language, spoken commands in any language of choice, to instruct intelligent agents to seek out what you require from anywhere in the world. Such a system could recognise who you are by simply looking at your face or recognising your voice, and instantly configure and customise your communications needs. In the research environment teleconferencing can be made very realistic as with electronic work spaces with eye-to-eye video conferencing that maintain gaze awareness. When augmented by directional and speaker tracking audio reproduction the illusion of 'being there' is almost complete. Speech and language processing has the potential to make such interfaces more natural and easy to use through the exploitation of visual, acoustic and gesture cues. Herein lies the true destiny of speech and language processing - not just one mode of communications, speech, but all the senses orchestrated to meet the real underlying communication needs of people. Conclusion Stop bending people into the technology, and start bending the technology into people Bibliography About The Authors Fred Westall received degrees in Electrical and Electronic Engineering from University College London and Communication Engineering from the University of Manchester in 1973 and 1975 respectively. In 1975 he joined BT Laboratories to undertake research and development of speech-band modems for the public switched telephone network and he has been closely associated with digital signal processing ever since. In 1982 he became head of the Speech Coding Applications Group with specific responsibility for novel speech-and-data multiplexers incorporating low bit-rate and high-quality (7kHz) speech codecs. In 1986 he was appointed to manage the Data Products Development Section where he was responsible for packet terminals development and high-speed modem R&D. He is currently responsible for downstreaming speech-band applications onto DSP speech platforms and for signal processing R&D in speech recognition, coding, analysis and synthesis. He is a Fellow of the IEE and Senior Member of the IEEE. |
![]() |
||
![]() |
|