Peter Cochrane


	Last Modified: ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

Homepage / Publications & Opinion / Archive / Articles, Lectures, Preprints & Reprints

It would be good to talk!
Peter Cochrane & Fred Westall

Introduction
All spoken languages are extremely complex with the simplest sentence containing a wealth of information beyond the linguistic content. They convey the mood, emotion, personality and gender of the speaker. Not surprisingly, getting a machine to interact with, generate, and comprehend such subtle information is a major challenge.

To date human-machine interfaces have been technology limited and dominated by the switch, button, keyboard and screen. Our natural mode would see us talking to the TV, VHS, washing machine, car, and computer. The inconvenience of typing, remembering obscure commands and navigating GUIs may soon look quaint as a modest degree of speech recognition and synthesis can now afford a reasonable control regime.

Whilst conversational interfaces are some way off, the integration of speech recognition, natural language processing, speech synthesis and search engines will allow the creation of new paradigms. In this contribution we discuss some of the issues relating to human interaction with machines from the standpoint of user needs.

Challenges And Opportunities
Although the public imagination has been fired with images of 'HAL' in '2001 - a Space Odyssey' and the ever-helpful computer on the SS Enterprise, the goal of producing complete natural language interfaces between humans and machines is some way off. The difficulty arises due to the variability in factors that dictate performance such as the great variety in speech characteristics. When a word or phrase is uttered by different speakers, or even when repeated by the same speaker, the variability is large in terms of the precision world of the machine. Beyond the acoustic, ambiguities occur at higher levels of linguistic abstraction. The phrase 'it's easy to recognise speech' could be interpreted as 'it's easy to wreck a nice beach' at a conference of surfers! To make matters worse people do not always say what they mean, or mean what they say!

Figure 1 Speech technology disciplines

Speech and language processing is an amalgam of many broadly based disciplines (Fig 1). Real-world applications using spoken commands therefore require a broad based, holistic approach to realise systems that are acceptable for public use. Engineering and computing need to be complemented by expertise in man-machine interaction, human perception, physiology, acoustics, linguistics, natural language processing, and many others.

Let's talk!
The ideal is always to talk face to face, but when this is not possible we resort to the telephone. In fact some people spend more time on the telephone than with people - it both restricts what you can do and say, whilst at the same time giving new degrees of freedom. The telephone network is the world's largest machine and means of individual communication. Language processing provides a raft of new opportunities for exploiting this infrastructure through new interactive services. Current developments are primarily directed toward interactive banking, shopping, entertainment and information access.

Interactive Voice-Response (IVR) Services
IVR services need a dialogue to ensure that information is obtained in a controlled and structured way. Good dialogue design is essential for success and ultimately defines the interface. Systems should appear friendly, reliable, comfortable, and align with expectations. It is a considerable challenge to orchestrate all the constituent technologies to achieve this for a massive user population (~850M) under all operating conditions.

In most current applications the dialogue flow is system controlled. Information is typically accessed an item at a time and a rigid interaction enforces certain strategies for confirmation, rejection, error detection, correction and reprompting. It can be frustrating to be 'steered' through a complex dialogue by a machine only to find yourself lost or furnished with the wrong information. This calls for a different approach to dialogue design and systems are beginning to emerge where it is possible for us to respond more spontaneously, flexibly and take control of the interaction. For example, the next generation system might have the following 'conversation' with you:

You: How many messages do I have from Denis?

System: There are two messages from Denis Smith and one from Denis Crowe.

You: Can I hear the one's from Smith?

System: The first message from Denis Smith is about 'Budgets', it's 20 seconds long and a summary is '... ( Interruption) ....'

You: OK, give me the next one....

System: The next message is about 'the future of speech recognition', it's 10 seconds long and a summary is ..

You: OK, - um - read the previous one back to me in full.

This illustrates some desirable features of 'more natural' interactive applications. They do not constantly prompt you for explicit information. It uses discourse and domain knowledge to resolve ambiguous expressions such as 'give me the next one' and is able to resolve references to earlier information. You can interrupt (barge in) at any stage, and the interrupting phrase is recognised. You are in control of whether the full message is heard or a summary given. The system discards hesitations and grunts but is able to identify words of significance wherever they occur (directed word spotting). Such a dialogue shifts the initiative in the conversation from the computer to the user. More fluent interactions of this type inevitably place greater demands on the underlying technology and require more language knowledge than with more structured dialogues.

Future Directions
BT's synthesis system (Laureate) is designed to convert unrestricted textual input into speech. It differs from conventional speech synthesis systems in that it does not artificially mimic human utterances (as in formant synthesis). Rather, it constructs the voice from elemental components of recorded speech. The result is natural sounding speech which can be modified to affect local accents, and other languages.

The current drive to improve the naturalness of text-to-speech systems will continue with the emphasis in areas such as;

improving the 'natural rhythm' of synthesised speech
providing more choice in selecting the 'personality' of the speaker
more choice of language and accent

Undoubtedly the current emphasis on large - vocabulary, speaker-independent systems will continue. To make such systems scaleable to network applications, more work will be needed on improving noise robustness and out-of-vocabulary rejection. There is also interest in developing systems that can adapt to the characteristics of the user's voice and the transmission environment. Techniques such as 'barge-in' and word spotting will also grow in importance in the drive to make systems more natural, and more acceptable.

To enhance significantly the basic performance of recognisers, more knowledge is needed in areas such as feature extraction, and the building in of knowledge as it becomes available on the physiology of the ear and brain as a coupled 'system'. Improvements to modelling of speech can be expected in a drive to reduce the limitations of the existing statistical model based approaches. New paradigms based on mixed mode (acoustic, speech and visual) cues may offer one way forward.

As interactive speech systems go operational they will generate new field experience and customers' service perception data. Such 'real-world' data will provide a new R&D focus on technology improvement. Statistical design and advanced data visualisation techniques will also help researchers interpret this volume of data. By examining large numbers of interactions, statistical models of dialogues can be generated and used to optimise deployed dialogues developed using traditional techniques. This offers the potential for a unified understanding from low level signal processing to higher level semantics.

There will also be an increasing interest in multilingual recognition and the associated ability to recognise the language of the speaker from just a sample of speech. Ultimately the ability to identify the topic domain as a means of focusing the recognition resources will have to be developed.

Future directions in natural language processing
An essential ingredient in the success of any business is access to information sources. Modern telecommunications can give direct access anywhere in the world within seconds or minutes. With the spread of personal communications products, many of them too small to support a keyboard, speech and language processing, incorporating some degree of understanding, will play an important role in accessing and transforming information.

We have already reached the state where people just can't cope with the information mountains. Even with oxygen and 'siege tactics' these mountains are becoming too high to scale! Such challenges require true language interpretation and generation capabilities, which in turn have to be supported by contextual reasoning. An increasing range of tasks indeed call for combinations of these speech, language and discourse functions.

Recent developments have seen Information Agents that 'learn' the interests of the user and select appropriate items from databases. A text summariser has also been produced, which interactively abridges articles by extracting the most important sentences. Tests have shown that an abridgement of only 5% of the original will contain roughly 70% of the information in a written abstract, while a 25% abridgement contains near 100%.

It is widely recognised that navigating through hypertext, using a mouse can be tedious and/or confusing. Speech is more direct, natural, and can take the user directly to the information needed. Speech does not replace GUIs - instead the best solution may be an integrated interface allowing the different input and output modes to complement and enhance each other.

An experimental system which integrates continuous speech input and output with a WWW browser to provide direct access to a subset of the BT Business catalogue is now on line. This covers a range of products such as telephones, answering machines and phone systems. Users can ask questions of the type: ?which phones have on-hook dialling and cost less than 60 pounds?, and ?which ones come in grey??

These technologies offer the user the attractive prospect of receiving summarised information, customised to individual needs, over the mobile phone from anywhere.

Person-To-Person
So far we have been concerned with machine interfaces using voice. But what about the future of person-to-person communications? Today video conferencing is limited to viewing the participants through a small rectangular window. Images are captured through a single camera and sound by means of a single microphone. In 'hands and eyes-free' mode the speech often sounds clipped or as if it originated in a bathroom. It is as if communications engineers consider customers to be:

'A person with one eye and one ear who lives in a bathroom with stroboscopic lighting. He/she has a mouthful of marbles, severe hearing loss above 3kHz and a 35km long throat, which explains the delay between the brain and mouth. This person has no body and a single finger which is used for signalling to the rest of the world. The primary speaking interface to the world's largest machine is by means of a plastic replica of a banana which is held close to the side of the face'

It is remarkable that we are able to communicate as well as we do given the constraints that we impose on natural communication. Now imagine a world where the environment provides the gateway to seemingly infinite quantities of information. A place where you can interact with real and virtual worlds, and where you can see, hear, sense and touch the person you are communicating. Imagine roaming from reality into virtual space or work in a virtual business. Imagine being able to use eyes and hands-free natural language, spoken commands in any language of choice, to instruct intelligent agents to seek out what you require from anywhere in the world. Such a system could recognise who you are by simply looking at your face or recognising your voice, and instantly configure and customise your communications needs.

In the research environment teleconferencing can be made very realistic as with electronic work spaces with eye-to-eye video conferencing that maintain gaze awareness. When augmented by directional and speaker tracking audio reproduction the illusion of 'being there' is almost complete. Speech and language processing has the potential to make such interfaces more natural and easy to use through the exploitation of visual, acoustic and gesture cues.

Herein lies the true destiny of speech and language processing - not just one mode of communications, speech, but all the senses orchestrated to meet the real underlying communication needs of people.

Conclusion
In all communications systems the processing and transmission delays between people are critical in determining naturalness and user acceptability. All other features and facilities can be created, but to neglect delay is to artificially dehumanise and downgrade the technology and our ability to interact and work with it. If we are to see the population at large (all of us!) accessing IT, then it has to work at human speed. For the first time in the history of IT this is within our grasp. So now is the time to:

Stop bending people into the technology,

and start

bending the technology into people

Bibliography
Preston K et al: ''Managing the Information Overload', Physics in Business, Inst. of Physics, June 1994.
Wyard P et al: 'A Combined Speech and Visual Interface to the BT Business Catalogue', ESCA Workshop on Spoken Dialogue Systems, May 30-June 2, 1995.
Travis D et al: 'Working Together in the Electronic Agora', BT Engineering J., Vol. 14, Part. 2, July 1995.
Emmott J S: 'Information Superhighways - Multimdeia Users & Futures, Academic Press', London, 1995.
Cochrane P & Heatley D J T: 'Modelling Future Telecommunications Systems'. Chapman Hall, London, 1995.
Westall F A and Ip S F A: 'Digital Signal Processing in Telecommunications', ibid 1993.
Westall F A et al: 'Speech Technology for Telecommunications', Special Issue of the BT Technol. J., January 1995.
Cochrane P: IT - A Glimpse of The Future, ASLIB (The Association for Information Management) Proceedings, Vol 47/10, Oct 95, pp 221 - 228.
Cochrane P: A Three Click One Second World, IEE Electronics & Communication Engineering Journal, Vol 7/4, August 95, pp 138 -139.
Cochrane P: The Potential for Multimedia, Information Technology & Public Policy, The Journal of the Parliamentary Information Technology Committee, Vol 13, No 3, Summer 1995, pp 6 - 10.

About The Authors
Peter Cochrane joined BT Laboratories in 1973 and has worked on a wide range of technologies and systems. In 1993 he was appointed as the Head of Advanced Research with a team of 660 people charged with engineering the future. A graduate of Trent Polytechnic and Essex University he is also a visiting professor to UCL, Essex, and Kent Universities. He has published and lectured widely on technology and the implications of IT. A fellow of both the IEE and IEEE and led a team that received the Queen's Award for Innovation & Export in 1990; the Martlesham Medal for contributions to fibre optic technology in 1994; the IEE Electronics Division Premium in 1986, Computing and Control Premium in 1994 and the IERE Benefactors Prize in 1994.

Fred Westall received degrees in Electrical and Electronic Engineering from University College London and Communication Engineering from the University of Manchester in 1973 and 1975 respectively. In 1975 he joined BT Laboratories to undertake research and development of speech-band modems for the public switched telephone network and he has been closely associated with digital signal processing ever since. In 1982 he became head of the Speech Coding Applications Group with specific responsibility for novel speech-and-data multiplexers incorporating low bit-rate and high-quality (7kHz) speech codecs. In 1986 he was appointed to manage the Data Products Development Section where he was responsible for packet terminals development and high-speed modem R&D. He is currently responsible for downstreaming speech-band applications onto DSP speech platforms and for signal processing R&D in speech recognition, coding, analysis and synthesis. He is a Fellow of the IEE and Senior Member of the IEEE.

All materials created by Peter Cochrane and presented within this site are copyright ? Peter Cochrane - but this is an open resource - and you are invited to make as many downloads as you wish provided you use in a reputable manner