1 Introduction - CiteSeerX

0 downloads 0 Views 535KB Size Report
After the beam has been focused it hits the two small .... Air-hockey is like ice-hockey but with the ... table with a hand held stick-device used to hit the puck.
Setting Up a Masters Programme in Intelligent MultiMedia Approach and Applications Thomas B. Moeslundand Lars Bo Larsen y Institute for Electronic Systems Aalborg University, Fred. Bajers Vej 7 DK-9220 Aalborg , Denmark E-mail: [email protected], [email protected]

1.1 Background

Abstract The goal of this paper is to describe the establishment of a new interdisciplinary research eld and post graduate study programme at Aalborg University. The eld is Intelligent MultiMedia (IMM), in which we mainly focus on advanced processing and integration of information from a variety of modalities. Most notably visual and spoken information sources. The paper illustrates the approach taken, partly by describing the establishment of a common platform (the Intellimedia WorkBench), and partly by describing a number of applications developed in student projects, utilizing the WorkBench. Image processing plays an important role in the applications, e.g. for tracking, classifying objects and for combined audio-visual speech recognition. The paper mainly focuses on presenting the overall approach and a number of di erent applications. The aim of the paper is to give an overview rather than to present many details.

1 Introduction This paper describes the approach taken at Aalborg University to build up a new area of research and teaching activities in the eld of Multi Modal and MultiMedia User Interaction (MMUI), called "Intellimedia 2000+" [15]. The general background is rst outlined, followed by a description of the Intellimedia Workbench [9]. A number of student projects are then described to illustrate the variety of applications, possible to implement on the WorkBench, as well as giving an impression of the subjects the students are covering in the education. Finally, we conclude by drawing some lines into the future.  Dept. of Medical Informatics and Image y Center for PersonKommunikation

Analysis

In order to comply with the ever increasing demand from industry for skilled engineers within the general eld of Information Technology and more speci c in interface design, the technical faculty at Aalborg University in late 1996 allocated resources to the Institute of Electronic Systems (IES) to form a new masters programme in Electrical Engineering within the eld of "Intelligent MultiMedia" or Intellimedia. Traditionally, multi media systems only includes shallow understanding (if any) of what is presented to the user. In intelligent multimedia we focus on the computer processing and understanding of input from sound, speech, text and images in terms of their semantic representations. This includes a deeper understanding of the various input modalities. Moreover, the initiative is directed at developing new and more user-friendly interaction methods between humans and computers (HCI) by using novel or combining already established modalities. This means that output modalities also are of great interest. At IES a number of strong, but hitherto unrelated, research groups have evolved over the last decade. These include the Laboratory for Image Analysis (LIA), the Center for PersonKommunikation (CPK), Computer Science (CS) and Medical Informatics (MI). A coordination group and a research group with persons from each of the above groups were charged with the tasks of 1) to merge relevant aspects of their respective research elds into a common platform and research activity, and 2) to set up a new post-graduate masters programme within this interdisciplinary eld. One of the ideas behind the rst task is to boot-strap the second task by developing a demonstrator platform. In the following sections the two tasks are described in more detail.

1.2 The Study programme

An important part of the purpose of the Intellimedia 2000+ initiative is to set up a new masters study programme within Intelligent MultiMedie (IMM) [20].

The students are given advanced courses in computer vision and -graphics, speech technology, spoken dialogue systems, multi modal human-computer interaction, readings in intelligent multimedia and computer science and electrical engineering in general (e.g pattern recognition, database systems, decision support systems, OOP, Java, networking). Students with a bachelor degree in electrical engineering or computer science are admitted. The programme is o ered internationally, and currently a third of the students come from abroad. An important aspect of the studies at Aalborg University is the problem oriented, or project based, study form. All students work in groups varying from 3-6 persons. The study is divided equally between courses and a student project adhering to a common theme of the particular semester. In the IMM programme this means that the students actually are able to design, implement and test large systems during one semester. Examples of these are given in section 3. In order for the students to actually build an test complex systems, the WorkBench plays a vital role in making available ready-to-use modules (such as e.g. a speech recognizer). Thus, the students can concentrate on the topic of their project, rather than being caught in establishing the underlying technology.

gure 2. The WorkBench is a physical as well as a software platform enabling both research and education within the area of multi modal user interfaces. The WorkBench makes a set of modules available which can be used in a variety of applications. The devices are a mixture of commercially available products (e.g. a speech recognizer), custom made devices (e.g. a laser system) and modules developed by in-house researchers (e.g. a natural language parser). The di erent devices are described in greater detail in the succeeding section.

1.3 The Intellimedia WorkBench - a common basis for research in multi modal communication

When working with sound and images transducers are required to capture and convert the signals into a digital form which can be used in a computer. Therefore a number of microphones (hand held and wearable) and cameras are present. To convert the signals from the transducers into a digital form a framegrabber, for the images, and a sound card, for the sound, is used. It is important that these devices are of good quality since the processors which will use these information sources depends solely upon their quality. Several high-level software routines are made available to make the underlying hardware and low-level software transparent to the user.

In order to achieve the integration of research done by di erent groups, it was decided to initiate work on a common platform for building multi modal systems, called the Intellimedia WorkBench [9][10][11]. The practical goal of the platform is not only to enable fast development and test of a variety of applications, but also to get researchers to work closely together on a concrete project. In this way we felt that we could best achieve the bene ts of combining the very diverse research elds involved in this new area. In concrete terms, the WorkBench is an experimental setup in a specially designed laboratory. The room is soundproofed, and has facilities for mounting equipment like cameras in the ceiling. A dividing wall can be placed at various positions, thus adjusting the size of the room individually for each experiment. Equipment like computers can be placed in an adjacent laboratory, from which experiments with test subjects can be monitored and controlled. A table placed in the middle of the room, on which objects can be placed, but in some cases the table can be removed, or replaced by e.g. a pool table (see the descriptions of the example applications in section 3 below). A laser is placed in the ceiling and is used for drawing on the table surface, acting e.g. as a "system pointer". A WorkBench setup is illustrated in

2 Workbench Devices As mentioned above there is a strong need for devices which are ready-to-use in order to design and test multi modal systems and new interaction methods. The focus of research and student projects can then be shifted from establishing low level devices till more high level interface and integration design. In this section the different hardware as well as software devices, present in the WorkBench, are presented.

2.1 Cameras and Microphones

2.2 Pan and Tilt Camera In many systems there is a need for a pan/tilt camera. E.g., in surveillance task, video conferencing, tracking and generally in all tasks where the system is trying to copy some human head/eye actions. A camera which has the ability to move is the SONY EVI-D31 [22]. The camera can, using a remote control, be moved (pan/tilt), zoomed, focused, moved to a preset position and a few more things. It actually also has a buildin tracking function which, based on color histogram matching, can be set up to follow a moving object. It is possible to control all the functions of the camera from a computer using a serial link. A driver has been developed [22] which provide a system developer with a

high level library for controlling the functionality of the camera from a PC.

LASER

LASER DIODE

2.3 Eye-gaze Tracker

When designing/experimenting with interfaces it is desirable to be able to know where the user is looking at a certain point in time, e.g. to determine the focus of attention of the user. It would also be interesting to investigate devices which could replace the mouse and thereby solving the Carpal Tunnel syndrome problem. A device with these abilities is an eye-gaze tracker. Several trackers are currently on the market but most of them require the user to wear special, uncomfortable/intrusive equipment like helmets or electrodes. Furthermore they are very expensive, in the range from 10 - 50.000 USD. Instead of an eye-gaze tracker we chose to purchase a head-gaze tracker. It has most of the same abilities and is much cheaper, less than 1.500 USD. The device we decided on was the Madenta Tracker [19], a product intended for handicapped people. It is roughly twice the size of an ordinary mouse and is placed on top of a monitor. It is capable of tracking the position of a small re exive dot, using an infrared camera, which the user places on his forehead or glasses. The tracker comes with a driver which, after calibration, can be used as easily as an ordinary mouse.

2.4 Controllable Laser Beam

One of the main issues when trying to create new types of interfaces is how the system should give feedback to the user when a screen is not used. A solution is to use a laser which can point, draw and write information to the user independent of a computer screen. To do so the laser beam should be movable. A laser like this has been constructed. In gure 1 a schematic representation of the laser is shown. The laser beam is generated in the laser diode and the more current this diode receives the more power the laser beam will have. This is known as the modulation of the laser. The diode is a red (640nm) laser capable of generating 15mW at 100% modulation. It is small (1cm) and cheap (500-1000 USD) but not very focused. Therefore the lenses are added to compensate for this. After the beam has been focused it hits the two small mirrors (5x10mm) whose position determines where the laser beam ends up. The control of the modulation and the control of the mirrors are extremely fast. It can scan 600 points at 50Hz, which corresponds to drawing a path with 100 corners and updating it without the human eye noticing. The modulation and mirror positions are controlled from a DA-converter in e.g. a standard PC. A high

X-MIRROR

0000000011111111 11111111 00000000 11111111 11111111 00000000 00000000 000000 111111 00000000 11111111 00000000 000000 0000000011111111 11111111 00000000111111 11111111 000000 111111 00000000 11111111 00000000 000000 0000000011111111 11111111 00000000111111 11111111 000000 111111 00000000 11111111 00000000 11111111 000000 111111 00000000 11111111 00000000 11111111 000000 111111 0000000011111111 11111111 00000000111111 000000 00000000 11111111 00000000 11111111 000000 0000000011111111 11111111 00000000111111 UNFOCUSED BEAM

LENSES Y-MIRROR FOCUSED BEAM

Y-POSITION MODULATION X-POSITION

Figure 1: The internal structure of the laser. level driver is build on top to make the laser a plug-in device [21].

2.5 Speech Synthesizer

Another solution to the problem, mentioned in 2.4, of how a system can give feedback to the user, is by using a speech synthesizer. A speech synthesizer converts a text string into an acoustic representation which is played back on a loudspeaker and comes out as synthetic speech. Fairly good and cheap products exist today and we are using Infovox1 from Telia and TruVoice1 from Lernout & Hauspie. Both are rule based formant synthesizers. The Infovox can simultaneously cope with multiple languages, e.g. pronounce a Danish name within an English utterance.

2.6 Speech Recognizer

A speech recognizer can be used in many interfaces since it is a very natural form of communication for humans. Basically it converts a spoken sentence into a string of words drawn from the vocabulary of the recognizer. We are using two di erent products, grapHvite1 from Entropic and Whisper1 from Microsoft. Both are realtime continuous speech recognizers based on HMMs (Hidden Markov Models) of triphones for acoustic decoding of English or Danish. The recognition process most often focuses on recognition of speech concepts and ignores non content words or phrases. A nite state network representation of the phrases is created by hand in accordance with a domain model. This process can also be done automatically by a grammar converter in the NLP module. 1

See [2] for details

2.7 Natural (NLP)

Language

Processing 3.1 Multimodal Campus Information System

Speech recognition and speech understanding is not the same. A speech recognizer 'simply' converts a speech signal into a text string. Therefore there is a need to extract the meaning (semantics) from the text string. The NL-parser [8] we use is based on a compound feature based (so-called uni cation) grammar formalism for extracting semantics from the N-best text outputs from the speech recognizer. The parser carries out a syntactic constituent analysis of input and subsequently maps values into semantic frames. The rules used for syntactic parsing are based on a subset of the EUROTRA formalism [6], i.e. in terms of lexical rules and structure building rules.

This work is the result of a larger project involving researchers with di erent backgrounds. The focus of the work is to integrate di erent modalities and building up a system which can be used as a demonstrator. The results have successfully been demonstrated and it has given the wanted e ect of boosting the entire initiative by illustrating ideas and concepts, and serving as an inspiration to researchers and students. The application is a multimodal campus information system [9][10][11]. A model (blueprint) of a building layout is placed on top of the WorkBench table, see gure 2.

2.8 Microphone Array An important issue when working with sound is to be able to determine the location from where it originated, e.g., a person speaking. A microphone array [18], which has this capability has been built. Depending upon the placement of a maximum of 12 microphones the microphone array calculates sound source positions in 2D or 3D, see gure 2. It is based on measurement of the delays with which a sound wave arrives at the di erent microphones. From this information the location of the sound source can be identi ed. Another application of the array is to use it to focus at a speci c location thus enhancing any acoustic activity at that location, or, in other words, to suppress sounds (noise) from all other directions. A high level API and a con guration tool have been developed for the microphone array [17].

2.9 Summary The list of modules is not exhaustive and an important ongoing activity is to update and add new components, create well de ned interfaces, etc.

3 Applications Both research projects as well as student projects have been developed within the multimedia initiative. As mentioned earlier the students at Aalborg university use approximately half of each semester to work on a project. This, together with the fact that students usually work in groups of 3-6 persons, results in rather comprehensive projects. In the following some of the di erent projects made within the initiative are presented.

Figure 2: An illustration of the Campus Information System. Notice the camera and laser mounted in the ceiling, and the microphone array on the far wall. The system allows the user to ask questions about the locations of persons and oces, labs, etc. Typical inquiries are about routes from one location to another, where a given person's oce is located or who's oce this is (together with a pointing gesture) etc. Input is simultaneous speech and/or gestures (pointing to the building model). Output is synchronized speech synthesis and pointing/drawing (using a laser beam to point and draw routes on the map). The pointing gestures are found by analyzing the image captured by a camera mounted above the table. The camera and laser are both carefully calibrated with respect to the model of the building, using a homogeneous representation and standard linear techniques [14]. The central module within the system is a blackboard, which stores information about the system's current state, history, etc. All modules communicate through the exchange of semantic frames with other modules in the system or the blackboard. A frame contains information about who (which device) produced it and also input/output information together with timestamps. The inputs are either spoken (e.g. "Who's oce is this?") and/or gestures (e.g. pointing coordinates). Outputs

are either spoken (e.g. "This is Thomas' oce") or gestures (e.g. laser coordinates) or a combination of both. Time-stamps can include the times a given event commenced and completed, and are used to integrate the di erent input and synchronize the output. The synchronization and interprocess communication in the system are based on the DACS IPC platform, developed by the SFB360 project at Bielefeld university [13]. DACS allows the modules to be distributed across a number of servers.

3.2 Virtual Air-hockey The main concern of this project, done by the research group, is to test the laser and the camera in a combined setup. The choice of application fell on a virtual game of air-hockey. Air-hockey is like ice-hockey but with the following di erences. Only one player on each team. The puck moves on air, instead of ice, which is pumped up through multiple holes. The game is played on a table with a hand held stick-device used to hit the puck. In the application the playing eld is de ned as a rectangular table (2x1m) divided into a number of colored elds which makes the borders easy to detect by the camera. A virtual puck is implemented as a laser dot, whose position is updated at 15Hz making it appear to move as a real puck. The hand held stick-device is implemented as a piece of colored carton. The position and orientation of this device is found by prediction, color thresholding and ltering. The game can now be played between two humans or between a human and the computer whose stick device is implemented as a short laser line. The game has been tested, and people very fast became skilled due to its similarity with the real air-hockey game. Something was found to be missing as the game had no sound e ects. Therefore some illustrative sounds were included and played each time the 'puck' hit a border or the stick device. Also a speech synthesizer is used to give the players comments during the game e.g., "very nice goal!" or "how could you miss that shot?". These improvements (especially the sound e ects) made the game much more natural, and greatly improved the illusion. The following applications have all been developed within 4th year or thesis student projects.

3.3 Face Recognition in Access Control The work in this project [7] deals with face recognition as a biometric substitution of, or supplement to, ID cards and PIN codes in the task of identifying a person. Face recognition can be utilized in a number of applications. One of them is physical access control which is used for an imaginary case-study. The purpose of the

case-study is to analyze the requirements when implementing a technology like face recognition. A structure for the di erent parts of an access control system from the case-study is proposed together with a design of a user interface implemented as a prototype design. The techniques used in the recognition process is principal component analysis (PCA) and Bayes classi cation theory. The variant parameters of PCA, scaling, rotation and translation, and the background problem are analyzed together with the access control problem and a novel solution is found. A mirrored version of the input image is shown to the user of the system on an access console. The image is masked by a static ellipse with a cross in the middle, see gure 3. It now seems very intuitive for the user to make his face t inside the ellipse and locating his nose at the cross. Masked image

Eigenface 1

Eigenface 2

Figure 3: The gure to the left shows a masked image. The two other gures show the two rst eigenvectors/eigenfaces. The system is trained on 12 di erent persons (50 images per person) using the rst 10 eigenvectors, see gure 3 for examples. The system is tested on 120 unknown images and the recognition rate is approximate 97%.

3.4 Virtual Wheel

The aim of the project [4] is to develop a system which makes it possible to control a Windows 95 racing game in a more natural manner, namely by using gestures. Di erent gestures are analyzed to be both obvious, comfortable, concurrent and identi able from di erent camera views. The parameters which should be controlled by the gestures are; turn, break, speed and gear shift. A frontal camera view is chosen. The hands control the wheel while the right foot controls speed and breaking by rotating it left or right. The gear shift is controlled by the right thumb (for gear up) and the left thumb (for gear down). To make the detection of the gestures easier the user wears red gloves and socks which can be segmented using simple color thresholds. The hands are found and represented by the center of mass and the turning position of the wheel can be calculated. The orientation of the foot is found as the relation between the length and width of the segmented foot-blob. The detection

of a gear shift is somewhat more dicult to recognize due to rotation of the hands and noise in the image. A solution to this problem has been proposed. First each hand blob is found. Then the contour is dilated to remove noise. Next the contour is vectorised, ltered and vectorised again. Each hand blob is now represented by a set of angles (between the vectors). Three Markov models based on these angles are set up. One for a closed st and two for a closed st with an outstretched thumb. The system is implemented and tested. It runs at 7Hz and is able to recognize the wheel and speed/break gestures with good accuracy. The method used for gear detecting turned out to be unable to deal with the amount of noise within the system. More Markov models or more detailed ones seem to be needed, if not an entire new method. The system was, however, found to have a very intuitive interface.

which are mounted in the ceiling. The system locates the position of the table edges (for calibration), the balls and the cue using the camera. The trainer can help the user in two ways, showing guidelines or giving lessons. When the user points the cue towards the cue ball for a speci ed amount of time, the pool trainer calculates the trajectories of the balls given the direction of the cue. The trajectories are directly drawn on the baize using the laser, see gure 4. Thus, the player is given feedback in a very direct and natural way, simply by having the predicted result of his shot shown to him directly on the surface of the table.

3.5 Automatic Camera Control in Videoconferencing

The work [5] deals with video conferencing and how to control a camera in a videoconference system. Most video conferencing systems either have one or more stationary cameras which you can choose between, or they have a camera man. Both are suboptimal solutions. A camera man can do the job but is a very expensive solution. Having only one camera you have the problem of focus/zoom. The speaker is not enhanced and therefore it can be hard to follow a conversation. This can, to some extent, be solved by including more cameras with di erent zoom, but then someone needs to control these cameras. Therefore it would be very nice to have an automatic control. One solution to this problem is to set up a microphone array which can pinpoint the 3D location of a speaker in a video conferencing scenario. The position from the microphone array is then used to control (pan, tilt and zoom) a camera. After having calibrated the microphone array to the camera, the system is able to keep a speaker in focus even though he moves around. The system has been implemented where the microphone array picks up the position and the speech of the speaker. The position is used to control the pan, tilt and zoom parameters of a SONY EVI-D31 camera. The video signal from the camera and the speech signal from the microphone array follows the H.320 videoconferencing standard set up by International Telecommunication Union [5] and can be send to a remote site.

3.6 The Automatic Pool Trainer

The aim of this application [12] is to provide guidance for novice pool players. A pool table is placed, instead of the WorkBench, directly under the laser and a camera

Figure 4: Pool table with laser guidelines shown. The other way of using the pool trainer is by lessons. The system will provide a number of lessons in a manner similar to e.g. chess exercises. The user is instructed to place balls at speci c positions (told by the speech synthesizer and shown with the laser). When all have been set up correctly (controlled by the camera), the user is instructed to e.g. pot a speci c ball. The system monitors the shot and give comments e.g., "you need to move the cue stick a little to the left". Both applications can be initiated by spoken commands and all questions asked by the system (using the speech synthesizer) can be answered using speech, making the system independent of keyboard, mouse and screen.

3.7 Natural Interface for Videoconferencing

The purpose of this project [1] is to create a natural interface for videoconferencing by integrating the processing of di erent modalities to control cameras, light, and sound. This is done to make it more natural for people to participate in a videoconference. The user of the system is therefore only required to have a basic knowledge of how to participate in a videoconference. A speech directed camera control is implemented. The user interact with the system by speech utterance e.g. "turn camera to the left", "zoom out" and "focus on Thomas" (where the location of the person are

known a priori by the system). These input sentences are interpreted by a speech recognizer and a natural language parser with a xed vocabulary. The semantics of the speech signal is, by a decision module, converted into control signals used to control a movable camera, the SONY EVI-D31. The system also include a novel graphical user interface where the state of the system is illustrated together with an overview of the di erent persons participating in the video conferencing.

3.8 Audio-Visual Speech Recognition The aim of the work [16] is to investigate whether the recognition rate of a speech recognizer in noisy environments can be improved by integrating visual data. An audio-visual speech recognizer is designed and implemented based on feature extraction from synchronized audio and video data. A sound card is used to sample the audio signal at 8kHz and images are grabbed at 25Hz. Mel scaled Ceptrum is used for feature extraction from the audio signal and principal components analysis for feature extraction from the images. The synchronized features from the two modalities are concatenated and applied to a recognizer, based on Hidden Markov models. This idea of integrating the two modalities at feature level is called early integration, as opposed to late integration where the signals from each modality are processed individually before they are integrated. The system is tested on the numbers from zero to nine. As expected, the recognition rate did not improve in normal conditions by integrating the two modalities, but test results show a more stable recognition rate under noisy conditions.

3.9 Improved HCI for a WIMP-based Environment This thesis project [2][3] concerns an attempt to enhance a windows based (WIMP - Windows Icon Menu Pointer) environment. The goal is to establish whether user interaction on the common desktop PC can be augmented by adding new modalities to the WIMP interface, thus bridging the gap between todays interaction patterns and future interfaces comprising e.g. advanced conversational capabilities, VR technology, etc. A user survey was carried out to establish the trouble spots of the WIMP interface on the most common desktop work station, the Windows 95 PC. On the basis of this, a number of new modalities were considered. Spoken in- and output and (head)gaze tracking were selected together with the concept of an interface agent for further investigation. A system was developed to control the interaction of the in- and output modalities, and a set of ve scenarios were constructed to test the proposed ideas. In

these, a number of test subjects used the existing and added modalities in various con gurations. The chosen combination of modalities was recommended by all test subjects without exception. The enhanced interface were also rated as being between `useful' and `very useful' on a general level whereas the usability of the standard interface were rated to be between `medium' and `useful'. Except for the methods involving the gaze tracker (at its current performance), the proposed interaction methods were all considered easier and more natural than the traditional methods. The work clearly indicates that spoken interaction can be expected to be widely accepted very quickly, when applications including this begin to appear. However, more investigation is needed to determine whether gaze control is viable. This work indicates that head-gaze is not viable (at least in it's present level of development), and it remains to be shown whether eye-gaze will be accepted, and whether technology suitable for mass production will appear.

4 Discussion As the previous sections clearly illustrates, a great diversity of applications have been developed using the WorkBench, or selected components from it. The concept of developing generic libraries and APIs for the devices associated with the WorkBench has proven a success. Students and researchers are able to select and plug in hardware and software modules with a minimum of e ort and thus to focus directly on developing and experimenting with advanced human computer interaction issues, as were the intention. E orts are continuously made to preserve and further add components to the WorkBench, often through the work of students. Modules which have been developed by a researcher or within a student project are documented, and well-de ned APIs are created, turning them into readily available WorkBench components. This has e.g. been the case for the interface to the laser and microphone array components.

4.1 Perspectives

As mentioned in the introduction there is a great demand for skilled engineers, especially within the general area of Internet, multi media and similar elds. We believe that the near future will see an upsurge of the demand for advanced processing of user input, like speech and image recognition, e.g. Virtual Reality like environments. Today, any PC has sucient computational power to do large vocabulary, real-time speech recognition, and this may soon be the case for image processing also. Therefore, we believe that our Intellimedia initiative is central to this development, as the human re-

sources and not computer hardware will be the bottle neck for the future evolvement of this eld.

Acknowledgments We would like to thank all the students who participated in the di erent projects described in this paper and the following people who have been involved in the IMM initiative: Tom Brndsted, Paul Dalsgaard, Flemming K. Fink, Mike Manthey, Paul Mc Kevitt and Kristian G. Olesen.

References [1] L. Bakman, M. Blidegn, S. Carrasco and T.D. Nielsen. NIVICO - Natural Interface for Videoconferencing. Student report, Institute for Electronic Systems, Aalborg University, 1997. [2] L. Bakman, M. Blidegn and M. Wittrup. Improving Human-Computer Interaction by adding Speech, Gaze Tracking, and Agents to a WIMP-based Environment. Master Thesis, Institute for Electronic Systems, Aalborg University, 1998. [3] L. Bakman, M. Blidegn, M. Wittrup, L.B. Larsen and T.B. Moeslund. Enhancing a WIMP based interface with Speech, Gaze tracking and Agents. In Proc. ICSLP98, Sydney, Australia, Nov. 1998. [4] J. Bang, C.B. Larsen, T. Madsen, B.C. Petersen and G. Rosset. Virtual Wheel. Student report, Institute for Electronic Systems, Aalborg University, 1998. [5] J. Bang, U.O. Koch, C.B. Larsen, T. Madsen and B.C. Petersen. VITAL - VIdeokonferencesystem med TALestyret kamera. Student report, Institute for Electronic Systems, Aalborg University, 1997. [6] C. Copeland, J. Durand, S. Krauwer and B. Maegaard Description of the EUROTRA framework. The Eurotra Formal Speci cations, Studies in Machine Translation and Natural Language Processing, vol. 2, 7-40, 1991. [7] P. Bondesen, S.H.B. Poulsen and M.L. Andersen. Face Recognition in Access Control. Student report, Institute for Electronic Systems, Aalborg University, 1998. [8] T. Brndsted. The Natural Language Parsing Modules in REWARD and IntelliMedia 2000+. S. KirchmeierAndersen, H.E. Thomsen (eds.): Proceedings from the Danish Society for Computational Linguistics (DALF), Copenhagen Business School, Dep. of Computational Linguistics, 1998. In press [9] T. Brndsted, L.B. Larsen, M. Manthey, P.M. Kevitt, T. Moeslund and K.G. Olesen. A platform for developing Intelligent MultiMedia applications. Technical Report R-98-1004, CPK, Aalborg University, May 1998. [10] T. Brndsted, L.B. Larsen, M. Manthey, P.M. Kevitt, T. Moeslund and K.G. Olesen. The Intellimedia WorkBench - an environment for building Multi Modal systems. International Conference on Cooperative Mul-

[11] [12] [13] [14] [15]

timodal Communication, Tilburg, The Netherlands, January 1998. T. Brndsted, L.B. Larsen, M. Manthey, P.M. Kevitt, T. Moeslund and K.G. Olesen. The Intellimedia WorkBench - a generic environment for multimodal systems. In Proc. ICSLP-98, Sydney, Australia, Nov. 1998. J. Buck, S.B. Christiansen, A. Cohen, S. Muhammad, S. Ortega and S. Thorvaldsdottir. Intelligent Multimedia Based Pool Trainer. Student report, Institute for Electronic Systems, Aalborg University, 1998. G.A. Fink et al. A Distributed System for Integration of Speech and Image Understanding. Rogelio Soto (ed.): Proceedings of the Int. Symposium on Arti cial Intelligence, Cancun, Mexico 1996, pp. 117-126. R.C. Gonzalez and R.E. Woods Digital Image Processing. Reading, Massachusetts: Addison Wesley The Intellimedia 2000+ initiative at Aalborg University, http://www.cpk.auc.dk/speech/chameleon.html

[16] J. Krogh, H.H. Pedersen, L. Skyt and H. Wang. AudioVisual Speech Recognition. Student report, Institute for Electronic Systems, Aalborg University, 1998. [17] C.B. Larsen and B.C. Petersen Microphone Array Driver. Technical report, Institute for Electronic Systems, Aalborg University, 1998. [18] P. Leth-Espensen and B. Lindberg. Application of microphone arrays for remote voice pick-up - RVP project, nal report. Center for PersonKommunikation, Aalborg University, 1995. [19] Madenta Inc. 3022 Calgary Trail South Edmonton, Alberta T6J 6V4, Canada, http://www.mandenta.com [20] Master's Programme (M.Eng./M.Sc.) in Intelligent MultiMedia (IMM), Aalborg University www.kom.auc.dk/ESN/masters/multimedia/mmui.html. [21] T.B. Moeslund, L. Bakman and M. Blidegn. Controlling a Movable Laser from a PC. Technical report, Institute for Electronic Systems, Aalborg University, 1998. [22] T.B. Moeslund. An application programmer's guide to the protocol and driver for computer control of the SONY EVI-D31 tracking camera. Technical report, Institute for Electronic Systems, Aalborg University, 1997.