Attention Driven Auditory Display
Thorsten Mahler, Pierre Bayerl, Michael Weber & Heiko Neumann (Universität Ulm)

The interdisciplinary field of image sonification aims at the transformation of images to auditory signals. The basic question is how information from the image domain can be transformed to the sound domain to produce an intuitively comprehensive audio signal. Three classes of sonification approaches have previously been proposed [1,2]: First, parameter mapping, where image data (e.g. position, luminance) is directly mapped to the parameters of the sound signal (e.g. amplitude, frequency, duration); second, model-based sonification, where virtual sound objects (e.g. instruments) are controlled by the visual input; third, auditory scene generation, where the input data is utilized to generate an auditory scene. The starting point for our work was the sonification system “vOICe” introduced by Meijer [3]. This system is a variation of the parameter mapping approach, where image luminance steers the sound amplitude, vertical image location the sound frequency and horizontal location time and stereo. A drawback of this approach is that the entire data contained in the image is sonified, regardless of the relevance of the information. Other systems are designed for a very special purpose (e.g. [2]). Unlike previous approaches we aim to sonify images of any kind. We propose that models of visual attention [4] and visual grouping [5] can be utilized to dynamically select relevant visual information to be sonified. For the auditory synthesis we employ an approach, which takes advantage of the sparseness of the selected input data. Horizontal image locations are directly mapped into the sound signal using auditory stereo. Vertical information is encoded by time. Additional audio parameters, such as frequency, can be controlled by local image features, such as orientation. Furthermore, we introduce a sequential playback mode where image features at different locations are played in a successive manner. Instead of being played simultaneously, this enhances the perception of relative differences in the stereo signal. In conclusion, the presented approach proposes a combination of data sonification approaches, such as auditory scene generation, and models of human visual perception. It extends previous pixel-based transformation algorithms by incorporating mid-level vision coding and high-level control. The mapping utilizes elaborated sound parameters that allow non-trivial orientation and positioning in 3D space.

[1] Hermann, Ritter (2004): Proceedings of the IEEE, Special Issue Engineering and Music, 92
[2] Hermann, Nattkemper, Ritter, Schubert (2000): Proceedings of the Mathematical and Engineering Techniques in Medical and Biological Sciences, CSREA Press
[3] Meijer (1992): IEEE Transactions on Biomedical Engineering, 39
[4] Itty, Koch, Niebur (1998): IEEE Transactions on Pattern Analysis and Machine Intelligence, 20
[5] Fischer, Bayerl, Neumann, Christobal, Redondo (2004): European Conference on Computer Vision 2004, LNCS 3023