Timo Baumann and David Schlangen, Interactional Adequacy as a Factor in the Perception of Synthesized Speech, to appear in Proceedings of the 8th ISCA Speech Synthesis Workshop, Barcelona, Spain, August 2013
Speaking as part of a conversation is different from reading out aloud.
Speech synthesis systems, however, are typically developed using assumptions
(at least implicitly) that are more true of the latter than the former situation.
We address one particular aspect, which is the assumption that
a fully formulated sentence is available for synthesis.
We have built a system that does not make this assumption
but rather can synthesize speech given incrementally extended input.
In an evaluation experiment, we found that in a dynamic domain
where what is talked about changes quickly,
subjects rated the output of this system as more `naturally pronounced’ than that
of a baseline system that employed standard synthesis,
despite the synthesis quality objectively being degraded.
Our results highlight the importance of considering a synthesizer’s ability to
support interactive use-cases when determining the adequacy of synthesized
Casey Kennington and David Schlangen, Situated Incremental Natural Language Understanding Using Markov Logic Networks, to appear in Computer Speech and Language
We present work on understanding natural language in a situated domain in an incremental, word-by-word fashion.
We explore a set of models specified as Markov Logic Networks and show that a model that has access to information about the visual context during an utterance, its discourse context, the words of the utterance, as well as the linguistic structure of the utterance performs best and is robust to noisy speech input. We explore the incremental properties of the models and offer some analysis. We conclude that MLNs provide a promising framework for specifying such models in a general, possibly domain-independent way.
Casey Kennington, Spyros Kousidis and David Schlangen, Interpreting Situated Dialogue Utterances: an Update Model that Uses Speech, Gaze, and Gesture Information, to appear in Proceedings of SIGdial 2013, Metz, France, August 2013
In situated dialogue, speakers share time and space. We present a statistical model for understanding natural language that works incrementally (i.e., in real, shared time) and is grounded (i.e., links to entities in the shared space). We describe our model with an example, then establish that our model works well on non-situated, telephony application-type utterances, show that it is effective in grounding language in a situated environment, and further show that it can make good use of embodied cues such as gaze and pointing in a fully multi-modal setting.
Spyros Kousidis, Casey Kennington and David Schlangen, Investigating speaker gaze and pointing behaviour in human-computer interaction with the
mint.tools collection, to appear in Proceedings of SIGdial 2013, Metz, France, August 2013
Can speaker gaze and speaker arm movements be used as a practical information source for naturalistic conversational human–computer interfaces? To investigate this question, we’ve recorded (with eye tracking and motion capture) a corpus of interactions with a (wizarded) system. In this paper, we describe the recording and analysis infrastructure that we’ve built for such studies, and the particular analyses we performed on these data.
We find that with some initial calibration, a “minimally invasive”, stationary camera-based setting provides data of sufficient quality to support interaction.
Timo Baumann and David Schlangen, Open-ended, Extensible System Utterances Are Preferred, Even If They Require Filled Pauses, to appear in Proceedings of SIGdial 2013, Metz, France, August 2013
In many environments (e.g. sports commentary), situations incrementally unfold over time and often the future appearance of a relevant event can be predicted, but not in all its details or precise timing. We have built a simulation framework that uses our incremental
speech synthesis component to assemble in a timely manner complex commentary
utterances. In our evaluation, the resulting output is preferred over that from a baseline system that uses a simpler commenting strategy. Even in cases where the incremental
system overcommits temporally and requires a filled pause to wait for
the upcoming event, the system is preferred over the baseline.
PDFs to follow (or, should I forget to add links here, will be available under “Publications” in any case).