We had two papers accepted to the Interspeech Conference in Dresden in August:
Title: Micro-Structure of Disfluencies: Basics for Conversational Speech Synthesis
Authors: Simon Betz, Petra Wagner and David Schlangen
Abstract: Incremental dialogue systems can produce fast responses and can interact in a human-like fashion. However, these systems occasionally produce erroneous material or run out of things to say. Humans in such situations use disfluencies to remedy their ongoing production and signal this to the listener. We devised a new model for inserting disfluencies into synthesis and evaluated this approach in a perception test. It showed that lengthenings and silent pauses can be built for speech synthesis with low effort and high output quality. Synthesized word fragments and filled pauses, while potentially useful in incremental dialogue systems, appear more difficult to handle for listeners. While we were able to get consistently high ratings for certain types of disfluencies, the need for more basic research on their micro structure became apparent in order to be able to synthesize the fine phonetic detail of disfluencies. For this, we analysed corpus data with regard to distributional and durational aspects of lengthenings, word fragments and pauses. Based on these natural speaking strategies, we explored further to what extent speech can be delayed using disfluency strategies, and how to handle difficult disfluency elements by determining the appropriate amount of durational variation applicable.
Title: Recurrent Neural Networks for Incremental Disfluency Detection
Authors: Julian Hough and David Schlangen
Abstract: For dialogue systems to become robust, they must be able to detect disfluencies accurately and with minimal latency. To meet this challenge, here we frame incremental disfluency detection as a word-by-word tagging task and, following their recent success in Spoken Language Understanding tasks, we test the performance of Recurrent Neural Networks (RNNs). We experiment with different inputs for RNNs to explore the effect of context on their ability to detect edit terms and repair disfluencies effectively, and also experiment with different tagging schemes. Although not eclipsing the state of the art in terms of utterance-final performance, RNNs achieve good detection results, requiring no feature engineering and using simple input vectors representing the incoming utterance as their training input. Furthermore, RNNs show very good incremental properties with low latency and very good output stability, surpassing previously reported results in these measures.