A spoken language understanding (SLU) system receives a sequence of words corresponding to one or more spoken utterances of a user, which is passed through a spoken language understanding module to produce a sequence of intentions. The sequence of words are passed through a first subnetwork of a multi-scale recurrent neural network (MSRNN), and the sequence of intentions are passed through a second subnetwork of the multi-scale recurrent neural network (MSRNN). Then, the outputs of the first subnetwork and the second subnetwork are combined to predict a goal of the user.