Systems and methods are provided for handling concurrent speech in which first speech data is received from a first participant of a session and second speech data is received from a second participant of the session. The second speech data includes a pause. The second speech data temporally overlaps the first speech data. A determination is made as to whether the first speech data exceeds a predetermined length. When the first speech data exceeds the predetermined length, the first speech data is outputted and then the second speech data of the second participant is outputted without the pause. When the first speech data does not exceed the predetermined length, the first speech data is outputted and then the second speech data is outputted with the pause.