Methods and systems of modifying a media presentation are presented. In one example, a first media item including first audio data is played. During that time, a user indication is received. In response to the indication, detection data for a first audio segment of the first audio data is generated. A second media item including second audio data is then played. A second audio segment in the second audio data corresponding to at least a portion of the first audio segment is detected based on the detection data. Whether a location in the second audio segment corresponds to a location in the detection data associated with the user indication is determined. In response to the location in the second audio segment corresponding to the location in the detection data, the playing of the second media item is altered during at least a portion of the second audio segment.