A multimodal system and process processed one or more sensor signals and extracts features from the one or more sensor signals through a spatiotemporal correlation between consecutive frames of an image or video sequence. The multimodal system and process determines the movement and direction of the features through an image subtraction, or a coherence measure and synthesizes a musical instrument signal in response to the detected movement and direction or the triggers. The imaginary musical instrument signal is added to an infotainment signal within a vehicle.