Frames of a video frame sequence capturing one or more skin regions of a body are provided to a first neural network. The first neural network generates respective appearance representations based on the frames. An appearance representation generated based on a particular frame is indicative of a spatial distribution of a physiological signal across the particular frame. Simultaneously with providing the frames to the first neural network, the frames are also provided to a second neural network. The second neural network determines the physiological signal based on the frames. Determining the physiological signal by the second neural network includes applying the appearance representations, generated by the first neural network, to outputs of one or more layers of the second neural network to emphasize regions, in the frames, that exhibit relatively stronger presence of the physiological signal and deemphasize regions, in the frames, that exhibit relatively weaker presence of physiological signal.