A method, system and computer program product for tracking movement of an object, such as a hand. Speakers of a device to be controlled transmit frequency modulated continuous wave (FMCW) audio signals. These signals are reflected by the object and received by the microphones at the controlled device. The received and transmitted audio signals are mixed. A fast Fourier transform (FFT) is then performed on the mixed audio signals. One or more peak frequencies in the frequency domain of the FFT mixed audio signals are selected and used to estimate the distance between the object and the speakers of the controlled device. Furthermore, the velocity of the object is estimated. The coordinates of the object are then computed using the estimated distance between the object and the speakers and microphones of the controlled device and the estimated velocity of the object.