Systems and methods for identifying and learning emotions in conversation utterances are described. The system receives at least one of textual utterance data, audio utterance data and visual utterance data. A set of facial expressions are fetched from the visual utterance data. The system annotates the set of facial expressions with corresponding set of emotions using predictive modeling. Upon annotating, labelled data is generated by tagging the textual utterance data and the audio utterance data with the set of emotions. The labelled data along with non-labelled data is fed into self-learning model of the system. The non-labelled data is new textual utterance data. The self-learning model learns, from the labelled data, about the set of emotions. Further, the self-learning model also determines a new set of emotions corresponding to the new textual utterance data by using recurrent neural network. The self-learning model generates new labelled data and update itself accordingly. FIG. 1