Images of a plurality of users are captured concurrently with the plurality of users evincing a plurality of expressions. The images are captured using one or more eye tracking sensors implemented in one or more head mounted devices (HMDs) worn by the plurality of first users. A machine learnt algorithm is trained to infer labels indicative of expressions of the users in the images. A live image of a user is captured using an eye tracking sensor implemented in an HMD worn by the user. A label of an expression evinced by the user in the live image is inferred using the machine learnt algorithm that has been trained to predict labels indicative of expressions. The images of the users and the live image can be personalized by combining the images with personalization images of the users evincing a subset of the expressions.