Systems and methods for capturing media content in accordance with viewer expression are disclosed. In some implementations, a method includes: at a computer system having one or more processors and memory storing one or more programs for execution by the one or more processors: detecting display of a media content item; while the media content item is being displayed: detecting a viewer expression corresponding to a predefined viewer expression; and in response to detecting the viewer expression: identifying a portion of the media content item corresponding in time to the viewer expression. In some implementations, the viewer expression—selected from one of: a facial expression, a body movement, a voice, or an arm, leg or finger gesture—is presumed to be a viewer reaction to the portion of the media content item.