Eye gaze is an important non-verbal cue in human-human and human-machine interactions.
In this master thesis, we explore optical
ow as a new feature of temporal information added to
face and eyes to perform 3D gaze estimation from remote cameras in a mid-distance scenario. We
propose new models that combine face, optical
ow from the face between the last two frames,
eyes, and face landmarks as individual streams in a CNN to estimate gaze using the last two
images. We also develop a recurrent model that exploits the dynamic nature of gaze by feeding
the learned features of all the frames in a sequence to a many-to-one recurrent module that
predicts the 3D gaze vector of the last frame. Our experiments show that, with the addition
of the temporal information of optical
ow, static models can perform just as well as recurring
models, while maintaining lower complexity and faster inference than recurrent ones.