Video Object Segmentation by Tracking Structured Key Points and Contours
Tutor / director / avaluadorPont Tuset, Jordi
Realitzat a/ambEidgenössische Technische Hochschule Zürich
Tipus de documentProjecte Final de Màster Oficial
Condicions d'accésAccés obert
In this thesis, we tackle the problem of video object segmentation where we have to classify every pixel of every frame in a video sequence into background and foreground classes. Our algorithms fall in the semi-supervised category, i.e., they start with the object of interest annotated in the first frame and then they track and segment that object in the following frames. The first algorithm that we have implemented describes the object of interest in terms of a set of points distributed on the object and then tracks them in the following frames. To make the tracking robust, we impose that the spatial distribution of these points is stable along the frames. To do so, we place a mesh on top of the mask of the object, whose vertices are the interest points to track, and the edges define the spatial structure within them. We then compute a descriptor of the appearance of each of the points and look for the displacements that bring those points in the following frame to a point with a similar descriptor. We enforce that the displacements of neighboring points are similar, which favors coherent deformations of the object. This algorithm may experience difficulties at the contours of the objects as the point descriptors might be influenced by the background. To overcome this problem, our second algorithm is based on the idea of tracking the contour of the object by imposing smooth deformations between frames. Starting from a polygonal representation of the contour of the object,we look for the locations at the following frame that have a strong response of an edge detector while minimizing the deformation of the shape. Specifically, we build a multiscale pyramid of segments of the contour polygon and look for the displacement of every segment that matches the edge response while being coherent with the rest of elements of the pyramid. This second algorithm can be understood as complementary to the first one, since it might fail in object with low-contrasted contours or with cluttered background. As an overall trade off, we propose a combination of the two algorithms that tries to make the most out of each of them and compensate their weaknesses. In order to validate our approaches, we perform an extensive validation on a recently-published database called DAVIS that provides fifty sequences with the ground truth annotated in each of their frames. We sweep all the different parameters of the algorithms in order to achieve the best performance in this database. The results show that the contour algorithm outperforms the mesh algorithm, so the weaknesses presented in the previous paragraph are more prominent in the mesh algorithm. Once we combine both of them, although we have not been able to do a full search in the parameter space, the results obtained are promising and an increase in the parameter space search suggests that we would outperform any of the standalone methods. We also perform a comparison against six state-of-the-art algorithms which shows that although we are still behind the better-performing ones, our approach might be competitive with further tuning and experimentation.