Ponències/Comunicacions de congressos

Ponències/Comunicacions de congressos http://hdl.handle.net/2117/3335 Wed, 17 Apr 2024 15:34:15 GMT 2024-04-17T15:34:15Z Uso de redes neuronales convolucionales para la detección remota de frutos con cámaras RGB-D http://hdl.handle.net/2117/385563 Uso de redes neuronales convolucionales para la detección remota de frutos con cámaras RGB-D Gené Mola, Jordi; Vilaplana Besler, Verónica; Rosell Polo, Joan Ramon; Morros Rubió, Josep Ramon; Ruiz Hidalgo, Javier; Gregorio López, Eduard La detección remota de frutos será una herramienta indispensable para la gestión agronómica optimizada y sostenible de las plantaciones frutícolas del futuro, con aplicaciones en previsión de cosecha, robotización de la recolección y elaboración de mapas de producción. Este trabajo propone el uso de cámaras de profundidad RGB-D para la detección y la posterior localización 3D de los frutos. El material utilizado para la adquisición de datos consiste en una plataforma terrestre autopropulsada equipada con dos sensores Kinect v2 de Microsoft y un sistema de posicionamiento RTK-GNSS. Con este equipo se escanearon 3 filas de manzanos Fuji de una explotación comercial. El conjunto de datos adquiridos está compuesto por 110 capturas que contienen un total de 12,838 manzanas Fuji. La detección de frutos se realizó mediante los datos RGB (imágenes de color proporcionadas por el sensor). Para ello, se implementó y se entrenó la red neuronal convolucional de detección de objetos Faster R-CNN, la cual está compuesta por dos módulos: red de propuesta de regiones de interés y red de clasificación. Ambos módulos comparten las primeras capas convolucionales siguiendo el modelo VGG-16 pre-entrenado con la base de datos ImageNet. Los resultados de test muestran un porcentaje de detección del 91.4% de los frutos con un 15.9% de falsos positivos (F1-score = 0.876). La evaluación cualitativa de las detecciones muestra que los falsos positivos corresponden a zonas de la imagen que presentan un patrón muy similar a una manzana, donde, incluso a percepción del ojo humano, es difícil de determinar si hay o no manzana. Por otro lado, las manzanas no detectadas corresponden a aquellas que estaban ocultas casi en su totalidad por otros órganos vegetativos (hojas o ramas) o a manzanas cortadas por los márgenes de la imagen. De los resultados experimentales se concluye que el sensor Kinect v2 tiene un gran potencial para la detección y localización 3D de frutos. La principal limitación del sistema es que el rendimiento del sensor de profundidad se ve afectado en condiciones de alta iluminación. Mon, 27 Mar 2023 12:16:51 GMT http://hdl.handle.net/2117/385563 2023-03-27T12:16:51Z Gené Mola, Jordi Vilaplana Besler, Verónica Rosell Polo, Joan Ramon Morros Rubió, Josep Ramon Ruiz Hidalgo, Javier Gregorio López, Eduard La detección remota de frutos será una herramienta indispensable para la gestión agronómica optimizada y sostenible de las plantaciones frutícolas del futuro, con aplicaciones en previsión de cosecha, robotización de la recolección y elaboración de mapas de producción. Este trabajo propone el uso de cámaras de profundidad RGB-D para la detección y la posterior localización 3D de los frutos. El material utilizado para la adquisición de datos consiste en una plataforma terrestre autopropulsada equipada con dos sensores Kinect v2 de Microsoft y un sistema de posicionamiento RTK-GNSS. Con este equipo se escanearon 3 filas de manzanos Fuji de una explotación comercial. El conjunto de datos adquiridos está compuesto por 110 capturas que contienen un total de 12,838 manzanas Fuji. La detección de frutos se realizó mediante los datos RGB (imágenes de color proporcionadas por el sensor). Para ello, se implementó y se entrenó la red neuronal convolucional de detección de objetos Faster R-CNN, la cual está compuesta por dos módulos: red de propuesta de regiones de interés y red de clasificación. Ambos módulos comparten las primeras capas convolucionales siguiendo el modelo VGG-16 pre-entrenado con la base de datos ImageNet. Los resultados de test muestran un porcentaje de detección del 91.4% de los frutos con un 15.9% de falsos positivos (F1-score = 0.876). La evaluación cualitativa de las detecciones muestra que los falsos positivos corresponden a zonas de la imagen que presentan un patrón muy similar a una manzana, donde, incluso a percepción del ojo humano, es difícil de determinar si hay o no manzana. Por otro lado, las manzanas no detectadas corresponden a aquellas que estaban ocultas casi en su totalidad por otros órganos vegetativos (hojas o ramas) o a manzanas cortadas por los márgenes de la imagen. De los resultados experimentales se concluye que el sensor Kinect v2 tiene un gran potencial para la detección y localización 3D de frutos. La principal limitación del sistema es que el rendimiento del sensor de profundidad se ve afectado en condiciones de alta iluminación. Comparative study of upsampling methods for super-resolution in remote sensing http://hdl.handle.net/2117/375123 Comparative study of upsampling methods for super-resolution in remote sensing Salgueiro Romero, Luis Fernando; Marcello Ruiz, Javier; Vilaplana Besler, Verónica Many remote sensing applications require high spatial resolution images, but the elevated cost of these images makes some studies unfeasible. Single-image super-resolution algorithms can improve the spatial resolution of a lowresolution image by recovering feature details learned from pairs of low-high resolution images. In this work, several configurations of ESRGAN, a state-of-the-art algorithm for image super-resolution, are tested. We make a comparison between several scenarios, with different modes of upsampling and channels involved. The best results are obtained training a model with RGB-IR channels and using progressive upsampling. Thu, 27 Oct 2022 08:28:25 GMT http://hdl.handle.net/2117/375123 2022-10-27T08:28:25Z Salgueiro Romero, Luis Fernando Marcello Ruiz, Javier Vilaplana Besler, Verónica Many remote sensing applications require high spatial resolution images, but the elevated cost of these images makes some studies unfeasible. Single-image super-resolution algorithms can improve the spatial resolution of a lowresolution image by recovering feature details learned from pairs of low-high resolution images. In this work, several configurations of ESRGAN, a state-of-the-art algorithm for image super-resolution, are tested. We make a comparison between several scenarios, with different modes of upsampling and channels involved. The best results are obtained training a model with RGB-IR channels and using progressive upsampling. Sign language video retrieval with free-form textual queries http://hdl.handle.net/2117/373923 Sign language video retrieval with free-form textual queries Cardoso Duarte, Amanda; Albanie, Samuel; Giró Nieto, Xavier; Varol, Gül Systems that can efficiently search collections of sign language videos have been highlighted as a useful application of sign language technology. However, the problem of searching videos beyond individual keywords has received limited attention in the literature. To address this gap, in this work we introduce the task of sign language retrieval with textual queries: given a written query (e.g. a sentence) and a large collection of sign language videos, the objective is to find the signing video that best matches the written query. We propose to tackle this task by learning cross-modal embeddings on the recently introduced large-scale How2Sign dataset of American Sign Language (ASL). We identify that a key bottleneck in the performance of the system is the quality of the sign video embedding which suffers from a scarcity of labelled training data. We, therefore, propose SPOT-ALIGN, a framework for interleaving iterative rounds of sign spotting and feature alignment to expand the scope and scale of available training data. We validate the effectiveness of SPOT-ALIGN for learning a robust sign video embedding through improvements in both sign recognition and the proposed video retrieval task. Tue, 04 Oct 2022 11:56:25 GMT http://hdl.handle.net/2117/373923 2022-10-04T11:56:25Z Cardoso Duarte, Amanda Albanie, Samuel Giró Nieto, Xavier Varol, Gül Systems that can efficiently search collections of sign language videos have been highlighted as a useful application of sign language technology. However, the problem of searching videos beyond individual keywords has received limited attention in the literature. To address this gap, in this work we introduce the task of sign language retrieval with textual queries: given a written query (e.g. a sentence) and a large collection of sign language videos, the objective is to find the signing video that best matches the written query. We propose to tackle this task by learning cross-modal embeddings on the recently introduced large-scale How2Sign dataset of American Sign Language (ASL). We identify that a key bottleneck in the performance of the system is the quality of the sign video embedding which suffers from a scarcity of labelled training data. We, therefore, propose SPOT-ALIGN, a framework for interleaving iterative rounds of sign spotting and feature alignment to expand the scope and scale of available training data. We validate the effectiveness of SPOT-ALIGN for learning a robust sign video embedding through improvements in both sign recognition and the proposed video retrieval task. Channel-wise early stopping without a validation set via NNK polytope interpolation http://hdl.handle.net/2117/366180 Channel-wise early stopping without a validation set via NNK polytope interpolation Bonet Solé, David; Ortega, Antonio; Ruiz Hidalgo, Javier; Sarath Shekkizhar, Sarath State-of-the-art neural network architectures continue to scale in size and deliver impressive generalization results, although this comes at the expense of limited interpretability. In particular, a key challenge is to determine when to stop training the model, as this has a significant impact on generalization. Convolutional neural networks (ConvNets) comprise high-dimensional feature spaces formed by the aggregation of multiple channels, where analyzing intermediate data representations and the model's evolution can be challenging owing to the curse of dimensionality. We present channel-wise DeepNNK (CW-DeepNNK), a novel channel-wise generalization estimate based on non-negative kernel regression (NNK) graphs with which we perform local polytope interpolation on low-dimensional channels. This method leads to instance-based interpretability of both the learned data representations and the relationship between channels. Motivated by our observations, we use CW-DeepNNK to propose a novel early stopping criterion that (i) does not require a validation set, (ii) is based on a task performance metric, and (iii) allows stopping to be reached at different points for each channel. Our experiments demonstrate that our proposed method has advantages as compared to the standard criterion based on validation set performance. Thu, 21 Apr 2022 13:13:39 GMT http://hdl.handle.net/2117/366180 2022-04-21T13:13:39Z Bonet Solé, David Ortega, Antonio Ruiz Hidalgo, Javier Sarath Shekkizhar, Sarath State-of-the-art neural network architectures continue to scale in size and deliver impressive generalization results, although this comes at the expense of limited interpretability. In particular, a key challenge is to determine when to stop training the model, as this has a significant impact on generalization. Convolutional neural networks (ConvNets) comprise high-dimensional feature spaces formed by the aggregation of multiple channels, where analyzing intermediate data representations and the model's evolution can be challenging owing to the curse of dimensionality. We present channel-wise DeepNNK (CW-DeepNNK), a novel channel-wise generalization estimate based on non-negative kernel regression (NNK) graphs with which we perform local polytope interpolation on low-dimensional channels. This method leads to instance-based interpretability of both the learned data representations and the relationship between channels. Motivated by our observations, we use CW-DeepNNK to propose a novel early stopping criterion that (i) does not require a validation set, (ii) is based on a task performance metric, and (iii) allows stopping to be reached at different points for each channel. Our experiments demonstrate that our proposed method has advantages as compared to the standard criterion based on validation set performance. H3D-Net: Few-shot high-fidelity 3D head reconstruction http://hdl.handle.net/2117/357627 H3D-Net: Few-shot high-fidelity 3D head reconstruction Ramon Maldonado, Eduard; Triginer Garcés, Gil; Escurt i Gelabert, Janna; Pumarola Peris, Albert; García Giráldez, Jaime; Giró Nieto, Xavier; Moreno-Noguer, Francesc Recent learning approaches that implicitly represent surface geometry using coordinate-based neural representations have shown impressive results in the problem of multi-view 3D reconstruction. The effectiveness of these techniques is, however, subject to the availability of a large number (several tens) of input views of the scene, and computationally demanding optimizations. In this paper, we tackle these limitations for the specific problem of few-shot full 3D head reconstruction, by endowing coordinate-based representations with a probabilistic shape prior that enables faster convergence and better generalization when using few input images (down to three). First, we learn a shape model of 3D heads from thousands of incomplete raw scans using implicit representations. At test time, we jointly overfit two coordinate-based neural networks to the scene, one modeling the geometry and another estimating the surface radiance, using implicit differentiable rendering. We devise a two-stage optimization strategy in which the learned prior is used to initialize and constrain the geometry during an initial optimization phase. Then, the prior is unfrozen and fine-tuned to the scene. By doing this, we achieve high-fidelity head reconstructions, including hair and shoulders, and with a high level of detail that consistently outperforms both state-of-the-art 3D Morphable Models methods in the few-shot scenario, and non-parametric methods when large sets of views are available. Thu, 02 Dec 2021 09:12:55 GMT http://hdl.handle.net/2117/357627 2021-12-02T09:12:55Z Ramon Maldonado, Eduard Triginer Garcés, Gil Escurt i Gelabert, Janna Pumarola Peris, Albert García Giráldez, Jaime Giró Nieto, Xavier Moreno-Noguer, Francesc Recent learning approaches that implicitly represent surface geometry using coordinate-based neural representations have shown impressive results in the problem of multi-view 3D reconstruction. The effectiveness of these techniques is, however, subject to the availability of a large number (several tens) of input views of the scene, and computationally demanding optimizations. In this paper, we tackle these limitations for the specific problem of few-shot full 3D head reconstruction, by endowing coordinate-based representations with a probabilistic shape prior that enables faster convergence and better generalization when using few input images (down to three). First, we learn a shape model of 3D heads from thousands of incomplete raw scans using implicit representations. At test time, we jointly overfit two coordinate-based neural networks to the scene, one modeling the geometry and another estimating the surface radiance, using implicit differentiable rendering. We devise a two-stage optimization strategy in which the learned prior is used to initialize and constrain the geometry during an initial optimization phase. Then, the prior is unfrozen and fine-tuned to the scene. By doing this, we achieve high-fidelity head reconstructions, including hair and shoulders, and with a high level of detail that consistently outperforms both state-of-the-art 3D Morphable Models methods in the few-shot scenario, and non-parametric methods when large sets of views are available. Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data http://hdl.handle.net/2117/357619 Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data Mañas Sánchez, Óscar; Lacoste, Alexandre; Giró Nieto, Xavier; Vázquez Bermúdez, David; Rodriguez López, Pau Remote sensing and automatic earth monitoring are key to solve global-scale challenges such as disaster prevention, land use monitoring, or tackling climate change. Although there exist vast amounts of remote sensing data, most of it remains unlabeled and thus inaccessible for supervised learning algorithms. Transfer learning approaches can reduce the data requirements of deep learning algorithms. However, most of these methods are pre-trained on ImageNet and their generalization to remote sensing imagery is not guaranteed due to the domain gap. In this work, we propose Seasonal Contrast (SeCo), an effective pipeline to leverage unlabeled data for in-domain pre-training of remote sensing representations. The SeCo pipeline is composed of two parts. First, a principled procedure to gather large-scale, unlabeled and uncurated remote sensing datasets containing images from multiple Earth locations at different timestamps. Second, a self-supervised algorithm that takes advantage of time and position invariance to learn transferable representations for remote sensing applications. We empirically show that models trained with SeCo achieve better performance than their ImageNet pre-trained counterparts and state-of-the-art self-supervised learning methods on multiple downstream tasks. The datasets and models in SeCo will be made public to facilitate transfer learning and enable rapid progress in remote sensing applications. Thu, 02 Dec 2021 08:37:28 GMT http://hdl.handle.net/2117/357619 2021-12-02T08:37:28Z Mañas Sánchez, Óscar Lacoste, Alexandre Giró Nieto, Xavier Vázquez Bermúdez, David Rodriguez López, Pau Remote sensing and automatic earth monitoring are key to solve global-scale challenges such as disaster prevention, land use monitoring, or tackling climate change. Although there exist vast amounts of remote sensing data, most of it remains unlabeled and thus inaccessible for supervised learning algorithms. Transfer learning approaches can reduce the data requirements of deep learning algorithms. However, most of these methods are pre-trained on ImageNet and their generalization to remote sensing imagery is not guaranteed due to the domain gap. In this work, we propose Seasonal Contrast (SeCo), an effective pipeline to leverage unlabeled data for in-domain pre-training of remote sensing representations. The SeCo pipeline is composed of two parts. First, a principled procedure to gather large-scale, unlabeled and uncurated remote sensing datasets containing images from multiple Earth locations at different timestamps. Second, a self-supervised algorithm that takes advantage of time and position invariance to learn transferable representations for remote sensing applications. We empirically show that models trained with SeCo achieve better performance than their ImageNet pre-trained counterparts and state-of-the-art self-supervised learning methods on multiple downstream tasks. The datasets and models in SeCo will be made public to facilitate transfer learning and enable rapid progress in remote sensing applications. How2Sign: A large-scale multimodal dataset for continuous American sign language http://hdl.handle.net/2117/356423 How2Sign: A large-scale multimodal dataset for continuous American sign language Cardoso Duarte, Amanda; Palaskar, Shruti; Ventura Ripol, Lucas; Ghadiyaram, Deepti; DeHaan, Kenneth; Metze, Florian; Torres Viñals, Jordi; Giró Nieto, Xavier One of the factors that have hindered progress in the areas of sign language recognition, translation, and production is the absence of large annotated datasets. Towards this end, we introduce How2Sign, a multimodal and multiview continuous American Sign Language (ASL) dataset, consisting of a parallel corpus of more than 80 hours of sign language videos and a set of corresponding modalities including speech, English transcripts, and depth. A three-hour subset was further recorded in the Panoptic studio enabling detailed 3D pose estimation. To evaluate the potential of How2Sign for real-world impact, we conduct a study with ASL signers and show that synthesized videos using our dataset can indeed be understood. The study further gives insights on challenges that computer vision should address in order to make progress in this field. Dataset website: http://how2sign.github.io/ Mon, 15 Nov 2021 11:42:57 GMT http://hdl.handle.net/2117/356423 2021-11-15T11:42:57Z Cardoso Duarte, Amanda Palaskar, Shruti Ventura Ripol, Lucas Ghadiyaram, Deepti DeHaan, Kenneth Metze, Florian Torres Viñals, Jordi Giró Nieto, Xavier One of the factors that have hindered progress in the areas of sign language recognition, translation, and production is the absence of large annotated datasets. Towards this end, we introduce How2Sign, a multimodal and multiview continuous American Sign Language (ASL) dataset, consisting of a parallel corpus of more than 80 hours of sign language videos and a set of corresponding modalities including speech, English transcripts, and depth. A three-hour subset was further recorded in the Panoptic studio enabling detailed 3D pose estimation. To evaluate the potential of How2Sign for real-world impact, we conduct a study with ASL signers and show that synthesized videos using our dataset can indeed be understood. The study further gives insights on challenges that computer vision should address in order to make progress in this field. Dataset website: http://how2sign.github.io/ Refinement network for unsupervised on the scene foreground segmentation http://hdl.handle.net/2117/332324 Refinement network for unsupervised on the scene foreground segmentation Pardàs Feliu, Montse; Canet Tarrés, Gemma Unsupervised learning represents one of the most interesting challenges in computer vision today. The task has an immense practical value with many applications in artificial intelligence and emerging technologies, as large quantities of unlabeled images and videos can be collected at low cost. In this paper, we address the unsupervised learning problem in the context of segmenting the main foreground objects in single images. We propose an unsupervised learning system, which has two pathways, the teacher and the student, respectively. The system is designed to learn over several generations of teachers and students. At every generation the teacher performs unsupervised object discovery in videos or collections of images and an automatic selection module picks up good frame segmentations and passes them to the student pathway for training. At every generation multiple students are trained, with different deep network architectures to ensure a better diversity. The students at one iteration help in training a better selection module, forming together a more powerful teacher pathway at the next iteration. In experiments, we show that the improvement in the selection power, the training of multiple students and the increase in unlabeled data significantly improve segmentation accuracy from one generation to the next. Our method achieves top results on three current datasets for object discovery in video, unsupervised image segmentation and saliency detection. At test time, the proposed system is fast, being one to two orders of magnitude faster than published unsupervised methods. We also test the strength of our unsupervised features within a well known transfer learning setup and achieve competitive performance, proving that our unsupervised approach can be reliably used in a variety of computer vision tasks. Tue, 17 Nov 2020 14:34:07 GMT http://hdl.handle.net/2117/332324 2020-11-17T14:34:07Z Pardàs Feliu, Montse Canet Tarrés, Gemma Unsupervised learning represents one of the most interesting challenges in computer vision today. The task has an immense practical value with many applications in artificial intelligence and emerging technologies, as large quantities of unlabeled images and videos can be collected at low cost. In this paper, we address the unsupervised learning problem in the context of segmenting the main foreground objects in single images. We propose an unsupervised learning system, which has two pathways, the teacher and the student, respectively. The system is designed to learn over several generations of teachers and students. At every generation the teacher performs unsupervised object discovery in videos or collections of images and an automatic selection module picks up good frame segmentations and passes them to the student pathway for training. At every generation multiple students are trained, with different deep network architectures to ensure a better diversity. The students at one iteration help in training a better selection module, forming together a more powerful teacher pathway at the next iteration. In experiments, we show that the improvement in the selection power, the training of multiple students and the increase in unlabeled data significantly improve segmentation accuracy from one generation to the next. Our method achieves top results on three current datasets for object discovery in video, unsupervised image segmentation and saliency detection. At test time, the proposed system is fast, being one to two orders of magnitude faster than published unsupervised methods. We also test the strength of our unsupervised features within a well known transfer learning setup and achieve competitive performance, proving that our unsupervised approach can be reliably used in a variety of computer vision tasks. Explore, discover and learn: unsupervised discovery of state-covering skills http://hdl.handle.net/2117/332308 Explore, discover and learn: unsupervised discovery of state-covering skills Campos Camúñez, Víctor; Trott, Alex; Xiong, Caiming; Socher, Richard; Giró Nieto, Xavier; Torres Viñals, Jordi Acquiring abilities in the absence of a task-oriented reward function is at the frontier of reinforcement learning research. This problem has been studied through the lens of empowerment, which draws a connection between option discovery and information theory. Information-theoretic skill discovery methods have garnered much interest from the community, but little research has been conducted in understanding their limitations. Through theoretical analysis and empirical evidence, we show that existing algorithms suffer from a common limitation -- they discover options that provide a poor coverage of the state space. In light of this, we propose 'Explore, Discover and Learn' (EDL), an alternative approach to information-theoretic skill discovery. Crucially, EDL optimizes the same information-theoretic objective derived from the empowerment literature, but addresses the optimization problem using different machinery. We perform an extensive evaluation of skill discovery methods on controlled environments and show that EDL offers significant advantages, such as overcoming the coverage problem, reducing the dependence of learned skills on the initial state, and allowing the user to define a prior over which behaviors should be learned. Tue, 17 Nov 2020 11:28:47 GMT http://hdl.handle.net/2117/332308 2020-11-17T11:28:47Z Campos Camúñez, Víctor Trott, Alex Xiong, Caiming Socher, Richard Giró Nieto, Xavier Torres Viñals, Jordi Acquiring abilities in the absence of a task-oriented reward function is at the frontier of reinforcement learning research. This problem has been studied through the lens of empowerment, which draws a connection between option discovery and information theory. Information-theoretic skill discovery methods have garnered much interest from the community, but little research has been conducted in understanding their limitations. Through theoretical analysis and empirical evidence, we show that existing algorithms suffer from a common limitation -- they discover options that provide a poor coverage of the state space. In light of this, we propose 'Explore, Discover and Learn' (EDL), an alternative approach to information-theoretic skill discovery. Crucially, EDL optimizes the same information-theoretic objective derived from the empowerment literature, but addresses the optimization problem using different machinery. We perform an extensive evaluation of skill discovery methods on controlled environments and show that EDL offers significant advantages, such as overcoming the coverage problem, reducing the dependence of learned skills on the initial state, and allowing the user to define a prior over which behaviors should be learned. Weakly supervised semantic segmentation for remote sensing hyperspectral imaging http://hdl.handle.net/2117/192482 Weakly supervised semantic segmentation for remote sensing hyperspectral imaging Moliner, Eloi; Salgueiro Romero, Luis Fernando; Vilaplana Besler, Verónica This paper studies the problem of training a semantic segmentation neural network with weak annotations, in order to be applied in aerial vegetation images from Teide National Park. It proposes a Deep Seeded Region Growing system which consists on training a semantic segmentation network from a set of seeds generated by a Support Vector Machine. A region growing algorithm module is applied to the seeds to progressively increase the pixel-level supervision. The proposed method performs better than an SVM, which is one of the most popular segmentation tools in remote sensing image applications. Mon, 06 Jul 2020 09:58:41 GMT http://hdl.handle.net/2117/192482 2020-07-06T09:58:41Z Moliner, Eloi Salgueiro Romero, Luis Fernando Vilaplana Besler, Verónica This paper studies the problem of training a semantic segmentation neural network with weak annotations, in order to be applied in aerial vegetation images from Teide National Park. It proposes a Deep Seeded Region Growing system which consists on training a semantic segmentation network from a set of seeds generated by a Support Vector Machine. A region growing algorithm module is applied to the seeds to progressively increase the pixel-level supervision. The proposed method performs better than an SVM, which is one of the most popular segmentation tools in remote sensing image applications.