Abstract
Detecting and pursuing various targets in space and time represents a key scientific
question for many vision-based perception scenarios. Recent developments in Deep
Learning offer enhanced ways to represent targets in terms of their location, shape,
appearance and motion. Learning can capture the significant variations seen in the
training data while retaining class- or target-specific cues. Learning even allows for
discovering specific correlations within an image of a 3D scene, as a perspective image contains many hints about an object’s 3D location, orientation, size and identity. This single-image based spatial reasoning task is the subject of ongoing research. However, detection failures, occlusion, and the presence of multiple interacting targets render this task complex and still unsolved. In this thesis, the integration of multiple learning-based representational enhancements is proposed to mitigate these problems and perform the 3D multi-target detection and tracking task more accurately. In these tasks, an attention mechanism can facilitate discovering the correlation between image features and spatial attributes. An attention-based representational enhancement is formulated to guide learning towards using spatially-aware features in the backbone network within a neural-network architecture and the reidentification branch. As a second contribution, a representation for extending multi-task learning to incrementally learn new classes
from a few (1-10) image samples without forgetting is introduced. As monocular 3D
estimation is an evolving field, the proposed scientific concepts take existing datasets,
research methodologies and evaluation concepts into account. Additionally, a synthetic multi-target trajectory generation scheme was developed to complement the evaluation task, offering a variable number of moving and interacting targets with computed ground truth. The proposed method is evaluated on the KITTI multi-target tracking benchmark dataset. It demonstrates competitive results against a baseline relying solely on a Kalman Filter based kinematic association step. The elaborated research concept has also been validated in an applied scenario (Bike2CAV project), where the time-varying spatial configuration of traffic participants is estimated from the viewpoint of a moving vehicle. The main findings of this research indicate that despite the monocular view ambiguity, the introduced representational enhancements lead to a more accurate spatial localisation. Results also demonstrate that target reidentification is advantageous beyond simple kinematic modelling, leading to a temporally more stable multi-target tracking performance. This advantage might be more pronounced by using larger datasets with extensive 3D poses and tracking annotations, indicating future research opportunities.
question for many vision-based perception scenarios. Recent developments in Deep
Learning offer enhanced ways to represent targets in terms of their location, shape,
appearance and motion. Learning can capture the significant variations seen in the
training data while retaining class- or target-specific cues. Learning even allows for
discovering specific correlations within an image of a 3D scene, as a perspective image contains many hints about an object’s 3D location, orientation, size and identity. This single-image based spatial reasoning task is the subject of ongoing research. However, detection failures, occlusion, and the presence of multiple interacting targets render this task complex and still unsolved. In this thesis, the integration of multiple learning-based representational enhancements is proposed to mitigate these problems and perform the 3D multi-target detection and tracking task more accurately. In these tasks, an attention mechanism can facilitate discovering the correlation between image features and spatial attributes. An attention-based representational enhancement is formulated to guide learning towards using spatially-aware features in the backbone network within a neural-network architecture and the reidentification branch. As a second contribution, a representation for extending multi-task learning to incrementally learn new classes
from a few (1-10) image samples without forgetting is introduced. As monocular 3D
estimation is an evolving field, the proposed scientific concepts take existing datasets,
research methodologies and evaluation concepts into account. Additionally, a synthetic multi-target trajectory generation scheme was developed to complement the evaluation task, offering a variable number of moving and interacting targets with computed ground truth. The proposed method is evaluated on the KITTI multi-target tracking benchmark dataset. It demonstrates competitive results against a baseline relying solely on a Kalman Filter based kinematic association step. The elaborated research concept has also been validated in an applied scenario (Bike2CAV project), where the time-varying spatial configuration of traffic participants is estimated from the viewpoint of a moving vehicle. The main findings of this research indicate that despite the monocular view ambiguity, the introduced representational enhancements lead to a more accurate spatial localisation. Results also demonstrate that target reidentification is advantageous beyond simple kinematic modelling, leading to a temporally more stable multi-target tracking performance. This advantage might be more pronounced by using larger datasets with extensive 3D poses and tracking annotations, indicating future research opportunities.
| Originalsprache | Englisch |
|---|---|
| Publikationsstatus | Veröffentlicht - 16 Apr. 2022 |
Research Field
- Assistive and Autonomous Systems