Performing many simultaneous tasks on a resource-limited device is challenging due to the limited amount of available computational resources. Efficient and universal model architectures are the key to solving this problem. Existing sub-fields of machine learning, such as Multi-Task Learning (MTL), have proven that learning multiple tasks with a single neural network architecture is possible and even has the potential to improve sample efficiency, memory efficiency, and can be less prone to overfitting. In Visual Question Answering (VQA), a model ingests multi-modal input to produce text-based responses in the context of an image. Our proposed architecture merges the MTL and VQA concepts to form TaskNet. TaskNet solves the visual MTL problem using an input task to provide context to the network and guide its attention mechanism towards providing a relevant response. Our approach saves memory without sacrificing performance relative to naively training independent models. TaskNet efficiently provides multiple fine-grained classifications on a single input image and seamlessly incorporates context-specific metadata to further boost performance in a world of high variance.
Multi-object tracking (MOT) is a crucial component of situational awareness in military defense applications. With the growing use of unmanned aerial systems (UASs), MOT methods for aerial surveillance is in high demand. Application of MOT in UAS presents specific challenges such as moving sensor, changing zoom levels, dynamic background, illumination changes, obscurations and small objects. In this work, we present a robust object tracking architecture aimed to accommodate for the noise in real-time situations. Our work is based on the tracking-by-detection paradigm where an independent object detector is first applied to isolate all potential detections and an object tracking model is applied afterwards to link unique objects between frames. Object trajectories are constructed using multiple hypothesis tracking (MHT) framework that produces the best hypothesis based on the kinematic and visual scorings. We propose a kinematic prediction model, called Deep Extended Kalman Filter (DeepEKF), in which a sequence-to-sequence architecture is used to predict entity trajectories in latent space. DeepEKF utilizes a learned image embedding along with an attention mechanism trained to weight the importance of areas in an image to predict future states. For the visual scoring, we experiment with different similarity measures to calculate distance based on entity appearances, including a convolutional neural network (CNN) encoder, pre-trained using Siamese networks. In initial evaluation experiments, we show that our method, combining scoring structure of the kinematic and visual models within a MHT framework, has improved performance especially in edge cases where entity motion is unpredictable, or the data presents frames with significant gaps.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.