Camera work is a critical aspect of conveying the atmosphere and impressions in movies, and it plays a vital role in video analysis. This research proposes a method to estimate camera work from monocular videos by analyzing the optical flow within the video frames. Our method facilitates the estimation of camera work in videos featuring dynamic subjects by incorporating semantic segmentation. Additionally, it is capable of distinguishing between zoom and dolly movements, which previous works have not achieved. The method uses the relationship between image depth, optical flow, and image coordinates to perform such classification.
Catcher framing technique in baseball has recently gained attention in game analysis. This technique involves a catcher adjusting their catching motion to increase the likelihood of an umpire calling a pitch a strike. Its success is typically evaluated based on the strike rate at the boundary of the strike zone, calculated from pitch trajectory data obtained through a tracking system. However, evaluating catcher framing in games without a tracking system is challenging, and alternative methods based on different types of information are needed. This research proposes a method to detect the catcher's mitt movement trajectory during catcher framing, which is considered useful information apart from pitch trajectory. The method applies object detection, pose estimation and deep learning to videos of baseball pitching scenes.
3D spatial recognition is a fundamental technology that supports automatic driving. For example, the processing accuracy of the vehicle depends on the accuracy of depth information around the vehicle body. While methods to geometrically measure the depth of the captured space by applying stereo vision to images taken by multiple cameras become being widely used, it is difficult to measure depth in poorly textured or occluded regions. On the other hand, it becomes possible to estimate depth information in such areas with the advent of estimating depth from monocular images by deep learning. However, if the observation conditions differ between training and estimation, the accuracy of the estimation will decline. This paper proposes a complementary method that integrates both methods by using a convolutional autoencoder.
In this study, we proposed a method for detecting a player’s movement trajectory from video images to evaluate the defensive range of an outfielder in baseball, quantitatively. Using this method, we succeeded in accurately estimating and visualizing the movement trajectory of a player by identifying only specific players and estimating the homography of changes in the angle of the view by matching the feature points using SIFT.
This research proposes a method based on image recognition technology to guide the visually impaired to the shortest Braille block in situations where they cannot find the existence of Braille blocks. In this paper, we make four proposals. First, we propose a method for detecting Braille blocks using SSD, a deep learning network with our own image dataset. Various kinds of Braille blocks and out-of-specification Braille blocks look different by weathering indoors and outdoors. Second, we propose a detection method that can handle differences in camera height. By constructing a training dataset with images of Braille blocks taken at a person's height or a general robot, we can achieve detection with cameras of different heights. Third, we propose a standalone method, real-time recognition of Braille blocks using the cameras of mobile devices. We incorporate MobileNet, a lightweight deep-learning network, into the SSD network. As a result, we achieve standalone and real-time Braille block recognition by optimizing the network for mobile devices. Finally, we propose a method of estimating the orientation and shortest distance of Braille blocks. It considers the continuous and linear arrangement pattern of guiding blocks that indicate the direction of travel.
Advanced, safe, and secure walking support could be possible if the next step of a visually impaired can be estimated before the foot actually touches the ground. We define the next step as one step to touch the ground now. We aim to predict the behavior of this next steps. In this study, we investigate a method to realize next-step landing position prediction using an IMU on a smartphone and discuss the possibility of next-step prediction. Since it is difficult for a person to walk while holding a smartphone steadily against the body, next-step prediction using only an IMU is considered an ill-posed problem. In this study, we assume that there is a pattern in the sway of a person's gait. To learn the pattern, we created a training dataset consisting of pairs of IMU outputs, positions of next steps, and the smartphone position to realize the next-step landing position prediction. We used the angle and distance errors to validate the next-step landing position prediction as the evaluation indices.
KEYWORDS: Video, Head, Cameras, Image quality, RGB color model, 3D modeling, Education and training, Deep learning, Image processing, 3D image processing
This paper introduces a method to generate 4D portrait of a person that can be played over a long period. 4D portrait is free-viewpoint video of a person with temporal changes in facial expression. In our proposed method, the parameters that represent person’s facial expressions and head poses are obtained from the video captured by a monocular RGB camera with a continuously moving viewpoint. A neural radiance field (NeRF) is trained from the captured video and estimated parameters. Using the radiance field, 4D portrait is generated based on the similarity of the person's facial expressions.
Owing to the impact of COVID-19, the venues for dancers to perform have shifted from the stage to the media. In this study, we focus on the creation of dance videos that allow audiences to feel a sense of excitement without disturbing their awareness of the dance subject and propose a video generation method that links the dance and the scene by utilizing a sound detection method and an object detection algorithm. The generated video was evaluated using the Semantic Differential method, and it was confirmed that the proposed method could transform the original video into an uplifting video without any sense of discomfort.
One of the sports training methods is VR training, where users watch a video image to recognize the situation, estimate the timing to move, and move their bodies by the situation. In this paper, we propose a real-time visual feedback method for sports motion information on the VR experience of Alpine skiing. We prepare a carefully designed visual feedback panel that presents the user's center of gravity and head height as sports motion information. An HMD presents the skiing situation on the slopes in VR180 style. The load sensor of our preliminary system is placed under the user's feet, and it acquires the center of gravity position. The tracking function of the HMD estimates head height. In the evaluation experiment, we investigated the appropriate parameters to realize the good visibility of the visual feedback panel during VR training.
At resource mining sites, drilling into the ground and bedrock often occurs in geological surveys and blasting explosives filling. However, extracting core samples from underground boreholes is time-consuming, labor-intensive, and difficult to evaluate quantitatively. Visualization of boreholes, which is realized by computer vision technology, allows engineers to evaluate the geological characteristics of underground rocks to determine trajectories and overall budgets. With the help of VR (Virtual Reality) simulation, this research develops a multi-view fiberscope camera system, which obtains videos of a borehole to generate a high-quality 3D borehole model by using one of the 3D photogrammetric techniques, Structure from Motion (SfM).
KEYWORDS: Cameras, Video, Head, Visualization, Distributed computing, Data processing, Associative arrays, Video surveillance, Video processing, Scientific research
We propose a unique method TWIN HEAD SLAM to build an integrated and consistent environment map by two camera heads that move freely with each other. Visual SLAM with multiple cameras is called cooperative SLAM. The updated environment map is shared by two tracking modules that are attached to each camera separately. Our contribution is that the integrated environment map is updated frame-by-frame from both camera inputs. This indicates that the key features obtained by one camera right now are instantly available on the tracking module of the other camera. We have implemented the proposed method based on OpenVSLAM and confirmed that two video inputs from two cameras are used to build the one consistent environment map that is shared by the localization modules of the two cameras.
We propose a three-stage navigation method to a hand size target object using sound guidance for the visually impaired in a walking distance situation. The advantage of our proposed method is to let visually impaired people reach a target object that he/she should touch with only a camera-equipped wearable device. It could apply to any indoor situation because our proposed system needs only a vision-based pre-registration process where only a single video trajectory should be set in advance. The navigation is decomposed into three stages—path navigation, body navigation, and hand navigation. As for the walking stage, we utilize the Clew app that is sufficient for this purpose. For the successive two stages, we introduce AR anchor. The AR anchor should be registered on the target object in advance. Our sophisticated sound guidance is made to let the subject reach the target with the resolution of hand size. The stage change is informed by vibration. We have conducted a preliminary evaluation with our smartphone-based system and confirmed that the proposed method can navigate users to a hand size target starting from a 5-meter away position.
Visually impaired people avoid dangers by using a white cane. Because they use it as an extension of their arm, we consider it a part of their body. Our goal is to recognize visually impaired people using a white cane in images and how they use it. We propose a method to estimate the posture of a person with a white cane by extending an existing posture estimation model: OpenPose. In our method, we incorporate a white cane as a part of the human skeleton model. We constructed a database of images of visually impaired people with a white cane to train the network for the extended human skeleton model. We develop a method to determine the left or right hand that holds the white cane in the training images because it is necessary to train right-handed and left-handed users separately. We can analyze the motion of the white cane by the result of posture estimation. We focus on the angle of the white cane and analyze its swing frequency. Throughout our experiments, we confirmed that our preliminary system successfully estimated the human posture with the white cane and the swing frequency of the white cane.
In the field of competitive swimming, performance investigation on official games is important and useful for performance development. We have been working on swimmer position estimation from a wide video view of official games. The challenges are water splash and complicated reflection of light that may hide swimmers from the camera. To overcome these problems, we utilize YOLOv3 and prepare a dedicated dataset of swimmers' heads in real games. We have trained the YOLOv3, and the trained YOLOv3 can detect heads by 48.1% mAP. In addition to the position estimation, we also propose a new method to investigate the status of the strokes along the time by detecting two head-classes: over the water and under the water. We also prepare another dedicated dataset for this two-class training. With the trained YOLOv3, we successfully visualize the status change of a swimmer over a whole game.
Laparoscopic surgery provides for patients such advantages as a small incision range and quick postoperative recovery. Unfortunately, surgeons struggle to grasp 3D spatial relationships in the abdominal cavity. Methods have been proposed to present the 3D information of the abdominal cavity using AR or VR. Although 3D geometrical information is crucial to perform such methods, it is difficult to reconstruct dense 3D organ shapes using a feature-point-based 3D reconstruction method such as structure from motion (SfM) due to the appearance characteristics of organs (e.g., texture-less and glossy). Our research solves this problem by estimating depth information from laparoscopic images using deep learning. We constructed a training dataset from both RGB and depth images with an RGB-D camera, implemented a depth image generator by applying a generative adversarial network (GAN), and generated a depth image from a single-shot RGB image. By calibration with a laparoscopic camera and an RGB-D camera, the laparoscopic image was transformed to an RGB image. We generated depth images by inputting the transformed laparoscopic images into a GAN generator. The scale parameter of the depth image with real-world dimensions was calculated by comparing the depth value and the 3D information estimated by SfM. Consequently, the density of the organ model increased by back-projecting the depth image to the 3D space.
On detecting people in a video of a wide area, there are three problems. First, the apparent sizes of people are likely to be very different due to their positions in the scene. Second, the density of people is unevenly distributed in the scene. Detecting the entire image implies a high video processing cost. Third, there are regions where people do not permanently appear. These regions should not be processed. To solve these problems, we aim to detect people regardless of people's positions and sizes, focusing on the regions where people could appear. To achieve this aim, we propose a video dividing method based on people's positions and sizes in the video for high-resolution video detection and a method to eliminate the divided regions where people will not appear. We adopt a pyramid representation for video division and integration. We skip processing the regions where no people are detected for a certain amount of time. Furthermore, we use semantic segmentation to eliminate regions where people do not appear permanently. We choose 4K high-resolution videos looking down wide areas from a fixed camera for the experiment. We test the proposed method on PANDA dataset and the videos taken at Tsukuba.
We propose a new verification approach for subjective evaluation of actions and reactions in virtual reality. We have been working on subject behavior analysis in VR. A head tracking HMD allows users to move around and make actions up to a room size, and their experiences can be measured by their head position, motion estimation, and gaze tracking when they are available. But in most cases, subjective evaluation for total VR experience are conducted after their experiences are over. We have succeeded in integrating a new EEG device that can measure EEG and mind indexes in real time with gaze-trackable and head-trackable HMD. By referring the gaze tracking results and EEG analysis and/or mind indexes when a situation is presented in VR, we can check the subject is surely watching the objects with their attention in the situation and their attention level in his/her mind. It may support the reliability of the score of subjective evaluation. This is very important when we need to conduct a number of subjective evaluations with a fewer number of subjects. We have developed our preliminary system for two applications – sport action analysis and traffic safety evaluation – and confirmed that it is a promising approach.
We are researching navigation for the visually impaired. We propose a new interface that utilizes sound and vibration to support turn-by-turn navigation for visually impaired people. In our proposed interface, the target path is divided into straight segments and points of change direction. The navigation instruction given by the sound and vibration is carefully designed to give minimum yet sufficient clues on the visually impaired walking. We have implemented a preliminary system based on our proposal and conducted a subject experiment for visually impaired people. The results imply that our proposed approach is useful for visually impaired people.
This paper proposes a shot detection method using the poses of a player in a badminton video sequence. In the proposed method, the hit timing is detected by focusing on the arm movements of the player and analysing the swing movement using skeletal information. The simple shot information is estimated by connecting player positions in the hit timing frame. By performing an experiment to verify hit timing detection, we confirmed that the detection is highly accurate and shot information detection is achieved.
Pitching in the correct form is essential for preventing injury and improving skills. It is not easy for athletes and instructors to check whether a pitcher is throwing in the correct form. In this study, we record a pitcher from the direction of the catcher by a monocular camera and estimate the skeleton pose of the pitcher by using OpenPose. We propose a new method to evaluate whether the pitcher can pitch in the correct form by examining the estimated pose. We use SSE(Shoulder, Shoulder, Elbow)-line as an evaluation index. When the upper body of the pitcher faces a batter, the SSE-line should be straight. To find the right frame at which the pitcher body turns squarely to the batter, the distance of the shoulders in a video frame is used. When it becomes the largest, the shape of SSE-line should be measured. Since the motion of the pitcher was fast, we use a 240 fps camera to investigate the relationship between the shape of SSE-line and the shoulder distance. The relationship between the shape of the SSE-line and the shoulder distance with the 240 fps camera was evaluated, and we discussed their pitching properties based on the evaluation.
The availability of pedestrian location estimation is one of the critical issues to realize a reliable navigation system for pedestrians in daily scenes. We propose a new pedestrian location estimation system that utilizes both the image-retrieval approach we have developed and a SLAM (Simultaneous Localization and Mapping) approach. Both approaches need only one single camera unit as a sensor, and the location is estimated by computer vision technology on both approaches. The problem here is that high processing cost is required to operate two approaches simultaneously. It could be impractical to run these two on a single wearable computing unit. We solve the problem by executing the two approaches on two separate computers that are connected with a computer network. We have implemented a preliminary system that unites the two approaches in hybrid fashion over two computers. We measured its performance in typical daily scenes on our campus. The result is promising for further implementation.
A new method of estimating swimmer position in swimming pool video is proposed. The video of swimming games is taken from a higher seat row in audience seat area. It can cover the whole field of a swimming pool. The swimming pool video is transformed so that each lane can be analyzed along with the lane direction. The foreground region that includes both the swimmer and their water splash is extracted by adaptive background modeling and by setting the mask region to cope with the influence of the non-planer water surface. Then, based on the color analysis on water splash, swimmer region can be successfully extracted. The position is estimated as the center of the Gaussian distribution of the swimmer region. The proposed method was applied to a nationwide swimming game.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.