Deep Learning-Based Object Detection in Augmented Reality: A Systematic Review PDF

Computers in Industry 139 (2022) 103661 Contents lists available at ScienceDirect Computers in Industry...

Computers in Industry 139 (2022) 103661 Contents lists available at ScienceDirect Computers in Industry journal homepage: www.elsevier.com/locate/compind Deep learning-based object detection in augmented reality: A systematic review ]] ]]]]]] ]] ⁎ ⁎ Yalda Ghasemi a, Heejin Jeong a, , Sung Ho Choi b, Kyeong-Beom Park b, Jae Yeol Lee b, a Department of Mechanical and Industrial Engineering, University of Illinois at Chicago, USA b Department of Industrial Engineering, Chonnam National University, South Korea a r t i cl e i nfo a bstr ac t Article history: Recent advances in augmented reality (AR) and artificial intelligence have caused these technologies to Received 21 December 2021 pioneer innovation and alteration in any field and industry. The fast-paced developments in computer Received in revised form 18 March 2022 vision (CV) and augmented reality facilitated analyzing and understanding the surrounding environments. Accepted 19 March 2022 This paper systematically reviews and presents studies that integrated augmented/mixed reality and deep Available online 6 April 2022 learning for object detection over the past decade. Five sources including Scopus, Web of Science, IEEE Xplore, ScienceDirect, and ACM were used to collect data. Finally, a total of sixty-nine papers were analyzed Keywords: Deep learning from two perspectives: (1) application analysis of deep learning-based object detection in the context of Object detection augmented reality and (2) analyzing the use of servers or local AR devices to perform the object detection Augmented reality computations to understand the relation between object detection algorithms and AR technology. Mixed reality Furthermore, the advantages of using deep learning-based object detection to solve the AR problems and Computation platform limitations hindering the ultimate use of this technology are critically discussed. Our findings affirm the Systematic review promising future of integrating AR and CV. © 2022 Elsevier B.V. All rights reserved. Contents 1. Introduction............................................................................................................. 2 2. Augmented reality technologies and devices................................................................................... 2 2.1. Wearable devices.................................................................................................... 2 2.2. Handheld devices (HHD).............................................................................................. 3 2.3. Projector-based displays.............................................................................................. 3 2.4. Holographic displays................................................................................................. 3 3. Deep learning-based object detection......................................................................................... 3 3.1. Convolutional neural network (CNN).................................................................................... 3 3.2. Region-based convolutional neural network (R-CNN)....................................................................... 3 3.3. Fast R-CNN......................................................................................................... 3 3.4. Faster R-CNN....................................................................................................... 3 3.5. Mask R-CNN........................................................................................................ 4 3.6. YOLO............................................................................................................. 4 3.7. Single-shot multi-box detector (SSD).................................................................................... 4 4. Review methodology...................................................................................................... 4 4.1. Study selection..................................................................................................... 4 5. Object detection in AR..................................................................................................... 4 5.1. Deep learning-based object detection in AR.............................................................................. 4 5.1.1. Applications................................................................................................. 4 5.1.2. Computation platform (local vs. server)........................................................................... 8 ⁎ Corresponding authors. E-mail addresses: [email protected] (H. Jeong), [email protected] (J.Y. Lee). https://doi.org/10.1016/j.compind.2022.103661 0166-3615/© 2022 Elsevier B.V. All rights reserved. Y. Ghasemi, H. Jeong, S.H. Choi et al. Computers in Industry 139 (2022) 103661 6. Summary of review results................................................................................................. 9 7. Discussion............................................................................................................... 11 8. Conclusion.............................................................................................................. 12 Declaration of Competing Interest........................................................................................... 12 Acknowledgments........................................................................................................ 12 References.............................................................................................................. 12 1. Introduction detection techniques in AR and their current and future directions. This paper provides a comprehensive literature review and an in- The advent of the digitalized world has enhanced the quality of depth discussion of deep learning-based object detection in AR and information and resulted in a tremendous amount of data genera describes significant research trends in this field. In addition, this tion. A wide variety of technologies leverage these data to provide study elaborates on how these techniques can be used to enhance novel solutions and perform traditional tasks more innovatively. the performance of an AR system. It also examines the effectiveness Augmented reality (AR) is a prominent example of such technolo of deep learning algorithms for object detection in AR, along with gies, which has become one of the most popular technology trends their advantages and disadvantages. in the current era (del Amo et al., 2018; Geng et al., 2020). AR can be The remainder of this paper is organized as follows. In Section 2, defined as an extended version of the physical world overlaid with the AR technology and its standard devices are introduced. Section 3 digital content bridging the real and virtual environments. AR sys presents an introduction to deep learning-based object detection tems should accurately identify the real environment and its com algorithms. In Section 4, the review method of this paper is de ponents to work best. Most AR technologies can recognize a 3D scribed. In Section 5, the previous studies on object detection in AR spatial map of the components by real-time scanning of the en are reviewed, and important information of each reviewed article is vironment. summarized in Section 6. Section 7 presents a discussion on the There is a rapid growth of marker-based and markerless tech limitations of the current deep learning-based object detection ap niques to identify real-world content in AR systems (Katiyar et al., proaches in AR. Finally, Section 8 includes conclusions and future 2015). In marker-based AR, objects can be localized and tracked directions of this area of research. using physical markers attached to real objects. This technique often suffers from inaccurate detections, especially when there is a con siderable distance between the AR camera and the real object or an 2. Augmented reality technologies and devices obstacle between the AR camera and the real object causing occlu sion. In addition, the markers should not reflect light, and their black This section provides an introduction to AR technologies and and white colors should have a strong contrast. To overcome such devices along with their advantages and disadvantages. AR was first limitations, markerless AR was proposed to recognize the real en introduced in the 1960s with limited functionalities. Still, it has vironment based on spatial geometry and does not require any made tremendous progress over the past decades and is becoming trigger markers attached to the space. This technique often results in increasingly popular and being used in many industries and appli more accurate detections by obtaining the spatial map of the en cations. Generally, many devices support AR technology. However, vironment and its components. However, markerless applications the capabilities may differ to some degree from one to another. need to recognize a textured flat surface to augment the digital Despite this difference, all AR devices consist of two elements: an content effectively. image generating optical unit producing the virtual content, and a Deep learning techniques can overcome typical marker-based projection surface displaying the produced virtual content to the and markerless AR deficiencies and provide faster and more accurate users. AR devices can generally be divided into four categories: detections. The final goal of deep learning-based object detection is wearable devices, handheld devices, projection-based displays (also to recognize and locate one or multiple objects in a specific frame. known as spatial AR), and holographic displays. The entire frame is the input, and the output is the object’s bounding box and its probability of recognizing an object. Deep learning-based object detection identifies the existing objects in an image or video 2.1. Wearable devices and demonstrates where they are located (i.e., object localization) and which category they belong to (i.e., object classification) Wearable AR devices are the most advanced devices typically (Sharma et al., 2020; He et al., 2016). A bounding box will dis worn on the user’s head. These devices are often used as helmets, criminate the detected object from the background and other objects goggles, or glasses. Wearables are also called optical heads-up dis in the frame. Furthermore, image segmentation can be applied to plays since they can superimpose computer-generated content into a assign particular class labels to each pixel of an image (Chen et al., see-through display in front of the user’s eye. These devices do not 2014). The performance of image segmentation is hampered when impede the user’s vision, and they are only responsible for aug one image includes multiple objects of the same class. To overcome menting the digital content to the scene. Wearables give the users this issue, instance segmentation was proposed to provide a pixel- the freedom to use both of their hands while interacting with AR level localization distinction among the objects by discerning be content. Popular wearables (e.g., Microsoft HoloLens, Google glasses, tween the objects of the same class. Epson Moverio, and Magic Leap) are being used in the military to Deep learning has been reviewed and investigated extensively for train soldiers in simulated environments (Livingston et al., 2011; object detection applications. Previous studies used deep learning Hidalgo et al., 2021), in the industry to provide real-time smart task for object detection in various applications (Zamora-Hernández assistance to the workers (Park et al., 2020a), or to provide data et al., 2021; Park et al., 2021; Khan et al., 2021). However, less at entry interfaces to the office workers (Singh et al., 2021). In addition, tention has been paid to reviewing deep learning-based object these devices are generally more unlikely to be accepted by the public, considering their current design. 2 Y. Ghasemi, H. Jeong, S.H. Choi et al. Computers in Industry 139 (2022) 103661 2.2. Handheld devices (HHD) the predictions. There are mainly two approaches to implement these computations, including server and local devices. Depending Handheld devices (HHD) or mobile devices are one of the popular on the complexity of the model and the amount of the training data, and easy-to-use devices for AR applications. Mobile AR allows users the training process can happen either on a local device or on a to create snapshots of the enhanced environments. Many camera- remote server. equipped devices such as smartphones or tablets can support AR and make the AR experience available. Since the acceptability of some AR 3.1. Convolutional neural network (CNN) devices, especially in public, is still a challenge, HHDs are a practical solution for the everyday use of AR technology, such as navigation Convolutional neural networks or CNNs in short are the simplest and gaming. HHDs are lightweight and do not need sophisticated and most widely used deep learning algorithms for object detection. hardware requirements to offer an AR experience (Chowdhury In CNN, first, an image should be divided into separate pieces. The et al., 2013). algorithm takes each of these pieces as the input and after going through convolutions and pooling layers, it outputs the objects’ 2.3. Projector-based displays classes. The problem with this approach is that the objects may cover different frame ratios. Therefore, it requires many regions, Projectors directly overlay the information into the physical which results in a massive amount of computational time. To over world without any mediator device. A projector-based display can come these issues, a faster approach is needed to reduce the number turn any surface into a screen using the projection mapping tech of regions by obtaining them through proposal methods. nique responsible for scanning the environment by combining visible-light cameras with depth sensors to map the shape of the 3.2. Region-based convolutional neural network (R-CNN) objects. It then overlays the 2D content ranging from images, videos, or only guidance lights onto the environment. Users have the Region-based Convolutional Neural Network or R-CNN works on freedom of working with two hands when working with elements a specific number of regions. The algorithm extracts a group of boxes created by projector-based displays, and the user does not need to or Regions of Interest (ROI) using proposal methods such as Selective wear a bulky headset. Furthermore, working with projector-based Search (Uijlings et al., 2013) in the frame and checks whether an displays does not require an extensive amount of training. However, object exists in that specific region. To use the R-CNN algorithm, first, it suffers from some disadvantages. Since the projectors are not it is required to choose a pre-trained convolutional neural network. equipped with Inertial Measurement Unit (IMU) sensors, other ex Then, based on the number of classes that need to be detected, the ternal sensors should be used to make the interactions possible. network's last layer should be re-trained. Next, the regions should be Moreover, physical objects can occlude the AR information. In ad reshaped to be matched to the network input size. After retrieving dition, since the projector is fixed to a particular location, all parts of the regions, typical classifiers such as Support Vector Machine (SVM) the real environment may not be overlaid with AR content in com can be used to classify the objects. Finally, techniques such as linear plex physical environments. regression are used to assign a bounding box to each predicted class. Although R-CNN is a practical algorithm for object detection, it often 2.4. Holographic displays suffers from low computational speed. A holographic display is a form of display that creates 3D digital 3.3. Fast R-CNN content using diffractions of light. Like projector-based displays, holographic displays do not need a device to mediate showing the Fast R-CNN has been proposed to reduce the computational time augmented information to the user (Lin and Wu, 2017). Holographic of the R-CNN algorithm (Girshick, 2015). In this method, after taking displays use holograms instead of graphic images to produce pro an image as the input, a convolutional neural network should be jected pictures. They beam white light or laser light onto the holo applied to generate the ROI. Next, an ROI pooling layer is applied to grams. The projected light produces bright two- or three- all regions for reshaping them into a fixed size. Lastly, a softmax dimensional images. While plain daylight facilitates the creation of layer and linear regression should be used to output classes and simple holograms, true 3-D images require laser-based holographic bounding boxes, respectively. projectors. Such images can be viewed from different angles and a true perspective. 3.4. Faster R-CNN 3. Deep learning-based object detection The major difference between Fast R-CNN and Faster R-CNN is that the former approach uses selective search to generate region This section provides an introduction of the most common al proposals. In contrast, the latter leverages a Region Proposal gorithms used in deep learning-based object detection as well as Network (RPN) method, discussed extensively in Ren et al. (2015). their advantages and disadvantages. Object detection is important in Faster R-CNN takes an image as the input and passes it through the computer vision since it intelligently identifies and analyzes a scene process, generating the feature map of the image. Then RPN is ap in a given frame. Depending on the context, the detection task can be plied to these feature maps, returning the object proposals and ob divided into several categories, such as face detection, pose detec jectness scores. An ROI pooling layer is applied to these proposals to tion, and pedestrian detection, all of which have been used in var bring down all the proposals to the same size. Finally, the proposals ious applications such as autonomous vehicles (Y. Li et al., 2020), are passed to a fully connected layer with a softmax layer and a robotics (Choi et al., 2022; Zhou et al., 2022; Chen et al., 2008), and linear regression layer at its top, to classify and output the bounding security (Jain, 2019). Applying deep learning-based object detection boxes for objects. R-CNN-based algorithms require sub-regions and in AR has been a challenge for researchers. The collected data is none of them consider a complete image to apply the process. Since usually used to train a model if it is unique and the existing models these systems operate in consecutive stages, the performance of are unsatisfactory. Otherwise, a pre-trained model is used to make each stage heavily depends on the performance of its previous stage. 3 Y. Ghasemi, H. Jeong, S.H. Choi et al. Computers in Industry 139 (2022) 103661 3.5. Mask R-CNN and after deleting duplicates, we ended up with 4136 unique articles. After screening the papers toward their title and skimming their Mask R-CNN algorithm is an extension of the Faster R-CNN, abstract, 3727 papers were excluded from the study due to the ir which adds a Mask network branch for ROIs prediction segmenta relevancy of the titles and abstract. For the 409 remaining articles, tion parallel to object classification and bounding-box regression (He we screened their full-text, and 69 research articles that met our et al., 2017). It involves two stages. The first stage consists of two inclusion criteria were selected as the most relevant papers for the networks including backbone and region proposal network. These current study's purpose and were included in the review. Fig. 1 networks run once per image to give a set of region proposals. In the shows the summary of the PRISMA process used for this study. second stage, the bounding boxes and object classes are predicted For this systematic review, the following information was ex for each of the proposed regions obtained in the former stage. tracted from the 69 papers: 3.6. YOLO Types of deep learning algorithms Types of AR devices You Only Look Once, or YOLO, is one of the most popular object Computation platform detection algorithms for real-time applications that often works best Dependent and independent variables in terms of speed and outcome. It was first introduced by Redmon User study et al. (2016). Unlike region-based algorithms, this algorithm looks at Year of publication the whole input frame. It then predicts how to identify, classify, and Application or scope localize objects in the frame. This approach can be performed in real-time object detection since it is faster than region-based 5. Object detection in AR methods. YOLO takes an image as the input and divides it into S×S grids to predict if there is an object in that specific cell. Using this Similar to many other areas, object detection has been ex information, YOLO can predict the class of the objects. Unlike region- tensively used in AR as well. Depending on the available amount of based methods, which require thousands of neural networks, in data, the configuration of the AR device, and the goal of detection, YOLO, the input frame passes through a single network evaluation. several types of algorithms can be used in AR applications. YOLO has three versions and in each new version, the creators made Traditional object detection in AR mainly consists of marker-based improvements in the detection accuracy compared to the pre methods and statistical classifiers. One of the first studies on picture vious one. properties and pictorial pattern recognition was introduced by Rosenfeld and Pfaltz (1966), which elaborates on computer proces 3.7. Single-shot multi-box detector (SSD) sing pictorial information techniques. Another early work in this field is an image processing approach proposed by decomposing an SSD algorithm was first introduced by Liu et al. (2016). Unlike image into primitive pieces as a basis for reference component de RPN-based approaches that needed two shots to generate region scription (Fischler and Elschlager, 1973). These studies were pri proposals, this method only needs one shot to detect objects. The marily based on matching techniques and part-based algorithms. SSD algorithm operates in two steps: extracting the feature maps Later studies focused on classifiers such as Viola-Jones Haar Cascade, and applying convolutional filters to detect objects. SSD is faster Histogram of Gradient (HOG), Scale-Invariant Feature Transform than region-based algorithms since it increases the speed by elim (SIFT), and Speeded-Up Robust Features (SURF). However, traditional inating the need for region proposal networks. This elimination may object detection algorithms used in AR often suffer from issues such affect the accuracy of the object detection, but SSD compensates for as lack of accuracy or being computationally intensive. Recent re this drawback by applying improvements such as multi-scale fea search in AR leveraged deep learning-based object detection to al tures and default boxes and uses lower image resolution to achieve leviate such challenges, either replacing traditional object detection higher accuracy. methods or as a complementary component to mitigate their shortcomings. 4. Review methodology 5.1. Deep learning-based object detection in AR This review follows the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) (Moher et al., 2010) to explore Object detection based on deep neural networks has been related literature in the field of deep learning-based object detection extensively used in AR. Compared to the traditional techniques in augmented/mixed reality over the past ten years. Five databases explained in Section 5, deep learning-based object detection is a were selected to search, including Scopus, Web of Science, IEEE novel and successful method. The main difference between the Xplore, ACM, and ScienceDirect. These databases provided a wide detection based on neural networks and statistical classifiers is variety of research studies in the corresponding fields. that the user designs the latter. In contrast, in neural networks, the algorithm should learn a large sample of data to represent 4.1. Study selection the response variable successfully. This section divides the pre vious works into several categories, including applications and The search terms in this study included “deep learning” AND computation platforms, and presents a comprehensive review of “object detection”, AND “augmented” OR “mixed reality”. These each category. terms were used together to extract the studies that leveraged deep learning-based object detection in the context of augmented or 5.1.1. Applications mixed reality technology. The literature search was limited to the Previous studies on deep learning-based object detection in AR articles published in the past decade from 2011 to 2020. However, are centered on several areas including but not limited to education, most of the articles directly related to this review were published manufacturing, aerospace, and robotics. Among all areas, manu after 2016. The initial search in this study resulted in 4835 papers, facturing, autonomous vehicles, and assistance are more common. 4 Y. Ghasemi, H. Jeong, S.H. Choi et al. Computers in Industry 139 (2022) 103661 Fig. 1. PRISMA process for selecting related studies. Also, some papers are not focused on specific areas and only provide computer vision have been used to make the factories smarter while a framework for object detection. Table 1 summarizes the applica facilitating the tasks for workers. Subakti and Jiang (2018) proposed tions used in the reviewed studies (Paper ID). Note that the Paper ID a mobile AR that recognizes three different industrial machines and is used in the upcoming tables and figures to indicate each reviewed their components. Digital information is sent to the smartphone to study. This section introduces the use of deep learning-based object be superimposed on the machine images shown on the display. By detection in AR for the three most common applications including leveraging the touch screen and distance perception of smartphones, manufacturing, driving, and assistive technologies. this system can provide two modes of interaction, including touch and distance-aware interactions. Assembly is a major task being carried out in manufacturing environments. To reduce the issues of 5.1.1.1. Manufacturing. A major application of deep learning-based manual assembly, S. Wang et al. (2018) proposed a new assembly object detection in AR is in the context of manufacturing. AR and Table 1 Application areas used in the reviewed studies. Applications Study [Paper ID] Manufacturing Ramakrishna et al. (2016) Subakti and Jiang (2018) , S. Wang et al. (2018) , Židek et al. (2019a) , Židek et al. (2019b) , Tao et al. (2019) , Corneli et al. (2019) , Sun et al. (2019) , Kim and Lee (2019) , Zheng et al. (2020) , Park et al. (2020a) , Park et al. (2020b) , Lai et al. (2020) , Kästner et al. (2021) , Konstantinidis et al. (2020) Navigation assistance for elderly, disabled people, Advani et al. (2017) , Eckert et al. (2018) , Lin et al. (2018) , Cruz et al. (2019) , Park et al. (2019) and shopping , Fuchs et al. (2019) , McKelvey et al. (2019) , Fuchs et al. (2020) , Nilwong and Capi (2020) Driving Abdi and Meddeb (2017) , Abdi et al. (2017) , Abdi and Meddeb (2018) , Alhaija et al. (2018) , Anderson et al. (2019) , Pai et al. (2020) , Deore et al. (2020) , Zhou et al. (2020) Robotics R. Wang et al. (2018) , De Gregorio et al. (2019) , Farasin et al. (2020) , Kästner et al. (2020) Education Karambakhsh et al. (2019) , Huynh et al. (2019) , Plecher et al. (2020) Healthcare Wang et al. (2019) , Waithe et al. (2020) Search and rescue Llasag et al. (2019) Firefighting Bhattarai et al. (2020) Geometric perception Han et al. (2020) Sudoku puzzle Syed et al. (2020) Children guessing game Putze et al. (2020) Geovisualization Rao et al. (2017) Obstacle detection Połap et al. (2017) Virtual agent positioning Lang et al. (2019) Not specified Tobías et al. (2016) , Ran et al. (2017) , Sutanto et al. (2017) , Liu and Han (2018) , Mahurkar (2018) , Rodrigues et al. (2018) , Bahri et al. (2019) , Liu et al. (2019) , Apicharttrisorn et al. (2019) , Huang et al. (2019) , Li et al. (2019) , X. Li et al. (2020) , Hu et al. (2020) , Rathnayake et al. (2020) , Le et al. (2020) , Ahn et al. (2020) , Dasgupta et al. (2020) , Golnari et al. (2020) , Cheng et al. (2020) , Lomaliza and Park (2020) 5 Y. Ghasemi, H. Jeong, S.H. Choi et al. Computers in Industry 139 (2022) 103661 Fig. 2. Results of tool/part detection and providing instructions in manufacturing using R-CNN algorithm (Tao et al., 2019). fault detection based on deep learning and mixed reality (MR), study, AR and computer vision (CV) techniques were utilized to which requires training a pretreatment model and detecting targets support novice operators in the maintenance procedures. A mobile via deep learning and extracting feature information. This method AR maintenance assistant using a handheld device’s camera was could significantly improve equipment efficiency and reduce introduced by Konstantinidis et al. (2020) to generate AR assembly errors. Židek et al. (2019b) used deep learning to identify maintenance instructions. The performance of this system showed assembly parts and speed up the assembly process with AR promising results in a real-world scenario. Fig. 2 provides an application and dynamic recognition, demonstrating the potential example of (a) assembly part detection using AR and deep learning of their approach for improving the assembly tasks. They also as well as (b) task instructions in AR display. presented a methodology for speeding up the CNN training process based on the automated generation of input sample data 5.1.1.2. Driving and autonomous vehicles. Deep learning-based object for learning without any monotonous manual work. This way, it detection has been used in driving and autonomous vehicles for would significantly shorten sample preparation time without obstacle avoidance and increasing drivers’ performance by automation. A portable visual device based on binocular vision and enhancing their awareness of the environment. AR HUD and deep deep learning was developed by Zheng et al. (2020) to realize fast learning can be used to recognize road obstacles and interpret and detection and recognition of cable brackets that were installed on predict complex traffic situations, which can significantly improve aircraft airframes. It consisted of three subsystems: bracket the driving experience (Abdi and Meddeb, 2017). A real-time inspection, cable text reading, and assembly process guidance approach for traffic sign recognition has been employed using based on AR. It could assist workers in quickly inspecting the state deep learning. This approach improves the accuracy of the traffic of brackets by showing the installation path of cables to be sign detector to assist the driver in various driving situations, assembled. This approach has improved the assembly efficiency increase driving comfort, and reduce traffic accident risks. and quality of the aircraft cable assembly process. A worker-centered Experimental results showed that the suggested method was training and assistant system for intelligent manufacturing was comparable to state-of-the-art approaches but with less proposed in Tao et al. (2019). The worker’s state was perceived computational complexity and shorter training time. It was also with multi-modal sensing and deep learning methods, and was used mentioned that AR impacts the allocation of visual attention more to determine the potential guiding demands. Then, active strongly during the decision-making phase (Abdi and Meddeb, instructions with AR were provided to suit the worker’s needs. The 2018). In a study by Alhaija et al. (2018), synthetic data in urban experiment showed the feasibility and promising results of applying driving scenes were generated by combining AR and computer the proposed system for training and assisting frontline workers. vision suitable for training deep neural networks. However, Park et al. (2020a) introduced a deep learning-based mobile AR for synthetic objects can only be placed on top of real images. They smart task assistance and in a further study, they proposed a user- thus cannot be partially occluded by the real objects. In addition, centered AR method that proved to be faster than the marker-based Deore et al. (2020) developed an algorithm to combine deep learning AR while overcoming the limitations of existing interactions in and AR in the context of autonomous vehicles. The trained deep wearable AR such that complex tasks can be performed more learning test model performed well on detecting the AR artificial accurately and effectively (Park et al., 2020b). An AR instructional navigational signs. As the objects were clear compared to real signs system integrated with Faster R-CNN for the mechanical assembly it was easier for the algorithm to detect. However, the artificial was proposed in Lai et al. (2020). A synthetic tool dataset was objects created using AR during testing hid important surrounding developed using data augmentation with CAD models and details. The detail in real self-driving vehicles can be any human or successfully deployed to detect real tools. The experimental results moving object which should be detected to avoid an accident. An on the assembly task indicated a considerable improvement in the augmented reality environment for drivers where important assembly performance, compared to the conventional methods. An information is displayed in Holograms was proposed in Anderson AR-based human assistance system for complex manual tasks et al. (2019), where real-time object and lane detection was studied incorporating deep neural networks was proposed by Kästner et al. to enhance the driver’s ability to avoid collisions. A driver assistance (2020). AR was combined with object and action detectors in this system that uses a network model based on deep learning study to make workflows more intuitive and flexible. In another technology was developed by Pai et al. (2020). A camera was used 6 Y. Ghasemi, H. Jeong, S.H. Choi et al. Computers in Industry 139 (2022) 103661 Fig. 3. Results of traditional and deep learning-based object detection algorithms used in the autonomous driving study (Anderson et al., 2019). to capture the vehicle's image ahead and was used to predict the using visual information. Experimental results from the simulation type and position of the forward object and the ground road signs showed the high effectiveness of the navigation system inside the using trained network models. This gave the driver a view of the simulation. The real experiments showed a potential of the game- road ahead with information about possible hazards and warnings of based Deep-Q Network (DQN) and a simple marker-based AR danger to enhance safety. Fig. 3 shows an example of object method for simple navigation tasks in short distances. The detection results using (a) a traditional method (Viola-Jones) and implemented DQN was trained in a game-based simulation (b) a deep learning-based method (YOLO). environment, then directly employed to the real robot without any changes. 5.1.1.3. Assistance. AR and its applications with deep learning were To enhance the shopping experience and guide visually impaired not only used in workplaces or modern technologies, but they have people, Advani et al., (2017) developed a system to provide tactile also been used to assist people in their daily routine activities. Some feedback from a custom glove equipped with a camera and vibration critical applications of these assistive approaches pertain to helping motors as well as auditory and visual feedback from a pair of smart the elderly or enhancing navigation for visually impaired people. glasses. They described the various features incorporated into this Using deep learning approaches including pose estimation, object visual-assistance system in multiple contexts while highlighting the and face detection, and a spatial AR technology, Park et al. (2019) efficiency of personal visual assistance systems in day-to-day ac provided alerts and assistance for daily works of older adults in their tivities. A system showing AR models to the users directly at the real environment. In addition, deep learning-based object detection store was proposed by Cruz et al. (2019). This system provides user in AR can be leveraged to enhance the navigation experience for all navigation and improves the localization of certain products at the people regardless of their impairment, especially when navigating a store. However, some limitations such as requiring an Internet new environment for the first time. A novel campus navigation app connection, high power consumption on mobile devices, dynami that uses AR to provide users with a new way was introduced in Lin cally changing environment, and localization difference between the et al. (2018). Using the combination of AR and deep learning-based taken picture and system response may cause failures to guide the object recognition, the information about the campus environment user accurately. MR headset-mediated technologies on food and was overlaid in the real world, making an interactive interface. To grocery selection are feasible, but little is known about their impact improve the app efficiency, this paper presented a virtual terrain on user choice and other outcomes in the real world and users’ modeling interface with deep learning to improve the object opinions on the efficiency of such systems. Fuchs et al. (2020) pre recognition ability. In addition, studies were conducted to address sented a novel framework that combines research streams into a robot navigation as well. Nilwong and Capi (2020) presented a deep novel user support system for providing healthy food choices to reinforcement learning-based robot outdoor navigation method address the research gap of the joint application of CV-based Fig. 4. An example of object detection to assist with shopping navigation using ResNet (Cruz et al., 2019). 7 Y. Ghasemi, H. Jeong, S.H. Choi et al. Computers in Industry 139 (2022) 103661 Fig. 5. Two processes of the object detection computation: (a) remote server-based process, (b) local device-based process. detection of food items. They used an MR wearable headset to ex an environment with high delay can increase the object detection plore the technical feasibility and potential impact of MR food labels latency. However, this method is generally faster than using a local in affecting beverage and food purchasing choices. They assessed device since it can process a higher amount of data in a shorter time. whether visual cues in the form of front-of-package labels (i.e., Due to the high memory bandwidth and the ability to conduct nu Nutri-score) influence consumers in preferring and selecting healthy merous parallel computations, GPUs have become a widely accepted or unhealthy beverages and foods and analyzed consumers with low method for training deep learning models. The quality of the GPU food literacy. They also included an in-depth discussion on the la can highly affect the performance of deep learning models. In con tency of product detection via CV to assess the technical feasibility of trast, using a local device to implement deep learning computations detecting packaged products under realistic circumstances. Fig. 4 is more convenient and flexible, requiring lighter algorithms to be shows an implementation of AR and deep learning for enhancing (a) executed effectively. Otherwise, time and accuracy should be sacri shopping experience and (b) navigation in large retail stores. ficed. Out of 69 studies collected for this review, 48 studies used cloud, 5.1.2. Computation platform (local vs. server) edge, or GPU servers. 17 studies implemented their computation on Deep learning is capable of making the AR/MR systems smarter. the local AR devices, and 4 studies used both server and local devices However, to implement deep learning-based object detection, some for their computations. Different methods and approaches were obvious considerations need to be considered when choosing be used to implement object detection either on a server or on a local tween a remote server or a local device. Fig. 5 shows the two pro device. cesses for the object detection computation. This section discusses Since mobile devices have become relatively powerful to handle some advantages and disadvantages of each approach and explains the computations near real-time, Tobías et al. (2016) proposed a some use cases from the literature. Capability, computation cost, method in which three lightweight CNN algorithms including complexity, and size of the model are important factors in choosing AlexNet, GoogLeNet, and NIN are used for object recognition on GPU, the computation method. Table 2 provides a summary of computa CPU, and mobile devices. AlexNet requires large memory storage, tion platforms used in each reviewed paper. making it less desirable for implementation on mobile devices. In For the remote server-based computation, the client/user (i.e., AR contrast, GoogLeNet and NIN models require lower memory. This device) captures frames and sends them to the remote server; then study showed that GoogLeNet led to the longest processing time the model processes the data on the server. Then, the model sends while Network In Network (NIN) model demonstrated the least the output back to the client/user. Network connection and delays processing time. A near real-time mobile outdoor AR was proposed are important factors during this process. Implementing operation in by Rao et al. (2017). They used the SSD algorithm, a vision-based approach that can use natural features of geographic objects under Table 2 various conditions. However, using this method in outdoor en Computation platforms used in the reviewed studies. vironment under poor signal conditions is challenging. Since SSD is Server Local Server and still too slow to efficiently perform the required computations local (both) without a powerful GPU, they modified this algorithm to a light weight SSD to make it more mobile-friendly by providing a less Paper ID 2, 3, 6, 7, 10, 11, 12, 17, 8, 9, 13, 14, 15, 16, 1, 4, 5, 33 18, 19, 21, 20, 24, accurate base network, fewer feature layers, and smaller input sizes. 22, 23, 27, 29, 30, 34, 25, 26, 28, 31, 32, Compared to the original SSD and a fast YOLO approach, this method 36, 37, 38, 39, 35, 59, 61, 69 was more robust while having less mean average precision (mAP). 40, 41, 42, 43, 44, 45, A new approach for object detection using deep learning net 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, works trained remotely by 3D virtual models was proposed in Židek 56, 57, 58, 60, et al. (2019a), where Faster R-CNN Inception v2 was selected due to 62, 63, 64, 65, 66, its high accuracy. However, considering the disadvantage of its 67, 68 network size and recognition speed, it was not suitable for 8 Y. Ghasemi, H. Jeong, S.H. Choi et al. Computers in Industry 139 (2022) 103661 embedded devices. MobileNet v2 reduced CNN, was more appro Table 3 priate for embedded devices, since it reduces the processing delay Input data used in the reviewed studies. and is optimized for low-performance computing devices such as Images/Videos Point smart glasses with Android OS. This study showed that AR devices clouds with embedded processing units could reach a decent amount of Paper ID 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 45, 62 frames per second which is satisfactory if the task does not require 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, too many movements. Other variations of deep learning algorithms 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, can be used to comply with less powerful devices processors. In 63, 64, 65, 66, 67, 68, 69 Corneli et al. (2019), a lightweight variation of YOLO (i.e., tiny YOLO v2) was used to perform the whole process on site and in real-time. Bhattarai et al. (2020) proposed an embedded system where the the user, they created an algorithm between HoloLens as a client and processed images were analyzed and returned through wireless a desktop as a server using Transmission Control Protocol/Internet streaming in real-time using an embedded GPU development plat Protocol (TCP/IP). Huang et al. (2019) proposed a framework using form. A quantized version of the SSD MobileNet v2 has been used both server and local methods. High computation complexity tasks because this network had a suitable speed of detection for AR ap were offloaded to the edge servers and low complexity tasks were plications, and it hadana mAP ideal for the dataset they used. A executed on mobile devices or the edge server depending on the quantized network means that smaller memory space is required for network latency. However, dynamic network condition changes the weights and the loading times of model are faster, which is made the process unstable and the limitations of the on-device deep necessary for mobile applications (Plecher et al., 2020). In a study learning models were considerable. To control these factors, a cache by Sutanto et al. (2017) they proposed a markerless AR using a Faster and matching algorithm was designed on the mobile device to en R-CNN algorithm and a mobile device capturing the images of ob hance the performance of the recognition tasks. This solution im jects and sending them to the server where the computations are proved the quality of the mobile AR application. performed. The results were displayed on the mobile, based on the Based on the reviewed studies and the tradeoffs each method stored object detection in the database and they could be seen entails, server-based object detection is more common and easier to through a 3D lenslet array case. By incorporating 3D integral ima implement. However, there are still some drawbacks that should be ging, they successfully implemented this application on non-high- considered. Although the difficulties during the implementation specs devices. Lang et al. (2019) used HoloLens to perform the user’s process may differ case-by-case, the limitations and capabilities computations. However, due to the limited computing power of should be identified and tested before implementation. HoloLens, they ran optimizations on a PC using other processors. Cost and latency due to network conditions are two critical Performance is an essential factor when an object detection challenges that one may face. Predicting and providing real-time model is implemented. A single network evaluation CNN predicted object detection, especially on relatively more complex video region of interest and class probabilities directly from full images in frames, is more challenging, and there is no guarantee that the re one evaluation (Eckert et al., 2018). This method significantly im sults will be received in real-time. Therefore, it can negatively affect proved performance over the other state-of-the-art models. An an the user experience in those situations. On the other hand, client- droid application to enable real-time AR was developed to perform side models are much cheaper than server-side models and in object detection (Ran et al., 2017). They performed the object de theory, they should have lower latencies since they do not send and tection on both server and mobile devices and showed that factors receive requests to a server. All computations will be performed in affecting detection performance in these scenarios include model one place, which is the client-or user-side. However, in practice, due size, offloading decision, and video resolution. While the results to the hardware limitations of existing devices, the latencies can using the server highly depend on the network condition, they actually be more significant than those of the server-side. They could concluded that offloading on the server improved frame rate and be more challenging to implement since many optimization opera accuracy. Since time consumption is considered one of the main tions should be performed to allow the system to run smoothly challenges of deep learning, studies used servers to increase the without any hindrance. speed of computations even in the training process (Karambakhsh et al., 2019). Advani et al., (2017) used a high-performance cloud 6. Summary of review results computing technique to leverage both GPU and field-programmable gate arrays (FPGA). While they could accelerate the process by ex For this study, publications of the last ten years from 2011 to ploiting parallel algorithms, the server could not meet the real-time 2020 were extracted from different resources. The papers were in computations since it should have handled multiple connections at troduced regarding the type of input used for object detection, once. To achieve real-time processing, the system must leverage all evaluation metrics, publication year, type of AR devices, and type of computing powers available at edge devices and local infrastructure. algorithms. Table 3 shows the type of inputs used in reviewed stu A server-based object detection has been implemented using near dies. The papers mainly employed frames of videos or images where real-time deep learning. A pre-trained model of YOLO v2 was used to the content should be augmented. In these cases, the AR device re perform the object detection on the server. Liu et al. (2018) showed cords and sends the videos of the environment to the object de the importance of characterizing tradeoffs between augmentation tection algorithm in real-time. The object detection and quality and latency when implementing on the edge server. They augmentation will be performed based on the frames captured from also implemented a protocol to maximize augmentation quality the streaming videos. Another less frequently used input includes under varying network conditions and computation workloads. point clouds or 3D object detection instead of 2D images. This Bahri et al. (2019) developed an object detection system to recognize method provides an efficient way for localizing, characterizing, and the objects via HoloLens and applied the YOLO algorithm at the obtaining depth information, such as measuring distance using 3D server side to transmit the data from the user or client sides. To data of the objects. After registering all of the detected point clouds increase the detection speed on the server side and display results to on the scene, complete capture of the scene can be obtained. As 9 Y. Ghasemi, H. Jeong, S.H. Choi et al. Computers in Industry 139 (2022) 103661 Table 4 Metrics used in the reviewed studies. Evaluation type Metrics Computation efficiency Runtime [1, 4, 9, 10, 15, 18, 19, 23, 17, 32, 33, 36, 39, 41, 44, 52, 57, 62, 63, 67, 68] Latency [4, 5, 9, 13, 24, 29, 33, 52, 54, 55, 56, 60] Energy consumption [1, 2, 5, 29, 30, 54, 56, 60] Performance Detection accuracy [1, 2, 3, 4, 6, 7, 8, 10, 11, 12, 15, 21, 22, 25, 27, 29, 30, 34, 35, 36, 40, 41, 42, 47, 49, 51, 52, 53, 57, 60, 61, 64, 66, 68, 69] Precision [4, 5, 8, 9, 13, 14, 17, 20, 23, 24, 25, 26, 28, 32, 33, 39, 42, 43, 44, 48, 51, 55, 56, 57, 58, 59, 67] Recall [6, 8, 9, 12, 26, 32, 42, 43, 49, 51, 56] Error rate [6, 16, 22, 29, 37, 38, 46, 48, 50, 61, 65, 68] Loss [4, 12, 18, 31, 45, 49, 60, 64] Intersection Over Union (IOU) [26, 30, 62, 66] Task completion time [3, 27, 47, 48, 50], Users’ accuracy [27, 48] Users’ error [27, 50, 69] Subjective measurements 31, 46, 47, 50, 57, 69 of efficiency, performance, and subjective measurements provides a more robust and comprehensive evaluation. Fig. 6 shows the number of studies published each year. The re sults show that deep learning-based object detection in AR was not studied before 2016. Since 2016, this topic has received remarkable attention and the number of publications has been increasing over the years. It can be observed that the highest number of published papers in this field corresponds to the year 2020. It should be noted that the concept of deep learning-based object detection was in troduced in 2014, and before that, traditional methods, such as Viola-Jones, HOG, and SVM were used. The current trend of this topic shows the interest of researchers in solving associated problems using these technologies. Across the 69 studies reviewed in this paper, six different cate gories of AR devices including five distinct categories for displaying AR information, were identified: Wearable devices, Projection-based AR (including HUDs), Monitors, and RGB Cameras. The “Other” ca Fig. 6. Number of publications per year. tegory includes devices that are only used for specific applications. Wearable devices such as HoloLens are the most frequently used devices while mobile devices come second. These two categories were used in more than half of the papers. It could result from the shown in Table 3, image/video inputs are a more common input than quality and simplicity of these devices for providing interactive ex point clouds since they are easier to capture and analyze. periences and the availability of mobile devices for AR applications. For evaluating the proposed object detection methods in AR, Fig. 7 shows the summary of devices used in studies per year. It can reviewed studies generally focused on three evaluation approaches be observed that the use of AR devices is increasing especially for including computation efficiency measures such as runtime and la wearables and mobile devices, while projection-based AR shows a tency, performance measures such as accuracy and error for both cyclic pattern. Moreover, use of camera and monitor is not showing a systems and users, as well as subjective measures such as user ac specific pattern each year, but it can be seen that their use has also ceptance, cognitive workload, ease of use, enjoyment, and usefulness been increasing over the years. using self-reported surveys only for studies that evaluated their The combination of deep learning algorithms and devices used in proposed approaches with human subjects. Table 4 summarizes the 69 collected papers is demonstrated in Fig. 8. Based on this re evaluation metrics used in the reviewed studies. Each evaluation sult, among all devices and algorithms, YOLO has been used more method has its advantages and disadvantages. Using a combination Fig. 7. Number and type of devices used per year. 10 Y. Ghasemi, H. Jeong, S.H. Choi et al. Computers in Industry 139 (2022) 103661 Fig. 8. Combination of algorithms and devices used in reviewed papers. Table 5 7. Discussion Studies conducted human subject experiments. Paper ID The number of human subjects Based on the existing data in the field of deep learning-based 3, 27, 48, 56, 69 20 object detection in AR, it can be observed that there are still many 31, 50 30 challenges that should be addressed to enhance the performance of 44 10 the detection algorithms in the future. This section provides an 46 20, 20 (two separate experiments) overview of the challenges we observed based on this review. 47 25, 20 (two separate experiments) A powerful computation resource, a large training dataset, and a 57 61 suitable machine learning model can significantly improve deep learning performance. However, implementation issues on mobile devices remains a challenge that need to be optimized to reduce than other deep learning algorithms, having a dense intersection computational time and improve the effectiveness of the algorithms with both wearables and mobile devices. to make them more lightweight, fast, and accurate. In addition, Out of 69 papers, only 11 papers mentioned that they conducted mobile applications generate noisy data if the algorithm is not ro user studies and evaluated their proposed systems with human bust enough and the dataset is small. In general, energy consump subjects (Farasin et al., 2020; Fuchs et al., 2020; Lai et al., 2020; Park tion is the primary concern for mobile AR applications. Additional et al., 2020a, 2020b; Tao et al., 2019; Kästner et al., 2020; Lang et al., work must be done to reduce the detection and segmentation ex 2019; Putze et al., 2020; Ramakrishna et al., 2016; Rathnayake et al., ecution time. In addition, several datasets can be used as a bench 2020). As shown in Table 5, the studies involve 10–61 participants. mark for object detection to train algorithms. However, many studies However, in all of the conducted studies, the number of female still use manual labeling that is not cost- and labor-effective. participants were significantly less than the number of male parti Another challenge is dealing with the limitations of wearable de cipants, which can possibly make the results biased toward one vices such as HoloLens. HoloLens’s hardware is unpleasant and un gender, especially in the sense that the research showed that females comfortable for prolonged durations of wear. Also, HoloLenshas a and males have different reactions when using the AR technology limited battery capacity and sometimes requires continuous and (Dirin et al., 2019). Surprisingly, most of the studies did not mention stable access to the network. These limitations may hinder its per participants’ age ranges, but from a few existing data, their ages formance in many real-world or in-the-wild applications. In addi ranged from 18 to 50 with an average of 25. Also, most of the studies tion, due to the limited range of the depth sensor of HoloLens, when did not clarify whether the user population was representative of the identifying an object at a long distance, it is necessary to move closer real target users. However, based on the available information from to the target to create a 3D mesh of the physical space before de some studies, we can conclude that most of them recruited uni tection. Another limitation that needs more improvement is detec versity students who were not necessarily representative of a larger tion accuracy when there is a reflective medium or a shiny surface population of users. such as glass in the environment. The detection also may fail when the object being scanned moves. 11 Y. Ghasemi, H. Jeong, S.H. Choi et al. Computers in Industry 139 (2022) 103661 The most important limitation of previous studies is the lack of local device, the tradeoffs between computation time, accuracy, la human subject experiments. Many of the earlier efforts did not tency, and battery drain should be made according to the task and evaluate the performance of the proposed systems in AR using AR device characteristics. To deal with manual labeling, increasing human subjects. Park et al. (2019) mentioned the importance of the number of datasets with reliable and accurate labels that use usability and user experience when designing a system for humans fewer samples for the learning stage can be a viable solution to and considered testing their system with users in future studies. enhance the accuracy and speed of detection. In general, for future Similarly, as a future direction for a driving-related task aiming to studies, after taking into account the considerations mentioned increase drivers’ awareness, Anderson et al. (2019) discussed the above, it is worth it for researchers to investigate their proposed necessity of conducting a user study to evaluate the impact of ho approaches in uncontrolled or wild environments while testing lograms on drivers. Moreover, Park et al. (2021) proposed the lim them on human subjects to validate their usability for real users and itations and future directions of their research aiming to use applications. subjective approaches to evaluate the physical and mental workload of the users in a human-robot collaboration scenario. Hu et al. (2020) Declaration of Competing Interest also noted the importance of the interaction modes in their pro posed system. To provide a better user experience in terms of in The authors declare that they have no known competing fi teraction modes, further user studies need to be conducted to nancial interests or personal relationships that could have appeared evaluate the system. Running experiments with human subjects to influence the work reported in this paper. could help evaluate the systems from humans’ perspectives; after all, all these applications are ultimately developed to be used by Acknowledgments humans. Conducting user studies can also reveal whether the system is suitable for users to use in their everyday lives. Evaluating a This research was supported by the Republic of Korea’s MSIT system with human subjects can also lead to identifying the (Ministry of Science and ICT), under the High-Potential Individuals strengths and weaknesses of proposed systems. Most of the previous Global Training Program (2020-0-01532 & 2021-0-01548) su studies either did not include such evaluations in their research or pervised by the IITP (Institute of Information and Communications conducted their evaluations with a small number of participants, Technology Planning & Evaluation). making it difficult to validate the results. In addition, researchers should be more attentive to balancing the number of male and fe References male participants when conducting a user study. In most existing research studies, the number of male participants is significantly Abdi, L., Meddeb, A. , 2017, April. Deep learning traffic sign detection, recognition and augmentation. In: Proceedings of the Symposium on Applied Computing. pp. greater than females. Moreover, while some research works con 131–136. ducted field studies including in-the-wild or in-situ, most of the Abdi, L., Meddeb, A., 2018. Driver information system: a combination of augmented studies performed their task in controlled lab environments with reality, deep learning and vehicular Ad-hoc networks. Multimed. Tools Appl. 77 (12), 14673–14703. consistent lighting, static workplaces, and a limited number of ob Abdi, L., Takrouni, W., Meddeb, A. , 2017, June. In-vehicle cooperative driver in jects which may not reflect the true utility and effectiveness of their formation systems. In: Proceedings of the 2017 13th International Wireless approaches. Since deep learning-based object detection in AR is Communications and Mobile Computing Conference (IWCMC). pp. 396–401. IEEE. Advani, S., Zientara, P., Shukla, N., Okafor, I., Irick, K., Sampson, J., Datta, S., Narayanan, applied to many real-world situations, it is worth exploring them in V., 2017. A multitask grocery assist system for the visually impaired: smart a more uncontrolled format to better understand their usefulness in glasses, gloves, and shopping carts provide auditory and tactile feedback. IEEE real tasks and environments. Consum. Electron. Mag. 6 (1), 73–81. Ahn, J., Lee, J., Niyato, D., Park, H.S., 2020. Novel QoS-guaranteed orchestration scheme for energyefficient mobile augmented reality applications in multi-access edge 8. Conclusion computing. IEEE Trans. Veh. Technol. 69 (11), 13631–13645. Alhaija, H.A., Mustikovela, S.K., Mescheder, L., Geiger, A., Rother, C., 2018. Augmented This paper provided a comprehensive review of deep learning- reality meets computer vision: efficient data generation for urban driving scenes. Int. J. Comput. Vis. 126 (9), 961–972. based object detection in AR, including an overview of current Anderson, R., Toledo, J., ElAarag, H. , 2019, April. Feasibility study on the utilization of technologies and devices in AR as well as frequently used algorithms Microsoft HoloLens to increase driving conditions awareness. In: Proceedings of for object detection. This review showed how deep learning is dif the 2019 SoutheastCon. pp. 1–8. IEEE. Apicharttrisorn, K., Ran, X., Chen, J., Krishnamurthy, S.V., Roy-Chowdhury, A.K. , 2019, ferent from statistical classifiers for object detection and provided November. Frugal following: Power thrifty object detection and tracking for many advantages of using deep learning over traditional detection mobile augmented reality. In: Proceedings of the 17th Conference on Embedded methods. It also represented the current state of deep learning- Networked Sensor Systems. pp. 96–109. Bahri, H., Krčmařík, D., Kočí, J. , 2019, December. Accurate object detection system on based object detection in AR regarding different applications and hololens using yolo algorithm. In: Proceedings of the 2019 International implementation methods. It was observed that the number of pub Conference on Control, Artificial Intelligence, Robotics & Optimization (ICCAIRO). lications in this field is increasing, making this area a pervasive field pp. 219–224. IEEE. Bhattarai, M., Jensen-Curtis, A.R., Martínez-Ramón, M. , 2020, December. An em of study. Depending on the type of algorithm, model size, network bedded deep learning system for augmented reality in firefighting applications. conditions, and the computing power of AR devices, care must be In: Proceedings of the 2020 19th IEEE International Conference on Machine taken when implementing computations on the server-side or on Learning and Applications (ICMLA). pp. 1224–1230. IEEE. Chen, I.Y., MacDonald, B., Wünsche, B. , 2008, December. Markerless augmented the local devices Future studies in this field can be improved by reality for robots in unprepared environments. In: Proceedings of the Australasian designing more powerful mobile devices that can process the com Conference on Robotics and Automation. ACRA08. pp. 3–5. putations locally in real-time and designing more pleasant wearable Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L. , 2014. Semantic image devices for long durations of use. There is room for developing more segmentation with deep convolutional nets and fully connected crfs. arXiv pre print arXiv:1412.7062. lightweight devices in the future. In addition, depending on the Cheng, Q., Zhang, S., Bo, S., Chen, D., Zhang, H., 2020. Augmented reality dynamic application, methods such as inertial odometry for estimating the image recognition technology based on deep learning algorithm. IEEE Access 8, pose and optical flow can help to reduce energy consumption and 137370–137384. processing time, respectively. Whether using a remote server or a 12 Y. Ghasemi, H. Jeong, S.H. Choi et al. Computers in Industry 139 (2022) 103661 Choi, S.H., Park, K.B., Roh, D.H., Lee, J.Y., Mohammed, M., Ghasemi, Y., Jeong, H., 2022. Karambakhsh, A., Kamel, A., Sheng, B., Li, P., Yang, P., Feng, D.D., 2019. Deep gesture An integrated mixed reality system for safety-aware human-robot collaboration interaction for augmented anatomy learning. Int. J. Inf. Manag. 45, 328–336. using deep learning and digital twin generation. Robot. Comput. -Integr. Manuf. Kästner, L., Frasineanu, V.C., Lambrecht, J., 2020. May. A 3D-deep-learning-based 73, 102258. augmented reality calibration method for robotic environments using depth Chowdhury, S.A., Arshad, H., Parhizkar, B., Obeidy, W.K., 2013. Handheld augmented sensor data. Proceedings of the 2020 IEEE International Conference on Robotics reality interaction technique. Proceedings of the International Visual Informatics and Automation (ICRA). IEEE, pp. 1135–1141. Conference. Springer, Cham, pp. 418–426. Kästner, L., Eversberg, L., Mursa, M., Lambrecht, J. , 2021, January. Integrative object Corneli, A., Naticchia, B., Carbonari, A., Bosché, F., 2019. Augmented reality and deep and pose to task detection for an augmented-reality-based human assistance learning towards the management of secondary building assets. ISARC. system using neural networks. In: Proceedings of the 2020 IEEE Eighth Proceedings of the International Symposium on Automation and Robotics in International Conference on Communications and Electronics (ICCE). pp. 332–337. Construction, vol. 36. IAARC Publications, pp. 332–339. IEEE. Cruz, E., Orts-Escolano, S., Gomez-Donoso, F., Rizo, C., Rangel, J.C., Mora, H., Cazorla, Katiyar, A., Kalra, K., Garg, C., 2015. Marker based augmented reality. Adv. Comput. Sci. M., 2019. An augmented reality application for improving shopping experience in Inf. Technol. (ACSIT) 2 (5), 441–445. large retail stores. Virtual Real. 23 (3), 281291. Khan, N., Saleem, M.R., Lee, D., Park, M.W., Park, C., 2021. Utilizing safety rule corre Dasgupta, A., Manuel, M., Mansur, R.S., Nowak, N., Gračanin, D. , 2020, March. Towards lation for mobile scaffolds monitoring leveraging deep convolution neural net real time object recognition for context awareness in mixed reality: a machine works. Comput. Ind. 129, 103448. learning approach. In: Proceedings of the 2020 IEEE Conference on Virtual Reality Kim, Y.H., Lee, K.H., 2019. Pose initialization method of mixed reality system for in and 3D User Interfaces Abstracts and Workshops (VRW). pp. 262–268. IEEE. spection using convolutional neural network. J. Adv. Mech. Des. Syst., Manuf. 13 De Gregorio, D., Tonioni, A., Palli, G., Di Stefano, L., 2019. Semiautomatic labeling for (5) JAMDSM0093-JAMDSM0093. deep learning in robotics. IEEE Trans. Autom. Sci. Eng. 17 (2), 611–620. Konstantinidis, F.K., Kansizoglou, I., Santavas, N., Mouroutsos, S.G., Gasteratos, A., del Amo, I.F., Erkoyuncu, J.A., Roy, R., Palmarini, R., Onoufriou, D., 2018. A systematic 2020. MARMA: a mobile augmented reality maintenance assistant for fast-track review of augmented reality content-related techniques for knowledge transfer in repair procedures in the context of industry 4.0. Machines 8 (4), 88. maintenance applications. Comput. Ind. 103, 47–71. Lai, Z.H., Tao, W., Leu, M.C., Yin, Z., 2020. Smart augmented reality instructional system Deore, H., Agrawal, A., Jaglan, V., Nagpal, P., Sharma, M.M., 2020. A new approach for for mechanical assembly towards worker-centered intelligent manufacturing. J. navigation and traffic signs indication using map integrated augmented reality for Manuf. Syst. 55, 69–81. self-driving cars. Scalable Comput.: Pract. Exp. 21 (3), 441–450.

Deep Learning-Based Object Detection in Augmented Reality: A Systematic Review PDF

Document Details

Tags

Related

Summary

Full Transcript