Euro Truck Simulator 2 Autosave Interval

6 views

Skip to first unread message

Shawna Erholm

unread,

Jul 24, 2024, 10:10:26 AM7/24/24

to diadrywimna

The task of drawing a bounding box around an object of interest is known as object detection within the computer vision literature. It is one of several methods that computer algorithms can now perform with high accuracy. Other tasks include image classification (Himabindu & Kumar, 2021, e.g. deciding whether an image contains a White or a Black person), image segmentation (Minaee et al., 2021, e.g. drawing a contour around a person), pose estimation (Chen et al., 2020, e.g. localising the position of feet, knees, hips, and shoulders of a person in an image), and object tracking (Chen et al., 2022, i.e. object detection while also tracking the identity of a person or object in the image). The accuracy of these techniques has improved substantially over the past years (Feng et al., 2019) due to improved algorithms, improved technology (particularly the introduction of graphical processing units, GPUs), and larger annotated datasets (e.g. Deng et al., 2009; Yang et al., 2015).

Until recently some level of programming skills was required to apply these computer vision methods, which can be an issue for behavioural researchers. This changed recently with the Ultralytics (Jocher et al., 2023) package that is easy to install and use for all of the above computer vision tasks. Applying computer vision methods with the Ultralytics packages means installing the package, sorting files into the appropriate folder structures, providing commands from the command line (Jocher et al., 2023) and saving the results.

euro truck simulator 2 autosave interval

Download Zip ……… https://ssurll.com/2zKzrb

Some object detection tasks can be performed with pre-trained models. This object detection can be performed on a standard personal computer (e.g. with an i5 processor and 8Gb of RAM). Pre-trained models tend to be based on the COCO dataset (Lin et al., 2014) that contains 80 types of objects. When the aim is to detect an object that is not in these pre-trained models (e.g. a plunger, a surgical instrument), a new model needs to be trained. For such training, a computer with a graphical processing unit is recommended (Google currently offers free processing time on their Colab servers, Bisong and Bisong, 2019).

To train a new object detector, a set of examples is needed. This involves finding images of the object and drawing bounding boxes around the object in each of these images, for example using the labelMe software (Wada, 2018). Training a new object detector often starts with a pre-trained network (often based on the COCO dataset, Lin et al., 2014) that takes advantage of pre-trained weights for the feature recognition stages of object recognition, a process known as transfer learning.

Typical object detection often involves highly variable contexts (e.g. various outdoor scenes, different weather and lighting conditions) and highly variable objects (e.g. various shapes, sizes and colours of cars and trucks). A common strategy is to use an already-existing dataset of annotated images (e.g. Yang et al., 2015; Krizhevsky et al., 2009; Deng et al., 2009). Using such an existing dataset may not always be a feasible strategy for objects used in the lab, as existing datasets may not contain the class of object that you may wish to detect (e.g. a plunger, Cohen and Rosenbaum, 2004). The question therefore arises what is required to train a new object detector for the object(s) under study.

Behavioural research may have an important advantage in this context. Experiments are often done in a much more controlled setting than found in typical object detection. Participants are all tested in the same lab, under the same lighting conditions, with the same camera viewpoint, manipulating the same object. Guidelines on, for example, how many images to annotate for training from typical object detection contexts may therefore not automatically apply to object detection in behavioural research (particularly when conducted in the lab).

Participants stood in front of a surgical simulator box with their chin in a chin rest (a). Inside the surgical box were two dishes containing beads and a webcam (b). Participants entered a surgical instrument (c) into one of the holes in the top of the box and moved as many beads from one dish to the other in the three allotted minutes while the webcam recorded the view inside the box (d)

A recent study examined the effects of the number of annotated images used to train an object-detector for playing cards. For object detection, the authors used the You Only Look Once (YOLO) algorithm and two of the older versions of this algorithm (YOLOv1 and YOLOv2, i.e. versions 1 and 2). They found that precision and recall (measures of the accuracy of detection) improved until 300 images and with at least 300 training epochs (Li et al., 2018). This suggests that a relatively small number of annotated images may suffice for reliable object detection, but it is unclear how these results extend to more recent object detectors (e.g. Jocher et al., 2023, 2023), which reports substantially improved object detection over the earlier versions (for an overview, see Jiang et al., 2022).

The present study will therefore examine the required conditions to train object detectors for behavioural research. It will focus on YOLOv8 because this version is particularly easy to use compared to other object detectors. The main set of experiments in the present study will focus on surgical tool detection, as this type of object detection has substantial interest in the medical community and studies have demonstrated that accurate detection can be achieve with earlier YOLO versions (e.g. Choi et al., 2017; Choi et al., 2021; Wang et al., 2021). The present study will make use of images of a surgical tool inside a simulator box to mimic the low variability contexts typically found in lab-based studies (the experiment from which these images were sampled will be described in the Methods section below).

To examine how well the results generalise to other objects, the YOLOv8 detector will also be applied to a second dataset in which participants moved a transparent bowl between rings (Hermens et al., 2014). This particular application could pose an additional challenge to the algorithm due to the transparency of the bowl, additional occlusion, and the low image quality of this older dataset.

In the experiments, the following research questions will be addressed: (1) How many annotated images are needed to train an object detector in a low variability setting? (2) How well does the object detector perform on unseen videos of the object? (3) Does the YOLO version and the pre-trained model size affect performance of the detector? (4) Does an object detector trained for one background perform adequately when used for the same object but a different background? (5) If performance drops with a change of background, does it suffice to train a new detector in a new context? (6) How well does an object detector perform when trained on different contexts and how many images are needed per context? (7) How do results depend on the random selection of images for training? (8) Are similar image set sizes needed for other types of objects?

a Labelling images with LabelImg. By using the single object and autosave options, labelling can be performed efficiently. To label around 450 images each with a single object, around 45 min was needed. b Example tree structure (with a total of 12 images) required for training with YOLOv8 from Ultralytics. c Output from the algorithm during training

Eye-tracking data and video recordings were collected from a total of 38 participants (ten males), recruited by word of mouth among students or staff at the University of Aberdeen (UK) with no experience in the use of surgical instruments. All provided written consent for their participation in the study that was approved by the local ethics committee (School of Psychology, University of Aberdeen, UK).

The experimental setup for data collection is shown in Fig. 1. Participants were standing with their head resting in a chin rest (UHCOT Tech Headspot, mounted on a wooden frame) and performed a task with a simulator box (Ethicon endo-surgery inc.) while an EyeLink 1000 system (SR Research, Ontario, Canada) recorded their eye movements and a webcam recorded the inside of the simulator box and projected the image on a Dell 19-inch flatscreen monitor. The task involved a single-use surgical grasper, shown in Fig. 1c, and required participants to move coloured beads from one dish inside the box to another dish (see Fig. 1d) for a total of 3 min. In the present context, only the video recordings are used, not the eye-tracking data.

A total of 436 images were extracted from a large portion of the videos taking one frame every 10 s. This sampling frequency was chosen such that around 450 images were obtained from the various videos while avoiding images that looked too similar. Using the labelImg software, bounding boxes were drawn around the tooltip in each image, as illustrated in Fig. 2a. Bounding boxes were drawn so that they included all of the metal region of the tool tip. This could sometimes lead to part of the (black) shaft of the instrument to be included in the bounding box. The target object for this analysis is therefore the tooltip, as this is the part of the instrument that participants are expected to fixate during the task. Most images contained the tooltip (386 images), but some images without the tool were also kept to determine whether the model did not detect an instrument when there was none.

The mAP values reported during training are computed on the validation set. These are based on 20% of the annotated images (e.g. 20 images when using 100 images in total). Because the validation set can be small in situations where fewer annotated images are used, the performance on the entire annotated set of 436 images was also compared across image set sizes. Because models trained on larger numbers of images see more of the original 436 images during training, models based on more images may have an advantage on the entire annotated set. Models were therefore also compared on two yet unseen videos. Performance for these videos was determined by counting how often the detected bounding included most of the tool tip, as in the examples in Fig. 6 (corresponding to an IoU over 50%) the number of times the bounding box was clearly not around the tool tip (like in the example in Fig. 9), the number of false positives (a tooltip was detected where there was none) and the number of false negatives (tooltip not detected where there was one).