Ihave a drone in a Gazebo environment with a RealSense d435 camera on it. My plan is to use YOLO to find the center of an object of interest, and then find the depth of that point from the depth image. I heard that the depth camera outputs an image where the depth values are encoded in the RGB values. When further looking this up online, I found that there is a pyrealsense2 library that has functions for everything I need.
The implementations I've seen online need you to create a pyrealsense.pipeline() and get your frames from that. The issue is this seems to only work if you have a RealSense camera connected to your computer. Since mine exists in the Gazebo environment, I need a way to get and use the depth frame in a ROS callback. How would I do this? Any pointers would be greatly appreciated
The YOLO-World Model introduces an advanced, real-time Ultralytics YOLOv8-based approach for Open-Vocabulary Detection tasks. This innovation enables the detection of any object within an image based on descriptive texts. By significantly lowering computational demands while preserving competitive performance, YOLO-World emerges as a versatile tool for numerous vision-based applications.
YOLO-World tackles the challenges faced by traditional Open-Vocabulary detection models, which often rely on cumbersome Transformer models requiring extensive computational resources. These models' dependence on pre-defined object categories also restricts their utility in dynamic scenarios. YOLO-World revitalizes the YOLOv8 framework with open-vocabulary detection capabilities, employing vision-language modeling and pre-training on expansive datasets to excel at identifying a broad array of objects in zero-shot scenarios with unmatched efficiency.
Efficiency and Performance: YOLO-World slashes computational and resource requirements without sacrificing performance, offering a robust alternative to models like SAM but at a fraction of the computational cost, enabling real-time applications.
Inference with Offline Vocabulary: YOLO-World introduces a "prompt-then-detect" strategy, employing an offline vocabulary to enhance efficiency further. This approach enables the use of custom prompts computed apriori, including captions or categories, to be encoded and stored as offline vocabulary embeddings, streamlining the detection process.
Powered by YOLOv8: Built upon Ultralytics YOLOv8, YOLO-World leverages the latest advancements in real-time object detection to facilitate open-vocabulary detection with unparalleled accuracy and speed.
Benchmark Excellence: YOLO-World outperforms existing open-vocabulary detectors, including MDETR and GLIP series, in terms of speed and efficiency on standard benchmarks, showcasing YOLOv8's superior capability on a single NVIDIA V100 GPU.
Versatile Applications: YOLO-World's innovative approach unlocks new possibilities for a multitude of vision tasks, delivering speed improvements by orders of magnitude over existing methods.
The YOLO-World models provided by Ultralytics come pre-configured with COCO dataset categories as part of their offline vocabulary, enhancing efficiency for immediate application. This integration allows the YOLOv8-World models to directly recognize and predict the 80 standard categories defined in the COCO dataset without requiring additional setup or customization.
The YOLO-World framework allows for the dynamic specification of classes through custom prompts, empowering users to tailor the model to their specific needs without retraining. This feature is particularly useful for adapting the model to new domains or specific tasks that were not originally part of the training data. By setting custom prompts, users can essentially guide the model's focus towards objects of interest, enhancing the relevance and accuracy of the detection results.
You can also save a model after setting custom classes. By doing this you create a version of the YOLO-World model that is specialized for your specific use case. This process embeds your custom class definitions directly into the model file, making the model ready to use with your specified classes without further adjustments. Follow these steps to save and load your custom YOLOv8 model:
After saving, the
custom_yolov8s.pt model behaves like any other pre-trained YOLOv8 model but with a key difference: it is now optimized to detect only the classes you have defined. This customization can significantly improve detection performance and efficiency for your specific application scenarios.
This approach provides a powerful means of customizing state-of-the-art object detection models for specific tasks, making advanced AI more accessible and applicable to a broader range of practical applications.
WorldTrainerFromScratch is highly customized to allow training yolo-world models on both detection datasets and grounding datasets simultaneously. More details please checkout
ultralytics.model.yolo.world.train_world.py.
For further reading, the original YOLO-World paper is available on arXiv. The project's source code and additional resources can be accessed via their GitHub repository. We appreciate their commitment to advancing the field and sharing their valuable insights with the community.
The YOLO-World model is an advanced, real-time object detection approach based on the Ultralytics YOLOv8 framework. It excels in Open-Vocabulary Detection tasks by identifying objects within an image based on descriptive texts. Using vision-language modeling and pre-training on large datasets, YOLO-World achieves high efficiency and performance with significantly reduced computational demands, making it ideal for real-time applications across various industries.
YOLO-World supports a "prompt-then-detect" strategy, which utilizes an offline vocabulary to enhance efficiency. Custom prompts like captions or specific object categories are pre-encoded and stored as offline vocabulary embeddings. This approach streamlines the detection process without the need for retraining. You can dynamically set these prompts within the model to tailor it to specific detection tasks, as shown below:
To reproduce the official results from scratch, you need to prepare the datasets and launch the training using the provided code. The training procedure involves creating a data dictionary and running the train method with a custom trainer:
Hello,
I just tried using the intel RealSense camera as an input camera for the YOLOv4 but whenever I run it it shows the following error after loading the weights. Intel RealSense camera shows the output video perfectly when opened using Intel realsense-viewer. I was able to use the Intel RealSense as an input camera (or as webcam) on Jetson AGX Xavier but on Jetson TX2 its giving me error. Any help appreciated. Thanks.
Please refer to installation guideline at Python Installation Please refer to the instructions at Building from Source These Examples demonstrate how to use the python wrapper of the SDK. Sample code source code is available on GitHubFor full Python...
Follow the RealSense SDK documentation to add RealSense public key and list of repositories. For x86, refer to _linux.md. For Jetson platforms, refer to _jetson.md. You may use Ubuntu 20.04 source path into apt-repository to install dependencies.
Plugin in Intel RealSense Depth D400 series USB camera. To verify the camera and upgrade firmware, run realsense-viewer and update from the UI. Navigate to the folder sources/apps/sample_apps/deepstream-3d-depth-camera.
ds3d::dataloader loads custom lib libnvds_3d_dataloader_realsense.so and creates a RealSense dataloader through the createRealsenseDataloader function. This specific loader is configured for output streams of [color, depth]. Gst-appsrc connects the dataloader into the deepstream pipeline.
ds3d::datarender loads custom lib libnvds_3d_gl_datarender.so and create a GLES render through the createDepthStreamDataRender function. This specific loader is configured for 2D depth and color images. Gst-appsink connects the datarender into the deepstream pipeline.
ds3d::dataloader is the same as the 1st pipeline for depth and color capture. It also outputs the intrinsic parameters of the color/depth sensors, and extrinsic parameters from depth to color sensor module.
ds3d::datafilter is loaded by the nvds3dfilter Gst-plugin which accepts in_caps as sink_caps and out_caps as src_caps. It creates a custom ds3d::datafilter instance and processess data as ds3d/datamap.
ds3d::datarender loads custom lib libnvds_3d_gl_datarender.so and create a GLES render through the createPointCloudDataRender function . This specific loader is configured for 3D points and colors rendering. It also supports 3D scene rotation based on mouse-drag.
realsense_dataloader captures color and depth streams along with the intrinsic and extrinsic parameters of the sensor in ds3d/datamap. It then delivers the data to the next component point2cloud_datafilter. This component generates 3D point-cloud data and UV coordination map into new output data ds3d/datamap. Finally, the data is delivered to ds3d::datarender component point-render for 3D rendering.
The depth-render custom lib shall display the depth data and color data together in same window. Update min_depth/max_depth to remove foreground and background objects in the depth rendering. From RealSense specification, D435 Camera depth range is between 0.3 and 3 meters.
RealSense (librealsense SDK v2) is integrated into Open3D (v0.12+) and youcan use it through both C++ and Python APIs without a separate librealsenseSDK installation on Linux, macOS and Windows. Older versions of Open3D supportRealSense through a separate install of librealsense SDK v1 andpyrealsense.
Here is a C++ code snippet that shows how to read a RealSense bag file recordedwith Open3D or the realsense-viewer. Note that general ROSbag files are notsupported. See more details and available functionality (such as gettingtimestamps, aligning the depth stream to the color stream and getting intrinsiccalibration) in the C++ API in the RSBagReader documentation
3a8082e126