Using libyuv on Android Devices with NEON support

technix

unread,

Aug 7, 2012, 1:09:43 PM8/7/12

to discuss...@googlegroups.com

Hi,

first of all thank you for the library and in particular the time you spend on it!

I want to use libyuv for color conversion from yuv420 to rgba. My applications scenario is real time video streaming with a server and the android device as a client. latency is my enemy and i fight for every single millisecond :).
For video decoding i m using ffmpeg (h264 Decoder) which works very fast so far. Up to now i used swscale for the color conversion which tooks round about 9 ms (no hardware acceleration for swscale). Using libyuv my results are about 10 ms which is very bad cause i thought i would get better sults because of the NEON extension. My test device is a Samsung Galaxy S2 (2x1,2 ghz) and the Testimage is 800x480 YUV420.

Maybe i did anything wrong in my Android.mk and there is actual no NEON enabled?

Android.mk
###########################################################################
LOCAL_PATH := $(call my-dir)
LIBYUV_DIR := $(LOCAL_PATH)/libyuv

FFMPEG_DIR := $(LOCAL_PATH)/ffmpeg

#ifeq ($(HOST_OS), windows)
# FFMPEG_DIR := $(shell cygpath -m $(FFMPEG_DIR))
#endif

include $(CLEAR_VARS)
LOCAL_MODULE := ffmpeg
# These need to be in the right order
FFMPEG_LIBS := $(addprefix ffmpeg/, \
android/armv7-a/lib/libavformat.a \
android/armv7-a/lib/libavcodec.a \
android/armv7-a/lib/libswscale.a \
android/armv7-a/lib/libavutil.a )
# ffmpeg uses its own deprecated functions liberally, so turn off that annoying noise
LOCAL_CPP_EXTENSION := .cc
LOCAL_CFLAGS += -g -Iffmpeg -Iffmpeg-rtsp -Wno-deprecated-declarations -D__STDC_CONSTANT_MACROS
LOCAL_LDLIBS += -llog -lz $(FFMPEG_LIBS)
LOCAL_EXPORT_C_INCLUDES := ffmpeg
LOCAL_EXPORT_C_INCLUDES += $(LOCAL_PATH)/include
LOCAL_C_INCLUDES := ffmpeg
LOCAL_C_INCLUDES += $(LIBYUV_DIR)/include $(LIBYUV_DIR)
LOCAL_SRC_FILES := ffmpeg-jni.cc \
                $(addprefix libyuv/, \
        source/compare.cc.neon \
        source/convert.cc.neon \
        source/convert_argb.cc.neon \
        source/convert_from.cc.neon \
        source/cpu_id.cc.neon \
        source/format_conversion.cc.neon \
        source/planar_functions.cc.neon \
        source/rotate.cc.neon \
        source/rotate_argb.cc.neon \
        source/rotate_neon.cc.neon \
        source/row_common.cc.neon \
        source/row_neon.cc.neon \
        source/row_posix.cc.neon \
        source/scale.cc.neon \
        source/scale_argb.cc.neon \
        source/video_common.cc.neon )
LOCAL_ARM_NEON := true
LOCAL_STATIC_LIBRARIES := cpufeatures
include $(BUILD_SHARED_LIBRARY)
###############################################################

As you can see i m using libyuv directly without any prebuilts. I ve never seen the .neon suffix before but its related to the information provided here https://groups.google.com/forum/?fromgroups#!topic/discuss-libyuv/q4CkZ__aSj0.

Is it possible that the color conversion with neon is really so slow?

technix

unread,

Aug 8, 2012, 7:53:42 AM8/8/12

to discuss...@googlegroups.com

I tried "yuv2rgb" from pinknoise (http://wss.co.uk/pinknoise/yuv2rgb/). In the same setting as for libav above i takes about 17 ms on avarage. On my last test the conversion with libyuv takes about 7ms on avarage (i optimized other threads of my app) so its the best choice up to now! next i ll try the OpenCV neon yuvtorgb conversion.

Frank Barchard

unread,

Aug 9, 2012, 2:42:43 AM8/9/12

to discuss...@googlegroups.com

Hi!

7 ms does seem slow. For reference, on x86 i7 (nehalem) w/ SSSE3 1280x720

ConvertToARGBBuffer (0.783 ms)

My first concern would be are you sure youre running the Neon code. One way to tell is profile it.

You've built it with neon. But the code still checks the cpu by scanning /proc/cpuinfo for Neon.

A profile on a neon machine is posted in this issue

http://code.google.com/p/libyuv/issues/detail?id=67

It shows a shift taking time, which doesnt make sense. But perhaps some tweaks in the code could help.

And load/store are slow, which is more expected. Load/Store can be speed up with alignment, more registers and preload.

The current method uses 3 vector multiples for contributions to R,G,B, the same as SSSE3 code.

Previously the x86 used 3 table lookups (6k). That was faster on older CPU's where math was slow and tables/memory were fast. It was also faster on large images, where the tables became cached, but slow on small images where cache misses

Make sure your buffers are 16 byte aligned.

It should help now, but may be the first thing I could optimize for. The code will still work, but if unaligned it would fall back on slower code.

Reply all

Reply to author

Forward