KYC-Documents-Verif-SDK & GPGPU

98 views
Skip to first unread message

Luka Pribanić

unread,
May 22, 2025, 1:13:31 PMMay 22
to doubango-ai
Dear all,

I got the KYC SDK working, yesterday with pre-built C binary, today with Python (3.11.2).
I've built the extension, tested on my laptop CPU, and got small batches processing for test, at about 1s/image (12-image batch at about 11s, give or take a bit depending on the run).

Then I've proceeded to GPGPU testing. At the moment I have:
- NVIDIA-SMI 575.51.03
- Driver Version: 572.16
- CUDA Version: 12.8
- NVIDIA GeForce RTX 3060 (Mobile)
- cuda-toolkit-12-8 (12.8.1-1)
- ...

But I can't get the code to run on GPU.

I've tried some other bits of Python code, unrelated to this SDK, just to verify GPU accesibility, and I get:
1)
import tensorflow as tf
gpus = tf.config.list_physical_devices('GPU')
...
- GPUs detected: [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

2)
pip install torch torchvision
...
import torch

print("CUDA available:", torch.cuda.is_available())
print("CUDA devices count:", torch.cuda.device_count())
...
CUDA available: True
CUDA devices count: 1
Tensor on device: cuda:0
Computation succeeded on GPU!

3)
...
All devices: [PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'), PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
GPUs: [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
I0000 00:00:1747920668.431684   12091 gpu_device.cc:2019] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 3586 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 3060 Laptop GPU, pci bus id: 0000:01:00.0, compute capability: 8.6
Result device: /job:localhost/replica:0/task:0/device:GPU:0
Computation succeeded on GPU!

At the time of running these samples I see activity in nvidia-smi as well (in second terminal window), where running process shows for at least a moment or a second.

I've followed all instructions I could find on your GitHub, and tried these tensorflow libs:
AND

No change in either execution speed, and nothing that I can catch with nvidia-smi even if I run it with high refresh eg:
nvidia-smi --loop-ms=200

I've looked for config options and read the docs here:
https://www.doubango.org/SDKs/kyc-documents-verif/docs/
... but except gpu memory controll in JSON config I don't see anything special about needing to enable GPU processing.

I've been testing in Debian 12 (bookworm) on WSL inside my Windows 11.

I've setup Python 3.11.2 with python3.11-venv.

ldd libKYCDocumentsVerifSDK.so
returns:
        linux-vdso.so.1 (0x00007ffe9bbfb000)
        libturbojpeg.so.0 (0x00007fb9b0e00000)
        libpng16.so.16 (0x00007fb9b0a00000)
        libtensorflow.so.1 (0x00007fb983a5b000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fb9b1f77000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fb9b1f70000)
        libheatmap.so (0x00007fb983400000)
        libtps.so (0x00007fb983000000)
        libusac.so (0x00007fb982a00000)
        libumeyama.so (0x00007fb982600000)
        libiconv.so.2 (0x00007fb982200000)
        libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007fb981fe6000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fb9b1e8e000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fb9b1e6e000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fb981e05000)
        /lib64/ld-linux-x86-64.so.2 (0x00007fb9b1f8c000)
        libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007fb9b1e4f000)
        libtensorflow_framework.so.2 => /mnt/c/#Local/DoubangoTelecom/KYC-Documents-Verif-SDK/binaries/linux/x86_64/libtensorflow_framework.so.2 (0x00007fb979a80000)
        librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fb9b1e48000)

Python extension was compiled successfully though I had to modify the setup.py a little by adding:
Extension(.....
     extra_compile_args=['-std=c++11']
)
Note that I compiled extension BEFORE I installed some of the Nvidia and tensorflow packages and libs, could that be an issue?

At the moment I have these installed in my venv (mostly due to testing with tensorflow and torch, rest were installed as dependencies):
pip list
Package                      Version
---------------------------- ------------
absl-py                      2.2.2
astunparse                   1.6.3
certifi                      2025.4.26
charset-normalizer           3.4.2
Cython                       3.1.1
filelock                     3.13.1
flatbuffers                  25.2.10
fsspec                       2024.6.1
gast                         0.6.0
google-pasta                 0.2.0
grpcio                       1.71.0
h5py                         3.13.0
idna                         3.10
Jinja2                       3.1.3
keras                        3.10.0
libclang                     18.1.1
Markdown                     3.8
markdown-it-py               3.0.0
MarkupSafe                   3.0.2
mdurl                        0.1.2
ml_dtypes                    0.5.1
mpmath                       1.3.0
namex                        0.0.9
networkx                     3.3
numpy                        2.1.3
nvidia-cublas-cu11           11.11.3.6
nvidia-cuda-cupti-cu11       11.8.87
nvidia-cuda-nvrtc-cu11       11.8.89
nvidia-cuda-runtime-cu11     11.8.89
nvidia-cudnn-cu11            9.1.0.70
nvidia-cufft-cu11            10.9.0.58
nvidia-curand-cu11           10.3.0.86
nvidia-cusolver-cu11         11.4.1.48
nvidia-cusparse-cu11         11.7.5.86
nvidia-nccl-cu11             2.21.5
nvidia-nvtx-cu11             11.8.86
opt_einsum                   3.4.0
optree                       0.15.0
packaging                    25.0
pillow                       11.0.0
pip                          23.0.1
protobuf                     5.29.4
Pygments                     2.19.1
requests                     2.32.3
rich                         14.0.0
setuptools                   66.1.1
six                          1.17.0
sympy                        1.13.3
tensorboard                  2.19.0
tensorboard-data-server      0.7.2
tensorflow                   2.19.0
tensorflow-io-gcs-filesystem 0.37.1
termcolor                    3.1.0
torch                        2.7.0+cu118
torchaudio                   2.7.0+cu118
torchvision                  0.22.0+cu118
triton                       3.3.0
typing_extensions            4.13.2
urllib3                      2.4.0
Werkzeug                     3.1.3
wheel                        0.45.1
wrapt                        1.17.2

These are the ones I installed manually:
cython==3.1.1
tensorflow==2.19.0
torchaudio==2.7.0+cu118
torchvision==0.22.0+cu118

Cython to build the python extension, and other 3 to test if my GPU is accessible in WSL and Python (which it is), but that was after I failed to make SDK run on GPU.

I also installed a bunch of Intel OneAPI packages
apt install intel-oneapi-...
OpenVINO seems to be working fine according to logs.
If needed I can list all intel packages currently on system.

My laptop is AMD Ryzen 5800H + RTX 3060 Laptop/Mobile.
The actual hardware where future SDK-developed app would eventually reside is to be decided based on these tests.

One of the pages mentioned checking these dependencies as well, and I see two missing, but code works, except if that's tied to CUDA/Nvidia issue somehow (unlikely since lib contains "vino" in name). If I can fix this somehow please let me know, I'd be happy to!
ldd libplugin_vino.so
        linux-vdso.so.1 (0x00007ffcc35aa000)
        libinference_engine_legacy.so => not found
        libinference_engine.so => not found
        libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007fa138fd8000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fa138fb8000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fa138dd7000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fa138cf7000)
        /lib64/ld-linux-x86-64.so.2 (0x00007fa13943b000)

GCC used to compile the extension:
gcc --version
gcc (Debian 12.2.0-14) 12.2.0

There are several lines in the running log that are GPU/CUDA related (with errors), but I don't know how to interpret any of that, as e.g. /home/ultimate/... paths are something from SDK, not existing in my environment, and I can't logically follow how I could fix them. Here is the excerpt (full log linked below):
...
*[COMPV INFO]: [CompVGpu] Initializing [gpu] module (v 1.0.0)...
***[COMPV ERROR]: function: "CompVGpu_isCudaSupported()"
file: "/home/ultimate/compv/gpu/compv_gpu.cxx"
line: "114"
message: [CompVGpu] cuInit failed with error code 100
...
*[COMPV INFO]: [CompVGpu] GPU enabled: true
*[COMPV INFO]: /!\ Code in file '/home/ultimate/ultimateBase/lib/source/ultimate_base_engine.cxx' in function 'init' starting at line #84: Not optimized for GPU -> GPGPU computing not enabled or deactivated
...
***[COMPV ERROR]: function: "isCudaAvailable()"
file: "/home/ultimate/IdentityOCR/SDK_dev/lib/source/kyc_verif_sdk_private_engine.cxx"
line: "661"
message: [KycVerifSdkEnginePrivate] cuInit failed with error code 100
...


Note that I've tried setting JSON config to output only errors in debug, but I still get info messages as well. This isn't preventing code to run though, just a note.

If you need me to provide any other info please let me know.

Thank you in advance,
Luka

Luka Pribanić

unread,
May 22, 2025, 1:25:45 PMMay 22
to doubango-ai
Hi all,

I have found a reason!

WSL environment was missing this in paths:
export LD_LIBRARY_PATH=/usr/lib/wsl/lib:$LD_LIBRARY_PATH

Now I see in log:
*[COMPV INFO]: [CompVGpu] === Number of CUDA devices: 1 ===
...
*[COMPV INFO]: [KycVerifSdkEnginePrivate] === Number of CUDA devices: 1 ===
...
*[PLUGIN_VINO INFO]: numInstances_Embeddings=1
I0000 00:00:1747934029.704048     728 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 614 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 3060 Laptop GPU, pci bus id: 0000:01:00.0, compute capability: 8.6
I0000 00:00:1747934030.050662     728 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 614 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 3060 Laptop GPU, pci bus id: 0000:01:00.0, compute capability: 8.6
I0000 00:00:1747934030.227400     728 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 614 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 3060 Laptop GPU, pci bus id: 0000:01:00.0, compute capability: 8.6

And I see task manager showing almost full GPU usage, and I see process on the GPU with nvidia-smi eg:
PID 728
Process name /python3.11

... it just takes so long ...
→ Done in 27.066 sec

CPU without CUDA was doing the same task in ~11-12 sec, this now needs more than twice as long, instead taking less time as I expected.
I'll try to lower the workers from 12 to 3 as it seems that it creates 3 instances, if I'm reading debug output correctly (?)

Wanted to reply ASAP so noone wastes time on looking for solutions to that original issue.

If someone has suggestions as to why the GPU processing is so slow please let me know.

Kind regards,
Luka

Mamadou DIOP

unread,
May 22, 2025, 1:35:57 PMMay 22
to Luka Pribanić, doubango-ai

Hi,

Good to see you have fixed one issue.

1/ Tensorflow 2.18 requires CUDA 12.5 while you're using 12.8

2/ Using Tensorflow on Linux on a Windows 11 host via WSL and trying to configure GPU is a disaster in the making

3/ Your GPU is stalling. If you write a simple python code doing some benchmark you'll have same issue.

4/ My advise: don't use WSL

--
You received this message because you are subscribed to the Google Groups "doubango-ai" group.
To unsubscribe from this group and stop receiving emails from it, send an email to doubango-ai...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/doubango-ai/04ce3882-556d-4ebd-beb3-ac01977b06b3n%40googlegroups.com.

Mamadou DIOP

unread,
May 22, 2025, 1:42:35 PMMay 22
to Luka Pribanić, doubango-ai

How did you get the "27.066 sec"?

You should use the benchmark app if not how you got the duration

A GPU takes much more time to process the first image, this is why there is a warming on all benchmark apps you can find on internet. Check https://github.com/DoubangoTelecom/KYC-Documents-Verif-SDK/blob/fb4bd2bf7fbfaa207560f77feb235d5e0c328f5a/samples/cpp/benchmark/benchmark.cxx#L210

Luka Pribanić

unread,
May 22, 2025, 1:45:59 PMMay 22
to doubango-ai
Hello, and - thank you for a VERY quick reply!

OK, WSL was just to provide a quick testing grounds. It wasn't meant for anything serious. I've seen it mentioned on one of your pages in a context of "better WSL than Windows" but that probably wasn't meant for GPU, so I got mislead a little.

Still, I got a good lesson and got everything to work finally. Good base to continue. Next up then is to run everything on bare metal. I'll try to setup dual boot for now on my own laptop.

Would you suggest Ubuntu or Debian, and which version, from your own experience?
Or do you think that GPGPU is better off with running directly on Windows?
For this project we will be setting up new hardware with an OS of choice, so any advice from you would be appreciated, as I could aim to test the best possible OS & hardware combination without wasting much time on the benchmarks between different CPU or GPU vendors, etc.

Thanks again,
Luka

Mamadou DIOP

unread,
May 22, 2025, 1:53:58 PMMay 22
to Luka Pribanić, doubango-ai

Google deprecated Tensorflow support for GPU on Windows (after TF2.10). So you'll not be able to run it directly on Windows unless you use TF2.6+CUDA11. We only support TF 2.6.0, 2.14.0, 2.16.1 and 2.18.0

We use Ubuntu, so I'd recommend that OS

Luka Pribanić

unread,
May 22, 2025, 1:55:26 PMMay 22
to doubango-ai
Regarding timinig:
At first I modified the verify script, so that I don't time anything at the start (engine load takes 15+ seconds), I let the engine load first, then I only time the calls to :
    KycVerifSdk.KycVerifSdkEngine_process(data, size)

I've later experimented with creating worker threads, which worked well for CPU and allowed me to process 12 images at once and get my CPU from 20% to 80-85% load easily.

        # get starting time to calculate how long the processing took
        start = time.perf_counter()

        with concurrent.futures.ThreadPoolExecutor(max_workers=12) as executor:
            futures = [executor.submit(process_image, img) for img in images]
            for f in concurrent.futures.as_completed(futures):
                f.result()

        # get final time and calculate to display info
        elapsed = time.perf_counter() - start
        print(f"→ Done in {elapsed:.3f} sec")

This code would process 12 images in bit more than 11s, eg in my CPU log:
→ Done in 11.416 sec

I've left that code with workers when I did this trial GPGPU run, tested with 12, 3 and just 1 worker, each time for 12 images, and each time I get about 26-27s.
→ Done in 27.066 sec
So a bit more than 2s per image (sorry if I confused you with such a large number, it is NOT 26 sec per image!!).

Still CPU at about 85%, never reaching 100% even for a moment, does under 1s per image. GPU (for whatever reason, WSL, stalling, etc) did about 2,25s per image.

Hope that makes it clear.
I'll look for ways to re-test tomorrow on my own laptop without WSL and emulation/virtualization, just bare metal. But as I am not in office till Monday, I don't have access to any other bare metal GPU right now.

Regards,
Luka

Luka Pribanić

unread,
May 22, 2025, 1:57:43 PMMay 22
to doubango-ai
Good to know, a lot of time saved in a quick information exchange.
I'll try to do Ubuntu on bare metal then. I see Nvidia repo supports 24.04, would that be OK, or should I hold off and use 22.04 instead?

Regards,
Luka

Mamadou DIOP

unread,
May 22, 2025, 2:05:51 PMMay 22
to Luka Pribanić, doubango-ai

Using Workers will slow down the code. Check https://www.doubango.org/SDKs/kyc-documents-verif/docs/Architecture_overview.html#thread-safety

The process function is auto-locked which means only 1 thread can run it at a time, all others will be locked

The C++ code looks like this:

int process() {

 COMPV_AUTOLOCK(mutex); // <- all your workers will be locked here.

 ....

}

and we don't support parallel processing (https://www.doubango.org/SDKs/kyc-documents-verif/docs/Parallel_processing.html) with Python

Luka Pribanić

unread,
May 22, 2025, 2:17:57 PMMay 22
to doubango-ai
Ok, workers worled  with Python for CPU, I just ran the same code to see if GPU will "light up". I'll keep all this in mind.
I'm not pro dev, so Python was my first choice. If I get good enough performance from it good,
If not... We'll look for a dev to take a look.

At the moment it's all about confirming the SDK satisfies our needs, and so far it looks great.

Thanks a million for this discussion, if you're in France it's getting late, wish you a nice evening, if I will have some particular showstoppers tomorrow - I'll write here. You've been really nice snd helpful!

Kind regards,
Luka

Luka Pribanić

unread,
May 23, 2025, 9:59:55 AMMay 23
to doubango-ai
Just a quick info, before I logoff for the weekend...

I got Ubuntu 24.04 running with:
Driver 555.42.06
Cuda 12.5
Tensorflow (GPU) 2.18.0 
Python 3.11.12

I used Tensorflow 2.18.0 as latest, then CUDA 12.5 as you pointed out (and indeed Nvidia website says same), and their recommended driver from the same official Nvidia page (555.42.06).

I used latest Python 3.11.12, as I got errors about Python 3.12.x because something was deprecated and caused issues when running Python SWIG extension,
... and I didn't want to complicate further, so I reverted to 3.11.x

As for results, with same 12 images, I got GPU loop (with 1 worker) running just fine, and results are around 14s for 12 images. So a little slower than CPU in WSL (forgot to test CPU on bare metal, sorry!).
I could see GPU running in nvidia-smi and was hitting 85-95W out of 115W limit.

I guess next week I'll see if I can get a dev to make a similar demo using C++ or C#. If you have any suggestions on what would be preferred language for CUDA/GPGPU vs CPU please let me know.

Thank you for your help, have a nice weekend!
Luka

Mamadou DIOP

unread,
May 23, 2025, 10:17:11 AMMay 23
to Luka Pribanić, doubango-ai

RTX3060 can process 7 images per second. Your numbers show you're 8 to 9 times slower.

Benchmark numbers: https://github.com/DoubangoTelecom/KYC-Documents-Verif-SDK/tree/main/samples/cpp/benchmark#peformance-numbers

Collect logs for 3 scenarios: GPU only, CPU only, both work-balancing. Use the Californian Driver License or any other public image (an image you can share).

GPU only:

LD_LIBRARY_PATH=../../../binaries/linux/x86_64:$LD_LIBRARY_PATH ./benchmark \
    --image "../../../assets/images/United States - California Driving License (2017).jpg" \
    --assets ../../../assets \
    --loops 20 \
    --vino_activation "off" \
    --gpu_ctrl_mem false \
    --parallel true

CPU only:

LD_LIBRARY_PATH=../../../binaries/linux/x86_64:$LD_LIBRARY_PATH ./benchmark \
    --image "../../../assets/images/United States - California Driving License (2017).jpg" \
    --assets ../../../assets \
    --loops 20 \
    --vino_activation "on" \
    --gpu_ctrl_mem false \
    --parallel true

Both:

LD_LIBRARY_PATH=../../../binaries/linux/x86_64:$LD_LIBRARY_PATH ./benchmark \
    --image "../../../assets/images/United States - California Driving License (2017).jpg" \
    --assets ../../../assets \
    --loops 20 \
    --vino_activation "auto" \
    --gpu_ctrl_mem false \
    --parallel true

Luka Pribanić

unread,
May 23, 2025, 1:50:54 PMMay 23
to doubango-ai
You're right, running C++ binary from /KYC-Documents-Verif-SDK/binaries/linux/x86_64/ on California sample I get:
# GPU only benchmark
*[KYC_VERIF_SDK INFO]: *** elapsedTimeInMillis: 2327.433085, notified: 20, estimatedFps: 8.593158 ***
( I also tried with my own jpg sample and got : *[KYC_VERIF_SDK INFO]: *** elapsedTimeInMillis: 2193.192963, notified: 20, estimatedFps: 9.119125 *** )
# CPU only
*[KYC_VERIF_SDK INFO]: *** elapsedTimeInMillis: 13937.574390, notified: 20, estimatedFps: 1.434970 ***
# Both
*[KYC_VERIF_SDK INFO]: *** elapsedTimeInMillis: 2585.830447, notified: 20, estimatedFps: 7.734459 ***

I did those runs without any changes to environment.

I then tried my Python code with just these changes to OpenVINO in JSON config, no other changes to environment:
- OpenVINO - OFF = 16.7s (had activity on GPU, over 100W for moments)
- OpenVINO - AUTO + CPU = 14.6s (what I had earlier, over 100W as well on GPU)
- OpenVINO - ON + CPU = 8.1s -> run only on CPU, no activity in nvidia-smi (expected)

But as you said, you don't support parallel processing on Python, so that may be why.
Though it could also be that you load one image and process it 20x, I want to lead different images in each processing run, as that would be more realistic.

I'll give C++ code a try on Monday, and try to make a benchmark loop but by loading 20 images and processing each once, then compare that to single image load and 20x processing.

Though I have to say, 8sec for 12 images in my Python code ran on CPU only is already good for our general needs. Putting this on 16-24 core server would give 2-3x the speed.
But, indeed, why waste a server if 3060/4060 class of GPU should do even better... We'll see.

Thanks,
Luka

Luka Pribanić

unread,
May 23, 2025, 3:04:18 PMMay 23
to doubango-ai
OK, it wasn't that complicated. Code is now running in C++. With GPT help I modified benchmark to load folder (containing 20 images) and process each image just once.

Ran it on GPU with:

LD_LIBRARY_PATH=../../../binaries/linux/x86_64:$LD_LIBRARY_PATH ./benchmark \
    --assets ../../../assets \

    --vino_activation "off" \
    --gpu_ctrl_mem false \
    --parallel true

1st run with my pictures:
*[KYC_VERIF_SDK INFO]: *** elapsedTimeInMillis: 12524.062984, notified: 20, estimatedFps: 1.596926 ***
2nd run, same pictures:
*[KYC_VERIF_SDK INFO]: *** elapsedTimeInMillis: 12741.215321, notified: 20, estimatedFps: 1.569709 ***
3rd run was with 20 copies of your California licence (but still 20 files):
*[KYC_VERIF_SDK INFO]: *** elapsedTimeInMillis: 4119.747141, notified: 20, estimatedFps: 4.854667 ***

I can see 20 JSONs with sample data so it seems to run fine.

I then run it on CPU with:

LD_LIBRARY_PATH=../../../binaries/linux/x86_64:$LD_LIBRARY_PATH ./benchmark \
    --assets ../../../assets \

    --vino_activation "on" \
    --gpu_ctrl_mem false \
    --parallel true

1st run with my images:
*[KYC_VERIF_SDK INFO]: *** elapsedTimeInMillis: 14788.718700, notified: 20, estimatedFps: 1.352382 ***
2nd run with 20x California license:
*[KYC_VERIF_SDK INFO]: *** elapsedTimeInMillis: 15911.949754, notified: 20, estimatedFps: 1.256917 ***

Looking at CPU, that's comparable to your original binary, so on CPU this code works as fast as original benchmark, just a bit slower (probably due to reloading images).

The GPU bench with 20 California licenses is about 43% slower than the original benchmark binary iterating 20x on the same file, but I could be ok with that, and it can be attributed to multiple file loads, since file now needs to be loaded to GPU each time, probably way higher latency than loading image to CPU.

But my own samples are way slower on GPU. Note I did not change the binary or environment at all, just re-ran with different 20 images in the folder.
Since just the image change caused speedup, I can only guess that my own samples are "somewhat imperfect" (I'm being polite here, those are photos people took with a phone, so not perfect at all, shadows, reflections, etc). Probably a lot of time is wasted to align the images, find the actual ID card (photos were taken of IDs on a table and such), cleanup reflections etc. Your sample is nice, cropped, clean, so probably way easier to go through.

Based on this I'd say YES - C++ code is faster on GPU, obviously, at 12.5-12.7s for 20 imperfect images, so 1.56-1.59 fps, vs Python that had 0,82fps on the same samples, that's 2x speedup.
Still, for some reason CPU doesn't get hit that hard by at all by running "imperfect" sample photos. Sure, I also noticed CPU is slightly less accurate, while GPU is almost perfect, so that has to be taken into consideration.

Anyway, since CPU that's close to 5800X (170€) is maybe 15-20% slower than GPU which is also about 170€, Id' say it's a tie in the early benchmark.
On the other hand your 300€ GPU benches (RTX 4060) handily beat a 500€ CPU (7950X), so we'll probably give a GPU another try with proper developer next week.

It was interesting non the less, and as I said, now at 1.5+fps on laptop GPU this easily satisfies our needs, with better hardware and proper developer, I'm sure this will be ok. Any improvements we squeeze from here will just be a bonus for the future needs.

I should really pack it, 9PM as with you, should be enjoying my weekend :-) Cheers, I'll report back next week!

Luka

Mamadou DIOP

unread,
May 23, 2025, 3:13:50 PMMay 23
to Luka Pribanić, doubango-ai
Share your c++ code



Sent from my Galaxy

Luka Pribanić

unread,
May 23, 2025, 4:03:35 PMMay 23
to doubango-ai
Hey,

just so you know , I don't expect you to "fix" my code... But here, since you asked, I've copied it to the gist here:
https://gist.github.com/luxzg/98077666ba85b8e5c71d84d56b98f405

It's weekend, have some time off :-) We can do this next week just the same, work won't go away...

Regards,
Luka

Mamadou DIOP

unread,
May 25, 2025, 9:38:01 PMMay 25
to Luka Pribanić, doubango-ai

The issue is that you've included the jpeg decoding in the timing (https://gist.github.com/luxzg/98077666ba85b8e5c71d84d56b98f405#file-benchmark-cxx-L174). Plus disk accesses to read the file for each loop.

The jpeg decoder used is at https://github.com/DoubangoTelecom/KYC-Documents-Verif-SDK/blob/main/samples/cpp/stb_image.h and not optimized at all.

The process function has 3 variants:

    1/ https://www.doubango.org/SDKs/kyc-documents-verif/docs/cpp-api.html#_CPPv4N8KycVerif17KycVerifSdkEngine7processEK24KYC_VERIF_SDK_IMAGE_TYPEPKvK6size_tK6size_tK6size_tKi

    2/ https://www.doubango.org/SDKs/kyc-documents-verif/docs/cpp-api.html#_CPPv4N8KycVerif17KycVerifSdkEngine7processEK24KYC_VERIF_SDK_IMAGE_TYPEPKvPKvPKvK6size_tK6size_tK6size_tK6size_tK6size_tK6size_tKi

    3/ https://www.doubango.org/SDKs/kyc-documents-verif/docs/cpp-api.html#_CPPv4N8KycVerif17KycVerifSdkEngine7processEPKvK6size_t

You're using the 1st version which requires raw/uncompress data. The 3rd version accepts compressed data and uses libjpeg-turbo to decode the image. Use the 3rd version in your benchmark app and check if it's faster.

If your images are not very large (in term of resolution not size on disk) I’d bet that the slow down is caused by disk access as jpeg decoding would be fast

It's common practice not to include image decoding in benchmarking.

Mamadou DIOP

unread,
May 25, 2025, 9:49:10 PMMay 25
to Mamadou DIOP, Luka Pribanić, doubango-ai
Another important issue in your code is that you’re waiting after each call: https://gist.github.com/luxzg/98077666ba85b8e5c71d84d56b98f405#file-benchmark-cxx-L180
This issue would addd significant delay and completely disable muti-threading. 
You must not wait for the result after calling process(). The result will be delivered asynchronously via the callback.
To disable parallel processing and use sequential calls: "--parallel false" https://github.com/DoubangoTelecom/KYC-Documents-Verif-SDK/tree/main/samples/cpp/benchmark#usage

Luka Pribanić

unread,
May 26, 2025, 4:05:52 AMMay 26
to doubango-ai
OK, thanks for these suggestions. I tried changing the code, but I don't profit too much. If I push too much I'm hitting memory limits, if I go too low I am essentially disabling parallelism altogether.

Thing is, original benchmark is sort of idealistic, and I'm not saying "false". It was made to show off speed of the SDK itself and does that beautifully.
In reality, we will never process same image 20x in a row, and we need to load/fetch image data somehow (be it from disk, some API, or whatever)
So some loading & decoding delay will always happen, and I will always need to account for all of this as a more realistic speed of the whole processing pipeline.

I've ran original benchmark, and indeed you've shown me (and I've confirmed) that idealistic speed of SDK itself is really fast, around 8.6 FPS on my laptop system which is great.

Seems when combined with image loading from some source, doing decoding, and processing, plus we'll also have some kind of writes of data, this will be more in the range of maybe 2fps.

This is still OK, my laptop isn't server, and it's limited by hardware power limits, amount of memory (both RAM and VRAM), etc. We'll play some more with this, get an actual dev to optimize, and see where we land.

My next goal is to try and include this in fake/demo pipeline where image data is arriving from something like Redis, and then experiment if it should stick to serial processing or try some more parallelism.

Please, for now don't waste your precious time on further assistance, I think we mostly have all the info, SDK is working, now it's up to tweaking and optimizing.

Thank you a lot for everything!
Kind regards,
Luka

Luka Pribanić

unread,
May 26, 2025, 6:21:06 AMMay 26
to doubango-ai
Me again,

I made a small test loop with Redis queue, and small C++ service with your SDK listening on that queue, and when I fill the queue with 20 images I get them processed and replies sent in 5 seconds. The whole process may not be best to test with small timings, but everything from reading 20 images, encoding them in Base64, sending to Redis, catching them and loading from Redis queue, processing with your SDK, returning as JSON data to second Redis queue, and reading and outputting formatted JSON data from that Redis queue takes about 5 seconds (locally, all running on a single laptop, including using same GPU for display and assorted apps like Chrome and text editor).

For example, I've added date/time print in milliseconds when "image sender" starts its code, so very start of the script (imperfect for SDK, but realistic for my needs):
=== START 2025-05-26 12:01:17.230 ===
and then made a listener that captures and decodes JSONs with data from SDK (received via Redis), and it exits once it gets to 20th image, and prints date/time again
=== END 2025-05-26 12:01:22.463 ===

That means 5 seconds and 233 milliseconds for 20 images, full workflow. I think that's fair number for the whole pipeline, or about 4 FPS (but including EVERYTHING, so literally worst case).

Also noticed that even when service starts and loads everything, on FIRST image there is still some kind of warmup process where it says it's loading cuDNN. So first loop takes 16.5s but second and every other loop after takes about 5 seconds, and I don't get the cuDNN message on subsequent runs.
That would probably explain why my initial code ran "slow", as that's about the speed I was getting before.
That was with openvino setting on auto, so GPU processing.

When ran with openvino = on (CPU) I got about same for the first run, and for the next run, again close to my earlier numbers, probably CPU doesn't need that "warmup" as it doesn't need or load cuDNN.

Anyway, GPU wins, and SDK is fast, demo/test "integration" with Redis and "web API" works, what's left is to talk to rest of team and hopefully start doing business with you.

Thanks again!
Luka
Reply all
Reply to author
Forward
0 new messages