Brotli vs Gzip: Brotli decompression takes lot of time

1,329 views
Skip to first unread message

vinoth eswaran

unread,
Jul 13, 2016, 10:21:47 AM7/13/16
to Brotli

Dear Group Users,

  Currently I am working on an embedded project using Minnowboard Turbot. As part of this project, I need to optimize the time taken to load the binary image from SD card and to extract the binary image to RAM. I am considering different compression algorithms like Gzip, LZMA, LZ4 etc, and came across Brotli algorithm. Many test results and documents shows Brotli performing better than Gzip, so I tried to see how it will perform in my test case.

The results that I got are pretty disappointing, as decompression takes lot of time in case of Brotli.

Uncompressed Size:14565941 bytes

Algorithm Type                                      Size in bytes                                          Decompression time

Gzip_level6                                               5678633                                                     228ms
Gzip_level9                                               5665389                                                     226ms
Brotli_level1                                              6327922                                                     534ms
Brotli_level11                                            4782061                                                     568ms
Brotli_level9                                              5205221                                                     509ms


I even had tried different window sizes:15bit,22bit,24bit and almost seeing the same decompression times(+/- 20ms) which is twice the time taken for Gzip. Let me know if I doing anything wrong here.


Analysis on the code:


 I tried to see what is happening in the decompressing part and why it is taking so much time. I identified that there are so many states internally and transition from state 5 to state 14 is taking so much time around 200ms --> ProcessCommandsInternal(). The function ProcessCommandsInternal() has so many states  within and I am seeing that the code is looping around the states which causes the delay. I don#t know how to optimize this,as I am not familiar with Brotli algorithm, any help on this topic will be of great help

[0.553972 0.003998]    Uncompressing Kernel Image ...
[0.557990 0.004018]
[0.558017 0.000027]  brotli decompression start
[0.561002 0.002985] evk state:0
[0.561983 0.000981] ring buffer size:4194304 window_size:22
[0.566076 0.004093] evk state:15
[0.567029 0.000953] evk state:15
[0.568975 0.001946] evk state:15
[0.569994 0.001019] evk state:15
[0.572101 0.002107] evk state:3
[0.573990 0.001889] evk state:21
[0.576027 0.002037] evk state:21
[0.577042 0.001015] evk state:5
[0.579011 0.001969] Process command start
[0.791033 0.212022] Process command stop
[0.793964 0.002931] evk state:14

[0.796992 0.003028] evk state:8
[0.799028 0.002036] Process command start
[0.880119 0.081091] Process command stop
[0.882969 0.002850] evk state:12
[0.884113 0.001144] evk state:1
[0.885157 0.001044] evk state:15
[0.887072 0.001915] evk state:15
[0.888140 0.001068] evk state:15
[0.889967 0.001827] evk state:15
[0.891043 0.001076] evk state:3
[0.893942 0.002899] evk state:21
[0.896050 0.002108] evk state:21
[0.897950 0.001900] evk state:5
[0.899022 0.001072] Process command start
[1.024052 0.125030] Process command stop
[1.026982 0.002930] evk state:14

[1.030050 0.003068] evk state:8
[1.032011 0.001961] Process command start
[1.128030 0.096019] Process command stop
[1.130155 0.002125] evk state:14
[1.133978 0.003823] evk state:8
[1.136037 0.002059] Process command start
[1.169060 0.033023] Process command stop
[1.171197 0.002137] evk state:12
[1.173017 0.001820] evk state:1
[1.174133 0.001116] evk state:15
[1.175165 0.001032] evk state:15
[1.177112 0.001947] evk state:15
[1.178322 0.001210] evk state:15
[1.180104 0.001782] evk state:3
[1.181293 0.001189] evk state:21
[1.183471 0.002178] evk state:21
[1.184169 0.000698] evk state:5
[1.185288 0.001119] Process command start
[1.194015 0.008727] Process command stop
[1.196042 0.002027] evk state:12
[1.201011 0.004969] brotli decompression end
[1.203181 0.002170] BRO 14548992

Thanks & Regards,
Vinothkumar Eswaran

Evgenii Kliuchnikov

unread,
Jul 13, 2016, 11:33:03 AM7/13/16
to Brotli
Hello.

  Thank you for a report. We haven't explored Intel Atom performance... so have no specific optimizations for it.
  Could you, please evaluate performance with some of the compilation options:
  • BROTLI_BUILD_32_BIT disables 64-bit optimizations
  • BROTLI_BUILD_64_BIT forces to use 64-bit optimizations
  • BROTLI_BUILD_ENDIAN_NEUTRAL disables endian-aware optimizations
  • BROTLI_BUILD_LITTLE_ENDIAN forces to use little-endian optimizations
  • BROTLI_BUILD_MODERN_COMPILER forces to use modern compilers built-ins, features and attributes
  • BROTLI_BUILD_PORTABLE disables dangerous optimizations, like unaligned read and overlapping memcpy

  First try them separately, then, if some of them provide performance gain, try co combine them.

  BTW what compiler is used?

Best regards,
  Eugene.

vinoth eswaran

unread,
Jul 14, 2016, 7:44:36 AM7/14/16
to Brotli
Hello,

    The compiler I am using is gcc version 4.6.3.

I tried all the configurations mentioned, but didn't see any performance improvements.

With the BROTLI_BUILD_BIG_ENDIAN -- the decompression fails, it makes sense as Intel is little endian based

With the BROTLI_BUILD_64_BIT , I am getting the following error messages :

[1.399921 0.114735] 4782061 bytes read in 115 ms (39.7 MiB/s)
[1.403965 0.004044]    Uncompressing Kernel Image ...
[1.407913 0.003948]
[1.407983 0.000070]  brotli decompression start
[1.411025 0.003042] General Protection
[1.412884 0.001859] EIP: 0010:[<7b568599>] EFLAGS: 00010002
[1.416906 0.004022] Original EIP :[<fff13599>]
[1.419852 0.002946] EAX: 00000001 EBX: 7833e748 ECX: 00000000 EDX: fffffff9
[1.425035 0.005183] ESI: 00000001 EDI: 01000000 EBP: 7833e848 ESP: 7833e6bc
[1.430915 0.005880]  DS: 0018 ES: 0018 FS: 0020 GS: 0018 SS: 0018
[1.435062 0.004147] CR0: 00000033 CR2: 00000000 CR3: 00000000 CR4: 00000600
[1.440859 0.005797] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[1.446064 0.005205] DR6: ffff0ff0 DR7: 00000400
[1.449103 0.003039] Stack:
[1.449966 0.000863]     0x7833e6fc : 0x7833e848
[1.452891 0.002925]     0x7833e6f8 : 0x5c391d06
[1.455755 0.002864]     0x7833e6f4 : 0x7833fae8
[1.458056 0.002301]     0x7833e6f0 : 0x7833e750
[1.461008 0.002952]     0x7833e6ec : 0x7833fae4
[1.463962 0.002954]     0x7833e6e8 : 0x7833fae0
[1.466876 0.002914]     0x7833e6e4 : 0xf9ad562e
[1.469842 0.002966]     0x7833e6e0 : 0x6eb7781b
[1.472083 0.002241]     0x7833e6dc : 0xe3d562de
[1.475045 0.002962]     0x7833e6d8 : 0x0756abd8
[1.477961 0.002916]     0x7833e6d4 : 0x875fabcb
[1.480914 0.002953]     0x7833e6d0 : 0x957d8f8e
[1.483862 0.002948]     0x7833e6cc : 0x7b58085f
[1.486759 0.002897]     0x7833e6c8 : 0x7833e848
[1.489059 0.002300]     0x7833e6c4 : 0x01000000
[1.492064 0.003005]     0x7833e6c0 : 0x00000001
[1.494956 0.002892] --->0x7833e6bc : 0x7833e748
[1.497860 0.002904]     0x7833e6b8 : 0x00010002
[1.500719 0.002859]     0x7833e6b4 : 0x00000010
[1.503109 0.002390]     0x7833e6b0 : 0x7b568599
[1.505943 0.002834] ### ERROR ### Please RESET the board ###


Thanks & Regards,
Vinothkumar

Evgenii Kliuchnikov

unread,
Jul 14, 2016, 4:18:38 PM7/14/16
to Brotli
64-bit option gave a confusing outcome.
Could you dump "echo | gcc -dM -E -" output, please.
Also, it would be interesting to see if using gcc 5.2+ would make any difference.

vinoth eswaran

unread,
Jul 15, 2016, 3:09:58 AM7/15/16
to Evgenii Kliuchnikov, Brotli
Hi,

The output of "echo | gcc -dM -E -" is,  

I have also attached this in text file for your reference

#define-sh: vies7605@eso9265:~$: not found
 __DBL_-sh: gcc: not found
root@minnowturbot:~# #define __DBL_MIN_EXP__ (-1021)
root@minnowturbot:~# #define __UINT_LEAST16_MAX__ 65535
root@minnowturbot:~# #define __FLT_MIN__ 1.17549435082228750797e-38F
root@minnowturbot:~# #define __UINT_LEAST8_TYPE__ unsigned char
root@minnowturbot:~# #define __INTMAX_C(c) c ## L
root@minnowturbot:~# #define __CHAR_BIT__ 8
root@minnowturbot:~# #define __UINT8_MAX__ 255
root@minnowturbot:~# #define __WINT_MAX__ 4294967295U
root@minnowturbot:~# #define __ORDER_LITTLE_ENDIAN__ 1234
root@minnowturbot:~# #define __SIZE_MAX__ 18446744073709551615UL
root@minnowturbot:~# #define __WCHAR_MAX__ 2147483647
root@minnowturbot:~# #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1
root@minnowturbot:~# #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1
root@minnowturbot:~# #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1
root@minnowturbot:~# #define __DBL_DENORM_MIN__ ((double)4.94065645841246544177e-324L)
root@minnowturbot:~# #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1
root@minnowturbot:~# #define __FLT_EVAL_METHOD__ 0
root@minnowturbot:~# #define __unix__ 1
root@minnowturbot:~# #define __x86_64 1
e __UINTroot@minnowturbot:~# #define __UINT_FAST64_MAX__ 18446744073709551615UL
root@minnowturbot:~# #define __SIG_ATOMIC_TYPE__ int
root@minnowturbot:~# #define __DBL_MIN_10_EXP__ (-307)
root@minnowturbot:~# #define __FINITE_MATH_ONLY__ 0
root@minnowturbot:~# #define __GNUC_PATCHLEVEL__ 3
root@minnowturbot:~# #define __UINT_FAST8_MAX__ 255
eroot@minnowturbot:~# #define __DEC64_MAX_EXP__ 385
root@minnowturbot:~# #define __INT8_C(c) c
root@minnowturbot:~# #define __UINT_LEAST64_MAX__ 18446744073709551615UL
root@minnowturbot:~# #define __SHRT_MAX__ 32767
root@minnowturbot:~# #define __LDBL_MAX__ 1.18973149535723176502e+4932L
root@minnowturbot:~# #define __UINT_LEAST8_MAX__ 255
root@minnowturbot:~# #define __UINTMAX_TYPE__ long unsigned int
root@minnowturbot:~# #define __linux 1
root@minnowturbot:~# #define __DEC32_EPSILON__ 1E-6DF
root@minnowturbot:~# #define __unix 1
root@minnowturbot:~# #define __UINT32_MAX__ 4294967295U
#droot@minnowturbot:~# #define __LDBL_MAX_EXP__ 16384
root@minnowturbot:~# #define __WINT_MIN__ 0U
root@minnowturbot:~# #define __linux__ 1
root@minnowturbot:~# #define __SCHAR_MAX__ 127
root@minnowturbot:~# #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1)
root@minnowturbot:~# #define __INT64_C(c) c ## L
root@minnowturbot:~# #define __DBL_DIG__ 15
efroot@minnowturbot:~# #define _FORTIFY_SOURCE 2
root@minnowturbot:~# #define __SIZEOF_INT__ 4
nroot@minnowturbot:~# #define __SIZEOF_POINTER__ 8
root@minnowturbot:~# #define __USER_LABEL_PREFIX__
root@minnowturbot:~# #define __STDC_HOSTED__ 1
root@minnowturbot:~# #define __LDBL_HAS_INFINITY__ 1
root@minnowturbot:~# #define __FLT_EPSILON__ 1.19209289550781250000e-7F
root@minnowturbot:~# #define __LDBL_MIN__ 3.36210314311209350626e-4932L
root@minnowturbot:~# #define __DEC32_MAX__ 9.999999E96DF
root@minnowturbot:~# #define __INT32_MAX__ 2147483647
efroot@minnowturbot:~# #define __SIZEOF_LONG__ 8
root@minnowturbot:~# #define __UINT16_C(c) c
root@minnowturbot:~# #define __DECIMAL_DIG__ 21
root@minnowturbot:~# #define __gnu_linux__ 1
root@minnowturbot:~# #define __LDBL_HAS_QUIET_NAN__ 1
root@minnowturbot:~# #define __GNUC__ 4
root@minnowturbot:~# #define __MMX__ 1
root@minnowturbot:~# #define __FLT_HAS_DENORM__ 1
root@minnowturbot:~# #define __SIZEOF_LONG_DOUBLE__ 16
deroot@minnowturbot:~# #define __BIGGEST_ALIGNMENT__ 16
iroot@minnowturbot:~# #define __DBL_MAX__ ((double)1.79769313486231570815e+308L)
root@minnowturbot:~# #define __INT_FAST32_MAX__ 9223372036854775807L
root@minnowturbot:~# #define __DBL_HAS_INFINITY__ 1
root@minnowturbot:~# #define __DEC32_MIN_EXP__ (-94)
root@minnowturbot:~# #define __INT_FAST16_TYPE__ long int
root@minnowturbot:~# #define __LDBL_HAS_DENORM__ 1
root@minnowturbot:~# #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL
neroot@minnowturbot:~# #define __INT_LEAST32_MAX__ 2147483647
root@minnowturbot:~# #define __DEC32_MIN__ 1E-95DF
root@minnowturbot:~# #define __DBL_MAX_EXP__ 1024
root@minnowturbot:~# #define __DEC128_EPSILON__ 1E-33DL
root@minnowturbot:~# #define __SSE2_MATH__ 1
root@minnowturbot:~# #define __PTRDIFF_MAX__ 9223372036854775807L
#droot@minnowturbot:~# #define __amd64 1
root@minnowturbot:~# #define __LONG_LONG_MAX__ 9223372036854775807LL
eroot@minnowturbot:~# #define __SIZEOF_SIZE_T__ 8
root@minnowturbot:~# #define __SIZEOF_WINT_T__ 4
root@minnowturbot:~# #define __GCC_HAVE_DWARF2_CFI_ASM 1
root@minnowturbot:~# #define __GXX_ABI_VERSION 1002
root@minnowturbot:~# #define __FLT_MIN_EXP__ (-125)
root@minnowturbot:~# #define __INT_FAST64_TYPE__ long int
root@minnowturbot:~# #define __DBL_MIN__ ((double)2.22507385850720138309e-308L)
root@minnowturbot:~# #define __LP64__ 1
root@minnowturbot:~# #define __DECIMAL_BID_FORMAT__ 1
root@minnowturbot:~# #define __DEC128_MIN__ 1E-6143DL
root@minnowturbot:~# #define __REGISTER_PREFIX__
#droot@minnowturbot:~# #define __UINT16_MAX__ 65535
root@minnowturbot:~# #define __DBL_HAS_DENORM__ 1
root@minnowturbot:~# #define __UINT8_TYPE__ unsigned char
root@minnowturbot:~# #define __NO_INLINE__ 1
root@minnowturbot:~# #define __FLT_MANT_DIG__ 24
root@minnowturbot:~# #define __VERSION__ "4.6.3"
root@minnowturbot:~# #define __UINT64_C(c) c ## UL
root@minnowturbot:~# #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__
root@minnowturbot:~# #define __INT32_C(c) c
root@minnowturbot:~# #define __DEC64_EPSILON__ 1E-15DD
root@minnowturbot:~# #define __ORDER_PDP_ENDIAN__ 3412
root@minnowturbot:~# #define __DEC128_MIN_EXP__ (-6142)
root@minnowturbot:~# #define __INT_FAST32_TYPE__ long int
root@minnowturbot:~# #define __UINT_LEAST16_TYPE__ short unsigned int
root@minnowturbot:~# #define unix 1
root@minnowturbot:~# #define __INT16_MAX__ 32767
root@minnowturbot:~# #define __SIZE_TYPE__ long unsigned int
root@minnowturbot:~# #define __UINT64_MAX__ 18446744073709551615UL
root@minnowturbot:~# #define __INT8_TYPE__ signed char
root@minnowturbot:~# #define __ELF__ 1
root@minnowturbot:~# #define __FLT_RADIX__ 2
root@minnowturbot:~# #define __INT_LEAST16_TYPE__ short int
root@minnowturbot:~# #define __LDBL_EPSILON__ 1.08420217248550443401e-19L
firoot@minnowturbot:~# #define __UINTMAX_C(c) c ## UL
root@minnowturbot:~# #define __SSE_MATH__ 1
root@minnowturbot:~# #define __k8 1
root@minnowturbot:~# #define __SIG_ATOMIC_MAX__ 2147483647
root@minnowturbot:~# #define __SIZEOF_PTRDIFF_T__ 8
root@minnowturbot:~# #define __x86_64__ 1
root@minnowturbot:~# #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF
root@minnowturbot:~# #define __INT_FAST16_MAX__ 9223372036854775807L
root@minnowturbot:~# #define __UINT_FAST32_MAX__ 18446744073709551615UL
root@minnowturbot:~# #define __UINT_LEAST64_TYPE__ long unsigned int
root@minnowturbot:~# #define __FLT_HAS_QUIET_NAN__ 1
root@minnowturbot:~# #define __FLT_MAX_10_EXP__ 38
root@minnowturbot:~# #define __LONG_MAX__ 9223372036854775807L
root@minnowturbot:~# #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL
root@minnowturbot:~# #define __FLT_HAS_INFINITY__ 1
root@minnowturbot:~# #define __UINT_FAST16_TYPE__ long unsigned int
root@minnowturbot:~# #define __DEC64_MAX__ 9.999999999999999E384DD
root@minnowturbot:~# #define __CHAR16_TYPE__ short unsigned int
root@minnowturbot:~# #define __PRAGMA_REDEFINE_EXTNAME 1
#droot@minnowturbot:~# #define __INT_LEAST16_MAX__ 32767
root@minnowturbot:~# #define __DEC64_MANT_DIG__ 16
root@minnowturbot:~# #define __INT64_MAX__ 9223372036854775807L
root@minnowturbot:~# #define __UINT_LEAST32_MAX__ 4294967295U
root@minnowturbot:~# #define __INT_LEAST64_TYPE__ long int
root@minnowturbot:~# #define __INT16_TYPE__ short int
root@minnowturbot:~# #define __INT_LEAST8_TYPE__ signed char
root@minnowturbot:~# #define __DEC32_MAX_EXP__ 97
root@minnowturbot:~# #define __INT_FAST8_MAX__ 127
root@minnowturbot:~# #define __INTPTR_MAX__ 9223372036854775807L
root@minnowturbot:~# #define linux 1
root@minnowturbot:~# #define __SSE2__ 1
root@minnowturbot:~# #define __LDBL_MANT_DIG__ 64
root@minnowturbot:~# #define __DBL_HAS_QUIET_NAN__ 1
root@minnowturbot:~# #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1)
root@minnowturbot:~# #define __k8__ 1
root@minnowturbot:~# #define __INTPTR_TYPE__ long int
root@minnowturbot:~# #define __UINT16_TYPE__ short unsigned int
root@minnowturbot:~# #define __WCHAR_TYPE__ int
root@minnowturbot:~# #define __SIZEOF_FLOAT__ 4
nroot@minnowturbot:~# #define __UINTPTR_MAX__ 18446744073709551615UL
root@minnowturbot:~# #define __DEC64_MIN_EXP__ (-382)
root@minnowturbot:~# #define __INT_FAST64_MAX__ 9223372036854775807L
efroot@minnowturbot:~# #define __FLT_DIG__ 6
root@minnowturbot:~# #define __UINT_FAST64_TYPE__ long unsigned int
root@minnowturbot:~# #define __INT_MAX__ 2147483647
root@minnowturbot:~# #define __amd64__ 1
root@minnowturbot:~# #define __INT64_TYPE__ long int
root@minnowturbot:~# #define __FLT_MAX_EXP__ 128
root@minnowturbot:~# #define __ORDER_BIG_ENDIAN__ 4321
root@minnowturbot:~# #define __DBL_MANT_DIG__ 53
root@minnowturbot:~# #define __INT_LEAST64_MAX__ 9223372036854775807L
root@minnowturbot:~# #define __DEC64_MIN__ 1E-383DD
root@minnowturbot:~# #define __WINT_TYPE__ unsigned int
root@minnowturbot:~# #define __UINT_LEAST32_TYPE__ unsigned int
root@minnowturbot:~# #define __SIZEOF_SHORT__ 2
root@minnowturbot:~# #define __SSE__ 1
root@minnowturbot:~# #define __LDBL_MIN_EXP__ (-16381)
root@minnowturbot:~# #define __INT_LEAST8_MAX__ 127
root@minnowturbot:~# #define __SSP__ 1
root@minnowturbot:~# #define __SIZEOF_INT128__ 16
root@minnowturbot:~# #define __LDBL_MAX_10_EXP__ 4932
root@minnowturbot:~# #define __DBL_EPSILON__ ((double)2.22044604925031308085e-16L)
root@minnowturbot:~# #define _LP64 1
root@minnowturbot:~# #define __UINT8_C(c) c
root@minnowturbot:~# #define __INT_LEAST32_TYPE__ int
root@minnowturbot:~# #define __SIZEOF_WCHAR_T__ 4
root@minnowturbot:~# #define __UINT64_TYPE__ long unsigned int
root@minnowturbot:~# #define __INT_FAST8_TYPE__ signed char
root@minnowturbot:~# #define __DBL_DECIMAL_DIG__ 17
root@minnowturbot:~# #define __DEC_EVAL_METHOD__ 2
root@minnowturbot:~# #define __UINT32_C(c) c ## U
root@minnowturbot:~# #define __INTMAX_MAX__ 9223372036854775807L
root@minnowturbot:~# #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__
root@minnowturbot:~# #define __FLT_DENORM_MIN__ 1.40129846432481707092e-45F
root@minnowturbot:~# #define __INT8_MAX__ 127
root@minnowturbot:~# #define __UINT_FAST32_TYPE__ long unsigned int
root@minnowturbot:~# #define __CHAR32_TYPE__ unsigned int
root@minnowturbot:~# #define __FLT_MAX__ 3.40282346638528859812e+38F
root@minnowturbot:~# #define __INT32_TYPE__ int
root@minnowturbot:~# #define __SIZEOF_DOUBLE__ 8
root@minnowturbot:~# #define __FLT_MIN_10_EXP__ (-37)
root@minnowturbot:~# #define __INTMAX_TYPE__ long int
root@minnowturbot:~# #define __DEC128_MAX_EXP__ 6145
root@minnowturbot:~# #define __GNUC_MINOR__ 6
root@minnowturbot:~# #define __UINTMAX_MAX__ 18446744073709551615UL
root@minnowturbot:~# #define __DEC32_MANT_DIG__ 7
root@minnowturbot:~# #define __DBL_MAX_10_EXP__ 308
root@minnowturbot:~# #define __LDBL_DENORM_MIN__ 3.64519953188247460253e-4951L
root@minnowturbot:~# #define __INT16_C(c) c
root@minnowturbot:~# #define __STDC__ 1
root@minnowturbot:~# #define __PTRDIFF_TYPE__ long int
root@minnowturbot:~# #define __UINT32_TYPE__ unsigned int
root@minnowturbot:~# #define __UINTPTR_TYPE__ long unsigned int
root@minnowturbot:~# #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD
root@minnowturbot:~# #define __DEC128_MANT_DIG__ 34
root@minnowturbot:~# #define __LDBL_MIN_10_EXP__ (-4931)
root@minnowturbot:~# #define __SIZEOF_LONG_LONG__ 8
root@minnowturbot:~# #define __LDBL_DIG__ 18
root@minnowturbot:~# #define __FLT_DECIMAL_DIG__ 9
root@minnowturbot:~# #define __UINT_FAST16_MAX__ 18446744073709551615UL
root@minnowturbot:~# #define __GNUC_GNU_INLINE__ 1
root@minnowturbot:~# #define __UINT_FAST8_TYPE__ unsigned char


Mit Freundlichen Grüßen
VinothKumar
+49 1798909072

--
You received this message because you are subscribed to a topic in the Google Groups "Brotli" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/brotli/dg2gNlo61yI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to brotli+un...@googlegroups.com.
To post to this group, send email to bro...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/brotli/20607ad9-86c9-41eb-8c3d-ebb60ea4f8dd%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

minnow_gcc.txt

Evgenii Kliuchnikov

unread,
Jul 15, 2016, 4:05:05 AM7/15/16
to Brotli
That is interesting: __x86_64__ should imply BROTLI_BUILD_64_BIT, so it should be no-op. It seems that this is host machine defines. We need to find a way to capture target machine defines.
Also, due to dump, CPU seems to be not switched to 64-bit mode at the moment decompression is started...

On Wednesday, July 13, 2016 at 4:21:47 PM UTC+2, vinoth eswaran wrote:

vinoth eswaran

unread,
Jul 18, 2016, 5:17:04 AM7/18/16
to Evgenii Kliuchnikov, Brotli

Hi,

 The target is defined for 32 Bit. Actually I am using Brotli with u-boot. U-Boot on x86 platform is defined currently only for 32 bit.

The reply from u-boot community for your reference,

http://lists.denx.de/pipermail/u-boot/2016-July/260736.html


Mit Freundlichen Grüßen
VinothKumar
+49 1798909072

--
You received this message because you are subscribed to a topic in the Google Groups "Brotli" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/brotli/dg2gNlo61yI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to brotli+un...@googlegroups.com.
To post to this group, send email to bro...@googlegroups.com.

Evgenii Kliuchnikov

unread,
Jul 18, 2016, 6:29:41 AM7/18/16
to Brotli
Please, could you compile with "-g3" option and attach brotli object files?
Compilation log (with unwrapped compiler options) would also be nice.
BTW:
  • the maximal slowdown (1.5x) I've achieved with "-m32 -DBROTLI_BUILD_64_BIT", i.e. force 64-bit simulation
  • "-O3" is slower than "-O2

On Wednesday, July 13, 2016 at 4:21:47 PM UTC+2, vinoth eswaran wrote:

vinoth eswaran

unread,
Jul 21, 2016, 3:27:40 AM7/21/16
to Brotli

Hello,

 The -g3 option is added to KBUILD_CFLAGS as shown below,

KBUILD_CFLAGS    += -g3
# $(KBUILD_AFLAGS) sets -g, which causes gcc to pass a suitable -g<format>
# option to the assembler.
KBUILD_AFLAGS    += -g3

KBUILD_CFLAGS    += -O2  # -O2 flag is used

I have attached the compilation logs with unwrapped compiler options and brotli object files for your reference
brotli_buildlog.txt
bit_reader.o
decode.o
huffman.o
state.o
dictionary.o

Joe Duarte

unread,
Jul 22, 2016, 10:40:55 AM7/22/16
to Brotli
Vinoth, it looks like you ran BROTLI_BUILD_BIG_ENDIAN when Evgenii suggested BROTLI_BUILD_LITTLE_ENDIAN. You might try the latter, though I have no idea how much of an optimization difference it makes.

The Minnowboard uses a Bay Trail CPU, which should support SIMD up through SSE4.2. You might try compiling with the SSE4.2 flag, or at the very least SSE3, to see if GCC can do more for you than the vanilla compile it's currently doing. I'd also use a much newer version of GCC if possible, or the Intel compiler.

Cheers,

Joe

Evgenii Kliuchnikov

unread,
Jul 25, 2016, 9:32:32 AM7/25/16
to Brotli
Hello.

  Currently I have not tested on Atom, but here is what I got up to time:
  • replacing -Os with -O2 provides considerable speedup; the cost of it is ~+4800 bytes of binary; if binary size is an issue, it could be reduced by ~100k if you cut off static dictionary (I think I'll add compilation flag for that)
  • more speed-up (on my PC with 32-bit code) is gained via memmove16 patch (see below); togethter with O2 it gives 1.25x speedup.
Please check if it helps in your case.
(NB: I use gcc 5.2)

decode.c patch:
@@ -121,9 +121,7 @@ static BROTLI_INLINE void memmove16(uint8_t* dst, uint8_t* src) {
 #if defined(__ARM_NEON__)
   vst1q_u8(dst, vld1q_u8(src));
 #else
-  uint32_t buffer[4];
-  memcpy(buffer, src, 16);
-  memcpy(dst, buffer, 16);
+  *((uint32_t*)dst) = *((uint32_t*)src);
 #endif
 }
 
@@ -1777,13 +1775,13 @@ postReadDistance:
       goto CommandPostWrapCopy;
     }
     pos += i;
-    if (i > 16) {
-      if (i > 32) {
-        memcpy(copy_dst + 16, copy_src + 16, (size_t)(i - 16));
+    if (i > 4) {
+      if (i > 8) {
+        memcpy(copy_dst + 4, copy_src + 4, (size_t)(i - 4));
       } else {
         /* This branch covers about 45% cases.
            Fixed size short copy allows more compiler optimizations. */
-        memmove16(copy_dst + 16, copy_src + 16);
+        memmove16(copy_dst + 4, copy_src + 4);
       }
     }
   }

Evgenii Kliuchnikov

unread,
Jul 25, 2016, 1:10:14 PM7/25/16
to Brotli
Is MMX allowed for this build? I've got some speedup using it in memmove16.
If MMX is allowed, then I would invest some time in trying to use it "bit reader".

Currently what I see is that 32-bit brotli starves on registers -> slows down saving and loading pieces of data again and again...

vinoth eswaran

unread,
Jul 26, 2016, 5:15:32 AM7/26/16
to Brotli
Hi,

1. With compilation flag change -os to -o2 , I am not seeing any big difference in the decompression speed in Minnowboard, though the image size is increased.

2. The patch helps, now the decompression time improved considerably. (From 568 ms to 354ms) . Thanks :)

May I know how does the window size affects the decompression speed? The result mentioned above is with the ring buffer size of around 4 MB, window_size is 22 bits.
I tried using the windows_bit 24 which has 16MB ring buffer, but the decompression time increased by another 10 to 15ms.

Thanks & Regards,
Vinothkumar

Joe Duarte

unread,
Aug 3, 2016, 1:50:22 AM8/3/16
to Brotli


On Tuesday, July 26, 2016 at 2:15:32 AM UTC-7, vinoth eswaran wrote:
Hi,

1. With compilation flag change -os to -o2 , I am not seeing any big difference in the decompression speed in Minnowboard, though the image size is increased.

2. The patch helps, now the decompression time improved considerably. (From 568 ms to 354ms) . Thanks :)

May I know how does the window size affects the decompression speed? The result mentioned above is with the ring buffer size of around 4 MB, window_size is 22 bits.
I tried using the windows_bit 24 which has 16MB ring buffer, but the decompression time increased by another 10 to 15ms.

Increasing the window size will improve compression (to some limit). On another forum, Jyrki said that going from 22 to 23 bits would probably reduce file size by 3%. Unlike gzip, Brotli's decompressor has to work harder as compression is improved, which reduces decompression speed per byte. How this affects total decompression time depends on the extent of the offset from the reduced file size. From what I've seen, brotli usually takes longer to decompress a more compressed file. It might actually be an always thing, not just "usually" – I haven't tested it enough, but the file size reduction seems to lose to slower decompression speed. So you can expect increasing the window size to increase total decompression time / wall time (because of the denser compression and resultant increased complexity of decompression.)

Have you tried the CPU instruction set flags I mentioned? -march=Nehalem would be a good try. I think your Bay Trail processor supports everything Nehalem supported, including SSE 4.2. A more conservative flag would be -march=core2, which would make available not just MMX, but SSE, SSE2, SSE3, and SSSE3 (note the extra S). Also, a newer version of GCC. Evgenii was using v5.2.

Cheers,

Joe

vinoth eswaran

unread,
Aug 4, 2016, 7:30:39 AM8/4/16
to Brotli
Hi,

  I have analyzed the Brotli decompression speed in the Minnowboard Turbot with the following flags enabled -march=core2 and -02.
  The default u-boot setup is -march=i386 and -os
  
Brotli settings: ring buffer size:4194304 window_bits:22 quality:11 (default)

With -o2:
                      -march=i386         -march=core2
with gcc:4.6     354ms                   342 ms
with gcc:5.4     334ms                   323 ms
with gcc:6.1     334ms                   324 ms

with -march=nehalem, the u-boot build is failing, so I couldn't analyze further.

Gzip is performing better in my analysis.

Regards,
Vinothkumar

Joe Duarte

unread,
Aug 5, 2016, 12:06:10 AM8/5/16
to Brotli
Hi Vinoth,

Looks like a small gain, 8 percent or so.

Switch brotli to quality: 9 – it should be faster overall. Note your original results where -11 took 568 ms to decode while -9 took only 509 ms (even though it was decoding a bigger file).

Using GCC 5.4 only from here on out, try: -march=silvermont -flto

I think you'll get under 300 ms with those steps, maybe 200. Your CPU is a Silvermont, and that flag will unlock a wide range of instructions beyond core2, and possibly more precise knowledge of cache sizes. (Intel has multiple confusing names for a given product: Silvermont, Bay Trail, Atom, etc. all apply to your chip. My earlier suggestion of Nehalem flag was a rough guess at SIMD and other feature level equivalence. Silvermont is the exact flag for you, but I didn't remember it before. https://gcc.gnu.org/onlinedocs/gcc-5.4.0/gcc/x86-Options.html#x86-Options)

Of course, gzip should be faster with these compiler optimizations too. One thing at a time. I think you can get some gains on brotli with the above steps. Last resorts will be to give exact cache line and cache sizes to GCC, to use profiling, and to let the compiler take longer with the flags at the bottom of this page. We've just gotten warmed up, but hopefully you'll see some gains today.

JD
Reply all
Reply to author
Forward
0 new messages