Weird memory corruption on serialized strings

Alessandro Cosentino

unread,

Oct 14, 2016, 7:50:44 AM10/14/16

to Redis DB

Hi,

We use Redis as an in-memory cache for our data engineer project. The instance we use in production is deployed on RedisCloud.

We're experiencing a weird issue with data corruption. Namely, strings that are inserted in a Redis lists have one bit flipped when they are fetched back. Very very curiously, the bit flipped is always in the same position.

More details:
- we use the redis-py library to interface our Python+Celery code with Redis;
- the string we add to the list are serializations of Python object of the form ""x|a|b|c|y|d", where x and y are integers and a, b, c, and d are strings.

It happens very rarely, so it's hard to reproduce. I don't have a precise figure of how often this happens, but it's in the order of <1/10^6.

It's important for me that the data is not corrupted because I use the fields of the unserialized string to access Python dictionaries and I get a key error if the Redis data is corrupted.

Is Redis memory data supposed to be 100% error-free? Can this be a problem with the deployment? Or with the Python library I am using?

Itamar Haber

unread,

Oct 16, 2016, 5:50:42 AM10/16/16

to Redis DB

Hi Alessandro,

This indeed sounds like a weird issue that shouldn't happen - definitely the first time I've heard of such data corruptions.

Please contact our support team (sup...@redislabs.com) so we can work on getting to bottom of this together.

Thanks,

--
You received this message because you are subscribed to the Google Groups "Redis DB" group.
To unsubscribe from this group and stop receiving emails from it, send an email to redis-db+unsubscribe@googlegroups.com.
To post to this group, send email to redi...@googlegroups.com.
Visit this group at https://groups.google.com/group/redis-db.
For more options, visit https://groups.google.com/d/optout.

--

Itamar Haber | Chief Developer Advocate
Redis Watch Newsletter | &&(curat||edit||janit||)
Redis Labs ~/redis

Mobile: +972 (54) 567 9692
Email: ita...@redislabs.com
Twitter: @itamarhaber
Skype: itamar.haber

Blog | Twitter | LinkedIn

Meir Guttman

unread,

Oct 16, 2016, 5:55:50 AM10/16/16

to redi...@googlegroups.com

And I would suggest running memory diagnostics...

-- Meir

Alessandro Cosentino

unread,

Oct 17, 2016, 4:09:53 AM10/17/16

to Redis DB

Hi Itamar and Meir,

Thanks for the replies!
Can I run memory diagnostics on a server hosted by RedisLabs?
I know of the command `redis-server --test-memory`, but that's for a local installation, isn't it?

@Itamar: OK, I am going to email the RedisLabs support team, thanks!

Alessandro

On Sunday, October 16, 2016 at 11:55:50 AM UTC+2, MeirG wrote:

And I would suggest running memory diagnostics...

-- Meir

On Sun, Oct 16, 2016 at 12:50 PM, Itamar Haber <ita...@redislabs.com> wrote:

Hi Alessandro,

This indeed sounds like a weird issue that shouldn't happen - definitely the first time I've heard of such data corruptions.

Please contact our support team (sup...@redislabs.com) so we can work on getting to bottom of this together.

Thanks,

On Fri, Oct 14, 2016 at 1:28 PM, Alessandro Cosentino <cos...@gmail.com> wrote:

Hi,

We use Redis as an in-memory cache for our data engineer project. The instance we use in production is deployed on RedisCloud.

We're experiencing a weird issue with data corruption. Namely, strings that are inserted in a Redis lists have one bit flipped when they are fetched back. Very very curiously, the bit flipped is always in the same position.

More details:
- we use the redis-py library to interface our Python+Celery code with Redis;
- the string we add to the list are serializations of Python object of the form ""x|a|b|c|y|d", where x and y are integers and a, b, c, and d are strings.

It happens very rarely, so it's hard to reproduce. I don't have a precise figure of how often this happens, but it's in the order of <1/10^6.

It's important for me that the data is not corrupted because I use the fields of the unserialized string to access Python dictionaries and I get a key error if the Redis data is corrupted.

Is Redis memory data supposed to be 100% error-free? Can this be a problem with the deployment? Or with the Python library I am using?

--
You received this message because you are subscribed to the Google Groups "Redis DB" group.

To unsubscribe from this group and stop receiving emails from it, send an email to redis-db+u...@googlegroups.com.

To post to this group, send email to redi...@googlegroups.com.
Visit this group at https://groups.google.com/group/redis-db.
For more options, visit https://groups.google.com/d/optout.

--
Itamar Haber | Chief Developer Advocate
Redis Watch Newsletter | &&(curat||edit||janit||)
Redis Labs ~/redis
Mobile: +972 (54) 567 9692
Email: ita...@redislabs.com
Twitter: @itamarhaber
Skype: itamar.haber
Blog | Twitter | LinkedIn

--
You received this message because you are subscribed to the Google Groups "Redis DB" group.

To unsubscribe from this group and stop receiving emails from it, send an email to redis-db+u...@googlegroups.com.

Salvatore Sanfilippo

unread,

Oct 17, 2016, 4:18:51 AM10/17/16

to redi...@googlegroups.com

Hello Alessandro,

what version of Redis are you running in RedisCloud?
Do you use LSET in your application?

Thanks,
Salvatore

> --
> You received this message because you are subscribed to the Google Groups
> "Redis DB" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to redis-db+u...@googlegroups.com.
> To post to this group, send email to redi...@googlegroups.com.
> Visit this group at https://groups.google.com/group/redis-db.
> For more options, visit https://groups.google.com/d/optout.

--

Salvatore 'antirez' Sanfilippo
open source developer - Redis Labs https://redislabs.com

"If a system is to have conceptual integrity, someone must control the
concepts."
— Fred Brooks, "The Mythical Man-Month", 1975.

Alessandro Cosentino

unread,

Oct 17, 2016, 4:35:39 AM10/17/16

to Redis DB

Ciao Salvatore!

No, I don't use LSET in my application. For that specific section of the code I only use RPUSH (with multiple value arguments) and LRANGE.

Here is the output of INFO on the instance where I experience that problem.

# Server
redis_version:3.0.3
redis_git_sha1:00000000
redis_git_dirty:0
redis_build_id:0000000000000000000000000000000000000000
redis_mode:standalone
os:Linux 3.2.0-48 virtual x86_64
arch_bits:64
multiplexing_api:epoll
gcc_version:4.6.3
process_id:7994324
run_id:20ff23062d45408c8ec67262d4fe03733725a3cd
tcp_port:17761
uptime_in_seconds:4120845
uptime_in_days:47
hz:10
lru_clock:0
config_file:

# Clients
connected_clients:6
client_longest_output_list:0
client_biggest_input_buf:0
blocked_clients:0

# Memory
used_memory:6605673392
used_memory_human:6.15G
used_memory_rss:6605673392
used_memory_peak:12377759792
used_memory_peak_human:11.52G
used_memory_lua:69632
mem_fragmentation_ratio:1
mem_allocator:jemalloc-3.2.0

# Persistence
loading:0
rdb_changes_since_last_save:8445098528
rdb_bgsave_in_progress:0
rdb_last_save_time:1472572005
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:0
rdb_current_bgsave_time_sec:-1
aof_enabled:0
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:-1
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_last_write_status:ok

# Stats
total_connections_received:5270462
total_commands_processed:1342042463
instantaneous_ops_per_sec:1
total_net_input_bytes:498971577755
total_net_output_bytes:2665380521055
instantaneous_input_kbps:0.09
instantaneous_output_kbps:1.46
rejected_connections:0
sync_full:1
sync_partial_ok:0
sync_partial_err:0
expired_keys:25555016
evicted_keys:0
keyspace_hits:727297900
keyspace_misses:184477264
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:0
migrate_cached_sockets:0

# Replication
role:master
connected_slaves:1
slave0:ip=0.0.0.0,port=0,state=online,offset=0,lag=0
master_repl_offset:0
repl_backlog_active:0
repl_backlog_size:1048576
repl_backlog_first_byte_offset:0
repl_backlog_histlen:0

# CPU
used_cpu_sys:0.00
used_cpu_user:0.00
used_cpu_sys_children:0.00
used_cpu_user_children:0.00

# Cluster
cluster_enabled:0

# Keyspace
db0:keys=87555,expires=87232,avg_ttl=5587825

Salvatore Sanfilippo

unread,

Oct 17, 2016, 4:40:43 AM10/17/16

to redi...@googlegroups.com

Thank you Alessandro. Being this Redis 3.0 and not 3.2, we are talking
of an implementation of lists that is N years old and never had such a
problem, so IMHO the issue must be searched outside of the Redis core.
I'll talk with my colleagues at Redis Labs to investigate in the
RedisCloud side. Please if possible also look into your client-side to
check if the corruption could be at application level. I think that
Itamar or somebody else from Redis Labs will update here ASAP. Cheers.

Alessandro Cosentino

unread,

Oct 17, 2016, 4:55:44 AM10/17/16

to Redis DB

OK, I'll double check the client-side.
In the meantime, let me add an actual example of what is happening.
Instead of the word "iphone", I get back the word "ipho.e". Second case: "entertainment" --> "en4ertainment".
Notice a pattern on the binary encoding of the letters that are replaced:
_____________________________
"n" --> "." | 0b1101110 --> 0b101110
"t" --> "4" | 0b1110100 --> 0b110100

(The values in the second column are given by the Python code `bin(ord(c))`)

--Alessandro

Salvatore Sanfilippo

unread,

Oct 17, 2016, 5:01:02 AM10/17/16

to redi...@googlegroups.com

Ok thanks. A few random things:

1. If this is a memory corruption, from time to time it should happen
on pointers as well and crash the server. Redis Labs operations will
notice if this is the case.
2. However if the crash does not happen, to see the corruption on the
strings make it a lot more likely that this is a client side issue.
3. Bit clearing always at the same position (bit 7) could actually
happen because of faulty memory, but if you are on EC2, AFAIK, the
servers all run memory corrected modules.

Cheers,
Salvatore

On Mon, Oct 17, 2016 at 10:55 AM, Alessandro Cosentino

Salvatore Sanfilippo

unread,

Oct 17, 2016, 5:05:12 AM10/17/16

to redi...@googlegroups.com

Another random idea, if it is feasible in your specific case, what
about modifying the user-application layer in order to store a crc64 8
byte prefix before the serialized data?
This way you can understand if it's Redis or not that corrupts the
data, since when it will happen again, you'll see if the crc64 sum
matches or not.

Pedro Melo

unread,

Oct 17, 2016, 5:06:20 AM10/17/16

to redi...@googlegroups.com

Hi,

The corruption could also happen in the network stackŠ Similar situations
have been known to happen, like this:
http://mina.naguib.ca/blog/2012/10/22/the-little-ssh-that-sometimes-couldnt
.html

One way to make sure: ask RedisLab's to run the same commands on a locally
hosted instance, or even in the same instance. If they cannot get
corruption, then the data is being corrupted someplace elseŠ

Bye,

On 17/10/16 10:00 AM, "Salvatore Sanfilippo" <redi...@googlegroups.com on

Salvatore Sanfilippo

unread,

Oct 18, 2016, 2:56:01 AM10/18/16

to Redis DB

Yes, that could be indeed, thanks to the weak checksums we got in the packets ;-)

However since the problem only happens rarely and randomly, it's very hard to reproduce.

Unfortunately there is no easy way to distinguish, by putting a CRC64 into the data itself, if it's the net or the Redis servers, but my bet is on client side code for a few reasons.

Btw if there is something corrupting packets, so often that it's possible from time to time that a checksum verifies, I bet that running tcpdump in promiscuous mode with a filter to show just packets with bad checksums, should do the trick of showing that there is some corruption.

Cheers,

Salvatore

Alessandro Cosentino

unread,

Oct 19, 2016, 8:37:31 AM10/19/16

to Redis DB

Hi,

Thanks for all the ideas!
We did put a CRC64 check on the client code, but since we added that, the problem hasn't shown up yet.
We are monitoring the situation and we have set up a system that will alert us in case the CRC check fails.
I'll keep you guys posted here.

Alessandro

Salvatore Sanfilippo

unread,

Oct 19, 2016, 8:49:46 AM10/19/16

to redi...@googlegroups.com

Thanks Alessandro! Let's hope is not an Hisenbug disappearing as soon
as you add a CRC :-)

Alessandro Cosentino

unread,

Nov 4, 2016, 12:57:57 PM11/4/16

to Redis DB

After many days and many gigabytes of data flowing, the issue has finally showed up again. It was again a bit clearance and the checksum did not match.
Actually now that I thought about the problem again, I am no longer sure I understand what information can be gathered from the client-side check. Salvatore, can you please elaborate on that?

Thanks,
Alessandro

Salvatore Sanfilippo

unread,

Nov 4, 2016, 1:31:23 PM11/4/16

to redi...@googlegroups.com

Thanks Alessandro, please before I reply, could you elaborate on how
exactly the checksum is computed, applied, and so forth? This is
important to understand the value of the checksum not passing. Thanks.

Alessandro Cosentino

unread,

Nov 8, 2016, 4:00:48 AM11/8/16

to Redis DB

Hi Salvatore,

Here is a gist of what the code looks like: https://gist.github.com/cosenal/168c81fed721ea9fb4809e8b80b17480

When in the last email I wrote "checksum did not match" I meant that the exception you see in line 26 was raised.

Alessandro

Salvatore Sanfilippo

unread,

Nov 8, 2016, 4:10:28 AM11/8/16

to redi...@googlegroups.com

Hello Alessandro,

looking at the code, I think that it starts to be more likely that
there could be some corruption problem outside the client. I'm passing all the
info to my colleagues at Redis Labs so that they can inspect the issue
more closely.
AFAIK it is very unlikely this is due to Redis himself, I never saw
this before and there is no simple mechanism to explain a bug like
that, so this may instead be from the networking to other systems
inside Redis Labs infrastructure, or may be in the network between the
clients and Redis Labs servers, in a way that makes the checksum pass from
time to time (very rarely indeed). I just pinged again my colleagues
internally, they'll evaluate the situation and give you some feedback.
Thanks for helping.

Salvatore Sanfilippo

unread,

Nov 8, 2016, 4:15:38 AM11/8/16

to redi...@googlegroups.com

Just to put things in context better, it is still possible that the
problem is the client, because when data is written in the pipeline,
the software may corrupt it before it is queued to the network stack.
However now that there is the checksum, we narrowed the amount of
client side code where this could happen, so an investigation in the
server side starts to be more interesting.

Salvatore Sanfilippo

unread,

Nov 8, 2016, 4:20:43 AM11/8/16

to redi...@googlegroups.com

Alessandro: other random ideas, did you tried to incrementally scan
the dataset to find broken data before they are fetched? For example
creating a slave (that Redis Labs may be able to provide for you for
testing purposes for free) and scanning it in search of issues in the
whole dataset. This can be interesting since you could time-stamp the
data, and find broken data immediately as they are created. On top of
this ability to reproduce, we may be able to gain more state (adding
client-side logging as well for the latest N writes) until it is
possible to have enough state to track it.

Reply all

Reply to author

Forward