How deduplicate logEntry in Cloud Logging

1,928 views
Skip to first unread message

Jerry Gao

unread,
Sep 18, 2020, 2:53:45 PM9/18/20
to Google Stackdriver Discussion Forum
Hi 
I'm using the cloud logging api to push logEntry to a project. I went through the doc about insertId. It said 
"A unique identifier for the log entry. If you provide a value, then Logging considers other log entries in the same project, with the same timestamp, and with the same insertId to be duplicates which are removed in a single query result. However, there are no guarantees of de-duplication in the export of logs." 
For now in my implementation, I left timestamp empty. Is it means the insertId can be used as the only key to deduplicate logEntry?

Thanks!

Summit Tuladhar

unread,
Sep 21, 2020, 8:23:48 AM9/21/20
to Jerry Gao, Google Stackdriver Discussion Forum
Hi Jerry,

If you leave the timestamp field in the log entry empty, the Logging API will add a timestamp equal to the received time. Since two log entries with the same insert_id and empty timestamps are likely to be received at different times by the Logging API, they are not considered duplicates.

To de-duplicate the entries, you will need to provide the same timestamp on both entries. Is that something that's possible on your end? 

Also, can you tell us a bit more about your use-case for de-duplication?

Thanks,
Summit

--
© 2020 Google Inc. 1600 Amphitheatre Parkway, Mountain View, CA 94043
 
Email preferences: You received this email because you signed up for the Google Stackdriver Discussion Google Group (google-stackdr...@googlegroups.com) to participate in discussions with other members of the GoogleStackdriver community.
---
You received this message because you are subscribed to the Google Groups "Google Stackdriver Discussion Forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to google-stackdriver-d...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/google-stackdriver-discussion/68a6828d-d349-47b7-8a56-db5225628e4cn%40googlegroups.com.

Jerry Gao

unread,
Sep 21, 2020, 10:20:44 AM9/21/20
to Summit Tuladhar, Google Stackdriver Discussion Forum
Hi Summit,
Thanks for your reply.
Basically I have a server to collect event logs from external APIs.
I don't want to parse the http response since other jobs would handle it, so I may not have the timestamp on my end.
As for the de-duplication, I just want to make sure there are no duplicate entries in cloud logging when any error or retry occurs.

Igor Peshansky

unread,
Sep 21, 2020, 12:15:45 PM9/21/20
to Jerry Gao, Summit Tuladhar, Google Stackdriver Discussion Forum
Jerry,

If you just want to de-duplicate retries, you can assign an arbitrary timestamp in your code (e.g., the time you've collected the log entry) and send that. When retrying, the code will send an identical record, which means it'll have an identical timestamp and insert_id (and should be de-duplicated).
        Igor

Jerry Gao

unread,
Sep 23, 2020, 12:00:30 AM9/23/20
to Igor Peshansky, Summit Tuladhar, Google Stackdriver Discussion Forum
Thanks Igor,
That's what I thought. I may assign a fake identical timestamp to all entries.
Also I'd like to ask how the deduplicate work is achieved. Will it create two entries and deduplicate the second one at a later time,or will it just ignore/reject the second one, or will it throw an exception?

Summit Tuladhar

unread,
Sep 23, 2020, 10:48:51 AM9/23/20
to Jerry Gao, Igor Peshansky, Google Stackdriver Discussion Forum
Hi Jerry,

How the deduplication happens internally is an implementation detail that is subject to change. 

If you are still interested: during ingestion we only have a small duration up to which we can deduplicate logs in a buffer before writing it to storage to make logs available for queries quickly. There's also some background work that goes on to merge files and remove duplicates at certain intervals. At query time, we merge the results and remove any duplicates.

Regards,
Summit

Jerry Gao

unread,
Sep 23, 2020, 11:11:56 AM9/23/20
to Summit Tuladhar, Igor Peshansky, Google Stackdriver Discussion Forum
Thanks for the info!
I'm asking because we are also using the pubsub functionality of cloud logging.
When the cloud logging received any logs, it would immediately send the message to another server.
I want to confirm that the pubsub would not send out two messages with different message id, but contains exactly info of logs.

Summit Tuladhar

unread,
Sep 23, 2020, 11:15:48 AM9/23/20
to Jerry Gao, Igor Peshansky, Google Stackdriver Discussion Forum
As mentioned in your original post: "... there are no guarantees of de-duplication in the export of logs"

Reply all
Reply to author
Forward
0 new messages