Groups keyboard shortcuts have been updated
Dismiss
See shortcuts

WAT Title bug report

91 views
Skip to first unread message

Robert Waksmunski

unread,
Oct 12, 2024, 1:30:28 AM10/12/24
to Common Crawl
Hello,

There appears to be a bug in the WAT file generation where the title of a document is derived from the last <title> tag in a document instead of the first. This is a problem in documents which embed <svg> images directly and those sometimes have <title> tags of their own. For example the title in WAT for foodnavigator . com shows up as "Linkedin" and not as "Food Ingredients & Food Science - Additives, Flavours, Starch". Linkedin is the last <svg><title> on that page.

I'm not sure where to properly report this bug so I'm starting here. If anyone could point to the proper place or person I would really appreciate it.

Thanks You

Sebastian Nagel

unread,
Oct 12, 2024, 7:34:41 AM10/12/24
to common...@googlegroups.com
Hi Robert,

thanks for reporting!

Would you mind to open an issue report at
https://github.com/commoncrawl/ia-web-commons/issues ?

Best and thanks,
Sebastian

Robert W.

unread,
Oct 13, 2024, 2:12:41 AM10/13/24
to common...@googlegroups.com
Hello Sebastian,

Thank you for the quick reply. I've filed the bug under https://github.com/commoncrawl/ia-web-commons/issues/36

Thanks again, Robert.

--
You received this message because you are subscribed to a topic in the Google Groups "Common Crawl" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/common-crawl/ZrPFdY3pPA4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to common-crawl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/5df58f28-8384-4486-aafe-cbf264539fc7%40commoncrawl.org.
Reply all
Reply to author
Forward
0 new messages