Question about HTML parsing

Yossi

unread,

Aug 7, 2017, 11:16:03 AM8/7/17

to Common Crawl

Hi,

I see a problem with HTML parsing in the June crawl. It seems to be skimping on whitespaces.

For example, in CC-MAIN-20170629154125-20170629174125-00719.warc.wet, the parsed text for the URL http://awaywithwords.co/category/general/ contains the line:

February 25, 2017by Catherine Heath9 min readAdd Comment One thing I’m surprised by in my career (in less than a year at professional blogging) is the haters.

It is parsed from:

<div class="meta-item"><i class="fa fa-calendar"></i><span class="updated">February 25, 2017</span></div><div class="meta-item"><i class="fa fa-user"></i><span class="vcard author"><span class="fn">by <a href="http://awaywithwords.co/author/catherine-j-heathgmail-com/">Catherine Heath</a></span></span></div><div class="meta-item"><i class="fa fa-clock-o"></i>9 min read</div><div class="meta-item"><i class="fa fa-comments-o"></i><a href="http://awaywithwords.co/2017/02/25/789/#respond">Add Comment</a></div> </div>

</header>

</div>

<div class="entry-content">
<p>One thing I’m surprised by in my career (in less than a year at professional blogging) is the haters.
Really, people hate bloggers. I have a hypothesis about why this might be.
People don’t really know what blogging is. It...</p>

The problem for me is when multiple words become one, e.g. "Heath9", but not having a newline before "One thing" is also strange.

I ran the same page through Nutch with parse-html, and the problem did not reproduce. What parser is CC using? Is this the intended behavior?

Thanks,

Yossi.

Sebastian Nagel

unread,

Aug 8, 2017, 8:44:49 AM8/8/17

to common...@googlegroups.com

Hi Yossi,

the software used to generate the WET files is described in this discussion:
https://groups.google.com/d/msg/common-crawl/hsb90GHq6to/SSVocyq8AAAJ

In your case I would consider it a bug: there should be definitely space between these words.
Please, open an issue on
https://github.com/commoncrawl/ia-web-commons/issues
The code where the text is constructed from the HTML parse tree, is in

https://github.com/commoncrawl/ia-web-commons/blob/master/src/main/java/org/archive/resource/html/ExtractingParseObserver.java

Adding meaningful space from HTML elements/tags is tricky: there are block elements (p, div, etc.)
which impose space or a newline and inline elements (a, span, i, etc.) which do not induce
extra space. However, with CSS one can add space around inline elements.

Btw., Nutch's parse-html adds space around all tags while parse-tika is close to the optimum
possible given that sometimes it's hard to decide.

Thanks,
Sebastian

On 08/07/2017 05:16 PM, Yossi wrote:
> Hi,
>
> I see a problem with HTML parsing in the June crawl. It seems to be skimping on whitespaces.
>
> For example, in CC-MAIN-20170629154125-20170629174125-00719.warc.wet, the parsed text for the URL
> http://awaywithwords.co/category/general/ contains the line:
>

> February 25, 2017by Catherine Heath9 min readAdd CommentOne thing I’m surprised by in my career

> (in less than a year at professional blogging) is the haters.
>
> It is parsed from:
>
> <div class="meta-item"><i class="fa fa-calendar"></i><span class="updated">February 25,
> 2017</span></div><div class="meta-item"><i class="fa fa-user"></i><span class="vcard
> author"><span class="fn">by <a
> href="http://awaywithwords.co/author/catherine-j-heathgmail-com/">Catherine
> Heath</a></span></span></div><div class="meta-item"><i class="fa fa-clock-o"></i>9 min
> read</div><div class="meta-item"><i class="fa fa-comments-o"></i><a
> href="http://awaywithwords.co/2017/02/25/789/#respond">Add Comment</a></div> </div>
>
> </header>
>
> </div>
>
> <div class="entry-content">
> <p>One thing I’m surprised by in my career (in less than a year at professional blogging) is the
> haters.
> Really, people hate bloggers. I have a hypothesis about why this might be.
> People don’t really know what blogging is. It...</p>
>
> The problem for me is when multiple words become one, e.g. "Heath9", but not having a newline before
> "One thing" is also strange.
> I ran the same page through Nutch with parse-html, and the problem did not reproduce. What parser is
> CC using? Is this the intended behavior?
>
> Thanks,
> Yossi.
>

> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

Yossi

unread,

Aug 9, 2017, 11:41:07 AM8/9/17

to Common Crawl

Opened an issue: https://github.com/commoncrawl/ia-web-commons/issues/13

Sebastian Nagel

unread,

Aug 9, 2017, 11:41:58 AM8/9/17

to common...@googlegroups.com

Hi Yossi,

thanks!

Sebastian

On 08/09/2017 05:41 PM, Yossi wrote:
> Opened an issue: https://github.com/commoncrawl/ia-web-commons/issues/13
>
>

Reply all

Reply to author

Forward