parseHtml() not working or hidden login?

61 views
Skip to first unread message

Stanislav Vasko

unread,
Dec 5, 2016, 12:46:26 PM12/5/16
to OpenRefine
Greetings,

i have list of URL and created another column with HTML code. Another column i want to parse this HTML code to get specific data, but still can't get it to work. My code looks like:

<div class="desc">
  <p class="top"><strong>Garance spokojenosti:</strong> 30 dní na vrácení zboží!</p>
  <p class="top">V ceně zahrnuto:<br></p>
  <ul>
    <li>sluneční brýle</li>
    <li><span class="originalCase infoButton2">Dárek zdarma: originální pouzdro<div class="over" style="margin-left: -106.5px;"><img src="xxx.png"></div></span></li>
    <li>záruka vrácení peněz do 30 dnů</li>
  </ul>
  <div class="priceItem" itemprop="offers" itemscope="" itemtype="http://schema.org/Offer">
  <a href="http://.cz/do-kosiku/740/" class="btn btn-pink btn-small to-basked pull-right">
      Koupit            </a>
      <span class="pull-right price">
        <meta itemprop="price" content="1 883">
        <strong>1 883</strong> 
        <meta itemprop="priceCurrency" content="Kč">
        Kč          </span>
        <meta itemprop="availability" content="http://schema.org/InStock">
        <span class="pull-right stockInfo" style="width:auto;">
          Skladem                      </span>
        </div>
        
        <p class="top">Cena brýlí bez skel</p>
        <div class="priceItem" itemprop="offers" itemscope="" itemtype="http://schema.org/Offer">
          <a href="http://.cz/do-kosiku/740/1" class="btn btn-pink btn-small to-basked pull-right">
            Koupit              </a>
            <span class="pull-right price">
              <meta itemprop="price" content="1 483">
              <strong>1 483</strong> 
              <meta itemprop="priceCurrency" content="Kč">
              Kč            </span>
              <meta itemprop="availability" content="http://schema.org/InStock">
              <span class="pull-right stockInfo" style="width:auto;">
                Skladem                          </span>
              </div>
            </div>


If i try:

value.parseHtml().select("div[class=desc]") 

i get whole DIV as expected

with:

value.parseHtml().select("p[class=top]")

i get only one P.top, but there are more

But with:

value.parseHtml().select("div[class=priceItem]")
value.parseHtml().select("div[class=priceItem]")[0]

i get nothing or error

Can somebody explain, how to target elements? I need parse price in tag strong. But i want to understand it generally.

Thanks for help or link to explain. I found a lot of tips, but nothing describes why it works only for some elements.

Owen Stephens

unread,
Dec 5, 2016, 1:35:47 PM12/5/16
to OpenRefine
Hi Stanislav,

Using the HTML here and the GREL expressions you have this all seems to work for me:

value.parseHtml().select("div[class=desc]")  - selects all of the HTML and puts in an array
value.parseHtml().select("p[class=top]") - selects the 3 paragraphs with class 'top' and puts in an array
value.parseHtml().select("div[class=priceItem]") - selects the 2 divs with class 'priceItem' and puts in an array
value.parseHtml().select("div[class=priceItem]")[0] - selects the 2 divs with class 'priceItem' and puts in an array, then selects the first item in the array and returns it as a string

I'm using OpenRefine 2.6rc2

Are you using a different version of OpenRefine?
Can you post some screenshots of where it isn't working?

Thanks

Owen


Stanislav Vasko

unread,
Dec 5, 2016, 2:48:05 PM12/5/16
to OpenRefine
I'm not so lucky, but running same version as you. Fresh reinstalled to be sure I'm not missing something

If i use:

value.parseHtml().select("p[class=top]")

i get only one P.top:


when i try value.parseHtml().select("div[class=priceItem]")


i get:


and finally value.parseHtml().select("div[class=priceItem]")[0]

brings error: Error: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0

Any tip what is wrong? Thanks for helping!



Dne pondělí 5. prosince 2016 19:35:47 UTC+1 Owen Stephens napsal(a):

Thad Guidry

unread,
Dec 5, 2016, 6:52:36 PM12/5/16
to OpenRefine
You can optionally use more .toString() to figure out where the issues are at.

I wrote up a wiki page on this several years ago:

Your Jsoup selector syntax can take advantage of easier syntax for finding a class. Instead of [class=xxx] you can do just do div.top

Classes can be prefixed with DOT .
IDs can be prefixed with HASH #
Elements with an attribute can be found with [attr]


-Thad

Stanislav Vasko

unread,
Dec 6, 2016, 2:42:03 AM12/6/16
to OpenRefine
I know, but i tried more syntax version just to be sure. .toString() makes no difference. When i get result, toString or join works fine, if .select fails, join and toString don't work too.

I can't find out what's the problem. Some tags and classes works fine some don't :( Maybe MacOS or local setup specific problem.

But thanks for help, maybe next release will fix it.


Dne úterý 6. prosince 2016 0:52:36 UTC+1 Thad Guidry napsal(a):

Owen Stephens

unread,
Dec 6, 2016, 4:19:43 AM12/6/16
to OpenRefine
I'm using MacOS (Sierra). However I'm running the Linux version of OR rather than the Mac specific one - so you could try that?

Trying to debug from here is difficult. A couple of suggestions:

Look at the output of value.parseHtml() and checking this looks as you'd expect (all HTML present, no issues with characters)
Write GREL to just select the <p> tags - and see if you get all of them or not

Just trying to narrow down where the problem actually lies

Owen

Stanislav Vasko

unread,
Dec 6, 2016, 5:35:27 AM12/6/16
to OpenRefine
I installed OpenRefine on another computer with Windows 7 and exactly same problem.

But when i try: value.parseHtml() i dont get the same output. At least some minor changes occures, like:

<meta name="viewport" content="width=device-width, initial-scale=1">
is changed to
<meta name="viewport" content="width=device-width, initial-scale=1" />

or czech language is escaped

Nenechte se zmást klasickým zevnějškem, tyto černé brýle v sobě skrývají jedinečné vlastnosti. Extrémní pružnost a neuvěřitelnou lehkost.
si changed to
Nenechte se zm&aacute;st klasick&yacute;m zevnějškem, tyto čern&eacute; br&yacute;le v sobě skr&yacute;vaj&iacute; jedinečn&eacute; vlastnosti. Extr&eacute;mn&iacute; pružnost a neuvěřitelnou lehkost.

i will try to install some ubuntu virtual to try there, but i think, it will be same too.

Dne úterý 6. prosince 2016 10:19:43 UTC+1 Owen Stephens napsal(a):

Stanislav Vasko

unread,
Dec 6, 2016, 6:07:05 AM12/6/16
to OpenRefine
And exactly the same while using Ubuntu, fresh install, all latest.

Do you get something for:
value.parseHtml().select("span.price")

I get only "[ ]" and if i use toString() a get empty cell.

It doesnt work for me too. I can get the 

Dne úterý 6. prosince 2016 11:35:27 UTC+1 Stanislav Vasko napsal(a):

Owen Stephens

unread,
Dec 6, 2016, 6:10:17 AM12/6/16
to OpenRefine
That GREL works for me - see screenshot


Thad Guidry

unread,
Dec 6, 2016, 10:03:45 AM12/6/16
to openr...@googlegroups.com

Stanislav,

Lets back up.

How are you CREATING the OpenRefine project ?
What encoding options or any options are you checkboxing on ?
What steps are you doing to create the OpenRefine project ?
Can you export your Undo/Redo project history and attach so we can take a look (its JSON) ?

-Thad

Stanislav Vasko

unread,
Dec 9, 2016, 4:10:02 PM12/9/16
to OpenRefine
I finally found it. Whole problem was that Refine was reading code from server without cookies (development version) and i was reading code directly from code. I found it by comparing my code with code downloaded by Refine. Sorry for such stupid problem. But maybe good to know and i will remember it for rest of my life :) 

Many thanks for help, now all works fine as should.



Dne úterý 6. prosince 2016 16:03:45 UTC+1 Thad Guidry napsal(a):
Reply all
Reply to author
Forward
0 new messages