Scraping from website using node Red ???

3,606 views
Skip to first unread message

Henrik Kamvåg

unread,
Aug 25, 2016, 3:59:32 PM8/25/16
to Node-RED
Ok, so I just started playing with node red and this should be failry simple.

I'd like to retrieve two values ( the first two minute values) from this site using Node Red, from where I intend to pass them on to my Amazon Echo.

The trigger should be a simple http POST.  

I'd guess the flow could be as simple as this;


So how do I 'Extract values'?

Mark Setrem

unread,
Aug 25, 2016, 4:24:12 PM8/25/16
to Node-RED
Does it work up to that point?

i.e. if you replace the function node with a debug node does it return what you what you would expect? (generally I tend to write it to a file and use a text editor to view given the length of the data returned)

Usually you need to delete "msg.headers" between http requests. But when the http request nodes return what you would expect there is a HTML node that will allow you to extract an item from within the html that is returned.

Henrik Kamvåg

unread,
Aug 26, 2016, 2:58:33 AM8/26/16
to Node-RED
Nah, I'm stuck.

This is my flow;
[
    {
        "id": "17f399e0.ce0ee6",
        "type": "http request",
        "z": "14436238.a7959e",
        "name": "Faan",
        "method": "GET",
        "ret": "txt",
        "tls": "",
        "x": 390,
        "y": 160,
        "wires": [
            [
                "cfaf9e83.56b0b"
            ]
        ]
    },
    {
        "id": "cfaf9e83.56b0b",
        "type": "html",
        "z": "14436238.a7959e",
        "name": "extract",
        "tag": "#minutes",
        "ret": "attr",
        "as": "single",
        "x": 530,
        "y": 160,
        "wires": [
            [
                "2a669e13.5ec4d2"
            ]
        ]
    },
    {
        "id": "4503c16c.416f1",
        "type": "inject",
        "z": "14436238.a7959e",
        "name": "",
        "topic": "trigg",
        "payload": "gris",
        "payloadType": "flow",
        "repeat": "",
        "crontab": "",
        "once": false,
        "x": 240,
        "y": 160,
        "wires": [
            [
                "17f399e0.ce0ee6"
            ]
        ]
    },
    {
        "id": "6d0f1448.6abaac",
        "type": "debug",
        "z": "14436238.a7959e",
        "name": "",
        "active": true,
        "console": "false",
        "complete": "true",
        "x": 810,
        "y": 160,
        "wires": []
    },
    {
        "id": "2a669e13.5ec4d2",
        "type": "json",
        "z": "14436238.a7959e",
        "name": "",
        "x": 670,
        "y": 160,
        "wires": [
            [
                "6d0f1448.6abaac"
            ]
        ]
    }
]

This is my debug;
 {"_msgid": "571edddc.a8e124", "topic": "trigg", "payload": "[{\"width\":\"7%\",\"id\":\"minutes\",\"title\":\"Realtid.\"},{\"width\":\"7%\",\"id\":\"minutes\",\"title\":\"Realtid.\"}]", "statusCode": 200, "headers": { "connection": "close", "date": "Fri, 26 Aug 2016 06:49:08 GMT", "server": "Microsoft-IIS/6.0", "x-powered-by": "ASP.NET", "content-type": "text/html", "set-cookie": [ "ASPSESSIONIDQSDTRRBT=JCBPICPDPGKBLOOGANBPKBGC; path=/" ], "cache-control": "private", "transfer-encoding": "chunked" }}

What I want to extract would be something like;
{ minutes: 7 },{ minutes: 15 }



Julian Knight

unread,
Aug 26, 2016, 4:18:37 AM8/26/16
to Node-RED
This may help:

[{"id":"ab30d271.85f66","type":"http request","z":"462586cc.55d938","name":"Faan","method":"GET","ret":"txt","url":"http://www.thoreb.se/webdeparture/VL/dep.asp?stopno=2886&stopname=R%F6nnby%20C","tls":"","x":350,"y":1760,"wires":[["95484369.1f527"]]},{"id":"8dacb32b.d8a75","type":"inject","z":"462586cc.55d938","name":"","topic":"trigg","payload":"gris","payloadType":"flow","repeat":"","crontab":"","once":false,"x":200,"y":1760,"wires":[["ab30d271.85f66"]]},{"id":"aea290a2.cdce7","type":"debug","z":"462586cc.55d938","name":"","active":true,"console":"true","complete":"payload","x":690,"y":1760,"wires":[]},{"id":"95484369.1f527","type":"html","z":"462586cc.55d938","name":"","tag":"td#minutes","ret":"text","as":"multi","x":510,"y":1760,"wires":[["aea290a2.cdce7"]]}]

It uses the html node to extract the data from the two table cells that have the ID "minutes" - actually, that is bad coding since all ID's are supposed to be unique and they should have used a class to identify the cells instead, but hey-ho. The way I've done it, it returns 2 messages but you can easily change that to return an array, it is only one setting in the html node.

Of course, I don't know how robust that is since I don't use that site and my Swedish is rather rusty - well, to be frank, though I've just come back from 16d in Sweden and Norway, it is non-existent!

sebasti...@gmail.com

unread,
Aug 26, 2016, 4:55:12 AM8/26/16
to Node-RED
If you want to have the output as you suggested try this flow. It will trim the result from control characters as well. I used a very small function node for that. I assumed that the "Change node" should be enough, but unfortunately it is not possible to change datatype by choosing JSON output type.
That means that you can have a regex on your number of your extracted html output and put in a JSON structure like {"minutes":$1} but as the original type is a string, the result is something like "[Object object]" so a not pretty string rep of the js object.. 

I think this is not the supposed behaviour of the "change node" this is why a function node is needed.

here the flow:

[{"id":"97a218de.eaf478","type":"http request","z":"8d30d201.0f52d","name":"Faan","method":"GET","ret":"txt","url":"http://www.thoreb.se/webdeparture/VL/dep.asp?stopno=2886&stopname=R%F6nnby%20C","tls":"","x":730,"y":280,"wires":[["23db2835.de8ff8"]]},{"id":"23db2835.de8ff8","type":"html","z":"8d30d201.0f52d","name":"extract","tag":"td#minutes","ret":"text","as":"single","x":870,"y":280,"wires":[["a97d1184.936d"]]},{"id":"d572d58a.dbaaa8","type":"inject","z":"8d30d201.0f52d","name":"","topic":"trigg","payload":"gris","payloadType":"flow","repeat":"","crontab":"","once":false,"x":580,"y":280,"wires":[["97a218de.eaf478"]]},{"id":"11d9a5a.f748f5a","type":"debug","z":"8d30d201.0f52d","name":"","active":true,"console":"false","complete":"true","x":1170,"y":280,"wires":[]},{"id":"a97d1184.936d","type":"function","z":"8d30d201.0f52d","name":"Trim data","func":"msg.payload = [{minutes:msg.payload[0].trim()}, {minutes:msg.payload[1].trim()}]\nreturn msg;","outputs":1,"noerr":0,"x":1020,"y":280,"wires":[["11d9a5a.f748f5a"]]}]

Henrik Kamvåg

unread,
Aug 27, 2016, 11:54:34 AM8/27/16
to Node-RED
Thanks alot, I was almost there when I compare with what you got me but I just could not figure those last steps out. :)

Thans again.

Henrik Kamvåg

unread,
Aug 27, 2016, 12:44:52 PM8/27/16
to Node-RED
That was quite helpfull. However, now I face the challange of renaming the "minutes" to "first" & "second". :)
Looking into is _msgid could be unique or anything else that could be used to seperate them.

Henrik Kamvåg

unread,
Aug 27, 2016, 1:28:32 PM8/27/16
to Node-RED
Fixed it. Thanks for the function.

Cheers!
Reply all
Reply to author
Forward
0 new messages