Web data to Nodered

3,029 views
Skip to first unread message

gorem...@gmail.com

unread,
Dec 10, 2017, 12:56:01 PM12/10/17
to Node-RED

I am new to this web scraping. How do i get these data to node-red  from this link http://publicinfobanjir.water.gov.my/View/OnlineFloodInfo/PublicWaterLevel.aspx?scode=SEL


Do i need to use a 3rd party software? 



Thank you

gorem...@gmail.com

unread,
Dec 10, 2017, 1:01:53 PM12/10/17
to Node-RED
It's a live data and i need to trigger an event if water level above the normal level. 

Dave C-J

unread,
Dec 10, 2017, 6:08:41 PM12/10/17
to node...@googlegroups.com
Hi

this simple flow should get you started

[{"id":"6356e541.11a4dc","type":"inject","z":"f03b57d5.e525f8","name":"","topic":"","payload":"","payloadType":"date","repeat":"","crontab":"","once":false,"x":120,"y":1200,"wires":[["f10c4c74.0c258"]]},{"id":"f10c4c74.0c258","type":"http request","z":"f03b57d5.e525f8","name":"","method":"GET","ret":"txt","url":"http://publicinfobanjir.water.gov.my/View/OnlineFloodInfo/PublicWaterLevel.aspx?scode=SEL","tls":"","x":290,"y":1200,"wires":[["6deab1bb.58e8a"]]},{"id":"6deab1bb.58e8a","type":"html","z":"f03b57d5.e525f8","name":"","tag":"table[class=Grid]","ret":"html","as":"multi","x":490,"y":1200,"wires":[["4eb5946f.7eae7c"]]},{"id":"4eb5946f.7eae7c","type":"html","z":"f03b57d5.e525f8","name":"","tag":"tr","ret":"html","as":"single","x":690,"y":1200,"wires":[["80bd91fa.77ce5"]]},{"id":"80bd91fa.77ce5","type":"debug","z":"f03b57d5.e525f8","name":"","active":true,"complete":false,"x":850,"y":1200,"wires":[]}]

It gets the web page - then slices out the relevant table, then puts each row into an array.
You can then slice these up however you want.

gorem...@gmail.com

unread,
Dec 26, 2017, 10:40:01 AM12/26/17
to Node-RED

Hi

Thank you for the flows. I have tried them but i do not know how to proceed further . Is there any prebuild node that can make it more readable? Basically the output should be like this . 

Thank you

STATE :  SELANGOR
Sg.Bernam di Jambatan SKC
Hulu Selangor, SELANGOR 
Date: 26/12/2017 23:15
Water Level 18.23 m has exceeded the ALERT!   level 18.10 m by 0.13m
WL Trend : No Change

node.jpg
end.jpg

Julian Knight

unread,
Dec 26, 2017, 12:58:32 PM12/26/17
to Node-RED
OK, so I think the real problem is converting from an HTML table to JSON data for processing. This is certainly a non-trivial task. All of the examples I have found make use of JQuery or a similar node.js equivalent such as Cheerio.

While there is a cheerio based node, as far as I can see, its configuration interface is too simplistic to be useful here and there are no examples to help.

The format of the table is quite complex - the easiest approach unfortunately would seem to be to install the cheerio npm package into ~/.node-red, reference it to a global variable in settings.js and then reference that in a function node to extract the data to JSON. There are a number of examples on the web of how to extract a table to JSON using JQuery which should work with Cheerio.

This certainly seems to highlight a need for a new node to do this as it is a very common requirement for any data processing task.

This is going to annoy me now as this really should be the kind of thing that Node-RED should be good for.

Dave C-J

unread,
Dec 26, 2017, 1:16:49 PM12/26/17
to node...@googlegroups.com
Julian
The example I gave sliced up the page ok.  It does now need to be re-rendered back into HTML.  Which is fairly standard template stuff. 
--
Sent from phone.

Julian Knight

unread,
Dec 26, 2017, 2:10:35 PM12/26/17
to Node-RED
Sorry Dave but I don't believe it gives the level of information that he needs to do what he indicated. I tried it and it returned an array of rows so each row would have to be processed and then everything put back together - not easy in itself.


Here is my first take on a function node that uses cheerio to parse an html table - set to the table in question. You only need the inject and request nodes from your example - obviously, you also have to add a global reference to the cheerio module.

const tableSelector = '#ContentPlaceHolder1_grdStation'


const cheerio = global.get('cheerio')
const $ = cheerio.load(msg.payload)
const options = {
    rowForHeadings
: 0,  // extract th cells from this row for column headings (zero-based)
    ignoreHeadingRow
: true, // Don't tread the heading row as data
    ignoreRows
: [],
}
const jsonReponse = []
const columnHeadings = []


$
(tableSelector).each(function(i, table) {
   
var trs = $(table).find('tr')
   
   
// Set up the column heading names
    getColHeadings
( $(trs[options.rowForHeadings]) )


   
// Process rows for data
    $
(table).find('tr').each(processRow)
})


msg
.payload = {
    columnHeadings
: columnHeadings,
    rows
: jsonReponse,
}


return msg;


function getColHeadings(headingRow) {
   
const alreadySeen = {}
   
    $
(headingRow).find('th').each(function(j, cell) {
        let tr
= $(cell).text().trim()
       
       
if ( alreadySeen[tr] ) {
            let suffix
= ++alreadySeen[tr]
            tr
= `${tr}_${suffix}`
       
} else {
            alreadySeen
[tr] = 1
       
}
       
        columnHeadings
.push(tr)
   
})
}


function processRow(i, row) {
   
const rowJson = {}
   
   
if ( options.ignoreHeadingRow && i === options.rowForHeadings ) return
   
// TODO: Process options.ignoreRows
   
    $
(row).find('td').each(function(j, cell) {
        rowJson
[ columnHeadings[j] ] = $(cell).text().trim()
   
})
   
   
// Skip blank rows
   
if (JSON.stringify(rowJson) !== '{}') jsonReponse.push(rowJson)
}


//EOF

gorem...@gmail.com

unread,
Dec 27, 2017, 1:06:01 PM12/27/17
to Node-RED
HI

Can you share the full flows? I m a newbie. 

Thank you

Julian Knight

unread,
Dec 29, 2017, 6:23:15 AM12/29/17
to Node-RED
Sorry, I've been travelling. I will try to do something today. I think I will be turning this into a node when I get a chance.

Dave C-J

unread,
Dec 29, 2017, 7:45:52 AM12/29/17
to node...@googlegroups.com
We do have an HTML node (based on Cheerio) - but it's a PITA to be honest.... so if there was a way of improving that - that would be good.

Nick O'Leary

unread,
Dec 29, 2017, 7:49:46 AM12/29/17
to node...@googlegroups.com

Dave,

I'm sure the maintainers of the html node would love to get some feedback on the pains it presents you. I'm not aware of any issues, and nothing appears on the whiteboard for looking at it.

Nick


On Fri, 29 Dec 2017, 12:45 Dave C-J, <dce...@gmail.com> wrote:
We do have an HTML node (based on Cheerio) - but it's a PITA to be honest.... so if there was a way of improving that - that would be good.

--
http://nodered.org
 
Join us on Slack to continue the conversation: http://nodered.org/slack
---
You received this message because you are subscribed to the Google Groups "Node-RED" group.
To unsubscribe from this group and stop receiving emails from it, send an email to node-red+u...@googlegroups.com.
To post to this group, send email to node...@googlegroups.com.
Visit this group at https://groups.google.com/group/node-red.
To view this discussion on the web, visit https://groups.google.com/d/msgid/node-red/CACXWFw%2BjX9mPevoG0FXJNzm4gbYDNDdm-3o_8%3DhXm1AMrE%3DORQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Julian Knight

unread,
Dec 29, 2017, 12:49:25 PM12/29/17
to Node-RED
I'll admit to having not gotten around to really using that in anger for anything complex. But clearly the html node if very simplistic.

That is great for people just getting going and I am not at all criticising it. But it certainly doesn't provide the capability to extract anything reasonably complex as we've seen from this thread. However, it works well at getting rid of any outer cruft before going on to process the innards.

In truth, I'm not sure how you would change the node to allow more options since, looking at how Cheerio works, it is clearly using a separate grammar - great for writing into code, not so good for translating into something visual.

If anyone can suggest a way to fold a table extractor into it, I'd be happy to create a PR. But currently, for simplicity, I think creating a simple table extractor node is going to be best. As always, open to suggestions though.

Julian Knight

unread,
Dec 29, 2017, 4:36:54 PM12/29/17
to Node-RED
Right, so here is the first part of the answer as a flow:

[{"id":"815bf28f.d0691","type":"inject","z":"e7463dd2.db517","name":"","topic":"","payload":"","payloadType":"date","repeat":"","crontab":"","once":false,"x":140,"y":100,"wires":[["46f8420f.a312dc"]]},{"id":"46f8420f.a312dc","type":"http request","z":"e7463dd2.db517","name":"","method":"GET","ret":"txt","url":"http://publicinfobanjir.water.gov.my/View/OnlineFloodInfo/PublicWaterLevel.aspx?scode=SEL","tls":"","x":290,"y":100,"wires":[["d0660df5.4f339"]]},{"id":"fe447848.e25938","type":"debug","z":"e7463dd2.db517","name":"","active":true,"complete":false,"x":602.0000228881836,"y":99.50000262260437,"wires":[]},{"id":"d0660df5.4f339","type":"function","z":"e7463dd2.db517","name":"","func":"const tableSelector = '#ContentPlaceHolder1_grdStation'\n\nconst cheerio = global.get('cheerio')\nconst $ = cheerio.load(msg.payload)\nconst options = {\n    rowForHeadings: 0,  // extract th cells from this row for column headings (zero-based)\n    ignoreHeadingRow: true, // Don't tread the heading row as data\n    ignoreRows: [],\n}\nconst jsonReponse = []\nconst columnHeadings = []\n\n$(tableSelector).each(function(i, table) {\n    var trs = $(table).find('tr')\n    \n    // Set up the column heading names\n    getColHeadings( $(trs[options.rowForHeadings]) )\n\n    // Process rows for data\n    $(table).find('tr').each(processRow)\n})\n\nmsg.payload = {\n    columnHeadings: columnHeadings,\n    rows: jsonReponse,\n}\n\nreturn msg;\n\nfunction getColHeadings(headingRow) {\n    const alreadySeen = {}\n    \n    $(headingRow).find('th').each(function(j, cell) {\n        let tr = $(cell).text().trim()\n        \n        if ( alreadySeen[tr] ) {\n            let suffix = ++alreadySeen[tr]\n            tr = `${tr}_${suffix}`\n        } else {\n            alreadySeen[tr] = 1\n        }\n        \n        columnHeadings.push(tr)\n    })\n}\n\nfunction processRow(i, row) {\n    const rowJson = {}\n    \n    if ( options.ignoreHeadingRow && i === options.rowForHeadings ) return\n    // TODO: Process options.ignoreRows\n    \n    $(row).find('td').each(function(j, cell) {\n        rowJson[ columnHeadings[j] ] = $(cell).text().trim()\n    })\n    \n    // Skip blank rows\n    if (JSON.stringify(rowJson) !== '{}') jsonReponse.push(rowJson)\n}\n\n//EOF","outputs":1,"noerr":0,"x":430,"y":100,"wires":[["fe447848.e25938"]]}]

That outputs an object that contains 2 arrays. The first is an array of the heading names, the second is an array of objects, one entry for each of the returned stations.

The next part of the answer would be to turn that into your Dashboard. To do that, unless you only want a single station, we would need to understand how you want multiple stations shown in the Dashboard.

If you only want a single station, you would need to know what row it is - there are ways to search for a specific entry but you'd need to let me know how you would identify the station you want (e.g. by Station ID). But in simple terms, something like payload.rows[40] would return the 40'th row and then each property of the object is accessed like:  payload.rows[40].District or payload.rows[40]['Station ID'] for properties with names containing spaces.

As a bonus, here is an extended version of the above but with 2 different methods of extracting or manipulating the data. In addition, it has a small caching function that speeds up development since you don't have to keep re-requesting the data from the server each time:

[{"id":"815bf28f.d0691","type":"inject","z":"e7463dd2.db517","name":"get data","topic":"","payload":"","payloadType":"date","repeat":"","crontab":"","once":false,"x":120,"y":100,"wires":[["46f8420f.a312dc"]]},{"id":"46f8420f.a312dc","type":"http request","z":"e7463dd2.db517","name":"","method":"GET","ret":"txt","url":"http://publicinfobanjir.water.gov.my/View/OnlineFloodInfo/PublicWaterLevel.aspx?scode=SEL","tls":"","x":330,"y":100,"wires":[["808d9c98.e305d"]]},{"id":"fe447848.e25938","type":"debug","z":"e7463dd2.db517","name":"","active":true,"complete":false,"x":810,"y":60,"wires":[]},{"id":"d0660df5.4f339","type":"function","z":"e7463dd2.db517","name":"","func":"/*jshint sub:true,asi:true,maxerr:1000*/\n\nconst tableSelector = '#ContentPlaceHolder1_grdStation'\n\nconst cheerio = global.get('cheerio')\nconst $ = cheerio.load(msg.payload)\nconst options = {\n    rowForHeadings: 0,  // extract th cells from this row for column headings (zero-based)\n    ignoreHeadingRow: true, // Don't tread the heading row as data\n    ignoreRows: [],\n}\nconst jsonReponse = []\nconst columnHeadings = []\n\n$(tableSelector).each(function(i, table) {\n    var trs = $(table).find('tr')\n    \n    // Set up the column heading names\n    getColHeadings( $(trs[options.rowForHeadings]) )\n\n    // Process rows for data\n    $(table).find('tr').each(processRow)\n})\n\nmsg.payload = {\n    columnHeadings: columnHeadings,\n    rows: jsonReponse,\n}\n\nreturn msg\n\nfunction getColHeadings(headingRow) {\n    const alreadySeen = {}\n    \n    $(headingRow).find('th').each(function(j, cell) {\n        let tr = $(cell).text().trim()\n        \n        if ( alreadySeen[tr] ) {\n            let suffix = ++alreadySeen[tr]\n            tr = `${tr}_${suffix}`\n        } else {\n            alreadySeen[tr] = 1\n        }\n        \n        columnHeadings.push(tr)\n    })\n}\n\nfunction processRow(i, row) {\n    const rowJson = {}\n    \n    if ( options.ignoreHeadingRow && i === options.rowForHeadings ) return\n    // TODO: Process options.ignoreRows\n    \n    $(row).find('td').each(function(j, cell) {\n        rowJson[ columnHeadings[j] ] = $(cell).text().trim()\n    })\n    \n    // Skip blank rows\n    if (JSON.stringify(rowJson) !== '{}') jsonReponse.push(rowJson)\n}\n\n//EOF","outputs":"1","noerr":0,"x":650,"y":100,"wires":[["fe447848.e25938","c35dcf95.5cc9","641996d3.fe8118"]]},{"id":"641996d3.fe8118","type":"change","z":"e7463dd2.db517","name":"","rules":[{"t":"set","p":"selected","pt":"msg","to":"{}","tot":"json"},{"t":"set","p":"selected.station","pt":"msg","to":"payload.rows[40]['Station ID']","tot":"msg"},{"t":"set","p":"selected.level","pt":"msg","to":"payload.rows[40]['River Level (m)']","tot":"msg"},{"t":"move","p":"selected","pt":"msg","to":"payload","tot":"msg"}],"action":"","property":"","from":"","to":"","reg":false,"x":820,"y":100,"wires":[["f0152260.70e1d"]]},{"id":"f0152260.70e1d","type":"debug","z":"e7463dd2.db517","name":"","active":true,"complete":false,"x":990,"y":100,"wires":[]},{"id":"c35dcf95.5cc9","type":"change","z":"e7463dd2.db517","name":"","rules":[{"t":"set","p":"payload","pt":"msg","to":"$$.payload.rows.(\t   {\t       \"station\": $.'Station ID',\t       \"name\": $.'Station Name',\t       \"level\": $.'River Level (m)',\t       \"alert\": $number($.'River Level (m)') > $number($.Alert) ? true : false\t   }\t)","tot":"jsonata"}],"action":"","property":"","from":"","to":"","reg":false,"x":820,"y":140,"wires":[["d505285a.26bcb8"]]},{"id":"d505285a.26bcb8","type":"debug","z":"e7463dd2.db517","name":"","active":true,"complete":false,"x":990,"y":140,"wires":[]},{"id":"808d9c98.e305d","type":"function","z":"e7463dd2.db517","name":"cache","func":"/*jshint sub:true,asi:true,maxerr:1000*/\n// Expects input msgs with topic set \n\nconst contextVarName = 'httpReqMsgs' // homeMsgs\n\n// saved context\nvar cachedMsgs = context.get(contextVarName) || {}\n\n// Only send to single client if needed\nvar socketId = null\nif ( msg.hasOwnProperty('_socketId') ) {\n    socketId = msg._socketId\n}\n\n// Replay cache if requested\nif ( msg.hasOwnProperty('cacheControl') && msg.cacheControl.toUpperCase() === 'REPLAY' ) {\n    for (var topic in cachedMsgs) {\n        let newMsg = {\n            \"topic\": topic, \n            \"payload\": cachedMsgs[topic]\n        }\n        // Only send to a single client if we can\n        if ( socketId !== null ) newMsg._socketId = socketId\n        node.send(newMsg)\n    }\n    return null\n}\n// -- else if --\n// Empty cache if requested\nif ( (msg.hasOwnProperty('cacheControl') && msg.cacheControl === 'RESET')  ||\n     (msg.payload.hasOwnProperty('cacheControl') && msg.payload.cacheControl === 'RESET') ) {\n    cachedMsgs = {}\n    context.set(contextVarName, cachedMsgs)\n    return null\n}\n// -- else --\n\n// ignore cacheControl and uibuilder control messages\nif ( msg.hasOwnProperty('cacheControl') || msg.hasOwnProperty('uibuilderCtrl') ) return null\n\n// Add a counter for each device name\nif ( msg.topic.endsWith('$name') ) {\n    let topic = msg.topic.replace('$name', '$count')\n    let count = cachedMsgs[topic] || 0\n    count = count + 1\n    cachedMsgs[topic] = count\n    let newMsg = {\n        \"topic\": topic, \n        \"payload\": count\n    }\n    // Only send to a single client if we can\n    if ( socketId !== null ) newMsg._socketId = socketId\n    node.send(newMsg)\n}\n\n// Keep the last msg.payload by topic\ncachedMsgs[msg.topic] = msg.payload\n\n// save context for next time\ncontext.set(contextVarName, cachedMsgs)\n\nreturn msg;","outputs":1,"noerr":0,"x":510,"y":100,"wires":[["d0660df5.4f339","c0cacccb.9e3c7"]]},{"id":"afc218de.706f88","type":"inject","z":"e7463dd2.db517","name":"replay","topic":"","payload":"","payloadType":"str","repeat":"","crontab":"","once":false,"x":110,"y":160,"wires":[["eddcc57e.277418"]]},{"id":"c0cacccb.9e3c7","type":"debug","z":"e7463dd2.db517","name":"","active":false,"complete":"true","x":650,"y":180,"wires":[]},{"id":"eddcc57e.277418","type":"change","z":"e7463dd2.db517","name":"","rules":[{"t":"set","p":"cacheControl","pt":"msg","to":"REPLAY","tot":"str"}],"action":"","property":"","from":"","to":"","reg":false,"x":300,"y":160,"wires":[["808d9c98.e305d"]]}]

The "Change 4 rules" change node extracts a subset of a single row of the table. The "set msg.payload" change node uses JSONata to show how you could simplify the data for all rows in a single statement, it also adds a flag which is true if the current river level is above the "alert" level.

gorem...@gmail.com

unread,
Jan 2, 2018, 9:31:44 AM1/2/18
to Node-RED

Thank you Julian for the flows but i have errors. First error was from the 1st flow and 2nd from the one below .  

Julian Knight

unread,
Jan 2, 2018, 4:52:57 PM1/2/18
to Node-RED
You will need to go through your function nodes and check them for problems. Quite possibly a small typo or something like that.

A couple of hints for debugging:
  • Change the node name settings for each function and debug node so that it is obvious which node errors and debug output are from even in an image.

  • Inside a function that is giving an error, you might need additional debugging output in order to identify where in the function code the error occurs. You can use "node.warn('position 1')" or similar statements throughout the code.

Colin Law

unread,
Jan 2, 2018, 5:06:07 PM1/2/18
to node...@googlegroups.com
In Julian's flow cheerio is set to global.get('cheerio') then the code cheerio.load(msg.payload).  The errors you are getting say the you are calling load on an undefined variable, suggesting that the global context variable cheerio has not been setup.  Perhaps you have not set up the global variable at the time the function is being called.  I have no idea what the flow is actually trying to do so I don't know what the significance of that variable is.

Colin

--
http://nodered.org
 
Join us on Slack to continue the conversation: http://nodered.org/slack
---
You received this message because you are subscribed to the Google Groups "Node-RED" group.
To unsubscribe from this group and stop receiving emails from it, send an email to node-red+unsubscribe@googlegroups.com.

To post to this group, send email to node...@googlegroups.com.
Visit this group at https://groups.google.com/group/node-red.

Julian Knight

unread,
Jan 2, 2018, 5:29:23 PM1/2/18
to Node-RED
Good spot Colin, yes that is quite likely.

The flow is getting a page from the web that contains a fairly complex HTML table. My example flows extract the table into a JavaScript object and demonstrate how to select data from it. The lower flow has a caching function that simply lets you avoid having to requery the web all the time (which is quite slow).
Reply all
Reply to author
Forward
0 new messages