Scrap data by using google sheets

863 views
Skip to first unread message

Le pirate noir

unread,
Jan 25, 2022, 11:17:30 AM1/25/22
to Google Apps Script Community
Hi, I would like to scrap datas on that website by using google apps script. I've tried many things but nothing works so far. Here is my code : 

function mainFunction() {

var str = UrlFetchApp.fetch(url).getContentText()
  const mainRegex = /<table id=\"manitouTable\" aria-label=\"Liste des postes disponibles\">([\s\S]*?)<\/table>/gi;
 var results = str.match(mainRegex);

Logger.log(results)

It comes *null* when i log results. I don't know why...

What i would like to do is to exctract is :
the description of the job
the city
the date
and put each of this information listed in 3 separate columns


Important : I must apply a filter to extract the datas that i want (see file attached)
What should i do to make it work ?

Thanks



12.png

Martin Hawksey

unread,
Jan 25, 2022, 12:14:01 PM1/25/22
to Google Apps Script Community
As it happens I recently came across the cheerio JS library that can be used with Google Apps Script. This is able to digest raw HTML and then select elements from the page similar to jquery syntax. Issue you still might encounter is it's not a headless browser so if the data you are trying to get is added async or with javascript it won't work. There is an apps script example of cheerio here https://pulse.appsscript.info/p/2022/01/google-apps-script-release-notes-as-an-rss-feed-scraping-web-pages-with-cheerio/ 

Le pirate noir

unread,
Jan 26, 2022, 4:09:45 PM1/26/22
to Google Apps Script Community
Thanks for the answer. I tried with cheerio, but i can't access the datas. Do you have a solution for me ?

Braja Patnaik

unread,
Jan 26, 2022, 10:42:29 PM1/26/22
to Google Apps Script Community
Have you checked if the URL - https://www.quebecemploi.gouv.qc.ca/plateforme-emploi/tables/ is correct or not. I loaded the given URL using my chrome browser. It does not show any data, rather some broken links.. Even if I do a view source of the page in browser, I cannot see the data. There are no tr/tds which contain the data. The URL  https://www.quebecemploi.gouv.qc.ca/plateforme-emploi/ actually renders the required data on the browser.  

Le pirate noir

unread,
Jan 31, 2022, 12:47:52 PM1/31/22
to Google Apps Script Community
Thanks for the answer, 

(note :  in my last message, i made a *typing mistake* by putting the URL. As you mentionned, i wanted to say that url : https://www.quebecemploi.gouv.qc.ca/plateforme-emploi/


i can't access the datas with cheerio. 

Can you help me please

Braja Patnaik

unread,
Jan 31, 2022, 12:59:12 PM1/31/22
to google-apps-sc...@googlegroups.com
Yes, I did some research after sending my previous email. The data is not being displayed in view source even with the correct URL.This means that the data is being dynamically generated through some callback event within the page. A further tinkering game me this URL:
If you check this directly in your browser, it is returning all the job postings. Infact, you can also apply filters like here:
I have filtered on CISS.
This data can easily be read and updated into your google sheet.

 

--
You received this message because you are subscribed to a topic in the Google Groups "Google Apps Script Community" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/google-apps-script-community/UWgf0tz-Ey8/unsubscribe.
To unsubscribe from this group and all its topics, send an email to google-apps-script-c...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/google-apps-script-community/4d42bf96-e047-4422-a801-a156732dc990n%40googlegroups.com.

Braja Patnaik

unread,
Jan 31, 2022, 1:01:57 PM1/31/22
to google-apps-sc...@googlegroups.com
Also, please change the en to fr in the URL.
https://www.quebecemploi.gouv.qc.ca/search/postingFilteredAI/fr/7b8d28ca-fa33-0b4f-baa2-46fd12521ec9/*/0/0/999/0/false/0/0/0/0/0/0
https://www.quebecemploi.gouv.qc.ca/search/postingFilteredAI/fr/7b8d28ca-fa33-0b4f-baa2-46fd12521ec9/CISS/0/0/999/0/false/0/0/0/0/0/0

Le pirate noir

unread,
Feb 2, 2022, 9:13:00 PM2/2/22
to Google Apps Script Community
Thank you very much for this very good answer.
For my personnal knowledge, what does this sequence means :  search/postingFilteredAI/en/7b8d28ca-fa33-0b4f-baa2-46fd12521ec9/*/0/0/999/0/false/0/0/0/0/0/0       ?
Second, what iwould be the best way to read that data and updated the ifo into my google sheet.

I would like to extrack the expiration date, the city name and the description of the job and list these infos in 3 seprarate columns

Thanks for helping

Braja Patnaik

unread,
Feb 3, 2022, 2:47:17 AM2/3/22
to google-apps-sc...@googlegroups.com
The sequence is a request to the server to search on certain filter conditions and provide the data. 
Each term highlighted below corresponds to one of the filter criteria that you see on the main search page.  
search/postingFilteredAI/en/7b8d28ca-fa33-0b4f-baa2-46fd12521ec9/*/0/0/999/0/false/0/0/0/0/0/0
For example, * in the sequence means to get all. In place of *, if you mention COOK, it will filter the results and display those records that have the word COOK in it. Similarly, the 'false' that you see in the sequence is for the check-box on the search page, when it is unchecked. If the check-box in the search criteria is checked, it will be true. Hope this helps.

The below code can help you retrieve the data to your excel sheet. But please note that the server is returning a maximum of 300 rows at a time only. So, you cannot get all the records together. If you wish to get more records, then the filter criteria needs to be updated accordingly and change the webURL in the code below.

function scrapeQuebec() {
  var r = JSON.parse(UrlFetchApp.fetch(webURL).getContentText());
  var ss = SpreadsheetApp.getActiveSpreadsheet().getSheets()[0];
  var val = [];
  var max_ret = r.meta.total_hits > 299 ? 300 : r.meta.total_hits;
  for (i = 0; i < max_ret; i++)
    val.push([r.items[i].expiration_date, r.items[i].nom_ville, r.items[i].titre]);
  ss.getRange(ss.getLastRow() + 1, 1, max_ret, 3).setValues(val);
}

Reply all
Reply to author
Forward
0 new messages