Scraping from each link - beginner.

77 views
Skip to first unread message

Dutch Dragonser

unread,
Jun 11, 2023, 12:07:53 PM6/11/23
to Web Scraping
I am trying to Automate my process of copying links from my huge youtube watch later videos to have the link pasted into a site that summarizes the videos. This way i can empty my watch later from videos i have been procrastinating to watch and were not so good but also have a summary of good videos.

i have already scraped a list of URLs and have them in a CSV file under the row URLS. 
But now i am stuck. how can i scrape the summary from each link, each link is inputted in the "yt link" box, and then pressed on summarize where on the same page a text appears with the summary that i want to copy.

Andrew11

unread,
Jun 11, 2023, 12:17:35 PM6/11/23
to Web Scraping
Have you tried https://takeout.google.com/ and deselect everything but Youtube -> playlist ?

Andrew11

unread,
Jun 11, 2023, 12:19:11 PM6/11/23
to Web Scraping
Oops, you were talking about the next step where you get summaries from a separate site. What's the URL?

Dutch Dragonser

unread,
Jun 11, 2023, 12:27:49 PM6/11/23
to Web Scraping
yeah, well eitherway i have tried takeout just now but that doesn't export a list of URLs of the videos rather: Playlist-ID, Channel-ID, Date created etc. etc.

the URL of the website?

Andrew11

unread,
Jun 11, 2023, 12:53:52 PM6/11/23
to Web Scraping
Go to Projects -> Import project and use the unzipped file attached here... to edit the list of URLs go to the gear icon @ left sidebar top and edit the Start Value in Project settings.
clipnote.ai_Project.zip

Dutch Dragonser

unread,
Jun 11, 2023, 1:58:35 PM6/11/23
to Web Scraping
great it works, I have thought that it has something to do with the start value, however it looks like the whole setup is different i thought there has to be a loop or something somewhere. Could u explain what u have done so i can understand it better and could edit it for future use? 

very much thanks it seems like u are an advanced user.

Andrew11

unread,
Jun 11, 2023, 3:36:47 PM6/11/23
to Web Scraping
Sure! If you have Google Chrome and you look at the AI clip page, and press CTRL/SHIFT/J, and click on the Elements tab in the Developer window that opens up, you can right click things on the page and it'll select the HTML for that element in the Dev window so you can take a better look. W3Schools has some CSS references that can help you learn about the selectors that let you tell ParseHub which elements you're looking for. You can change a regular ParseHub auto Select statement into CSS by using the green button in the left sidebar when the command is selected. The big one here is the CSS selector for the summary, .flex:has(input) + div:contains("Title:"), which means to find an element with class "flex" and which has an input box inside it (the yt url text box), and choose the following sibling if it's a div element that contains text "Title:", meaning the summary's ready; and in Parsehub it says to wait for such an element if not found for up to 2 minutes. There is a loop actually, for each item in URLs, the array in the Start Value. You can fill this in by opening the CSV with your URLs in it, in Excel, then copying the column and pasting into Visual Studio Code, and then find&replace using regular expressions: \n -> ",\n", then copy and paste the result into the Start Value between the [] brackets. 

You also if the URLs was a list similar to the list1 our project makes new entries in, made earlier in the scrape, you could later in the same project run make a loop For each item in list1, and then refer to item.URL in the expression box in the left sidebar for Extracts, If's, etc. You can even make a new project which grabs the last known data from your old Youtube URLs-only project using  https://www.parsehub.com/api/v2/projects/{PROJECT_TOKEN}/last_ready_run/data?api_key={API_KEY}
and filling in the {} items with info gathered from Project Settings on your old Youtube URL project. Make a Go to template in the new project using this as the GET address between double quotes, and then Extract obj as the first command in the new template, and choose JSON object as the thing to extract in the drop down list where it usually says Element text. Then you can do things like For each item in obj.list1 and use item.URL in later commands. When you're done, Extract obj again but this time set it to 0; this cleans up the workspace for your results. Hope this helps!

Andrew11

unread,
Jun 11, 2023, 3:41:52 PM6/11/23
to Web Scraping
ps If you're on the free plan, don't use the parsehub.com/api with your key because they say anyone can see your projects if you're not on a paid plan, and then it's like they have your username and password, in a way.

Dutch Dragonser

unread,
Jun 12, 2023, 2:03:35 AM6/12/23
to Web Scraping
yes  everything is more clear now. 
While scraping i however found that there is 1 problem: when inserting a youtube link that has no transcription the site ofcourse can't make a summary and gives back an error. but this error messes up the whole scraping format and now parts of summaries are here and other parts are there etc. etc. 

a random video to recreate this problem that has no caption: https://youtu.be/z8WOstt3Za0 

Andrew11

unread,
Jun 12, 2023, 6:33:54 PM6/12/23
to Web Scraping
OK, try adding a select just inside the summary with rooted selection unchecked and 
> div:has(p:contains("Something went wrong."))
and add an Extract which says Delete element.
Reply all
Reply to author
Forward
Message has been deleted
0 new messages