Forums.hardwarezone

0 views

Skip to first unread message

Albina Hickel

unread,

Aug 3, 2024, 5:30:53 PM8/3/24

to cyathebeagen

I am trying to scrape multiple links/sources from an online social forum, but the posts come from different dates. For instance, one forum topic might open in Dec 2020, while another is in July 2021, and it's crucial for me to organize online posts chronologically.

The scraping code below works well w/out errors, but for some reason the "date" variable shows all dates from 2021-09-21 onwards, which is incorrect because the date for the topic under url_2 below is from November 2020, so I assume the dataset should start from the social media post written on Nov 2020, rather than Sept 2021.

If this is the actual code, the for-loop is generating 300 requests, 100 per each forum thread, url_1, url_2 & url_3, requesting pages 1 ... 100 of each of those threads. Parsing is only applied to pages of url_3 as the object storing the page content, soup, gets overwritten (twice) in the early stages of each cycle. Meaning that all your collected posts are from a single thread.

As you generate URLs for 100 pages, have you checked what happens when you pass the last page?
When we take url_3 as an example, it currently lists 10 pages, 20 posts per page + reminder on 10th. When your request goes beyond the last page, e.g. you request the 100th in the final cycle of that for-loop ( forums.hardwarezone.com.sg/threads/.../page-100 ), what you get back instead is the actual last page, .../page-10. Meaning that your resulting dataset is not just from a single thread, but posts on the 10th page are repeated 91 times (responses for loop cycles 10 .. 100 are identical).

Answer I left for your previous question on the same topic extracts pagination details from page content to avoid such issues. It was also tested and proved fully reproducible, though only with up-to-date packages. So if you had issues with map(), you might want to consider updating packages.