HTML to WordPress Scraper?

44 views
Skip to first unread message

Matt @ GEEK, with a personality

unread,
May 20, 2019, 12:35:19 PM5/20/19
to Minneapolis St. Paul WordPress User Group
Howdy!

I'm looking for an efficient way to scrape a client's existing HTML website (250 pages/posts) and import it into WordPress.  Has anyone had good success with this?  If so, what tool did you use?   Appreciate any insights!

Thanks!
Matt

barbara schendel

unread,
May 20, 2019, 12:36:48 PM5/20/19
to mpls-stpau...@googlegroups.com
I use a little app called SiteSucker for this. Does a pretty good job!


--
You received this message because you are subscribed to the Google Groups "Minneapolis St. Paul WordPress User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mpls-stpaul-word...@googlegroups.com.
To post to this group, send email to mpls-stpau...@googlegroups.com.
Visit this group at https://groups.google.com/group/mpls-stpaul-wordpress.
To view this discussion on the web visit https://groups.google.com/d/msgid/mpls-stpaul-wordpress/d8117b31-ac43-4651-8bf6-2283ed7dd8fe%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jodi Stammer

unread,
May 20, 2019, 1:33:53 PM5/20/19
to mpls-stpau...@googlegroups.com

Toby C

unread,
May 21, 2019, 7:47:54 AM5/21/19
to Minneapolis St. Paul WordPress User Group
Is Sitesucker able to convert/push to WordPress?  I've been using it to pull down a site as a last resort backup, but I haven't tried restoring to wp.

Toby

Nick Ciske

unread,
May 21, 2019, 9:24:37 AM5/21/19
to Mpls-Stpaul-Wordpress
There are a few plugins that claim to do this, but your results will vary based on the quality of the old code and how much massaging the code will need to work on the new site. 

All of the options shared are great for archiving a site, and could be paired with a PHP script that crawls the local archive and creates posts based on the extracted content, but that’s a custom job at that point.  

I’ve done this just about every way, including hiring an offshore company to do a manual conversion. 

There is no silver bullet. The right approach depends on the client budget, quality of the old code, and value of the content. Sometimes keeping the most of the old content around as static files in a sub folder or sub directory is the right approach, sometimes it’s not. 

Often I’ve had to resort to a custom coded approach using a html scraping library to do the content extraction, then custom code to massage and create the posts themselves. 

One new approach is the web scraping as a service approach that has become more popular in recent years. Often those can export a series of URLs as a spreadsheet which could enable you to import via WP All Import or similar. 

YMMV

_______________________ 
Nick Ciske 
CTO / Web Engineeer
@nciske


--

joe hobson

unread,
May 21, 2019, 2:24:41 PM5/21/19
to Minneapolis St. Paul WordPress User Group
We built a plugin a few years ago to help a Department of Education division migrate from SharePoint to WordPress. Originally it was setup to mirror content on a nightly basis, with options to grab specific parts of pages (stripping out navigation), merge multiple pages together, and perform regex to fix some styling and classes. Not the greatest code but it got the job done.


Of course they were supposed to just use it temporarily until they could get off of SharePoint completely, but that has taken years. We're working on an import process now, so that you don't have to manually setup the mirror configs. Supposedly we're going to get them off of SharePoint for good this summer. 

Feel free to reach out directly if you want some help getting it running. Not much documentation, but it could be useful.

Matt @ GEEK, with a personality

unread,
May 21, 2019, 2:42:43 PM5/21/19
to Minneapolis St. Paul WordPress User Group
Thanks for your input.  I suspected it was more involved based on my research and per Nick's reply.   I did find something that was super close (https://data-miner.io).  You can actually specify html tags, IDs, and classes for specific content you want it to grab and it can export to CSV (importable by WP All Import).  500 free grabs per month and they also offer paid options. BUT if there are links in the content it doesn't bring them in. :(  Of course, images would be a problem too. Close but no cigar.

Matt

On Monday, May 20, 2019 at 11:35:19 AM UTC-5, Matt @ GEEK, with a personality wrote:

Matt @ GEEK, with a personality

unread,
May 21, 2019, 2:56:03 PM5/21/19
to Minneapolis St. Paul WordPress User Group
Thanks Joe!  I will take a look.

Matt @ GEEK, with a personality

unread,
Jul 22, 2019, 9:15:22 PM7/22/19
to Minneapolis St. Paul WordPress User Group
Long overdue, but thought I would update you on this. https://data-miner.io actually worked. I wasn't doing it quite right. I was trying to have it scrape and save the text, but instead I needed to scrape the HTML. It saved it all in a nice little spreadsheet (CSV) and I was able to use WP All Import to import all of the content right into WordPress.  Worked great.  Hope this helps.

Matt

On Monday, May 20, 2019 at 11:35:19 AM UTC-5, Matt @ GEEK, with a personality wrote:

Toby C

unread,
Jul 24, 2019, 7:02:03 AM7/24/19
to Minneapolis St. Paul WordPress User Group
Thanks for the update, Matt!

Toby

--
You received this message because you are subscribed to the Google Groups "Minneapolis St. Paul WordPress User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mpls-stpaul-word...@googlegroups.com.

elbigpa...@gmail.com

unread,
Feb 15, 2021, 2:25:01 PM2/15/21
to Minneapolis St. Paul WordPress User Group
I'm looking to get a viewer, like a life viewer for a stucco construction company website in toronto .  I have see other site implementing this life viewer for paint and color in WordPress.
Reply all
Reply to author
Forward
0 new messages