Extract Text from HTML and Crawling with CefSharp

882 views
Skip to first unread message

dirk_...@gmx.de

unread,
Feb 24, 2015, 1:32:14 PM2/24/15
to cefs...@googlegroups.com
I've reviewed the Off Screen Browser sample:
https://github.com/cefsharp/CefSharp/blob/master/CefSharp.OffScreen.Example/Program.cs

and derived a version that can extract text from a URL:
https://github.com/Dirkster99/KB/blob/master/00_HelloWorld/KnowledgeBase%20-%20Sample%208%20ExtractText/ExtractHTML/Program.cs


My problem with this is that I have such a level of indirection. There is an event that is being fired when
the page is loaded and a task being process when the text is finally extracted.

Is it a bright idea (most efficient way) to implement this or is there a better way to extract text from HTML?

Is it possible to extract a set of links to follow them later on?

What would be the correct code pattern to browse more than 1 page with a given browser instance
(eg.: can we do a foreach over string[] testUrl in the sample code)?

In other words, what would be the best way of replacing the 'Console.ReadKey();' statement in the
using (browser = new ChromiumWebBrowser(testUrl[0]))
{
 
...
 
Console.ReadKey();
}

statement to wait for a complete text extraction cycle before browsing to the next URL?

Thanks for all the good work, CefSharp really is a great project once these 101 type questions are solved :-)

Alex Maitland

unread,
Feb 25, 2015, 7:46:06 PM2/25/15
to cefs...@googlegroups.com
Reply all
Reply to author
Forward
0 new messages