interpreted language for scraping html

25 views
Skip to first unread message

Pork

unread,
May 14, 2012, 12:58:43 PM5/14/12
to Australian Cocoaheads
My app scrapes html and provides a native interface to a website.

Now I don't control the site, so whenever it changes, my app breaks
and I have to go through an app review process.

I'm not thinking of using an interpreted language for the scraping
logic. Let's not get into the T&Cs of this, although they are
important, I understand what is at stake. So for my sake, lets just
talk about technicals.

I'm just struggling on choice of interpreted language and interpreter.

For myself, it really looks like javascript is the way to go, but that
requires a loaded webview which makes the engineering a little
funny...

I also love lua but am worried about how overkill that is...

it seems like a few cross platform APIs use Lua or javascript but... i
really only want it for the business logic.

Jesse Collis

unread,
May 14, 2012, 5:22:46 PM5/14/12
to cocoah...@googlegroups.com
Why not just have your own web service/API that your app accesses rather than scrape from the app itself? place the scrape/business logic/code at the API level and when the site changes you can update your API and the app continues.

This is my approach for similar problems.

-JC

Sent from my iPhone
> --
> You received this message because you are subscribed to the Google Groups "Australian Cocoaheads" group.
> To post to this group, send email to cocoah...@googlegroups.com.
> To unsubscribe from this group, send email to cocoaheadsau...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/cocoaheadsau?hl=en.
>

Aidan Steele

unread,
May 14, 2012, 5:57:33 PM5/14/12
to cocoah...@googlegroups.com
Jesse's got the right idea. Doing it server-side has even more benefits than the continuity that he outlined: you can batch/"compress" multiple scrape calls into a single API request and you can return the data in a format more amenable to your app's internal data structures. 

Chris Miles

unread,
May 14, 2012, 6:19:20 PM5/14/12
to cocoah...@googlegroups.com, Chris Miles
+1 for scraping the html from an intermediary web service that you control and provides a consistent API to your app. This allows you to quickly make changes to your scraping code if the source html changes. Also gives you the option to cache data, if you need to (usually recommended).

I use this approach for such an app, hosting my web service on Google App Engine. It scales insanely well, is free up to a certain usage threshold (I've never paid a cent for my modest project, after like 3 years) and gives you the option of coding in either Python (my personal recommendation) or Java.

Another subtle benefit of using App Engine is that http requests to the 3rd party site(s) being scraped are all made from Google infrastructure, so they are less likely to cause any suspicion (maybe important to some, probably less so for most).

Cheers,
Chris

On 15/05/2012, at 7:22 AM, Jesse Collis wrote:

> Why not just have your own web service/API that your app accesses rather than scrape from the app itself? place the scrape/business logic/code at the API level and when the site changes you can update your API and the app continues.
>
> This is my approach for similar problems.
>

Stewart Gleadow

unread,
May 14, 2012, 6:38:06 PM5/14/12
to cocoah...@googlegroups.com
Trust Aidan and Jesse, the scraping experts... I'm the same, I usually just whack a little Ruby app written in Sinatra on Heroku. You can have it written and deployed in minutes.

- Stew

Donny Kurniawan

unread,
May 14, 2012, 8:13:42 PM5/14/12
to cocoah...@googlegroups.com
+1 for all replies.

I open sourced some frontend/backend codes (App Engine).
https://github.com/donny/ Look for MelbJourney, Meltrams,
melb-journey-back-end, meltrams-back-end

The apps scrapes the journey planner and the real-time tram information.
Note, the apps are written for iOS 2.0 and I think they are leaking
memory. My memory management wasn't that good at that time.

Although, I think I got an email from YarraTrams, suspecting a DDoS. So yeah...

Also, don't try to regex the string. It's fragile. Use XPath.
http://cocoawithlove.com/2008/09/cocoa-application-driven-by-http-data.html

With XPath (disclaimer: I haven't tried it), it's possible for the
devices to download "recipes" (i.e. XPath string) and perform the
scraping on the device (rather than on the server). If the HTML
changes, just update the recipes on the server and make the devices
redownload the recipes.


Cheers,
Donny

Cameron Barrie

unread,
May 14, 2012, 9:25:56 PM5/14/12
to cocoah...@googlegroups.com
That reminds me of this wonderful SO post about parsing HTML with a Regex.

Long and the short of it, is unless your name is Chuck Norris, don't parse HTML with a Regex.

-C

P.S. I agree with everyone else's comments, regarding where in the stack this logic should reside.

Pork

unread,
May 15, 2012, 12:51:33 PM5/15/12
to Australian Cocoaheads
It's not at all feasible to put an intermediary between the app and
the server that is being scraped. The data volume is really HUGE.

So far, I'm pumping javascript via a UIWebView [self.webView
stringByEvaluatingJavaScriptFromString:script];

But so far it's been painful. Mostly because I'm not used to
javascript and there is no way to debug or log while it's running. In
fact a huge hurdle was even passing that data into that call in the
first place.

+ (NSString*) encodeToPercentEscapeString:(NSString *)string {
return (__bridge NSString *)
CFURLCreateStringByAddingPercentEscapes(NULL,
(__bridge_retained
CFStringRef) string,
NULL,
(CFStringRef) @"!*'();:@&=+
$,/?%#[]",
kCFStringEncodingUTF8);
}

Was able to get the full html page into a string literal in javascript

while this unpacks it.
function process(json_escaped) {
var json = decodeURIComponent(json_escaped);
return json;
}


then i'm returning json to objective-C


But I'm still working through.. The error handling is non-existant in
this framework and it's been in the back of my head to swtich to LUA.

The biggest reason not to though is that javascript outputs JSON like
a BOSS and decodes HTML like a BOSS.


On May 15, 8:25 am, Cameron Barrie <camwritesc...@gmail.com> wrote:
> That reminds me of this wonderful SO post about parsing HTML with a Regex.http://stackoverflow.com/questions/1732348/regex-match-open-tags-exce...
>
> Long and the short of it, is unless your name is Chuck Norris, don't parse
> HTML with a Regex.
>
> -C
>
> P.S. I agree with everyone else's comments, regarding where in the stack
> this logic should reside.
>
> On Tue, May 15, 2012 at 10:13 AM, Donny Kurniawan <donny.kurnia...@gmail.com
>
>
>
>
>
>
>
> > wrote:
> > +1 for all replies.
>
> > I open sourced some frontend/backend codes (App Engine).
> >https://github.com/donny/ Look for MelbJourney, Meltrams,
> > melb-journey-back-end, meltrams-back-end
>
> > The apps scrapes the journey planner and the real-time tram information.
> > Note, the apps are written for iOS 2.0 and I think they are leaking
> > memory. My memory management wasn't that good at that time.
>
> > Although, I think I got an email from YarraTrams, suspecting a DDoS. So
> > yeah...
>
> > Also, don't try to regex the string. It's fragile. Use XPath.
> >http://cocoawithlove.com/2008/09/cocoa-application-driven-by-http-dat...
>
> > With XPath (disclaimer: I haven't tried it), it's possible for the
> > devices to download "recipes" (i.e. XPath string) and perform the
> > scraping on the device (rather than on the server). If the HTML
> > changes, just update the recipes on the server and make the devices
> > redownload the recipes.
>
> > Cheers,
> > Donny
>
> > On Tue, May 15, 2012 at 8:38 AM, Stewart Gleadow <sglea...@gmail.com>

Cameron Barrie

unread,
May 15, 2012, 7:58:30 PM5/15/12
to cocoah...@googlegroups.com
So besides the lack of feasibility surrounding a server not being able to scrap data, which I cannot comprehend, let's get back to your original problem.

Now I don't control the site, so whenever it changes, my app breaks
and I have to go through an app review process.

So there's 3 issue here; 
1. You need to update your app all the time to fix the breakages.
2. Your users need to wait a week to get an update so it all works again.
3. You've got no guarantee that users will update your app, so you run the risk of users giving up using your app because it's broken.

You can really avoid point number 1, but what part of the stack that update reside in, you can help. I'm not certain how any client side approach will fix these issues. How is using javascript on the device going to help you with these issues?

If you're thinking of loading that scraping code from a server and running it on device, you might get bounced for that:
"Apps that download code in any way or form will be rejected"

I may be wildly misguided about what you're trying to do, but I think you're heading down a rabbit hole trying to do all this client side.

Ruby also passes HTML like a BOSS and produces JSON like a BOSS, as does pretty much every programming language. Take a look at the Nokigiri and JSON gem for Ruby, and the Sinatra or Rails frameworks. Deploy a site to heroku(heroku.com), scale out as many dynos as you need to deal with the load from the incoming requests. In my experience moving that logic off the client is the only way to really solve your issues.

Oliver Jones

unread,
May 15, 2012, 8:00:17 PM5/15/12
to cocoah...@googlegroups.com
If you run iOS5 you can debug Javascript in UIWebViews by turning on the remote inspector. It is a little flaky at times but is a real boon to have.

http://atnan.com/blog/2011/11/17/enabling-remote-debugging-via-private-apis-in-mobile-safari/

An excellent tip by everybody's favourite Apple Engineer.

Regards

On 16/05/2012, at 2:51 AM, Pork wrote:

Jesse Collis

unread,
May 15, 2012, 10:36:05 PM5/15/12
to cocoah...@googlegroups.com
Your data is too LARGE for a web service? How is an iPhone going to handle it then?

Sent from my iPhone

Nathan de Vries

unread,
May 16, 2012, 3:29:27 AM5/16/12
to cocoah...@googlegroups.com
On Tuesday, 15 May 2012 at 9:51 AM, Pork wrote:
So far, I'm pumping javascript via a UIWebView [self.webView
stringByEvaluatingJavaScriptFromString:script];

But so far it's been painful. Mostly because I'm not used to
javascript and there is no way to debug or log while it's running. In
fact a huge hurdle was even passing that data into that call in the
first place.
The trick is to minimize the raw JavaScript you're attempting to pass over the bridge. A UIWebView can load any HTML, CSS, JavaScript etc. file from within the application bundle on the device, or anywhere in the Simulator. Loading your JS from the filesystem will make things much easier.

To create a minimal bridge for loading JS, you can do this:

- (void)webViewDidFinishLoad:(UIWebView *)webView {

  NSString *bridge = @""

    "var NDVWebViewBridge = {"

      "loadScriptURL: function(scriptURL) {"

        "var script = document.createElement('script');"

        "script.type = 'text/javascript';"

        "script.src = scriptURL;"

        "document.getElementsByTagName('head')[0].appendChild(script);"

      "}"

    "}";


  [webView stringByEvaluatingJavaScriptFromString:bridge];

  [webView stringByEvaluatingJavaScriptFromString:@"NDVWebViewBridge.loadScriptURL('file:///Users/pork/path/to/my.js')"];

}


You can go even further to reduce the chance of mistakes by saving the bridge script into a NDVWebViewBridge.js file in your application bundle, and use dataWithContentsOfFile: to read it into a string for evaluation.


Cheers,

Nathan

Pork

unread,
May 16, 2012, 6:38:05 AM5/16/12
to Australian Cocoaheads
Wow people here are quite derogatory! Instead of discussing technicals
about interpreted languages, most here simply jumped to the conclusion
that I hadn't thought about this process.

Defending my choice
---
OK so the reason the data is "HUGE" is because of the shear number of
users of the app. I couldn't afford to keep the app free if I had to
use an intermediate server. Also, just like Chris mentioned in his
case, the potential for the site shutting down your one big mammoth
connection is very high regardless of how friendly your relationship
is with the server. It also puts an extra breaking point to the
availability of the app. All show stoppers.

Server vs client parsing. If the server format changes, both solutions
break anyway. Responsiveness is much faster doing it client side than
doing it on a central server. So on a technical level, I don't see any
benefit to using a central intermediary when in reality that central
intermediary is just processing (not bandwidth or storage) the exact
same code (scraping html into json).

The only problem which has been alluded to, which is not a technical
problem, is the grey area about "downloaded code". I'm not going to
discuss this apart from acknowledging that it's grey. eg. safari is
doing almost nothing but "downloading code". Now I don't need a flame
war here, just want to talk to people who are willing to discuss
interpreted languages and their use in iOS. And even a week of
downtime when the server does change format while apple approve is
still better than making the app economically unfeasible and less
responsive.

Back on topic!
--
Thank you Oliver for that link! Debugging javascript was so painful
and if I had to do it again. The remote debugger looks "creative",
I'll give it a shot.

Thanks Nathan for that code. Being able to load a file directly is
much more efficient!

One thing that I had trouble with, although mostly a JS question than
Obj-C, how do I get the file loaded into the DOM object? I tried all
sorts of hoopla but everything failed. The closest I've seen involved
creating an iframe and then loading the JS into that. But because I'm
not familiar with traversing the DOM tree (and no debugging to help me
figure out my basic syntax mistakes) I abandoned doing it this time
around...


On May 16, 2:29 pm, Nathan de Vries <nat...@atnan.com> wrote:
> On Tuesday, 15 May 2012 at 9:51 AM, Pork wrote:
> > So far, I'm pumping javascript via a UIWebView [self.webView
> > stringByEvaluatingJavaScriptFromString:script];
>
> > But so far it's been painful. Mostly because I'm not used to
> > javascript and there is no way to debug or log while it's running. In
> > fact a huge hurdle was even passing that data into that call in the
> > first place.
>
> The trick is to minimize the raw JavaScript you're attempting to pass over the bridge. A UIWebView can load any HTML, CSS, JavaScript etc. file from within the application bundle on the device, or anywhere in the Simulator. Loading your JS from the filesystem will make things much easier.
>
> To create a minimal bridge for loading JS, you can do this:
>
> - (void)webViewDidFinishLoad:(UIWebView *)webView {
>   NSString *bridge = @""
>     "var NDVWebViewBridge = {"
>       "loadScriptURL: function(scriptURL) {"
>         "var script = document.createElement('script');"
>         "script.type = 'text/javascript';"
>         "script.src = scriptURL;"
>         "document.getElementsByTagName('head')[0].appendChild(script);"
>       "}"
>     "}";
>
>   [webView stringByEvaluatingJavaScriptFromString:bridge];
>   [webView stringByEvaluatingJavaScriptFromString:@"NDVWebViewBridge.loadScriptURL('fi le:///Users/pork/path/to/my.js')"];

Oliver Jones

unread,
May 16, 2012, 7:39:52 AM5/16/12
to cocoah...@googlegroups.com
Hi Pork,

I don't think people were intentionally trying to be derogatory, rather they just didn't have enough context of your app's specifics and were making incorrect assumptions.

If all you're using Javascript for is parsing (X)HTML then I would suggest you look at a C or C++ HTML parsing library.

A quick google turned up these:

https://github.com/rofldev/pugihtml
http://htmlcxx.sourceforge.net/
http://www.netsurf-browser.org/projects/hubbub/
http://tidy.sourceforge.net/

I've only done one app that ever needed to screen scrape HTML and that was written in C# so I can't vouch for any of these libs but on initial inspection they all look like legitimate options for C/C++/ObjC apps.

Of course they will only help you if the pages you're parsing don't need JavaScript to build their page content.

If you need to download recipes/scripts to adjust the parsing logic as the web content you're scraping changes without visiting the app store review queue then I would suggest using Lua to define the recipes. Interfacing C/C++ with Lua is pretty easy. And loads of apps ship from the app store embedding Lua.

In my opinion you will get a lot more control and determinism by going that path than by using UIWebView.

Regards

Nathan de Vries

unread,
May 16, 2012, 12:28:41 PM5/16/12
to cocoah...@googlegroups.com
On Wednesday, 16 May 2012 at 3:38 AM, Pork wrote:
One thing that I had trouble with, although mostly a JS question than
Obj-C, how do I get the file loaded into the DOM object? I tried all
sorts of hoopla but everything failed. The closest I've seen involved
creating an iframe and then loading the JS into that. But because I'm
not familiar with traversing the DOM tree (and no debugging to help me
figure out my basic syntax mistakes) I abandoned doing it this time
around...
Do you mean how do you get your JS file into the UIWebView? If you use the sample code I provided, your JS file (which can be located at a file://, http:// or even data:// URL) is added to the DOM and evaluated automatically. Any JS within that file will have access to the DOM of the content loaded in your UIWebView.

My suggestion would be to avoid worrying about UIWebView for now, and simply do everything in a desktop browser using the WebKit Web Inspector. You can use your browser's developer features to set your user agent to Mobile Safari if the sites you're working with rely on user agent detection.

Once you have your scraping working in a desktop browser, then you can worry about how to re-appropriate that work for UIWebVIew.


Cheers,

Nathan de Vries

davesag

unread,
May 16, 2012, 7:51:58 PM5/16/12
to Australian Cocoaheads
If the volume of data is too huge for a server-side solution then I am
at a loss as to how an iOS device will handle it. The way I'd attack
this problem (and the way I have in the past) is to write a simple
Ruby app (Use Nokogiri to do this, it's awesome, fast and really
simple — http://www.rubyinside.com/nokogiri-ruby-html-parser-and-xml-parser-1288.html)
that does the core parsing and expose the data you need in a nice
consistent format via a REST api. I use Sinatra as my web-app
framework of choice due to its incredible simplicity — see
http://www.sinatrarb.com/ and host the web-app on http://heroku.com
for free (or pay a little bit of money when you need heavier load to
spin up more power.)

Your iOS app then only needs to know how to talk to your REST API
which, ideally, will not have to change — certainly not as often as
your target website might change. And when the website does change,
then it's a few minutes work to modify the web-app and push it back up
to Heroku.

The other advantage is every instance of your your app can query the
API and only get the minimal data it needs when it needs it, rather
than having to scrape the website each time. If you have 1000 users
of your app then whoever owns the webiste you are scraping will see
1000s of requests coming from your app and may decide to lock you out
via a simple edit their web server's htaccess file. If you have 10's
of 1000s of users this is even more likely. By moving that logic to a
simple (and by simple it could be done in 20 lines of code) web-app
that functions as a cache. Another advantage of course is once your
web-app scrapes the html it can manipulate that data into a nicer data
format allowing partial queries and all manner of good things.

You can't expect your users to download a new app to their devices
every time some website not under your control changes.

Dave

Sean Woodhouse

unread,
May 16, 2012, 8:01:42 PM5/16/12
to cocoah...@googlegroups.com
Pork, you haven't really clarified whether the web site and content you're trying to scrape requires a user account to access. If that is the case, then I agree that the best place to do the scraping is on the client because you don't want to store the user's credentials and scrape on their behalf on your intermediary server. Is that what you meant when you said the data volume is too huge?

Cheers

Sean

Simon Harris

unread,
May 16, 2012, 8:16:46 PM5/16/12
to cocoah...@googlegroups.com
I haven't been following closely enough to know if someone has already suggested thisor even if it is a viable option however, if you are doing your processing IN javascript, could you have the application download essentially static HTML+JS from a server each time (or as necessary). Then, if the web site you are scraping changes, you can update the HTML/JS serverside without needing to deploy your app again.

My 2c worth :/

Pork

unread,
May 17, 2012, 3:35:57 PM5/17/12
to Australian Cocoaheads
Just to make clear, I'm not looking for anybody's approval with
regards to my thoughts that an intermediate server not being suitable.
Please don't take this as a slight, I'm just here to talk about
interpreted languages in Obj-C.

JS from a webview is a tad funky in terms of engineering but it is
what I settled upon. Eg. if your processing is done in a business
level object rather than the viewcontroller it means passing in a
UIWebview down to that, not to mention processing has to be done on
the UI thread. Not a big deal in my case because the processing is
simple, but I can imagine it quickly becoming a problem.

So my JS scrapes the site and produces JSON which is more uniformly
consumed by Obj-C.

So what other languages are out there that play well within iOS? I've
seen "Wax" for lua which seems to go all the way, and might be a bit
of a pain to isolate it to simply "do business logic"

I'm kind of missing python.

Stewart Gleadow

unread,
May 17, 2012, 5:13:09 PM5/17/12
to cocoah...@googlegroups.com
If you're just processing Javascript, and not actually presenting the web view, I'd be running headless JavascriptCore instead of running it inside a UIWebView. I think it's currently private, but I wouldn't be surprised if it becomes public soon, since I know a few recent apps that use it that have been approved.

I don't know whether things like Rhodes/Rhomobile allow you to call out to the scripting environment, or whether they only allow you to run the entire app in that environment.

- Stew

Jesse Collis

unread,
May 17, 2012, 6:00:35 PM5/17/12
to cocoah...@googlegroups.com
You beat me to it Stew, but I was going to say that JavaScriptCore would be my preference over throwing UIWebviews around in the background. 

There's a project going on that is compiling the Webkit's source for iOS here: http://www.phoboslab.org/log/2011/06/javascriptcore-project-files-for-ios



Sent from the new iPad

Chris Miles

unread,
May 17, 2012, 7:09:57 PM5/17/12
to cocoah...@googlegroups.com, Chris Miles

On 18/05/2012, at 5:35 AM, Pork wrote:

> So what other languages are out there that play well within iOS? ...
>
> I'm kind of missing python.

Funny you say that. I've been recently thinking more about embedding a Python interpreter within iOS apps to use for scripting. It is certainly doable, and has been done before, like in the Python IDE for iOS app http://pythonforios.com/ . Although, AFAIK, he hasn't released his port of the interpreter as a library for others to use in their own apps.

Cheers,
Chris

Jesse Collis

unread,
May 17, 2012, 7:34:45 PM5/17/12
to cocoah...@googlegroups.com
I can't wait for you to get it working Chris then mention how it "wasn't that hard" ;)

-JC

Pork

unread,
May 18, 2012, 3:08:03 AM5/18/12
to Australian Cocoaheads
Yeah I noticed the pythonforios. It certainly was about the only thing
when searching google.

It surprises me overall just how few interpreted languages are
integrated into iOS. There really is no obvious answer or even good
answer?

The JavaScriptCore looks promising but I wouldn't describe it as
mature (although that isn't a main requirement, I just want it to
work). I like how they claim to be getting full speed turbo mode or
whatever.

Maybe the real answer is LuaJit, which I've heard many use and good
old fashioned C bridging code. But we must be being miopic here, Obj-
c isn't young, surely there are a lot of options, just not finding
them.


On May 18, 6:09 am, Chris Miles <miles.ch...@gmail.com> wrote:
> On 18/05/2012, at 5:35 AM, Pork wrote:
>
> > So what other languages are out there that play well within iOS? ...
>
> > I'm kind of missing python.
>
> Funny you say that. I've been recently thinking more about embedding a Python interpreter within iOS apps to use for scripting. It is certainly doable, and has been done before, like in the Python IDE for iOS apphttp://pythonforios.com/. Although, AFAIK, he hasn't released his port of the interpreter as a library for others to use in their own apps.
>
> Cheers,
> Chris

Chris Suter

unread,
May 18, 2012, 3:29:30 AM5/18/12
to cocoah...@googlegroups.com
Hi Pork,

On 18/05/2012, at 5:08 PM, Pork <kvil...@gmail.com> wrote:

> Yeah I noticed the pythonforios. It certainly was about the only thing
> when searching google.
>
> It surprises me overall just how few interpreted languages are
> integrated into iOS. There really is no obvious answer or even good
> answer?

Well the obvious answer to me would be that there's little point when you can't download any code, and given that's the case you're usually best off writing in Objective-C.

-- Chris

Reply all
Reply to author
Forward
0 new messages