Streaming HTML parser?

110 views
Skip to first unread message

Itamar Gilad

unread,
Nov 4, 2016, 10:37:47 AM11/4/16
to openresty-en
Hi everyone,
I saw the Cloudflare blog post about on-the-fly HTML rewrites (https://blog.cloudflare.com/how-we-brought-https-everywhere-to-the-cloud-part-1/), which I can safely assume was implemented using openresty :-).
The entire post is very interesting, but I found the mention of a streaming HTML parser especially exciting. While we all know it could be done, I'm not aware of any current open source libraries that support streaming processing (and would be suitable for this use case).

Does anyone know of any projects I should look into?
Does anyone know if Cloudflare has any plans to open source their parser?

Best regards,
Itamar

Yichun Zhang (agentzh)

unread,
Nov 6, 2016, 12:47:06 PM11/6/16
to openresty-en
Hello!

On Fri, Nov 4, 2016 at 7:37 AM, Itamar Gilad wrote:
> I saw the Cloudflare blog post about on-the-fly HTML rewrites (https://blog.cloudflare.com/how-we-brought-https-everywhere-to-the-cloud-part-1/), which I can safely assume was implemented using openresty :-).
> The entire post is very interesting, but I found the mention of a streaming HTML parser especially exciting. While we all know it could be done, I'm not aware of any current open source libraries that support streaming processing (and would be suitable for this use case).
>
> Does anyone know of any projects I should look into?

It's possible to build simple streaming HTML/CSS/etc parsers atop
streaming regex engines like my sregex library:

https://github.com/openresty/sregex

OpenResty (note, not Cloudflare!) will probably provide such parsers
in the near future.

> Does anyone know if Cloudflare has any plans to open source their parser?
>

Not that I'm aware of :)

Regards,
-agentzh

lypanov

unread,
Dec 16, 2016, 5:30:20 AM12/16/16
to openresty-en
On Sunday, November 6, 2016 at 6:47:06 PM UTC+1, agentzh wrote:
Hello!

On Fri, Nov 4, 2016 at 7:37 AM, Itamar Gilad wrote:
> I saw the Cloudflare blog post about on-the-fly HTML rewrites (https://blog.cloudflare.com/how-we-brought-https-everywhere-to-the-cloud-part-1/), which I can safely assume was implemented using openresty :-).
> The entire post is very interesting, but I found the mention of a streaming HTML parser especially exciting. While we all know it could be done, I'm not aware of any current open source libraries that support streaming processing (and would be suitable for this use case).
>
> Does anyone know of any projects I should look into?

It's possible to build simple streaming HTML/CSS/etc parsers atop
streaming regex engines like my sregex library:

Good day!

Does sregex have lua bindings? I tried hard to find something. I'd really prefer to not write C.

Thank you!
Alex

Ingvar Stepanyan

unread,
Jan 24, 2017, 12:27:32 PM1/24/17
to openresty-en
Hi, I just saw this post. Could see it much faster if you would leave it a comment to the original article, but anyway :)

which I can safely assume was implemented using openresty :-)

It wasn't really. As a last paragraph mentions, it's done with a self-written parsing / transformation pipeline, in C & Ragel.

Does anyone know of any projects I should look into? 
Does anyone know if Cloudflare has any plans to open source their parser? 

Some ideas for handling streaming edge cases were taken from parse5 project. We do plan to open-source parser at some point, but there are still things in need of cleanup and finalizing API for public usage.

Hope this answers your question.
Reply all
Reply to author
Forward
0 new messages