Redevelopment Plans

James Titcumb

unread,

Jul 26, 2013, 5:02:28 AM7/26/13

to browsc...@googlegroups.com, RAD Moose, Joshua Estes

From what I understand we need to focus on two projects initially (with scope for improvements coming later). Lets start small and work our way up...

INI Files Generation

Source/Resource format

We need to make a new resource format for the INI files. Currently the database is SQL Server 2000. I think we should change these to flat files to make contribution on GitHub easier. I haven't had a look at the existing database files yet, so I don't know exactly how we could do this.

I think this is probably the most important part of the redevelopment - making it easy to contribute, easy to understand, but also relatively for the tool to parse.

Tool

I see Joshua you've already started a Symfony Console-based application, which is deal. I think the "resources" folder needs to be reworked (as per my comments above) - the "browsers" folder is crazy slow to load... I don't think it would be too difficult to make the tool - we just need it to do one thing, and do it well (i.e. generate the files). We can expand with more functionality later if need be

New Website

The Website Itself

The website should have the ability to download the files (and check the current version) to start with only. This should be very simple. I think we can probably use a lightweight framework like Silex to make a very basic site to begin with? Silex allows easy expansion using Symfony / Zend Framework modules.

Where to host the files?

Currently the files are hosted by Moose, but this is costly. I'm thinking the generated files themselves could even reside in a repository on GitHub? Do you think that could work? We'd have the added bonus of people being able to download old versions then too...

Bandwidth

Gary has mentioned the issue he had (and I think Rad Moose noticed the same issue) of people repeatedly hitting the files to download them, so we need to think carefully about how we can rate-limit the download of the files. Using a library like browscap-php (the one GaretJax made) should help alleviate this problem though. Perhaps if the project perks up a bit, people will be more willing to contribute to libraries in other languages (Ruby, Python etc.)

James Titcumb

unread,

Jul 31, 2013, 12:34:12 PM7/31/13

to browsc...@googlegroups.com, RAD Moose, Joshua Estes, ja...@asgrim.com

Ok guys - I have put together an idea for the "source" files, based loosely on the SQL database structure:

https://github.com/browscap/browscap/wiki/RFC-Source-File-Formats

Please can you take a look and let me know what you all think.

Thanks

James

Mikolaj Misiurewicz

unread,

Aug 3, 2013, 12:56:30 PM8/3/13

to browsc...@googlegroups.com, RAD Moose, Joshua Estes, ja...@asgrim.com

From what I can tell current browscap format uses only * and ? for wildcards. If you're planning to rewrite it to PHP a natural way to go would be to use preg parser for user agent matches.

You can just convert * and ? to preg format and leave it at that, but there are some problems with browscap which can be addressed if you actually modify the source data.

In my opinion you should convert the data to use preg matches in results for 'version' and similar fields. At the very least you could:

1) Merge many entries that differ only by version number into one entry. For example, you can create an entry for 3 versions of the browser at once: "My Browser/(11|12|13)" and then use the "%1$s" match to fill the version field.

2) Start to identify minor and patch/build version parts more precisely. For example "Mozilla/5.0 (Windows NT 6.2; Win64; x64; rv:16.0.1) Gecko/20121011 Firefox/21.0.1" should be identified as Firefox 21.0.1 not as Firefox 21.0. You don't even have to use \d+ for that matching as this will assume you'll also accept fake patch versions. You can still just do "(21\.0\.1|21\.0|20\.0)" to do the match and then split the version by dots if you want to put them into separate fields.

This way you will be able to provide more specific user agent identification and also make the source format a more compact. Right now the files used by get_browser() are terribly huge, and if you don't do anything with that they will keep growing fast, as with each new browser you will have to keep adding new full user agent matches.

Not to mention that I find this way of storing data more readable. Simple preg usage doesn't add much to complexity and you could see at glance which browsers have the same features and differ only by version. But that's just how I see this issue. Perhaps other people prefer to have each single user agent separately?

James Titcumb

unread,

Aug 3, 2013, 3:21:54 PM8/3/13

to Mikolaj Misiurewicz, browsc...@googlegroups.com, RAD Moose, Joshua Estes

Hi Mikolaj

Thanks for your very insightful feedback, much appreciated.

I just want to clarify that the proposal I made for the JSON source files is for what I call the "source files". Currently the data I am proposing to convert to JSON is in a set of SQL Server 2000 databases. Currently a script is run that converts these to the .ini files (known just as "the files") that people download and use with libraries.

We absolutely *must* maintain these generated INI files for backwards compatibility. There are many legacy Classic ASP applications that use the unfettered browscap.dll which means we must continue to support. Additionally, this is built into PHP Zend engine as you are aware with get_browser. This however does not stop us considering a slimmer format in the future, but for the time being the focus is on making the generation of the ini files - in their current format - as simple and as open as possible. That includes allowing people to make pull requests on GitHub in an open format - hence my proposal to use JSON as the source format.

What you've suggested is a great idea though and will allow us to simplify source format though, I think I can work something into this, although with limitations, so using PCRE may not really be viable. We need to include strict version constraints in order to generate the INI files, which as you have observed have a block for each version. However, I think we don't have to have a source file for each version, we can just say "this match applies from version x to y" and the match might be like

Mozilla * (My Browser #version#, #render#, #platform#)

And we could use #version# as the version constraint... So in the JSON we might add two properties, "version_from" and "version_to". If a certain browser changed their UA format significantly, we would have to make a new matcher.

You'll notice I added #render# and #platform# - we could use these to reduce any repetition on rendering engine, device and platform matching too - so each of those would add their own matches. This may or may not work, I'll have to investigate further.

The key to making this new source format is that it MUST be able to generate the INI files in their current format.

I'll work on updating the proposal with some of the suggestions you've made, great ideas :)

Exciting!

Thanks
James

James Titcumb

unread,

Aug 3, 2013, 4:33:17 PM8/3/13

to browsc...@googlegroups.com, Mikolaj Misiurewicz, RAD Moose, Joshua Estes

Hi folks,

I have updated my RFC with my ideas above, any feedback would be appreciated:

https://github.com/browscap/browscap/wiki/RFC-SourceFileFormats

Thanks

James

James Titcumb

unread,

Aug 4, 2013, 7:31:01 AM8/4/13

to browsc...@googlegroups.com, Mikolaj Misiurewicz, RAD Moose, Joshua Estes

I have written a very basic example of how the new tool will work: https://github.com/asgrim/browscap

James Titcumb

unread,

Aug 15, 2013, 11:03:44 AM8/15/13

to browsc...@googlegroups.com, Mikolaj Misiurewicz, RAD Moose, Joshua Estes

Does anyone have any further feedback? Especially on my prototype tool. Has anyone managed to run it to generate the small "sample" browscap?

My repository with my prototype can be found here: https://github.com/asgrim/browscap

Mikolaj Misiurewicz

unread,

Aug 16, 2013, 5:14:47 PM8/16/13

to browsc...@googlegroups.com

I've looked over the revised source format and I don't see any more obvious issues. It looks logical and people should be able to easily contribute with new user agents, which was the main point, as I understand.

As for the build tool - I haven't checked it. An obvious test here would be to convert the whole database to the new source format and then run the tool. If it creates files identical to the old ones (modulo order of some entries?) then we're golden. If it doesn't then it will need to be modified.

James Titcumb

unread,

Aug 16, 2013, 5:37:41 PM8/16/13

to Mikolaj Misiurewicz, browsc...@googlegroups.com

That's my plan! The tool I've built is really a prototype, and I've built an "ini diff" command too which compares ini files regardless of the order and "format", so that will help to start with. I still need to put some work into it to get it working properly but the general idea is there so far. I also need to write a script to import the old database to the new JSON format, that should be fun ;)

I'll keep working on it then :) thanks for your feedback Mikolaj.

James

--
You received this message because you are subscribed to the Google Groups "browscap-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to browscap-dev...@googlegroups.com.
To post to this group, send email to browsc...@googlegroups.com.
Visit this group at http://groups.google.com/group/browscap-dev.

Mikolaj Misiurewicz

unread,

Aug 16, 2013, 5:44:45 PM8/16/13

to ja...@asgrim.com, browsc...@googlegroups.com

Although... does the order really NOT matter? I haven't really analysed the browscap file in detail, but I can imagine a situation in which browser A would have a

I'm a browser A (some random stuff)

string, and browser B would have

I'm a browser A (some random stuff) oh, wait, no, I'm actually browser B (more random stuff)

string.

In such situation the regular expression matching order would matter.

James Titcumb

unread,

Aug 16, 2013, 6:05:57 PM8/16/13

to Mikolaj Misiurewicz, browsc...@googlegroups.com

The thought did cross my mind - one step at a time though :) iirc, the original database does have a sort of rough ordering pattern, although I have not looked into it too much yet.

Mikolaj Misiurewicz

unread,

Aug 26, 2013, 7:03:08 PM8/26/13

to browsc...@googlegroups.com, Mikolaj Misiurewicz, ja...@asgrim.com

Just an update on the regular expressions order:

Yes, the order does matter, but it doesn't have to be and isn't currently enforced by the source files. The order in the source files can be random and each of the parsers will have to reorder the expressions by itself anyway. That's what get_browser and phpbrowscap currently do.

So - no need to worry about order in the source database format.

James Titcumb

unread,

Aug 28, 2013, 5:40:03 PM8/28/13

to Mikolaj Misiurewicz, browsc...@googlegroups.com

Hi folks,

An update on my progress.

I've written a bit about it here: https://github.com/browscap/browscap/wiki/Proposal%3A-JSON-Source-File-Schema

As I mention in that wiki document, I'm having problems with the schema - it is too strict, so I/we need to come up with a new structure for the JSON schema.

My issue is that the current proposal assumes that the "children" UAs are just variations by platform alone - but this is not the case (for example the Google+ App UA - this has a GooglePlus/* match, followed by several seemingly unrelated Mobile Safari 5.1 UAs...). The format needs to allow for much more flexibility. It may be that we start with a JSON schema that is very close to the current INI format (i.e. just a JSON representation of the INI files) and go from there.

Unfortunately, I've not found the database itself to be a huge amount of use - the data contained in the INI files is more readable and structured, so I've mostly been working off that anyway.

The sooner this "open source" Browscap tool can be used, the better - I think it will encourage public contributions and limit issues of only one person (Gary) knowing how to generate the files.

If anyone has any feedback, it'd be really appreciated.

Thanks

James

Mikolaj Misiurewicz

unread,

Aug 28, 2013, 7:41:13 PM8/28/13

to James Titcumb, browsc...@googlegroups.com

Download this https://gist.github.com/quentin389/6372455 and look at the HTML file.

Entries with 'YYY' are grouping entries where the section name (the part starting with ";;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;" in browscap.ini) is different than the entry name.
Entries with "XXX" are match entries where the section that they are in is not the same as the section of their parent.

If you look at all of them you will see that some are just a slightly different names used in two places. For example "Maemo Browser" in "Maemo" section.

Then there is the Kindle section which is unnecessarily separated into several sections.

But the rest are incorrect entries (!). For example:

"vlc/*" match in Nintendo section - Even if user agent like that exists (I found "vlc/1.1.9" which may be or may not be an actual ua), then it's not specific to Nintendo. It should either be removed or moved to "Media Players" section

All those "Facebook App" and "Google+ App" matches: either they are safari browser or an app, never both. Even if the app would actually run Safari internally (that would be strange), not just emulate the UA, it would still have to be called an app, because when we say "Safari" we mean user operated browser. But if you look closely, you'll see that the entry:
Mozilla/5.0 (iPhone*CPU*OS 5_1* like Mac OS X*)*GooglePlus/*(*KHTML, like Gecko*)*Mobile/* makes sense as Google+ App
but
Mozilla/5.0 (iPod*CPU*OS 5_0* like Mac OS X*)*AppleWebKit/*(*KHTML, like Gecko*)* does not. At all.

I'm pretty sure that it's a mistake in classification and it's actually a genuine Safari on iPod. So it should be moved to the iPod section.

There aren't that many section names that differ from what they should be, so in my opinion, the best way to proceed is to clean all those entries up. That way you will be able to create section names in the files automatically, based on the grouping entry name.

Not to mention that the sections aren't actually parsed, they are just comments, so you could completely ignore them.

Cheers,

Mikolaj

James Titcumb

unread,

Aug 29, 2013, 3:11:33 AM8/29/13

to Mikolaj Misiurewicz, browsc...@googlegroups.com

Mikolaj

Thanks for your feedback - yes this is what I see also. However my primary objective at the moment is to duplicate, like-for-like, the generation of the INI files, including any existing "oddities".

For this reason I'm going to completely rip up my JSON schema and the tools I've written so far and start more or less again. Looking at the data presented in a tree the way you have has given me an idea to start with, which we can iterate on to improve.

So the first step of my new plan is to create a tiered folder structure containing JSON files that uses the parent as the name of the file (don't do any device/platform/rendering engine parsing yet). Basically I want to be able to import from the full_asp_browscap.ini file to a JSON format, then export it and produce more or less exactly the same INI file as the "imported" INI. After that we can examine the new JSON schema and see what we can change/improve/optimise.

James

Mikolaj Misiurewicz

unread,

Aug 29, 2013, 3:52:58 AM8/29/13

to browsc...@googlegroups.com, Mikolaj Misiurewicz, ja...@asgrim.com

That should be ok, as long as the section names will be treated just as another properties and you won't add another degree of complication to the source format just because several entries are in incorrect sections.
It would be a shame to create a bad or overly complicated format just because of those names that we already know we want to correct.

James Titcumb

unread,

Aug 29, 2013, 6:11:58 PM8/29/13

to Mikolaj Misiurewicz, browsc...@googlegroups.com

Another update.

I now have a fully working prototype - I am able to import a `full_asp_browscap.ini` into a very basic JSON format, and then generate a browscap.ini file that matches. The code is really straightforward stuff at the moment - with no complexities (which means the JSON format isn't particularly simple)

You can see more info here: https://github.com/asgrim/browscap/blob/new-json-format/README.md#demonstrating-functionality

My next plan is to see how we can simplify the JSON structure, whilst maintaining exactly the same output as it currently generates. However, even though at this point I am 99% confident we can generate our own browscap.ini files now (the aim of the first step of the rewrite), I would like to do a lot more work on simplifying the JSON files.

Mikolaj - do you get on IRC at all? During UK working hours I am usually on #browscap on Freenode.