Hi,
the current problem of browscap is that it doesn't detect recent/popular user agents.
It's great that people report them one by one, and great that there is a page to which you can go and submit a single entry.
But wouldn't it be better to accept entries in bulk? I'm sure there are people who, like me, collect user agents from high traffic websites. If some of them would agree to upload their logs on a regular basis, then this could really help the project.
That's only half of the job, cause even if you collect bulk data, you still have to prioritize collected user agents and then analyse and add them to browscap one by one. However, in my opinion, having bulk data as a source could be very beneficial to the project, and creating a system to parse and prioritize that data is a simple, one day task.
What needs to be done:
- we need to agree on input data format
- a program analysing this data has to be created
- this program has to be able to prioritize data and automatically show not recognised user agents based on their priority
But hey, that's easy. Here's a proposal of such system:
=== 1 === format
A single zip file named <username>-ua-<date_from>-<date_to>.tar.gz
Inside, one or more text files named <username>-ua-<listname>-<date_from>-<date_to>.txt
Each text has a CSV structure.
Rows separators - any kind of new line.
Columns separators - \t
No quotes or escaping
Column 1 - weight - a positive integer that signifies a weight of each user agent entry relative to other entries in the file. The higher the number the more important that entry is. This is easy to generate by just doing COUNT() or SUM() on a relevant field when you create the file from the database.
Column 2 - user agent - can this contain new lines? If so, then perhaps that should be escaped or converted to space. Although any user agent with new line in it is a spoof anyway...
Rows have to be ordered by weight descending. User agents have to be unique.
Generating such file should be a breeze for those who want to contribute. They could even write a 5 lines script to get this data from their database and run it each week.
The best idea would be that each contributor would send a new file with just new data, but to be honest, if they send whatever, new and old, they have in their logs that would be good too.
I've attached my logs, from two different tables, in said format.
=== 2 === upload
Those files need to be uploaded somewhere, but to be honest we can just let people send them to an e-mail address. I don't think there will suddenly be hundreds of programmers sharing their logs.
=== 3 === parsing
A simple program can be created and then updated to add new functionalities. The base version would look like that:
foreach zip, foreach file
get first and last line and save min_weight and max_weight
foreach line
normalize the weight in a way that min_weight would be scaled to 100 and max_weight to 10,000
round the normalized_weight to nearest integer
if user_agent does not exist in local_array
add an entry user_agent => normalized_weight
else
local_array[user_agent] = local_array[user_agent] + normalized_weight
endif
endforeach
foreach entry in the local_array
remove entry from the array if it is recognized by browscap
endforeach
sort the local_array by weight descending
save the data to file
endforeach, endforeach
This way you'll get a list of most important user agents to add to browscap.
So, in my opinion, creating a system to collect bulk data is a simple task, if you don't require for it to be great from the beginning, but just to give you a list of what user agent to look at next.
What do you think?