Automatic submission and prioritization of user agents, based on bulk data from websites.

25 views
Skip to first unread message

Mikolaj Misiurewicz

unread,
Jul 30, 2013, 12:06:20 PM7/30/13
to browsc...@googlegroups.com
Hi,
the current problem of browscap is that it doesn't detect recent/popular user agents.

It's great that people report them one by one, and great that there is a page to which you can go and submit a single entry.

But wouldn't it be better to accept entries in bulk? I'm sure there are people who, like me, collect user agents from high traffic websites. If some of them would agree to upload their logs on a regular basis, then this could really help the project.

That's only half of the job, cause even if you collect bulk data, you still have to prioritize collected user agents and then analyse and add them to browscap one by one. However, in my opinion, having bulk data as a source could be very beneficial to the project, and creating a system to parse and prioritize that data is a simple, one day task.

What needs to be done:
- we need to agree on input data format
- a program analysing this data has to be created
- this program has to be able to prioritize data and automatically show not recognised user agents based on their priority

But hey, that's easy. Here's a proposal of such system:


=== 1 === format

A single zip file named <username>-ua-<date_from>-<date_to>.tar.gz

Inside, one or more text files named <username>-ua-<listname>-<date_from>-<date_to>.txt

Each text has a CSV structure.
Rows separators - any kind of new line.
Columns separators - \t
No quotes or escaping

Column 1 - weight - a positive integer that signifies a weight of each user agent entry relative to other entries in the file. The higher the number the more important that entry is. This is easy to generate by just doing COUNT() or SUM() on a relevant field when you create the file from the database.

Column 2 - user agent - can this contain new lines? If so, then perhaps that should be escaped or converted to space. Although any user agent with new line in it is a spoof anyway...

Rows have to be ordered by weight descending. User agents have to be unique.

Generating such file should be a breeze for those who want to contribute. They could even write a 5 lines script to get this data from their database and run it each week.
The best idea would be that each contributor would send a new file with just new data, but to be honest, if they send whatever, new and old, they have in their logs that would be good too.

I've attached my logs, from two different tables, in said format.


=== 2 === upload

Those files need to be uploaded somewhere, but to be honest we can just let people send them to an e-mail address. I don't think there will suddenly be hundreds of programmers sharing their logs.


=== 3 === parsing

A simple program can be created and then updated to add new functionalities. The base version would look like that:

foreach zip, foreach file
 
get first and last line and save min_weight and max_weight
 
foreach line
    normalize the weight
in a way that min_weight would be scaled to 100 and max_weight to 10,000
    round the normalized_weight to nearest integer
   
if user_agent does not exist in local_array
      add an entry user_agent
=> normalized_weight
   
else
     
local_array[user_agent] = local_array[user_agent] + normalized_weight
    endif
  endforeach
 
 
foreach entry in the local_array
    remove entry
from the array if it is recognized by browscap
  endforeach

  sort the
local_array by weight descending
  save the data to file
endforeach
, endforeach



This way you'll get a list of most important user agents to add to browscap.


So, in my opinion, creating a system to collect bulk data is a simple task, if you don't require for it to be great from the beginning, but just to give you a list of what user agent to look at next.


What do you think?


quentin389-ua-1.07-30.07.tar.gz

James Titcumb

unread,
Jul 31, 2013, 4:20:57 AM7/31/13
to browsc...@googlegroups.com
I like this idea - do you think you could make some kind of prototype (preferably in PHP, as that is the chosen language of the rewrite) to accept submissions?

I can't guarantee if or when it would included into the project though, but it would be cool to see something like this working.

Mikolaj Misiurewicz

unread,
Aug 3, 2013, 12:27:15 PM8/3/13
to browsc...@googlegroups.com

James Titcumb

unread,
Aug 3, 2013, 4:44:26 PM8/3/13
to Mikolaj Misiurewicz, browsc...@googlegroups.com, RAD Moose
Awesome Mikolaj - this looks like something we'll be able to use in the near future to help identify which UAs urgently need adding at least!

Moose - do you have any feedback on this also?


--
You received this message because you are subscribed to the Google Groups "browscap-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to browscap-dev...@googlegroups.com.
To post to this group, send email to browsc...@googlegroups.com.
Visit this group at http://groups.google.com/group/browscap-dev.
 
 

Reply all
Reply to author
Forward
0 new messages