Common Protocol for Tagging Web Links?

26 views
Skip to first unread message

Ali Rantakari

unread,
Nov 7, 2009, 7:16:51 AM11/7/09
to OpenMeta
I'd like to start a discussion about tagging web links in applications
that support OpenMeta because I think it'd be in the best interests of
users if different applications would be compatible with each other in
this respect as well.

I just released version 0.9 of Tagger where I added this feature, and
the way I implemented it was that for every tagged URL I create
a .webloc file in "~/Library/Metadata/Tagger/Web Links/" and add the
tags to that. I use the title of the tagged page as the name of the
file (replacing forward slashes with hyphens) because I think it'll be
intuitive for the users to see the page titles (instead of, say, their
URLs) when they do Spotlight searches. Whenever two different pages
with the same title are tagged, I append " (URL)" to the file name,
replacing forward slashes with colons in the URL.

I noticed that Gravity Apps' Tags has this feature, but it does it a
little differently: it uses "~/Library/Caches/Metadata/Tags/
Bookmarks/" for the .webloc storage, replaces also colons and
backslashes with hyphens in the page title when using it as
the .webloc filename, and deals with page title collisions by
appending incrementing numbers to the filenames.

If different apps were to standardize on this feature, here are some
of the issues I think are important to think about:


Location for storing the .webloc files
-----------------------------------------------------------
I propose some path under "~/Library/Metadata/" because this path is
both indexed by Spotlight and backed up by Time Machine by default. "~/
Library/Caches/Metadata" is indexed by Spotlight but not backed up by
Time Machine.

How does "~/Library/Metadata/OpenMeta/Web Links/" sound?


Naming of the .webloc files
-----------------------------------------------------------
I think using page titles for naming the files is intuitive and clear.
The only possible issue that I can think of is that tagging
applications would always need to have the title as well, and not just
the URL, but I don't see this as a big problem.

The second thing is how to 'clean up' the page title so that it'd be
suitable for use as a file name. What needs to be taken into account
here are the restrictions posed by different file systems. The most
common one is of course HFS+ but I suppose one can also use a
different file system on the volume where their home folder is
located. An optimal solution would probably be one where the set of
forbidden characters for all possible Mac filesystems (or at least the
most common ones) would be replaced with other non-forbidden
characters and the length of the filename would be restricted to the
maximum filename length of the filesystem in this set for which the
limit is the lowest. Taking restrictions for filesystems common on
other platforms (e.g. FAT32, NTFS, ext2/3/4) into account would be
nice as well (although not imperative) since users might want to copy
the .weblinks onto Windows or Linux systems and this could make it a
bit easier.

Then there is the issue of page title collisions: if the user wants to
tag two different web links that have the same title, there needs to
be a predictable system for making the filenames unique. I can think
of three different options here:

(1) Appending incrementing numbers to the filenames (this is what Tags
does). This is a simple solution for making the filenames unique and
doesn't make them 'ugly' (in my opinion). The only problem I can think
of is that it won't be obvious to the user which link points to which
page if you have more than one with the same name, with the only
difference of an additional number at the end.

(2) Appending the page URLs to the filenames (this is what Tagger 0.9
does). This is an even simpler (?) solution for making the filenames
unique but it kind of clutters them up (i.e. makes them 'ugly'). The
problem with #1 is solved here, though, since the URL is in the
filename. One problem with this is that it requires a lot of space for
additional characters in the filename (especially if the URL is long)
in order to guarantee uniqueness, as opposed to #1. This might become
a problem if the maximum filename length is low.

(3) Creating a folder tree that follows the structures of the URLs and
saving the .webloc files at the leaf folders in this tree (see below
for example). This allows one to keep using the page titles as
filenames without modifying them further and still avoid collisions,
but the problem with #1 still applies here and this one requires more
work (you'd need to make sure that empty parts of the folder tree are
removed whenever .webloc files are deleted). File system folder name
length limits (and path length limits? total file number limits?) will
probably cause problems here as well. I'm not rooting for this one at
the moment.

https/www.google.com/dogs/cats/?q=dogs%20cats/Dogs and Cats.webloc
http/hasseg.org/tagger/Tagger.webloc
ftp/hasseg.org/tagger/Tagger.webloc


Handling of anchors in page URLs
-----------------------------------------------------------
In Tagger I decided to remove the 'anchor' part from all URLs before
doing anything else with them since I want to apply tags to full
pages, not parts thereof. I think this is a reasonable approach but I
wanted to mention this as well in case there are compelling arguments
to the contrary.


Cleaning up
-----------------------------------------------------------
This is not a big issue but Tags seems to delete the .webloc files
whenever they don't have any tags assigned to them anymore. I made
Tagger do the same. I think this is a good idea.


Thoughts?

Jon Stovell

unread,
Nov 7, 2009, 8:30:56 AM11/7/09
to open...@googlegroups.com
Regarding file systems and illegal characters in file names: in my
experience, the system automatically handles the conversion of illegal
characters on its own, both when creating files and when moving them.
Of course, as an AppleScripter, my experience is limited to using
Finder or shell commands to manipulate files, so this may no longer
hold true when a Cocoa application is manipulating files itself.

Jon

On 2009-11-07, at 7:16 AM, Ali Rantakari <ali.ra...@gmail.com>
wrote:

Tom Andersen

unread,
Nov 7, 2009, 2:10:24 PM11/7/09
to open...@googlegroups.com
Ali,

Sounds good. -

Some comments:
You can't store anything in Library/Caches, as that can get wiped out. (i toss mine sometimes to look for bugs).

Users - will rarely (NEVER) look in Library for documents. But yes spotlight searching will find them.

These webloc files are documents. In my mind they should be in the documents folder.

I don't know if this will help, but Yep stores stuff like this in a day - by day folder system that we put at
~/Documents/Filed Documents/ (Yep does an alias/symlink resolve on that path in case the user wants the actual filed documents folder somewhere else).

Then when they say make PDF from a web page, or do a scan - by default in ends up in 'today's' folder.
~/Documents/FiledDocuments/2009/11/7

Then all documents that the user does not want to bother filing are filed by date - which makes a sort of 'digital paper trail' of the days activities. It also largely solves name collisions.

You are of course free to add it to 'todays' folder if you would like - I like to keep the folders neat, and only create them if there are files inside them.

But your users may not like a 'Filed Documents' folder? I have had a few complaints - very few about creating a folder in that location.

Here then are my possible solutions for name collisions:
1) Use Filed Documents - append 1, 2, etc for collisions on the same day
2) Use Filed Documents - if it exists, otherwise make a date based folder hierarchy in ~/Library/Metadata or App Support. ~/Library/Metadata/2009/11/7/This webloc file.webloc

3) Just always put them in ~/Library/Metadata/2009/11/7/ etc.

I think that the last thing you want to do is ask the user where to file them. With a barrier like that, users just won't bother saving them.


Yep also supports webloc files. We save them to the Filed Documents folder, using a name as from the pasteboard url name

public.url-name

Then I replace everything with spaces that I don't like.
And yes return characters can be in file names - but it fouls up a lot of software so I don't put them in... no tabs either!

: / \ \n \t

But I can change that in a future release if you settle on something.

--Tom

Ali Rantakari

unread,
Nov 7, 2009, 5:08:36 PM11/7/09
to OpenMeta
Hi Tom,

Thanks for your insights on the matter.

> These webloc files are documents. In my mind they should be in the documents folder.

This is actually a very good way to think about this. I didn't
consider them documents per say, but simply as 'containers' for tags
that users wish to assign to arbitrary web pages -- whenever all tags
are removed from one of these .webloc files, they are deleted because
they're not needed anymore (i.e. there's nothing to 'contain' anymore
since all tags for that page are gone).

So I think this is a very fundamental question regarding this issue:
should we consider the .weblocs 'documents' or 'temporary tag
containers'? From my point of view both options sound okay -- right
now Tagger considers them the latter, but I don't see any problems in
changing this behaviour for my app; I'd just need to refrain from
deleting them when all tags are removed from them and of course change
the storage path.

If we go the way of '.weblocs as documents', then I agree that we
should use a path under ~/Documents to store them. For the opposite
case I'd still suggest "~/Library/Metadata/OpenMeta/Web Links/" or
somesuch.

Regarding your suggestions about handling name collisions, I am not
opposed to the combination of date paths (like "2009/11/7/") and
incrementing numbers at the ends of filenames, regardless of the
location where the .weblocs would be stored.

Replacing all the characters you mention with spaces when sanitizing
the page titles sounds ok to me, although I'd prefer to replace
forbidden non-whitespace characters with the 'closest equivalent' non-
whitespace character, like a hyphen. There are a couple of other
filename sanitization issues that came to my mind just now, though:

- If we end up with an empty string after page title sanitization,
should we use some sort of sanitized version of the URL instead, or
just save the file with the name "teh-internets.webloc" or some other
default? Do you handle this somehow in Yep?

- We should probably strip away any leading dots so as not to
inadvertently make the files hidden.
> >  https/www.google.com/dogs/cats/?q=dogs%20cats/Dogsand Cats.webloc

Gravity

unread,
Nov 9, 2009, 9:13:05 AM11/9/09
to OpenMeta
Hello,

Right now we store tag informations about web links in the Application
Support folder. We then create spotlight indexed elements in the
caches folder. So whenever the caches folder is cleaned up, it's
rebuild with the next run.
For us the URL is the primary key and not the title.

I'm also not sure if I want to have tons of links in the Documents
folder (or a subfolder of it). I would never browse such folders in
the Finder. So what's the benefit of having it in the Documents
folder? The address book data as many others is also stored in
Application Support.
For me the Documents folder is a place where the user is supposed to
pus his files.

Martin
> ...
>
> read more »

Tom Andersen

unread,
Nov 9, 2009, 2:12:41 PM11/9/09
to open...@googlegroups.com
Ah yes - I see.

For us, a user drags a URL to Yep/Leap, then I make a webloc file for that URL, which will likely relate to other events of the day - and so we file it in today's files. But I don't know how many of these you are dealing with, and if users would consider them automatically generated or not. I guess it depends on how the user thinks of the webloc - as a document, url, bookmark, etc.

--Tom

Ali Rantakari

unread,
Nov 29, 2009, 5:10:13 PM11/29/09
to OpenMeta
Thanks for taking the time to participate in the discussion, guys.

It seems we should clearly separate the approaches of ".weblocs are
documents" and ".weblocs are temporary tag containers" -- I think it's
clear to everyone that these should be handled quite differently.

I would like to take this opportunity to limit this discussion purely
to ".weblocs are temporary tag containers" for the following reasons:

- This is where I think a common protocol would be more beneficial to
end users (or to put it the other way around, the *lack* of a common
protocol would be detrimental to end users.)
- This is what Tagger does, so it's relevant to me personally

To reiterate the main purpose here, I would simply want the users to
be able to tag a website with app A and then later edit the tags with
app B, without directly referencing the .webloc file itself (i.e.
instead of saying "tag this .webloc file", the user would say "tag
this site I have open in my browser" or "tag this link I have here"
etc.).

In order to get the discussion moving again, I'll give a suggestion
which can be criticized and commented on:

- a "catalog" plist file in a known per-user location specifies a key-
value mapping of URLs to .webloc file names (e.g. "~/Library/
Application Support/OpenMeta/weblocCatalog.plist")
- the .webloc files themselves would be located in a known per-user
location that is indexed by Spotlight and backed up by Time Machine
(e.g. "~/Library/Metadata/OpenMeta/Web Links/")
- whenever an app wants to tag a website, it goes through the
following steps:
- remove "anchor" part of URL (?)
- read contents of catalog plist, see if URL exists as a key
- if URL exists, read value for key to get the .webloc filename
- if URL does not exist, determine a "user-friendly" filename for
the .webloc file (employing incrementing number suffixes to avoid name
collisions) and create the .webloc file with this name, appending the
corresponding key/value pair (URL/.webloc filename) to the catalog
- open the .webloc file for tagging
- whenever all tags are removed from a .webloc file, it should be
deleted and its corresponding catalog entry removed
- the protocol implementation should be added to the OpenMeta library
codebase in order to minimize compatibility issues

The purpose of the catalog file in this suggestion is to minimize the
overhead required for finding the correct .webloc file for a given
URL, if one already exists.

What do you think?

Reply all
Reply to author
Forward
0 new messages