GSoC project: VCards search engine

28 views
Skip to first unread message

Thomas Koch

unread,
Mar 27, 2011, 1:02:19 PM3/27/11
to onesoc...@googlegroups.com
Hi,

I've dived deeper into osw and your GSoC procect ideas[1]. I'd like to work on
the "User search and discovery across networks" idea. I've written a social
media monitoring crawler in my last job and used lucene[2] for that.

Given that osw server provide two informations, the implementation could be
rather easy:
- osw servers provide a list of all accounts registered at this server
- osw servers provide a list of all other osw servers they know

I presume that there is some way to fetch the public vcard information of an
account.

So my project would be to crawl all osw servers, fetch the account lists,
fetch the available public vcard information for each account, push that into
lucene and write an integration into the GWT web app.

The search engine I propose would not be decentralized. It would collect all
available public account information in one central database and make it
searchable from there. However every osw installation would be free to provide
it's own search server.

It would not provide a good user experience to have a decentralized search
solution. Such a solution would need to contact maybe hundreds or even
thousands of osw servers and aggregate the search results. This however leads
to high latency and would suffer partial non-availability since there will be
some server down or not available at any given time.

Having a searchable database of even billions of osw users is doable since
every profile contains only a few bytes. For some millions of users even plain
lucene is good enough.

What do you think?

[1] https://github.com/onesocialweb/osw-openfire-plugin/wiki/GSoC-Ideas-Page
[2] http://lucene.apache.org/

Best regards,

Thomas Koch, http://www.koch.ro

Maxi

unread,
Mar 27, 2011, 6:25:46 PM3/27/11
to onesoc...@googlegroups.com
Hi,

I'm not affiliated with OSW, but just wanted to let you know that I'm
working on a similar project. I'm using Python, however.
My project has two differences to what you describe: I use
synchronization between search servers, so each server has the same
database.
Also, if I get you right, you want to maintain a list of OSW servers
that should be crawled. This will make things difficult for users who
want to run instances of OSW/Diaspora/Friendika just for themselves. I
suggest that OSW servers simply submit their VCard addresses to any
search server, which will fetch the VCards. Synchronization makes sure
that every search server gets these VCards.

Here is my repository:
http://github.com/Leberwurscht/Diaspora-User-Directory
And here some discussion on the diaspora-dev mailing list:
http://groups.google.com/group/diaspora-dev/browse_thread/thread/b7e168187160f2b4

Note that I'm currently focussing on synchronization and spam
prevention. I have not yet concentrated on efficient searching in the
database, which is still SQLite.

Maxi

Diana Cheng

unread,
Mar 28, 2011, 11:50:12 AM3/28/11
to onesocialweb
Hello everyone,

This discussion came up in the federated social web mailing list: See
--> [1]

From there, it seems like a decentralized search solution (like in a
"pure" P2P case) does not seem to satisfy some, due to being time
consuming and therefore providing poor UX. Centralized approaches
discussed in [1] are relying on public data, and Thomas also mentions
he wants to crawl all publicly available information in a profile. I
think we need go back to the definition of what public is. Since we
are working on interop with OStatus, I assume as public everything
which can be provided as an hCard profile linked to a webfinger
account, i.e. accessible as a web page by anyone. Another level of
visibility I see for profile data is "visible to OSW users only". In
the context of activities also "OSW followers only" was discussed, and
in the future, confirmed relationships too: http://onesocialweb.org/spec/1.0/osw-relations.html

I think, as has also been mentioned in the thread, that ideally,
within OSW for example, you should be able to discover a contact who
has a OSW private profile, i.e, not visible outside the federation (if
he decided he wants it that way). In any case I guess the minimum
amout of information which should be visible to anyone in the
federation are jid and Jabber Name (the one the XMPP Server stores as
part of the XMPP account), which is what's included in an activity
payload as part of the actor/author. Any comments on this?

From the thread in [1], I gather that ideally, if you have the email
address of a OSW contact for example, but the user keeps this field as
private, you could still search using this field, find the profile and
see only the information he marked as "public" and "visble to any OSW
user". Am I right here, and would this make sense?

Something else we need a solution for is to find a way to discover
contacts based on your contact in existing networks (users will want
that at some point). See first message in [1]. Maybe it would be good
to tackle Gmail contacts first (XMPP-based)?.

That said, Thomas' solution would anyways be a plus for the project,
since we don't have any way of discovering users and the only way to
retrieve their profile is searching by JID. But perhaps we can
converge towards the best approach?

Best regards,
Diana.

[1] http://goo.gl/gL5NN

On Mar 28, 12:25 am, Maxi <mlm...@hoegners.de> wrote:
> Hi,
>
> I'm not affiliated with OSW, but just wanted to let you know that I'm
> working on a similar project. I'm using Python, however.
> My project has two differences to what you describe: I use
> synchronization between search servers, so each server has the same
> database.
> Also, if I get you right, you want to maintain a list of OSW servers
> that should be crawled. This will make things difficult for users who
> want to run instances of OSW/Diaspora/Friendika just for themselves. I
> suggest that OSW servers simply submit their VCard addresses to any
> search server, which will fetch the VCards. Synchronization makes sure
> that every search server gets these VCards.
>
> Here is my repository:http://github.com/Leberwurscht/Diaspora-User-Directory
> And here some discussion on the diaspora-dev mailing list:http://groups.google.com/group/diaspora-dev/browse_thread/thread/b7e1...

Thomas Koch

unread,
Mar 30, 2011, 4:16:29 AM3/30/11
to onesoc...@googlegroups.com, Diana Cheng
Diana Cheng:
Hi Diana,

there is no public search engine of email adresses, but people still manage to
share their email adresses offline and use email for communication. So I think
that for those people that don't even want to make only their name public, osw
could still work.
However with Email there are two workarounds:
a) When I want to contact somebody, I do a web search for his name and maybe
affiliation (like "Debian") and I'll probably find an Email he sent to a
public mailing list and grab his Email adress from the search result.
b) Many contact (like the contact with you) got established by joining a
mailing list (channel in buddycloud speech) and receiving a response on a
message.

I propose to encourage people to reveal a small part of their profile as
public data: name, country, account name, maybe small photo(?). The great
majority of social network users reveals even more data today to public search
engines. I don't see, how the information of the pure fact of my existence
could do any harm to me.
Information which is far more sensible and which should not be publicly
exposed is my social graph, contact data, interests, channel subscriptions,
CV, affiliations, ...

As an optional addition to the public search engine we could later on provide
a (slower) distributed search: Ask all my contacts, whether they know a person
fitting to my search request. So my profile would have an option: "reveal my
profile for 2nd grade searches".

Still people have of course the option to not publish any information about
themselves.

@Diana: Any response, whether XSF would accept you as a mentor for a GSoC
project?

Jacob Maldonado

unread,
Mar 30, 2011, 5:59:08 AM3/30/11
to onesoc...@googlegroups.com, Thomas Koch
When you means Lucene You means you use NUtch and Hadoop The idea is good 
if is like that. If not you need to see many problems like storage of databases and 
slow robots
Reply all
Reply to author
Forward
0 new messages