One more thing in addition to my previous post

1 view

Skip to first unread message

Wicks

unread,

Dec 28, 2006, 4:00:44 AM12/28/06

to okopipi-dev

I forgot to mention this in my previous post.

The ice framework may help quite a bit for the underlying p2p
communications.
the framework was primarily built for grid computing
http://www.zeroc.com/

Stuff like:
1) Programming language independant, platform independant object
serialization across tcp streams through firewalls. Automatically takes
care of stuff like little endian to big endian object serialization.

This means the comms layer will interop with everything from C++, Java,
C#, Visual Basic, Python, PHP, and Ruby

Full overview is at:
http://www.zeroc.com/ice.html

but with specific respect to this project

Glacier with IceSSL: Secure comms layer through firewalls

IceStorm: Broadcast messages from master nodes to the entire mesh of
nodes.

Freeze: Persistance and management of object database (whitelists, work
items etc...)

IceBox: Might work as the application host. where the client is
deployed to Icebox as a dll. If this works it would make life pretty
smooth as IceBox is cross platform. Might have issues though trying to
do this.

IcePatch: Auto update of client applications. Vulnerability for
saboteurs ? DNS Hijack of update server and pointing it to something
else.

Ice: Well this is the core library.

Wicks

unread,

Dec 28, 2006, 4:03:04 AM12/28/06

to okopipi-dev

Ahh I see my previous post was too long for this forum. So I will cut
it into two:
-----Part 1 of 2-------

Not trying to blow my own trumpet or anything here.
I really respect the initiative that you have taken.
I've been monitoring this for about a month or so and it kind of looks
like the project is dying.

I would like to volunteer if you feel it is appropriate to help with
this project.
In terms of being a coder I'm a systems architect / network engineer /
dot net developer and have been doing this for about 11 years now.

Sadly however my GCC is beyond rusty and java is a realm that I avoid
which means that apart from systems design and architecture I can only
help out in coding with respect to Dot Net for Windows platforms. I am
aware of Mono.net but Mono does not as yet have full support of the Dot
net 2.0 spec to support cross compiling the same code base.

In the name of progress what came out of your project meetings with
respect to the core design. There is nothing posted on the site.
Overall I understood that the idea was to be based on a pure P2P
architecture so as to avoid any single central point of failure by
ddos, hackers or whatever.

However food for thought. The validation of what constitutes spam and
what does not constitute spam needs to be fully automated so as to
avoid dependance on any single entity. One suggestion is perhaps:
Step 1) cross validation of the source of the spam against good RBL's
such as SpamCop. In RBL = Black List, Not RBL = White List.
Step 2) automated parsing of the email in order to detect the presense
of the spammers website, remote loading tracker images etc...
Step 3) Validation of the target spammer websites against a database of
good and bad targets and in the case of unknown ones putting it up for
a vote by the community of users (70% or better Majority wins). a few
fail safes can be built into here. For example the sites listed on DMOZ
are probably a good white list. Perhaps the good folks at DMOZ would be
happy if requested to also add a whitelist tag to their listing of
websites which would make it even easier.
Step 4) Automated Spidering of known target spam sites and automated
identification of contact forms and contact email addresses.
Step 5) Once above steps are completed the submission of opt out emails
on a given scale (perhaps 1 spam = 10 opt out messages) where each
email / opt-out is submitted by a random client on the network. Perhaps
in the case of sites that have hidden away all contact details 1 spam =
100 complete page downloads on their website with the contents of the
opt out messages appended to the url as a querystring. this way it
appears in their web logs. I do not know of th elegality of this.
Step 6) Every single tracker image embedded in an identified spam email
should be hit many hundred times. Perhaps a ratio of 1 tracker image to
100 hits by every single client on the network. Nested tracker images
are complete violation of privacy laws in any civilised country. My
choice to read an email or not is my personal information. Embedding a
tracker image in that email to detect when and how many times I read
the email is a complete violation of my privacy. Plus since the
intention of the spammer in placing the tracker image is such that it
will be downloaded and the creater wanted it to be downloaded. I do not
know if there are any legal issues here. In essence this is a large
number of hits on a tracker service. Unfortunately it is possible that
a tracker service will be effectively crippled by a large number of
hits on it because every single hit probably generates a log entry in
some database and uses small but significant cpu power.

Point to note. Step 5 also may be on the border of legality because
imagine a spammer sending out 1,000,000 spams out of which 100,000 hit
the community. This would result in 1,000,000 opt out messages. However
consider if the spammers website does not have any detectable means of
contact. In this case it would result in 10,000,000 full page downloads
against random pages of the spammers website. In terms of hits this
could be 20 times the figure.

Having said that:
point 1) that spammers themselves are not legal
point 2) no single node on the network would ever submit the same opt
message more than once for a single site within a given period of time
(say 1 hour). I am led to believe (by google search) that this would
probably fall within the accepted fair use policies of the internet.
point 3) no single node would do a full page download of the same page
from a single website more than once on a single site within a given
period of time (say 1 hour). keyword beign same page. the same client
could download a few (say under 10) random pages from the same site
spaced out over time. I am led to believe (by google search) that this
would also probably fall within the accepted fair use policies of the
internet.

I am no legal expert. These are just my personal thoughts and in no way
constitute legal advice. Proper legal counsel should be sought on all
these points. I have simply outlined the obvious points of contention.

Any legal experts with legal feedback ?

Wicks

unread,

Dec 28, 2006, 4:04:43 AM12/28/06

to okopipi-dev

-----Part 2 of 2-------

Distributed computing versus P2P

For something along these lines would I personally feel that a
distributed computing technology such as the whats used in the seti
project but adapted to a true P2P topology would perhaps work better
than a vanilla p2p framework.

The reason I say this is because as there is no central point of
control, no central server. There is a certain amount of data such as
whitelists and blacklists that need to be held by the clients. Yes I
realize that in theory this is volatile storage of these data but what
is the chance that every single user goes offline at the same time and
stays offline for a while ? if that happens the network is dead anyway.

Along the same lines this same data would need to be propagated
automatically via the network itself. This would be rather advanced
data replication however and would require quite some thought.
Especially since we are talking a large number of distributed databases
that need to remain in sync plus the added complication that it is not
possible to rely on complete round trip messages across the entire
network.

One possible solution would be a given numebr of master nodes on the
network. Where the network topology would be arranged as a double mesh.
Inside mesh of master nodes and outer mesh of non master nodes.
This is based off the simple concept employed some time ago for windows
networks called the master browser. Each windows pc on the network
contends for the position of master browser. Each pc also has a rank
which is determined by it's uptime and type of OS. For example a server
OS on the network would always have a much higher base rank. if there
are two server os's on the network then the contention would be decided
by uptime.

Something similar could be employed to decide on which nodes get to be
the master nodes. The rank of each node could be a Base rank plus
additive where the base rank would be calculated by the effective
bandwidth of the node as measured by the client application. the
additive rank could be based on uptime.

The same rank could be used for positioning of nodes within the outside
mesh where higher ranked nodes would be closer to the master nodes and
promotion to master node would be from the circle of next hop neighbors
from the master node mesh.

Master Nodes is most definitely in plural. There are many reasons for
this. the most obvious is simply because of network load. From this
point of view the number of masters does NOT however need to be
proportional to the number of client nodes on the network. the reason I
say this is because every client node that is not a master does not
need to be directly connected to a master node. it can connect to a
master node through a neighbour which in turn could connect through a
neighbour. However each master node would have to be connected to
either all or a significant segment of the master nodes.

Another key point is the potential of sabotage. There is no central
server to hit. However this type of network can be effectively
sabotaged by the introduction of rogue clients. Also since it is an
open source project rogue clients would not be that hard to build. A
rogue client as a non master node would could effect it's sabotage by:
a) Perhaps introducing fake messages into the network
b) Perhaps messing with the peer to peer routing tables of its
neighours (and connecting to more neighbours). this would have a sort
of decoupling effect on the network.
etc..
As a master node however a rogue client would be able to do a hell of a
lot more damage. for this reason all the possible means of sabotage
would need to be thought through carefully and for each one a counter
or at least a detection algorithm would have to be devised. For example
cross validation of each message against 2 (or more) other nodes with
higher rank, where the rank of a node is obtained by nodes nearest
neighbors opinion. In all cases numbers are on the side of this
application so it should be used to our advantage.
1) There will always be more spammed people than spammers.
2) The true clients will be up and running before any rogue client
infections may take place.
3) Simultaneous injection of hundreds of clients from a single IP
address would be quite easy to detect and counter

Therefore by induction: At any given point of time where an infection
is attempted there will always be more true clients on the network than
rogue clients.

That's all my ideas for now.

Please do let me know if there is a project team. If anyone is working
on anything. Where I could contribute etc....

I am currently working two full time contract jobs but I will make time
for this.

Cheers

Wicks

unread,

Dec 28, 2006, 4:22:14 AM12/28/06

to okopipi-dev

Forgot to mention for tagging as spam. combining RBL with distributed
filtering of some sort such as Bayesian filtering.. Also to save time
and effort SpamAssasin code base could be used instead of rebuilding
the wheel. Although opensource It's not GPL and I havent checked the
terms of the Artistic License but I doubt the authors would have any
issue.
Second thought is that if image filtering is also use then yes it is
cpu intensive but once one message has been identified as spam by one
of the clients on the network rather then the remaining clients on the
network do not necessarily have to rechec the same or simillar message.
MD5 hashes of the messages about to be processed could be broadcast
over the network followed by say a 1 minute wait for responses. problem
with an MD hash of course is that this means only identical messages
would be matched. However on the other hand if something like the
google shingles algorithm can be used then all contextually simillar
messages would also be identified as a hit. Problem is that the google
stuff is patented.

Reply all

Reply to author

Forward

0 new messages