Previously I wrote a up a sketch for how blog readers migh guard their privacy by forming reading clubs. The club would reveal the union of what everybody is reading but inside the club it wouldn’t be possible to discern what each member was reading. Last time I stated that engineering this wouldn’t be particularly difficult; but didn’t reveal my sketch for how it might be done.
My idea is/was that the club members would from into a circle. A stream of encrypted traffic woul arrive from the left and be passed onto the right. This traffic would be chat about club business; i.e. what blogs the club is interested in along with notices of the state of those blogs. For example an entry might state “The club is interested in blog vile.example.com as of Jan 12 2006.” or “The blog at embaress.example.com was checked on a 11:14am Jan 21; it last change at 2:37 Dec 17th.” If you record the assertions in the stream for a sufficent period of time you can form a complete model of the blog reading club’s interests and the state of the blogs it’s reads.
My presumption was that individual club members would inject their interests into the stream. This turns out to be harder than I thought. If I inject my interest in lame.example.com into the stream my upstream neighbor can only tell that the club he’s a member of has increasingly lame interests, but if he collaborates with my downstream neighbor then he can pin the blame on me for that. Not good. Ben Laurie pointed this out to me.
The full set of assertions collected by listening to the passing stream is a substitute for a centralize club house were the club keeps their records. That club house is a substitute for the ping aggregation service, i.e. the intermediary the club was meant to avoid. The whole point of this exercise is to hide the reading interests of individual club members from the intermedary.
Ben’s somewhat spontanious suggestion for how to organize this club is to build a club house but run it so the individual members interactions are kept anonymous. Systems like TOR or the anonymous email remixing system illustrate how to let the members communicate with the club house anonymously.
The club house could be a simple web site that enumerates all the blogs the club is keeping an eye. It drops blogs off the list if nobody signals in interest in that blog for a period of time. Club members randomly poll blogs on the list and report what they find back to the club house; including the RSS feed should it change. When a member wants to read his blogs he does sync’s his copy of the club house data. This syncronization can be done in public if the club member is willing to reveal that he is a member of the club. Of course he should synchronize the full database from the club house since otherwise he’d reveal his peculiar interests. By extablishing an anonymous connection to the club house, ala TOR, the member could avoid pulling down the entire database.
While last time I wrote that this engineering a system like this is straight forward, this time I’m less confident of that.
I’m not particularly happy with the introduction of a central club house into the design. Who’s going to volunteer for that thankless task? I’d rather liked the idea that the club members were all asked to carry the same proportion of the load. But now I’m thinking that the streaming around the circle approach is just a scam for relocating the club’s records; and that I”m gotten myself out out on a limb.
Designing distributed anonymous peer to peer databases that enable clubs like this to form looks like a more meaty design problem than I expected. While I’m sure that’s fun for some folks it’s a bit of a barrier to making progress on the problem I care about.
From a technical point of view the design it might be possible to do this peer to peer over multicast with repeater nodes. Now it is still probable that somone on your own network subnet (relatively unlikely) might finger you, it would at least be somewhat obscure. What if every node broadcast fragmented packets over UDP multicast with a stream ID/part id. Only nodes that managed to put the stream together would rebroadcast it to another subnet by tunnelling over TCPIP. Nodes in the system would repeat packets with their stream and part id until they had put together the entire stream. They would be reasonably unsure what was from whom (as any might have been repeated). The data would as a whole be encrypted twice. First with a subnet key and then with a club key. Ironically members on the same subnet would not be able to decrypt the message. Members on a different subnet would decrypt it and rebroadcast it at least once rencrypted with their own subnet key and te club key (which would finally allow the nodes on the network to be able to read it). This would be unidentifiable to local nodes as to wether the message was a retransmit of a local message or a transmission of a new message from afar.
The problem with this (up until now) is that eventually the (re)transmission rate becomes infinity. To solve this we must keep some state. Once we’ve decrypted a message, we store a reduction (some sort of hash or short numeric representation, CRC, whatever) of the contents in a local database with a count. We will only retransmit this message twice.
There remains a problem of identification along the branch route (the subnet chatter being the leaf route and the tcpip tunnel to another branch of the tree being the branch route). In order to reduce this risk every node transmits the data to a random place along the branch route (meaning to a different leaf in the other branch) once it has the full message.
This still exposes a problem of “elimination”. Meaning if you’re the only one who did not do a TCP transmission then I know that you sent it. Some structuring of the physical network could do this, but let us assume that we have little control over that. This might be solved by having the leafs retransmit packets over UDP multicast upon receipt and the original broadcaster put together its message as if it were receiving it for the first time then retransmitting to the branch nodes. There might be some ms or ns latency, but it is likely that other network events would add sufficient randomality to make this delay of transmit a difficult factor to use for identification without a large picture of the transmission patterns over time. It is still a question as to whether that could be an identifying factor.
The ideas behind this network are:
a. You are more likely [by content] to identify a member geographically close to you (a simplification of the “subnet” and network layout I know, but it works)
b. the act of transmission is a primary identifier
c. obscuring your geographical network location enhances your anonyminity.
d. there are at least 3 members on every subnet (>4 optimum)
Wow… You make me think too much when I’m sick… I’ll see you in GITMO 😉