Monday, June 2, 2008

CopyWench, a File Format for Publicly Discussing Undistributable Material

by Travis Goodspeed <travis at utk.edu>
at the Extreme Measurement Communications Center
of the Oak Ridge National Laboratory

Suppose that an author has written a document describing the reverse engineering of ROM image, and he wishes to publish his work. Further, let us suppose that the author is an honest and law-abiding citizen of a nation with intellectual property laws that prevent him from distributing software copyrighted by another. He cannot distribute his work in its entirety, as his work will cite numerous lines of the original. In this short article, I will discuss a format for a utility which--much like diff--allows an author to distribute only his changes to a document. Unlike diff, my format will not include so much as a single line of the original.

First, suppose this to be our secret file. It is from Wikipedia's BASIC article. We'll call it secret.txt:
10 INPUT "What is your name: ", U$
20 PRINT "Hello "; U$
30 INPUT "How many stars do you want: ", N
40 S$ = ""
50 FOR I = 1 TO N
60 S$ = S$ + "*"
70 NEXT I
80 PRINT S$
90 INPUT "Do you want more stars? ", A$
100 IF LEN(A$) = 0 THEN 90
110 A$ = LEFT$(A$, 1)
120 IF A$ = "Y" OR A$ = "y" THEN 30
130 PRINT "Goodbye ";U$
140 END
Each line of secret.txt has a one-way MD5 hash, which is what will be published in lieu of the line itself. To someone who does not have the secret file, 12b9bdf57db1a1c9e4fb2fb1b35e9c41 means nothing. But to someone who does have the file, it's rather easy to find that the hash belongs to line 10 above.

Suppose this to be our private commentary:
Commentary of a BASIC program.
by John Doe.

10 INPUT "What is your name: ", U$
My name is John!
20 PRINT "Hello "; U$
Hello to you, too!
30 INPUT "How many stars do you want: ", N
Tons of stars!
40 S$ = ""
50 FOR I = 1 TO N
60 S$ = S$ + "*"
70 NEXT I
80 PRINT S$
90 INPUT "Do you want more stars? ", A$
Always.
100 IF LEN(A$) = 0 THEN 90
110 A$ = LEFT$(A$, 1)
120 IF A$ = "Y" OR A$ = "y" THEN 30
130 PRINT "Goodbye ";U$
Adieu.
140 END
By replacing every secret line--that is to say every line which is found in secret.txt--with a one-way hash of the original, we will arrive with something like the following, which is free of copyrighted material.
Commentary of a BASIC program.
by John Doe.

COPYWENCH 12b9bdf57db1a1c9e4fb2fb1b35e9c41
My name is John!
COPYWENCH 8a80ddc83dc0592951d29151d62487a4
Hello to you, too!
COPYWENCH cbbf650ae187427f3cd74a58a7b0edf4
Tons of stars!
COPYWENCH f842a481f76099503c63c611dc84784e
COPYWENCH 9376ab1f0bed02260134dc9d6b723433
COPYWENCH 5cbb0ef3dbb1b75c047b278b3480fa46
COPYWENCH 95cfeba9a3004c3e6b2db6ce3222341b
COPYWENCH 402c331f990b99054568e72315e5e01b
COPYWENCH f8ee49bfcd41e6a43266105e65964247
Always.
COPYWENCH 24407c853d6df83fecd3074dc44e5ef5
COPYWENCH cc67be8c43810e63859ae75abaf7b1c2
COPYWENCH 36c25fd5bf40b1a5a0ef8d68af1580c7
COPYWENCH 80a1d74d3d5b1321fe3805a6d6f6011c
Adieu.
COPYWENCH 811061ce8478b3e56543e103ccd52c67
The following algorithm will translate between public and private files:
  1. For each line of secret.txt:
    1. Place a hash of the line in an associative array.
  2. Then for each line of the input file,
    1. If the line begins with COPYWENCH, print the line matching the checksum of the second word.
    2. Else if the line exists in the associative array, print "COPYWENCH" followed by a space and the line's hash.
    3. Else print the input line.
I recommend this file format, which I've named the CopyWench format, because it is terribly easy to implement. A few minutes in any scripting language will result in a working implementation.

A sample implementation follows. As all implementations ought to be licensed so as to be distributable with the documents with which they are intended to be used, this one is in the public domain. Do with it as you will.

#!/usr/bin/perl

#CopyWench 0.1
#Authored for the Public Domain
#by Travis Goodspeed <travis at utk.edu>

#Either Digest::MD5 or Digest::Perl::MD5 is needed.
BEGIN {
eval {
require Digest::MD5;
import Digest::MD5 'md5_hex'
};
if ($@) { # no Digest::MD5
require Digest::Perl::MD5;
import Digest::Perl::MD5 'md5_hex'
}
}

if($#ARGV!=0){
print "Usage: $0 secret.txt <input.txt >output.txt
Where secret.txt is the secret being written about.\n"
;
exit;
}

my $secret=$ARGV[0];
my %lines;
my ($line,$hash);
open SECRET, "<$secret";

#Build an associative array of secret lines.
while(<SECRET>){
chomp($_);
$line=$_;
$hash=md5_hex($line);
#print "$hash $line\n";
$lines{$hash}=$line;
}
close SECRET;

while(<STDIN>){
chomp($_);
$line=$_;
$hash=md5_hex($line);
if($lines{$hash}){
print "COPYWENCH $hash\n";
}elsif($line=~m/\bCOPYWENCH ([0-9a-f]+)/){
print "$lines{$1}\n";
}else{
print "$line\n";
}
}

1 comment:

Travis Goodspeed said...

This doesn't play well with the asm-mode of Emacs. Perhaps a complete version ought to drop all surrounding whitespace before hashing a line?