Diplomatic cables - a fast browsing and searching program


In the summer of 2011, a lot of diplomatic cables from USA's embassies from around the world have been made available on the internet for everyone to have a look at them. It was obviously not intended by the USA, but nor was it by those who had the data in their hands, like wikileaks, who were in a process to release them edited, with some sensitive data removed, like names of some people.

The problem is that the data is stored in a very big text file of more than 1.7 GB (1730507223 bytes to be precise). It is problematic to search for stuff in it.

Something is needed to manipulate this file, something fast and offline. Here is the result of my quick and dirty work.


This program works under Unix. Maybe under windows or other operating systems too, but that was not tested.


Unfortunately, it's not point-and-click-easy to use.

1. Fix cables.csv

Obviously, you need the file cables.csv. Dig the internet, it's easy to find.

First, you should fix it (it has garbage in it). Run the fix.sh program. an needs to exist as ./an/an and cables.csv must be in the current directory. Maybe an works with unfixed cables.csv. I didn't try. (I guess the dictionaries would be bigger and the process would slow down a bit, but that's all. Maybe...)

Anyway, here is what to do, in a shell:

/tmp/cables> ls
an-0.1.tar.gz  cables.csv
/tmp/cables> tar xf an-0.1.tar.gz
/tmp/cables> cd an-0.1
/tmp/cables/an-0.1> make
/tmp/cables/an-0.1> cd ..
/tmp/cables> ln -s an-0.1 an
/tmp/cables> an/fix.sh

And there you are, a fixed cables.csv.

2. Generate index

You must then generate the index file, which tells where a given cable starts in cables.csv.

/tmp/cables> an/an cables.csv -build-index -index-file index.bin

3. Generate dictionaries

Then the dictionaries.

This step is the longest (one minute on my computer, a PC at 2GHZ with 2GB of memory).

It will eat a lot of memory, so don't surf the web or play video games at the same time, unless you have a lot of memory.

/tmp/cables> an/an cables.csv -dict -index-file index.bin -dict-file dict.8 -table-file table.8 -dict-fields 10000000

You should get the following output:

252116581 words, 675305 unique words, 14118 collisions

The option -dict-fields is to decide what field(s) to use to build the dictionary. In the example above I take only field 8, which contains the text of the cables. You can include other fields. Replace '0' by '1' and the field is taken into account. (cables.csv is a file containing records. Each record has eight fields.)

4. Simple queries

Let's go by example, it's simpler.

What cables contain "war"?

/tmp/cables> an/an cables.csv -index-file index.bin -dict-file dict.8 -table-file table.8 -word-search war

It returns the cables' numbers containing the word "war".

Note that the dictionary has words made of letters and numbers only. In the dictionaries, the words are all in upper case, so searching for "war" is like searching for "WAR" or "War". And if a (real) word is, I don't know, "anti-fraud", it won't exist in the dictionary. There will be two words "anti" and "fraud".

Then you can look at the cable:

/tmp/cables> an/an cables.csv -index-file index.bin -dict-file dict.8 -table-file table.8 -get-cable 2 | less

You can also only watch one field if you prefer, add -get-cable-field, as in:

/tmp/cables> an/an cables.csv -index-file index.bin -dict-file dict.8 -table-file table.8 -get-cable 2 -get-cable-field 8 | less

5. Complex queries

I made a simple language to have complex queries.

If you want the cables containing the word "war" and the word "peace", type:

/tmp/cables> an/an cables.csv -index-file index.bin -dict-file dict.8 -table-file table.8 -search "/and[war,peace]"

There are a lot of them. 10107 (if my program is correct).

Instead of "/and" you can use "/or". There is also "/not". You can nest requests, like "/or[/not[love], /and[sex,clinton]]", which will return the cables that don't contain the word "love" or that contain "sex" and "clinton" (249141 says the program).

Don't overuse "/not", it tends to return a lot of cables.

You can also look for series of words, like "bin laden".

/tmp/cables> an/an cables.csv -index-file index.bin -dict-file dict.8 -table-file table.8 -search "/seq[bin,laden]"

(1029 cables.)

You can dig for long sentences, like "I want to kill Bin Laden":

/tmp/cables> an/an cables.csv -index-file index.bin -dict-file dict.8 -table-file table.8 -search "/seq[I,want,to,kill,Bin,Laden]" -verbose|less

(Hum, bad luck. 0 cable.)

If you pass -verbose, you get some more information that may be useful (like the number of cables matching the search, to write nice web pages like the one you read).

Finally, there are regular expressions. Like searching all the words derived from "ugly" (like ugliness, uglier, ugliest):

/tmp/cables> an/an cables.csv -index-file index.bin -dict-file dict.8 -table-file table.8 -search "/reg,^ugl[yi].*,"

(792 cables.)

Here, there is some syntax to learn. You type "/reg", followed by a character (here it's ','), followed by the regular expression, followed by the character again. Why so? So that if you need to put ',' in your regular expression, simply use another character (like ':' or '/' or whatever), that you don't use in the regular expression. Beware! The shell may do some tricks with what you type. Run with -verbose and look at what the program says to know what pattern was really used. You might need to escape some characters for them to be taken into account. Put '\' before them ('$' for example is weirdly used by the shell). This is where a GUI program is needed I guess...

Regular expressions are a bit hard to use if you don't know about them. Dig the internet, I have no time for more explanation than that. (I might have used them wrongly too in the code. If you find a weirdness, get in touch so I can fix things.)

You just need to understand that regular expressions are tested against words in the dictionary, not the file cables.csv, so don't try "/reg" queries on more than a word.

The good point is that at every place that you can put a word, you can also put a regular expression. Like "/and[/seq[bin,laden],/reg:^ugl[yi].*:]" (14 cables with Bin Laden and "ugly/ugliness/etc.").



Related work

There is an online searching website. Go to wikileaks. It's better. But online. But you can get the sources and do a local web server on your computer. I don't know if it eats a lot of resources.

Contact: sed@free.fr

Created: Tue, 06 Sep 2011 16:29:12 +0200
Last update: Mon, 19 Sep 2011 09:54:17 +0200