Extracting words from a PDF dictionary

Why?

I cheat to scrabble-like games. I need a dictionary with known words. The turkish version I play (kelimelik), I mean cheat, uses a thing called Büyük Türkçe Sözlük (or is it Güncel Türkçe Sözlük?). Unfortunately, there is no list of words for it, contrary to the french ODS6.

But there is a PDF out there (maybe here, otherwise dig the web).

So I need to extract words from it. And only words, not the definitions.

Okay, no problem, let's use pdftotext.

Have a look at the first page of the PDF file:

[image: BTS page 1]

Here is the beginning of what pdftotext gives to us:

Büyük Türkçe Sözlük
Sürüm No: 1.0
Açı
klama

Farabi

(veya ağ nıiçine) bakmak
zın
* ne söyleyeceğ beklemek.
ini
* onun sözüne göre davranmak.
... (bir) hâl almak
* bir duruma gelmek.
... canlı
sı
* düş
künü.
... damgası vurmak
nı
* (biri için) kötü bir yargı varmak.
ya
... -e kuvvet

See? That does not work. At all. (Açıklama is cut. All words with those funny ı, ğ and ş are cut.)

Selecting from inside xpdf does not work either.

And no, I won't try with Adobe's software. No way I run those on my computer.

So, some hacking is needed. We'll extract things by hand. Let's go!

Extracting objects

I already extracted stuff from a PDF file a few years ago (see here).

I just used the programs I wrote at the time to get the objects. I needed to modify extract because the file uses ^M instead of \n to separate lines.

So, here is ieee754.tar.gz, a new version that works with this file.

To run it, create a subdirectory, go into it and type:

  ../ieee754/extract < ../Buyuk.pdf

You get around 12000 obj files. Yes, the dictionary has more than 4000 pages and uses several objects per page.

The obj files are named 00001.obj, 00002.obj, and so on.

Get pages

You need to get the objects defining pages. For that, there is a special object with a /Kids array in it. Looking at the objects with less *.obj, and typing :n to go to next file, we are lucky and that thing is in 00003.obj.

Have a look at the beginning of it:

3 0 obj<</Type/Pages/Count 4041/Kids[ 9 0 R 12 0 R 15 0 R 18 0 R
 2 1 0 R 24 0 R 27 0 R 30 0 R 33 0 R 36 0 R 39 0 R 42 0 R 45 0 R
 48 0 R 51 0 R 54 0 R 57 0 R 60 0 R 63 0 R 66 0 R 69 0 R 72 0 R 
75 0 R 78 0 R 81 0 R 84 0 R 87 0 R 90 0 R 93 0 R 96 0 R 99 0 R 1
02 0 R 105 0 R 108 0 R 111 0 R 114 0 R 117 0 R 120 0 R 123 0 R 1
26 0 R 129 0 R 132 0 R 135 0 R 138 0 R 141 0 R 144 0 R 147 0 R 1
50 0 R 153 0 R 156 0 R 159 0 R 162 0 R 165 0 R 168 0 R 171 0 R 1
74 0 R 177 0 R 180 0 R 183 0 R 186 0 R 189 0 R 192 0 R 195 0 R 1
98 0 R 201 0 R 204 0 R 207 0 R 210 0 R 213 0 R 216 0 R 219 0 R 2
22 0 R 225 0 R 228 0 R 231 0 R 234 0 R 237 0 R 240 0 R 243 0 R 2
46 0 R 249 0 R 252 0 R 255 0 R 258 0 R 261 0 R 264 0 R 267 0 R 2
70 0 R 273 0 R 276 0 R 279 0 R 282 0 R 285 0 R 288 0 R 291 0 R 2
94 0 R 297 0 R 300 0 R 303 0 R 306 0 R 309 0 R 312 0 R 315 0 R 3
18 0 R 321 0 R 324 0 R 327 0 R 330 0 R 333 0 R 336 0 R 339 0 R 3
42 0 R 345 0 R 348 0 R 351 0 R 354 0 R 357 0 R 360 0 R 363 0 R 3
66

And the end:

12063 0 R 12066 0 R 12069 0 R 12072 0 R 12075 0 R 12078 0 R 1208
1 0 R 12084 0 R 12087 0 R 12090 0 R 12093 0 R 12096 0 R 12099 0 
R 12102 0 R 12105 0 R 12108 0 R 12111 0 R 12114 0 R 12117 0 R 12
120 0 R 12123 0 R 12126 0 R 12129 0 R]>>^Mendobj

The first page is defined in object 9 (9 0 R), the second in object 12, and so on.

So, let's put those 9, 12, ..., into a file.

Here comes the magic shell line. Note that other shell commands may be used. I did it that way for I'm used to those commands and they popped up in my brain when I needed to do something.

cat 00003.obj | cut -f 2 -d '['|cut -f 1 -d ']'|sed -e "s/ 0 R//
g" |sed -e "s/ //"|tr -s ' ' '\n' > PAGES

Let's explain it, piece by piece.

cat 00003.obj outputs the file.

cut -f 2 -d '[' removes everything before the [

cut -f 1 -d ']' remove the end of the output after ].

From now on we are left with 9 0 R ... 12129 0 R.

sed -e "s/ 0 R//g" removes all the 0 R.

We now have the object numbers.

sed -e "s/ //" removes the first space. Not absolutely necessary but avoids the necessity to edit the produced file to remove the first empty line if that command is not there.

tr -s ' ' '\n' replaces spaces by newlines. We end up with one number per line.

> PAGES puts the numbers in the file PAGES.

Ah yes. | is a pipe. It lets the output of a command to be the input of the next command. That's how you glue commands in unix. That's a nice little thing, except it is limited to one output connected to one input.

Get data file of page

Here comes 00009.obj (the object describing the first page).

9 0 obj<</Type/Page/Parent 3 0 R/MediaBox[0 0 595 842]/Resources
<</ProcSet[/PDF/Text]/Font<</F1 6 0 R/F2 7 0 R/F3 8 0 R>>>>/Cont
ents 4 0 R/Annots 12144 0 R>>^Mendobj

From it we find that the content of the page is in object 4 (/Contents 4 0 R).

To extract this 4 (and the same for the 4000 other pages), here is the magic shell line.

for i in `cat PAGES`; do sed -e "s/.*Contents //" `printf "%5.5d
" $i`.obj |cut -f 1 -d ' '; done > PAGES_DATA

Let's explain.

for i in `cat PAGES`; do ...; done runs the commands for all the numbers in the file PAGES. Each number is put in the variable i that you use as $i.

`cat PAGES` is used instead of a | because there is no way to use a | here. You can't write:

cat PAGES | for i ; do ...; done

Why? Because the shell is done the way it is. Those who invented it could have chosen another way to do things.

Anyway, we have to put the command between ` and pass it as it is written.

sed -e "s/.*Contents //" `printf "%5.5d" $i`.obj removes everything before (and including) Contents from the file we pass to it.

And the file we pass to it is given by `printf "%5.5d" $i` which transforms i into a five digit version of it. We then append .obj to it and get the file name. (%5.5d might not be necessary. Maybe %5d or %.5d is enough. But I never remember what number does what, so I put both, lazy me.)

We now have:

4 0 R/Annots 12144 0 R>>^Mendobj

cut -f 1 -d ' ' just keeps the first number (here 4).

> PAGES_DATA puts those numbers into the file PAGES_DATA.

Get length object of page

The file 00004.obj looks like:

4 0 obj<</Length 5 0 R/Filter/FlateDecode>>stream
[ binary data ]
endstream^Mendobj

It's compressed data, using the zip algorithm (/FlateDecode).

I wrote the program deflate (in the .tar.gz given above). But that program needs to get the length of the binary data. Which is... in object 5 (/Length 5 0 R).

So we need to get that length object.

And another magic shell line:

for i in `cat PAGES_DATA`; do head -n 1 `printf "%5.5d" $i`.obj 
| sed -e "s/.*Length //"| cut -f 1 -d ' '; done > PAGES_LENGTH_I
NDEX

The head thing outputs only the first line of the file, then we remove everything before Length (the sed command) and then we only keep the number (the cut command), which is 5 in the example above. This is put into the file PAGES_LENGTH_INDEX.

Get length of compressed data

We now get the actual length (in bytes) of the compressed data.

00005.obj contains:

5 0 obj 2349^Mendobj

The length is 2349. Here is the shell command to get it:

for i in `cat PAGES_LENGTH_INDEX`; do cat `printf "%5.5d" $i`.ob
j | tr -s '\015' ' '|cut -f 4 -d ' '; done > PAGES_LENGTH

It's similar to the previous command. We have tr -s '\015' ' ', which replaces the ^M by a space. Not doing that would let the following cut return 2349^Mendobj instead of the desired 2349.

Build unzip script

We now have the compressed objects (in the file PAGES_DATA) and their lengths (in the file PAGES_LENGTH). We need to call deflate with the correct arguments.

I didn't find some shell magic, because I have two files and I need to read them at the same time. So I wrote a little C program. (If someone on this planet knows how to do it with shell magic...)

cat > unz.c << EOF
#include <stdio.h>
#include <stdlib.h>

int main(int n, char **v)
{
  FILE *f1, *f2;
  int data, size;
  f1 = fopen(v[1], "r"); if (!f1) abort();
  f2 = fopen(v[2], "r"); if (!f2) abort();
  while (1) {
    if (fscanf(f1, "%d", &data) != 1) break;
    if (fscanf(f2, "%d", &size) != 1) abort();
    printf("../ieee754/deflate %d < %5.5d.obj | tr -s '\\\015'"
           " '\\\n' > %5.5d.txt\n", size, data, data);
  }
  return 0;
}
EOF

Then I compile and run it.

gcc -o unz unz.c -Wall
./unz PAGES_DATA PAGES_LENGTH > unzip.sh
chmod +x unzip.sh

The file unzip.h starts with:

../ieee754/deflate 2349 < 00004.obj | tr -s '\015' '\n' > 00004.txt
../ieee754/deflate 2603 < 00010.obj | tr -s '\015' '\n' > 00010.txt
../ieee754/deflate 2120 < 00013.obj | tr -s '\015' '\n' > 00013.txt
../ieee754/deflate 2041 < 00016.obj | tr -s '\015' '\n' > 00016.txt
../ieee754/deflate 1859 < 00019.obj | tr -s '\015' '\n' > 00019.txt
../ieee754/deflate 2316 < 00022.obj | tr -s '\015' '\n' > 00022.txt
../ieee754/deflate 1817 < 00025.obj | tr -s '\015' '\n' > 00025.txt
../ieee754/deflate 2278 < 00028.obj | tr -s '\015' '\n' > 00028.txt

Get text data

00004.txt starts with:

4 0 obj<</Length 12182 0 R>>stream
q
BT
0 0 0 rg
 0 Tc 0 Tw /F1 12 Tf
1 0 0 1 242 749  Tm
[(B)-3(\374)2(y\374)1(k)-249(T)-2(\374)2(r)4(k)1(\347e)-252(S\366)1(zl\374)1(k)]TJ
 0 Tc 0 Tw /F2 9.96 Tf
1 0 0 1 70 737  Tm
[(S)-3(\374)-4(r)-4(\374)-4(m)-253(No)4(:)-251(1.)1(0)-1013(F)-3(a)-4(r)-4(a)-4(b)4(i)]TJ
0 0 1 rg
 0 Tc 0 Tw /F2 9.96 Tf
1 0 0 1 70 726  Tm
[(A)2(\347)]TJ
 0 Tc 0 Tw /F3 9.96 Tf
1 0 0 1 81 726  Tm
<00d6>Tj
 0 Tc 0 Tw /F2 9.96 Tf
1 0 0 1 84 726  Tm
[(kla)-4(ma)]TJ

This prints the beginning of the page 1, up to and including "Açıklama".

After reading a few of those objects, we see that PDF commands are put one per line. This eases our life a lot. To parse PDF commands, we can work line by line.

What do we need from that?

We need Tj and TJ lines, which are the commands that display text.

The TJ takes an array (between [ and ]) in argument. In this array, there are actual strings (between ( and )) and numbers, which moves the pen to the left or to the right.

After printing a string, the position is moved by a width value (more on that later). The number in TJ moves it a little bit more, to the right, or to the left.

Tj is used in this PDF file to display the weird turkish letters (ı, İ, ş, Ş, ğ and Ğ). It always comes as <XXXX>Tj with XXXX being 00d6, 00f7, 00f8, 00f9, 00fa or 00fb for, respectively ı, Ğ, ğ, İ, Ş and ş. I'll explain later the work needed to find that out.

But we also need Tm, which moves the pen to the specified position. PDF says it creates a transformation matrix, but in this file it's only used to set the position of the pen. No rotation or scale is applied through the Tm command.

And we also need the Tf lines, which specify which font is used. We could skip that, because the file uses only three fonts. The first one for the title, the second one for ordinary text and the third for the turkish letters. And since the turkish letters always come in a Tj command, we could be done with that. For whatever reason I chose to process the Tf lines. My brain likes a bit of autonomy from time to time.

And now we can glue all that together and throw a magic shell line to get the wanted lines from all the *.txt we just produced.

for i in `cat PAGES_DATA`; do grep -e "Tf$" -e "Tm$" -e "Tj$" -e
 "TJ$" `printf "%5.5d" $i`.txt; echo XXX; done > FULL_TXT

The grep does it all, selecting lines ending with Tf, Tm, Tj or TJ.

We also put XXX after each page (echo XXX). This is used to reset the current state in the next program (retext), the one that will extract text from the file FULL_TXT we just produced (this file is 50822895 bytes long).

Reconstruct text

That part is the hardest.

The PDF file puts letters at given position on a page. There is no information on words. In fact, there is no space or new line at all.

But we are lucky. The commands come in order, that is line by line and from left to right on each line.

The line's position is given as the last argument of a Tm line. So each time we see a Tm with a modified last argument, we know there is a new line and we print a \n. Each time we see a XXX that we put to separate pages, there is also abviously a new line to put.

We are now faced with the problem of words. How do we know that between the last letter and the next one we must put a space or not?

There we need to know the width of a letter. And we need to check that the next position is "far enough" to insert a space.

The next position is by default "last position + width". But in TJ the number changes that.

The first TJ line in the file is:

[(B)-3(\374)2(y\374)1(k)-249(T)-2(\374)2(r)4(k)1(\347e)-252(S\36
6)1(zl\374)1(k)]TJ

The numbers are -3 (no space), 2 (no space), 1 (no space) and then we have -249 and there is a space, because that number is "big enough" to force the insertion of a space at that position.

It's all very heuristic. What is "big enough"? -100? -200? -300?

It depends. It depends on the font and the size of the font.

So I tried some values and kept the one that seemed to work. I could not check all the text (too big), so maybe it fails here or there. But that's life.

The Tm command also sets the horizontal position of the pen and may introduce a space too. retext handles that too.

One last thing. How do I know what part of the text is a word and what part is a definition?

Look at the screenshot above. You see that words are on their own line and there is some space between that line and the previous one. (Well, that first page does not define words but some stuff like ... canlısı. The structure of other pages is the same though. Some space before a word line.)

That can be detected. Just check that the new vertical position of the Tm is, there again, "far enough" from the previous one, and you know that the following line is a word, not a definition.

There is a bug in this method: the first line of a page is always taken as a word. It's a little annoyance. Some later filtering gets rid of all the practical cases.

To ease the extraction of words at the next step, I chose to prepend WORD to word lines.

And here is what I type to extract text from the file FULL_TXT

cd ..
gcc retext.c -o retext -Wall -O3
cd z
../retext < FULL_TXT > full.txt

The file full.txt is 6090959 bytes long (more than 8 times shorted than FULL_TXT) and has 176717 lines.

Here is retext.c.

(There is a little printf "bug" in there, see it (in TJ())? I should have used fprintf to stderr at that spot. Checking full.txt shows that the bug does not popup. Lucky me.) (Another little bug in oppar, in the fprintf. It should be TJ, not Tj.)

Extract words

full.txt starts with:

WORD Büyük Türkçe Sözlük
Sürüm No: 1.0 Farabi
Açıklama
WORD (veya ağzının içine) bakmak
* ne söyleyeceğini beklemek.
* onun sözüne göre davranmak.
WORD ... (bir) hal almak
* bir duruma gelmek.
WORD ... canlısı
* düşkünü.
WORD ... damgasını vurmak
* (biri için) kötü bir yargıya varmak.
WORD ... -e kuvvet
* herhangi bir şeye ağırlık verildiğinde kullanılır.
WORD ... fırın ekmek yemesi lazım
* bir duruma erişmek için pek çok emek vermesi, çalışması gerekir.
WORD ... gözüyle bakmak
* yerine koymak.

(Yes, there is a first empty line.)

A little farther in the file, we have:

WORD abideleşme
* Anıtlaşma.
WORD abideleşmek
* Anıtlaşmak.
WORD abideleştirme
* Anıtlaştırmak işi.
WORD abideleştirmek
* Anıtlaştırmak.
WORD abidemsi
* Anıt benzeri.
WORD abidevi
* Anıtla ilgili, anıtsal, anıta benzer, anıt gibi.

We want lines starting with WORD , remove that WORD and keep the lines with no space, because we want simple words only.

So we want to reject WORD (veya ağzının içine) bakmak and accept WORD abideleşme.

Here is the magic shell line:

grep "^WORD " full.txt | sed -e "s/^WORD //"|grep -v ' ' > dict.
txt

We want lines starting with WORD (grep "^WORD " full.txt), remove WORD (sed -e "s/^WORD //") and reject lines that contain some space (grep -v ' ').

A note on the little bug (first line of a page always taken as a word even it's part of a definition starting from the previous page). The space filtering above would leave us with only one possible case: a word followed by a dot, as definitions always end with a dot. Running:

grep "\.$" dict.txt

Returns nothing. (No line in dict.txt ends with a dot.) So, as previously stated, the filtering removes all practical cases of that little bug.

dict.txt is 463347 bytes long and has 47648 lines.

Cleanup the word list

Unfortunately, dict.txt contains words not accepted by the kelimelik game. I have to remove them by hand when my cheat program pops them up.

I wrote a shell script (full-check.sh) to use an online dictionary instead of going by hand. It has bugs, in particular the echo lines, because in dict.txt we have lines -e and -n and this is interpreted by echo instead of being echoed. There are surely other bugs. Shell expansion is weird.

The script takes several hours to finish. It does something like at most 10 requests per second on my computer / network access. And very often less than 1 request per second.

I'm not satisfied with that script, so I keep my "by hand" approach. (No it's not rational.)

(Edit from 2013/11/22: my programs now play automatically, I don't need to cleanup anything at all.)

A note on fonts and characters' widths

As we have seen above, the text is put on the page with Tj and TJ commands with Tf commands all around.

Maybe that's why xpdf and pdftotext fail to extract the text correctly.

Anyway, because of those Tj and TJ, we absolutely need to know the width of characters. We can't just say "oh, a new Tj, let's start a new word" because that Tj may be part of a word. We really must start a new word only if the current letter is "far enough" from the previous one. If the PDF file only used one TJ per line of text and no Tj we could just check the numbers in it and put a space if the number is "big enough". But that's not the case.

Reading various objects we see that Tj always uses the font /F3 (it is always preceded by a Tf command selecting font /F3). And we see that apart from the title which uses the font /F1, all the rest uses the font /F2.

Where are those font and how are they defined?

By reading the obj files one by one we see that the font /F1 is defined in 00006.obj. We are lucky, it's not 12000.obj. (We could use a grep to dig for F1 if the fonts were defined in later obj files.)

Here is the content of the file 00006.obj.

6 0 obj<</Type/Font/Subtype/TrueType/Name/F1/BaseFont/ZRAQAR+Gar
amond,Bold/Encoding/WinAnsiEncoding/FirstChar 32/LastChar 255/Wi
dths[250 260 552 667 469 833 802 281 354 354 490 667 260 333 260
 552 469 396 469 469 469 469 469 469 469 469 260 260 667 667 667
 417 927 656 677 677 781 708 615 729 865 396 375 677 635 917 844
 792 615 792 698 510 688 760 667 896 688 656 667 365 552 365 583
 500 333 479 552 469 552 469 302 542 552 281 260 531 260 844 552
 521 552 552 344 417 313 552 458 708 500 469 469 396 542 396 667
 750 469 750 250 708 490 1000 500 500 333 1031 510 281 990 750 6
67 750 750 250 250 490 490 354 500 1000 333 1000 417 281 729 750
 469 656 250 260 469 677 688 656 542 500 333 750 302 458 667 333
 750 500 396 667 313 313 333 458 542 333 333 313 333 458 833 833
 833 417 656 656 656 656 656 656 917 677 708 708 708 708 396 396
 396 396 781 844 792 792 792 792 792 667 792 760 760 760 760 656
 615 542 479 479 479 479 479 479 698 469 469 469 469 469 281 281
 281 281 521 552 521 521 521 521 521 667 521 552 552 552 552 469
 552 469]/FontDescriptor 12130 0 R>>^Mendobj

As we can see, the widths are in there.

Still lucky, the font /F2 is in 00007.obj.

7 0 obj<</Type/Font/Subtype/TrueType/Name/F2/BaseFont/LXNZOC+Gar
amond/Encoding/WinAnsiEncoding/FirstChar 32/LastChar 255/Widths[
250 219 406 667 448 823 729 177 292 292 427 667 219 313 219 500 
469 469 469 469 469 469 469 469 469 469 219 219 667 667 667 365 
917 677 615 635 771 656 563 771 760 354 333 740 573 833 771 781 
563 771 625 479 615 708 677 885 698 656 656 271 500 271 500 500 
333 406 510 417 500 417 323 448 510 229 229 469 229 771 510 510 
510 490 333 365 292 490 469 667 458 417 427 479 500 479 667 750 
469 750 219 615 448 1000 427 427 333 1021 479 198 938 750 656 75
0 750 219 219 448 448 354 500 1000 333 979 365 198 698 750 427 6
56 250 219 417 573 677 656 500 427 333 760 260 365 667 313 760 5
00 396 667 313 313 333 500 448 333 333 313 333 365 813 813 823 3
65 677 677 677 677 677 677 854 635 656 656 656 656 354 354 354 3
54 771 771 781 781 781 781 781 667 781 708 708 708 708 656 563 5
00 406 406 406 406 406 406 583 417 417 417 417 417 229 229 229 2
29 521 510 510 510 510 510 510 549 510 490 490 490 490 417 510 4
17]/FontDescriptor 12133 0 R>>^Mendobj

Same structure as for font /F1.

And here comes 00008.obj that defines the font /F3.

8 0 obj<</Type/Font/Subtype/Type0/Name/F3/BaseFont/ZLNTSC+Garamo
nd/Encoding/Identity-H/ToUnicode 12136 0 R/DescendantFonts[ 1213
7 0 R]>>^Mendobj

We are less lucky, no widths.

Two things in there.

First, we have /ToUnicode 12136 0 R. After uncompressing and reformatting 12136.obj, we get its content.

12136 0 obj<</Length 452 0 R>>stream
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo 
<<
/Registry (ZLNTSC+#47#61#72#61#6D#6F#6E#64) 
/Ordering (Zeon1) 
/Supplement 0
>> def
/CMapName /ZLNTSC+#47#61#72#61#6D#6F#6E#64 def
/CMapType 2 def
1 begincodespacerange
<0000> <FFFF>
endcodespacerange
6 beginbfchar
<00d6> <0131>
<00f7> <011e>
<00f8> <011f>
<00f9> <0130>
<00fa> <015e>
<00fb> <015f>
endbfchar
endcmap
CMapName currentdict /CMap defineresource pop
end
end
endstream
endobj

Oh by the way, see that /Length 452 0 R? It's wrong. It should be /Length 452. My program "deflate" didn't work with indirect lengths as we have here and thus I did a quick and dirty hack and we now have that little bug. Since we don't reconstruct the PDF file, we don't care.

What do we learn here? We learn that 00d6 maps to unicode 0131 and so on, up to 00fb that maps to unicode 015f. These are our funny turkish letters. See for example that and change the URL for the other unicode characters.

So we now know what letter to output for the Tj lines.

Ah, in TJ we have ASCII characters and stuff like \374. This is an octal (base 8) number. (374 in octal is 3*8*8 + 7*8 + 4 = 252.) In the fonts /F1 and /F2 we have /Encoding/WinAnsiEncoding, which is, according to the PDF documentation, windows-1252 (and that format is described here). So we know all we need to output letters.

One thing still misses: the widths in font /F3.

And we now go to /DescendantFonts[ 12137 0 R] found is 00008.obj.

12137.obj is:

12137 0 obj<</Type/Font/Subtype/CIDFontType2/BaseFont/ZLNTSC+Gar
amond/FontDescriptor 12138 0 R/CIDSystemInfo<</Registry(Zeon)/Or
dering(Identity)/Supplement 0>>/DW 1000>>^Mendobj

/FontDescriptor 12138 0 R leads us to 12138.obj.

12138 0 obj<</Type/FontDescriptor/FontName/ZLNTSC+Garamond/Flags
 6/FontBBox[-250 -263 1202 1000]/StemV 78/ItalicAngle 0/CapHeigh
t 862/Ascent 862/Descent -263/StemH 78/XHeight 431/Leading 125/A
vgWidth 387/MaxWidth 1202/MissingWidth 1202/FontFile2 12139 0 R>
>^Mendobj

Still no width in there... But we have /FontFile2 12139 0 R.

12139.obj is a compressed object. After decompression, we get an object with a binary stream. Let's put that binary stream into a file 12139.raw (simply remove the first line of the decompressed object) and run file 12139.raw. It says:

12139.raw: TrueType font data

Okay, so it's a font file.

In another project I had to deal with fonts, so I have some code here and there that plays with the freetype library.

So I took some code from here and there and wrote ttf.c. The file has a lot of #if 0 ... #endif in there, just uncomment the part you need. (It's a hack after all.)

I needed to see how many characters (in a font it's called a "glyph", not a character, we'll keep "character" in this web page) there were (663). Then what they looked like, so in ttf.c you have the draw and save functions. I could check that 00d6 was indeed at position 214 (00d6 in hexadecimal, and we have d * 16 + 6 = 214 where d is 13) and looked like the turkish ı and the same for the other funny characters, as was expected from the file 12136.obj we have seen above.

The final required information is the width and that is given by ft_face->glyph->metrics.horiAdvance in freetype. We must take care to call FT_Load_Glyph with FT_LOAD_NO_SCALE. The value of horizAdvance is in font space and must be converted in PDF text space. So you have to do x * 1000 / font.units_per_EM. I digged the web to find that information (here). (Don't remember what I looked for in the search engine.) In our case, freetype says units_per_EM is 2048.

So we're good! We have all the necessary widths!

Some thoughs on PDF

I'll be short.

PDF is a very bad file format. Why so many options and different ways to do the same thing? Why embed compressed data? Why the /Length thing? Why this madness with fonts? Why the index? People at Adobe, you are totally insane.

What would a good file format be? It should be simple. One construct for one result. No compression at all, you leave that for other programs that will do it certainly better. Only one format for embedded fonts. Easy to edit with any text editor. (And all the work I had to do and this web page should not exist at all.) (Hum, wait, if we go that way, what to say of someone who cheats at scrabble-like games?) (Okay, Let's forget it.)

That kind of simple things will never come out of a big corporation or association or whatever structure containing dozens and hundreds of human beings, because every individual in there wants to be "part of it" and you end up with many useless things as in PDF. That also works for free software projects like the linux kernel, firefox, gtk, all the big names in the play that are overly complex for what they do.

Conclusion

I explained (with some details) how I extracted the words found in the Büyük Türkçe Sözlük PDF file.

Here is the README file I wrote while doing this all. Same information as the current web page, just way way shorter... (The file is in UTF-8.)


Contact: sed@free.fr

Created: Mon, 04 Nov 2013 22:43:18 +0100
Last update: Sat, 16 Nov 2013 18:54:13 +0100