hello.

it was a challenge for me to make a lesson 1 of data-reversing, I don't know if I am sucessfull.

------------------------

Lesson 1 of data-reverse-engineering, by SvD (svd_bg@yahoo.com)

Well, I dont know if I could teach someone how to reverse-engineer data, but i'll try to make a live "unplugged" show of some old things. May be within it I could point out some important things.

I will use "C" style: 0x0000 for hex numbers; WORD=16bit-int, DWORD=32bit. And also I will use square brackets [] and uppercase to enclose important things.

[FIRST: to reverse-engineer data, you must be able to SEE it] (not to watch - "everyone's watching, but nobody sees" (Rainbow'81)).

I will explain what I mean using an anecdote over one old fairytale.

Excuse my english, it is technical, not philologist's.

...Once upon a time there was a hero, who had to fight the dragon. He walked through 9 lands into 10'th, and found the mountain of the dragon. He entered the biggest stinking cave and shouted:

"- Come on, you silly dragon, to fight with me".

Nobody answered. Then he shouted:

"- Come on, you silly dragon! or I will kill you while you sleep!"

Again silence. Third time he shouted with all his powers:

"- Come out, dragon, I'll kill you".

Now some powerful voice from beyond said:

"- Well, then. But I can't understand why should you scream from inside my ass?"

... /eo story /eo explanation. i.e. one have to see much farther than his own nose.

[SECOND: this is a technology on how to make your own technology. It is not a ready-cooked fast-food]

Note: most of techniques here are very useful also while reversing code.

Tools needed:

Reversing data is different from reversing code, because:

There are several ways of reversing data:

Of course, one must know about all these methods, and combine them if and where appropriate.

You have many bitmap fonts of same family and different sizes/weight. (like Bitstream screen fonts), or you have many .pfm's (that is for Poscript Font Metrics files). While switching between them, at same current pointer (zero), you must see the differences in the headers (AH! they have headers!)

[BOOKMARK 1: there are data-files with Headers, may be containing auxilary info about the following after that Contents (this is usual, coz more flexible); and there are data-files without Headers (or with Headers only). These two kinds could be divided in many others, possibly recursive subkinds.]

Then, if you still remember what file you are looking at (or YES! it is written on the screen), you could see, that 12-points-font-file has a 0x0C at offset 002, that 10-points-font-file has 0x0A at same offset etc. So you get several data items that look similar AND you are almost imaging what they meant; check your intuition - look into some other file. If your intuition is right, save that info as Rule1.

[BOOKMARK 2: keep track of the rules you suppose or guess. Never destroy old ones before absolutely sure they are wrong, or they have a substitute.] A rule that is less than 20% proved is just a guess; if it is 50-60%, it is a working rule; if it more than 90%, you are more (or at least not less) clever than the creator of the structure; the hardest thing is to gain that final 10-20%; sometimes it is needless, or impossible (90/10 razor: 10% of resources go for 90% of a thing, and the final 10% of the thing may eat the rest 90% OR even MORE.]

....etc about font struct: you see, that a chunk of non-zero data starts at offset 0x220, and you notice, that there is such a number written at some offs, etc.

[BOOKMARK 3: data could contain pointers to itself. That's why some files are not easy changeable...]

you have a kill-them-all-and-try-get-to-next-level game with savegame feature. You make a savegame, shoot once, then make another savegame (or rename first savegame and save over it). Now compare both files.

Better don't move - most of the "clever" games are tracking your hero and there may be very big differrences if you move.

[BOOKMARK 4: always try to compare the most little possible instances/differences you can, thus avoiding mistakes from "noise". Bigger things may be usable when you think you know enough and you are testing the resulted struct against reality.]

[BOOKMARK 5: if you are in doubt, you may be right. Some of the data in a data-file could be redundant, i.e. not at all looked through] (e.g. some non-initialized array - if the buffer is 100 bytes long, and the message is 5 byte long, and 0'th byte is its size, why then clear buffer to the end ?)

Target: MS Word (for DOS!) .doc format (do not mess with new stupid winword's format, or with any other .doc). It is a good example of

There are more things in the file, than Word understands. Or, some things you see are simply not get by Word itself (they are just some garbage - the state of the stack frame).

Here I will explain shortly how I derived the struct (about 70%) of a Word file.

It took me about 2 weeks (6 years ago - then I need only to know how to make longer/shorter (change) some text with another, without destroying formatting info, but without understanding it), then 3 days more, just 5 months ago, when I needed to understand what the formatting is). I have used all of above ways/techniques, except 4), and very much gazing onto screen.

1. compare some (small) .doc files: you must know what text is in, and how long they are.

1.1. they all start with 0x31 0xBE, zeroes, 0xAB. OK, assume 0xBE31 (16bit-word) is an identifier of the file-type

1.2. name of style-sheet is at offs 0x1E; name of printer is at offs 0x62

1.3. they have 128 (0x80) byte long header

1.4. after the header is raw text as it is; look well where it finishes

1.5. look into header again. Something familiar? the eo-text offset is a DWORD (litle endian - Intel style, 4bytes) at offset 0x0E. Check with a large file. Is it so ?

1.6. now go eoText and look carefully. There are some meaningless characters, but they stopped when offset came to a 128 byte boundary. Sounds familiar ?

Did you see any word4dos file that is not rounded to 512 bytes? Ok, lets see what is at next 128 byte boundary. Hm, it is similar. It starts with 0x80, then zeros, then it is like an array of offsets,... no, they aren't reasonable.

But there is really a 128 byte chunk, or block, that is repeated. Ok, we'll assume:

MSWord is using these 128 byte blocks to save it's formatting info. As the program works with VERY large files on very small memory (256K RAM was more than enough, in these good old times ...I used it with a 3Meg textfile, and it was working fine), it is using some paging technique (do you know what is memory paging? what is a virtual memory machine? hmmm..) to get these things allright. And only whole such blocks are saved. (and the mess that is in the stack, goes with it - why to be slow and clear it if it is not/never used?)

1.7. hm, how much 128-byte blocks are there ? divide whole-file-size by 128.

What do you get ? Do you remember where you saw such numbers? Go to the header. See it?

If you get a small file (1-2 lines, NO Formatting), you will see it. If you go wrong with a big file, you will not understand. There are several word's, starting at 0x012, just after the eoText-offs. OK. lets guess again. Multiply 1st number by 128, what you get? go there. hm, second block start. 2nd number? 3rd number? ... all they point to some block with (2byte-words) 0x12, 0x13, 0x14, ... then some dates as text, and then shits. OK, there is another number after the printer name. Multiply by 0x80, go there - hm, the end of same that block.

Try several files with differrent sizes (but without formatting!) OK, that last number will be assumed the meaningfull size of the file in 128byte blocks.

The rest up to 512byte boundary is just a filling (anyway, block devices always write full blocks).

1.8. Up to here we have used only way 2 (and 1). Now WILL go inside, and format first 5 letters bold. Save to another file and look at both of them (I just do the same at the moment, coz I cannot explain on-air).

1.9. things are changed, but only in 1st block after text. hmmmmm, second offset is where formatting ends. then its is 0xFFFF, then .. eoText +1, then 0xFFFF, then mess again. Ah, and the last byte of the block is now 2, but was 1. Lets try harder. add chars into bold part. look at file again. See? second offset moves, again to the eoBold. Divide bold thing into three things - 1 bold, one normal, one bold. look again. last byte becomes 4 (yes, they Are 4: bold, non-bold, bold, the rest); other offsets move and now are 2 more.. OK, assumed struct is:

struct block128 {
DWORD 0x80;
struct { DWORD offset_of_eo_formatting; WORD some_format; } repeated;
mess...
BYTE number_of_formattings_in_block;
};

1.10. try with many character formats. you will see, no more than 15-20 are in a block. if there are more, a new block is opened, and YES! 1st DWORD is not 0x80, but start of the block! and the numbers in header move!

Add paragraph formats. What changes ? the block[s] after character's.

------

2. here we know that .doc struct is:

struct MSWordHeader {
WORD BE31, w1_0, ab00, w4_0[4];
DWORD text_end_file_ofs; //0E
WORD Nblock_start_para; //12
WORD Nblock_start_footnote; //14
WORD Nblock_start_dontknowwhat; //16
WORD Nblock_start_divisions; //18
WORD Nblock_start_summary; //1A
WORD Nblock_start_summary2; //1C
BYTE nameStylesheet[68]; //1E
BYTE namePrinter[8]; //62
WORD N_total_blocks; //6A
BYTE _others[0x14]; //6C
};

//and char & paragraph format block's struct is

block128 = {
DWORD start_of_block;
{ DWORD offset_of_eo_formatting; WORD some_format; } repeated;
meaningless mess... up to offs 126
BYTE number_of_formattings_in_block; //at offs 127
};

and the whole file is cut into 128-byte-blocks:

As the most things used are character and paragraph formats, rarely divisions and pages, this was enough to make a hyphenation program, that reads .doc file, inserts discrete hyphens where needed, and saves the result (without understanding a bit of formatting while keeping it consistent).

[BOOKMARK 6: Do not try to understand everything. It may not be necessary to have the job done.]

When I needed to understand the formatting, I started from beginning, but now looking on how the file changes when I change one format to another. It was a good long gazing, trying, comparing, repeating from the start...

As a result, here is the description of a block:

struct Thing {
DWORD file_ofs; //where the thing/property ends
WORD descr_ofs; //ofs from first thing in block;
// descr is 1 byte sz + sz bytes definition
//thing is applied from prev-thing's ofs upto (not including) its own ofs
//char's are applied from-to; N_of_char_things = N_of_changes_in_char_style
//para's have one thing for every paragraph; no matter same, or different
};
struct ThingBlock {                    //size:0x80; char or para
DWORD prevblock_last_file_ofs; //(1st block begins with ofs 80h)
Thing things[17];
BYTE deflist[21]; //format definition list
BYTE Nvalid_things; //[0x7f]

//and several methods, regarding type of format; plus iterator over the list.
void thingtype( Thing * p, const BYTE * typ ) const;
Thing * eothings() { return things+Nvalid_things; }
const BYTE*thingdef( Thing * p) const { return p->descr_ofs==0xFFFF ? (BYTE*)~0 : (BYTE*)things + p->descr_ofs; }
};

It is very sophisticated struct: so-called Things are filled from start-to-end, the definition list is filled end-to-start; different def's have different sizes; where they meet Things, the block is over, and new one is started.

Using the above, i wrote a preprocessor/parser/converter for DOS WORD's .doc, that turned .doc into little TEX engine. You just define how character and paragraph styles are to be converted (also headings, where to place pictures, markers - into same-Word file!), define DTP's stylesheet and go on. The thing is working automagically. I got several 100+ page documentations turned into Ventura, Frame Maker, and HTML, formatted (to resize pictures), tuned for best size-versus-cost relation and printed in 3 days!

So, adio for now. I think, too much as a first attempt. Sorry if you are tired of all this. It is a pleasure for me to stop here too.

Oct'1998

SvD