Structure Recovery as Counter-Example Guided Abstraction Refinement [Archive]

OpenRCE_RolfRolles

October 10th, 2009, 05:17

Counter-Example Guided Abstraction Refinement ("http://www.cs.cmu.edu/~emc/papers/Papers%20In%20Refereed%20Journals/Counterexample-guided%20abstraction%20refinement.pdf") (C.E.G.A.R. for short) is one of those brilliant techniques, like abstract interpretation, that is simple to state, and that in principle applies beyond its original application domain (in this case, model checking). As this blog entry is meant to be an informal exposition of an idea I had, I feel that it is not warranted to set up the necessary formalism to discuss C.E.G.A.R. in precise mathematical terms, but it would be wrong to convey the notion that it is anything other than a mathematical endeavor.

When you're trying to figure something out in a black-box fashion, you probably do something like the following. You begin by inspecting the data at hand and constructing some sort of model (an "abstraction"

that fits it. Then you acquire more data, and test your model against it. If new data causes the model to be falsified (that particular data is called a "counterexample"

, you must "refine" your model so as to account for it. This process is iterated until no more counterexamples are found. Basically, this is C.E.G.A.R. -- you can see that the idea enjoys widespread genericity and more or less applies to any scientific endeavor. Notice I didn't talk about software or anything in particular in this paragraph; I'm just talking about thinking.

To be more specific, C.E.G.A.R. is an idea in model checking that is very similar to the preceeding paragraph. Basically, a model (abstraction) is constructed automatically, and standard model-checking techniques are used to verify that the model satifies the specification which has been provided. If it does not, a counterexample is generated. You then analyze the original hardware or software system to see if the counterexample falsifies the specification. If it does, the counterexample is a genuine witness to the fact that the system does not adhere to the specification. If it does not, then we have a "spurious" counterexample (an indication that our abstraction is deficient), which we use to refine our model accordingly, and then we repeat the process.

Once I learned about C.E.G.A.R., I realized that I already adopted a very similar technique while manually reverse engineering object-oriented applications. I begin by assuming that there are not any structures in the program at all. When I encounter an instruction of the form mov reg, [reg+offset], I define a new structure consisting solely of that member, with any bytes below or above that member being considered as undefined. I then track the structure pointer and repeatedly add members every time I encounter a reference of that form.

In terms of C.E.G.A.R.:

1: Begin with an empty structure as the model.
2: Assert that the structure is a valid model of the program's memory accesses with respect to the structure pointer.
3: If a reference to an as-yet undefined structure member is found, this is a counter-example which is a witness to the falsity of the notion that the model accurately depicts the structure under consideration. Therefore, refine the model to incorporate the new structure member, and go to 2.
4: Output the structure.

I don't mean to imply that this is the way that an automated structure recovery tool should be constructed; I think that an interprocedural version of what Hex-Rays does ("http://www.hexblog.com/2008/02/easy_structure_types_1.html") with global dataflow analysis ought to be sufficient for most purposes. I'm merely suggesting a mathematically precise way to look at the problem of structure recovery.

https://www.openrce.org/blog/view/1506/Structure_Recovery_as_Counter-Example_Guided_Abstraction_Refinement