Faculty of Technology > Electrical and Information Engineering > Computer Engineering Laboratory

OUSPG

[This page is CSS2 enabled. Your browser might not fully support it]

PROTOS Protocol Genome Project

The Search for Information Technology DNA

$RCSfile: index.html,v $ $Revision: 1.1 $ $Date: 2008/09/08 10:31:26 $

ABSTRACT

Computer programs are a part of our daily lives. The programs we use process information from various sources using a pleathora of encodings and protocols. The input processing routines are among the most exposed areas of a program, which is why they should be especially reliable. This is rarely the case. Our previous experiments with protocol implementations have shown that well designed structural mutations are an efficient way to expose errors in networked programs. A similar approach can be applied to any type of content by using automatic structure inference techniques. It is our belief that that automatic structure inference combined with domain specific reasoning capabilities will enable, among other things, efficient and creative automated program robustness testing tools.

Project Overview

The PROTOS Protocol Genome Project was initiated in January of 2004. The original motivation was to enable testing of implementations of arbitrary, possibly unknown, protocols by inferring a model of the packet structure of a protocol from examples, and using the model to generate testing material for implementations of the protocol.

The guiding principle of the project is to build an automatic program robustness testing system based on model assisted fuzzing. Structure of valid inputs for a program is first analyzed by a program to understand what kind of data the program is known to process. This knowledge is then used to build similar data, with various kinds of mutations. The program is then subjected to these mutated inputs in a controlled enviroment, and the cases that cause the program to fail are collected. We currently consider each program as a stateless protocol.

Main areas of the project are structure inference and instrumentation. Our current structure inference tools are based on two of our previous prototypes. Although the tools and testing framework are preliminary versions, we have already found a large number of interesting program failures.

Project Status

We currently have two funders. We are testing, for the most part, security software, compression tools and office suites.

We are looking for more funders, since we can barely implement and test the most crucial ideas. Additional hardware and human resources would significantly speed up the project.

Documents

Viide J., Helin A., Laakso M., Pietikäinen P., Seppänen M., Halunen K., Puuperä R., Röning J. Experiences with Model Inference Assisted Fuzzing. 2008. [publication page]

A. Helin, J. Viide, M. Laakso, J. Röning. Model Inference Guided Random Testing of Programs with Complex Input Domains. 2006. [pdf]

These documents need some serious revision:
- Some notes on a simple structure inference approach (pdf)
- Some notes on our structure description language (pdf)
- A very outdated Research Plan

Todo and Notes

Build a proper testing framework.
Update, partition and write documentation.
~~Design an extended language and a related evaluation function to be able to incorporate different structure inference algorithms in a flexible manner.~~ (probably deprecated)
A separate generalizer/improver. (separate or integrated?)
Build a continuous MDL-based structure analyzer.
Prototype some genes-as-logical-relations approaches. (ILP anyone?)
~~Prototype some (ε + probabilistic) context free grammar inference approaches.~~
Write a small portable functional gene (read: simple backtracking parsing functions) library that could be released as public domain, since it appears to be useful.
Do some more background research on learnability theory.
~~Do some more background research on compression.~~
Add the simplified Ukkonen's algorithm and Qsufsort to code snippets.
Add common file content expert modules.

Recent Changes (newer first)

The old permutation fuzzer now also outputs grammars.
A grammar fuzzer was written.
A new (apparently better) grammar based fuzzer was written.
Started parallelizing structure inference.
Started a graphical version of a reverse-engineering tool.
Started building a testing frameworks.
The permutation fuzzer now requires less memory.
Added metadata to fuzzer output, so that the generated files can be reverse-engineered.
Started making our own training material collection to be able to release test sets without copyright issues.
Wrote a semirandom channel-surfer fuzzer.
Webpage updates, started populating a gallery.
One fuzzer tool is now ready for testing. (it will still be updated)
Made the mutations more diverse.
The fuzzer was ported to a Scheme compiler. It now runs 5-10x faster.
Improved checksum algorithm to improve fuzzer output speed.
We seem to have created a monster. Some of the new tools will not be released publically.
A trivial file format fuzzer was written. This will be used as a reference point for different fuzzing techniques.
Data structure changes: the equivalent of a suffix tree node is now constructed from suffix arrays as needed. This solves a few bottlenecks.
Added a uniqueness filter to fuzzer output.
Made the qsufsort adaptation simpler, more efficient and less memory-intensive.
Updated the third prototype. (3x)
The bottleneck in partitioning was handled by resorting to re-sorting partitions.
Added file fuzzing functionality.
Added a mutation factor to be able to break more programs.
Preliminary prototype 3 made available.
A simple fuzzer was made.
Changes to gene pools: genes have labels and can be mutually recursive.
Modifications to expression scoring: evaluation function is curried to be able to cache results in the gene pool, and scoring happens inside the expert modules instead of the controller.
Simplified data structures. The lcp array is no longer needed.
Simpler and more efficient functional gene matching on suffix arrays.
Suffix array sorting based on "Faster suffix sorting".
Released current source tree as prototype 2.
Linear time suffix tree construction in Scheme.
A basic suffix array based analyzer was written.

Contact Information

In case you are interested in this kind of work, know of a related domain in which similar ideas could be useful etc, feel free to contact us. The preferred way of contacting the project personnel is through our collective OUSPG mailing list. Please remove SPAMLESS before delivery.

ouspg@ee.SPAMLESS.oulu.fi

[This page is CSS2 enabled. Your browser might not fully support it]