Controls

Using Controls

In experimental sciences, the controls are of utmost importance. This is sometimes neglected in computational experiments, but is really as important as in the "wet" sciences.

Using controls, and control experiments, you establish

that your software is correct and
that the results you get are significant and not merely noise.

Controls for correctness

There are several steps you can take to ensure correctness of your software.

The basic demand is that your software contains checks that ensure that computations are not made on bad data if you would happen to have that. Use tests to look at, for example, sequence lengths (really short sequences are probably not relevant) and other properties (you don't expect DNA to have 20 letters).
Set up test cases based on the most trivial cases of data input you can imagine. Such trivial cases can be an empty file, really small datasets, and strange corner cases that you believe will never happen unless there is an error in the input data. Your software should be correct on these examples too! If it does not work on trivial data, why would it work on non-trivial data?
Create or (automatically) generate data that resembles the real data you will work with, but in a way that you know what the results will be. For example, if you are predicting genes, create a small chromosome consisting of random nucleotides (uniform distribution perhaps) and insert a real, known, gene into this synthetic data. Can you now automatically find this gene?
Make a test run based on a small portion of your data. It is unfortunate to start making full scale (computational) experiments and then start looking for errors within huge amounts of output data. By starting small, in a "prototype version" of your project, you can easily take care of the first trivial problems. And they will occur.

Using synthetic data

Beyond ensuring correctness, it can be valuable to generate data using your own tools. In this course, it may not be natural, but in many projects in computational biology there is a need to understand the statistics of your results. If you can generate "null data", that is data which does not contain any valuable information at all, then you can run your analysis on it and get an understanding of what results you can get on such data. When you then run the same analysis on real data, you would expect to find results which are way more interesting than on your null data.

If you, for example, are writing a tool for detecting genes in genomes, then you can make use of non-coding genome data to see how your tool handles this. If you find many genes in such gene-empty data, then you know you will have a problem with many false positives. If you are assigning scores to the detected "genes", you automatically get a distribution of "non-interesting" scores and you can then require that your tool only reports genes with significantly higher score.