It comes a time in the life of a wet lab geek when one has to simulate the outcome of a proteolytic degradation of a protein of known sequence (those who said that haven't are lying).

So for a problem this common, I did what I always do first: google for an already made solution. Surprisingly, after ten minutes or so, I came out empty-handed.

Now don't get me wrong: there were solutions out there, like PeptideCutter, PoPS, or EMBOSS' digest, but none of them were easily usable as a library. The first two only offered a web front-end, and that is OK for a quick digestion of one or two sequences, but I needed batch-processing capabilities and much more flexibility (ie, being able to predict partial digests, for instance). The last one was indeed available as a command line utility, but it offered very few enzyme specificities, it forced an interactive mode, and writing a wrapper for it would have involved more logic that the one implemented in the application (not to mention the ugly reading/writing of temporary files that would have been required to deal with the app's inflexible interface, and that would have created a nice and shiny IO bottleneck in every downstream application that used it).

So, there. Homework done, I decided to code for a solution myself. Luckily, PeptideCutter had a pretty comprehensive explanation of how their algorithm works, and it is simple enough to be coded quickly.

Basically, a protease specificity is modeled as a regular expression that returns true or false on a sequence sliding window. This sliding window is six residues long in PeptideCutter, I did it 8 residues long to account for potentially more specific proteases. When the substrate matches the regular expression, the sequence should be cut at the siscile bond. Simple enough.

The module is up at my github repo (I called it Bio::Protease for a lack of better name), you can download it/fork it/follow it to your liking. It has quite a robust test suite, I checked every single specificity against PeptideCutter's results on the same input.

I used Perl and Moose for it. One of the things that I like the most about Moose is how easy it is to make really DWIMmy APIs with it.
For instance, in Bio::Protease, you can do:

my $protease = Bio::Protease->new(specificity => 'trypsin');

This will coerce the string 'trypsin' into an internal representation of the trypsin specificity, which is implemented as a code reference. (Currently, there are more than thirty different specificities to choose from, you can list them by doing say Bio::Protease->Specificities).

Also, you could say:

my $cutting_pattern = Bio::Tools::SeqPattern->new(
    -SEQ => 'XXXW[^P]RSX', -TYPE => 'Amino'
my $protease = Bio::Protease->new(specificity => $cutting_pattern);

to get a custom regex-based specificity. The one above would cut sequences between the fourth and fifth residues if they match that pattern.

You could object that the regex-specificity model is too simplistic for your personal use, and you'd be right. Enzymic cleavage of peptide bonds does not share the deterministic character of restriction enzymes, and sometimes protein's structural characteristics get in the way and play an important part in cleavage. So, to account for an arbitrarily complex specificity model, the 'specificity' attribute also accepts a code reference:

my $nobel_prize_winning_specificity_model = sub {
    my $peptide = shift;
    # decide whether to cut $peptide or not

my $protease = Bio::Protease->new(
    specificity => $nobel_prize_winning_specificity_model;

Internally, the module passes a sliding window of peptides to the code reference; if it returns true, it marks the siscile bond as cut.

So back to the original problem, to actually make the cuts, you can do:

my @products = $protease->digest('MAAEEELLKKVVIKP');

The products of a full digestion of the argument sequence will be returned as a list. On the other hand, if you want to get the siscile bonds, you'd say:

my @cut_sites = $protease->cleavage_sites($seq);

And, for partial digests, you can use the method 'cut', that will return the products of a single cleavage if you provide a siscile bond as an argument:

my @products = $protease->cut($seq, $cut_sites[rand @cut_sites]);

In the above expression, $cut_sites[rand @cut_sites] will choose a random cleavage site from @cut_sites, and cut $seq there. This is the basis for a bigger module (and the reason for coding this one), which could make it to a separate blog post in the future.

The take home lesson is: use Moose. I am too lazy to have done all of this had not been so easy. Without it, I would have probably whipped up a half-assed script (not a shiny, reusable module) that took a sequence and a specificity as input from the command line, and that barfed a table with the cleavage sites as output, and not much more (sounds familiar?). But all of the above functionality was implemented in a hundred-or-so lines of clean, declarative, object-oriented code, and I don't ever have to worry about this problem again (and neither do you).