Computing: Website and Database Programming

Chemistry: Quantitative analysis of proteins.


1. Protein sequences introduction.
 
Proteins are polymers composed of monomers, called amino acids. In other words, a protein is a chain of amino acids. Because amino acids were once called peptides, the chain is also referred to as a polypeptide. All amino acids have an amino end (-NH2) and a carboxyl end (-COOH). They also have a side chain, called R (reactive) group. Of the many hundreds of described amino acids, 20 (22) are proteinogenic ("protein-building"). It is these 20 compounds that combine to give a vast array of peptides and proteins assembled by ribosomes.
In proteins, amino acids are linked together by covalent bonds called peptide bonds. A peptide bond is formed by the linkage of the -NH2 end of one amino acid with the -COOH end of another amino acid. Because of the way in which the peptide bond forms, a polypeptide chain always has an amino end and a carboxyl end.
The primary structure (amino acid chain) is sufficient to uniquely identify it. This means that a protein may be described by a sequence of letters, where each letter codes for a given amino acid. These 1-letter codes are fine for being read by computer programs; to make a protein sequence better readable for humans, there are also 3-letter codes.
There are various ways to classify amino acids: by their volume (molecule size); by their side chain structure; by their polarity; by their charge; by their hydropathy; by their nutritive requirement (for humans); by their chemical function (metabolism).
For further details, please, have a look at my tutorial A general overview of the protein structure.
2. Quantitative analysis of proteins online calculator.
 
"Quantitative analysis of proteins online calculator" is a web-application, that may be used to determine the composition (amino acids count and percentage) of a protein sequence. The sequence must be in 1-letter coding; only the codes of the 20 standard proteinogenic amino acids are accepted (the presence of the "uncertain" resp. "unknown" codes B, Z, and X, as well as O = pyrrolysine, and U = selenocysteine will result in an "Invalid protein sequence" error). The sequence may be entered either as raw data, or in FASTA format (cf. my tutorial Biological sequences and genetic code primer). The FASTA headers must be one single line; comments are not permitted. Multiple-sequence FASTA is not supported.
You can enter the sequence manually (for example, using Copy/Paste) or upload a file, stored on your computer. To use a file, select the corresponding checkbox. Please, note, that the file size is limited to 25kB and that filenames must be all letters, numbers, spaces, underscores or hyphens.
With a valid sequence entered, pushing the Calculate button, will display a table with the 20 amino acids counts and percentages. To display the count/percentages for the different classification categories (molecule size, side chain, etc.), click the corresponding link.
Use the following link to start the online application.
3. Quantitative analysis of proteins Perl script.
 
Click the following link to download the Quantitative analysis of proteins Perl script and all other files needed to run this application on your web server. Have a look at the ReadMe.txt file, included in the download archive, for details about the different files, and where to place them on the server.
The Perl script is rather long and I do not display the entire source code here. Some remarks, concerning how the application works:
  • The script first checks where it should read the protein sequence from.
    • If the Load protein sequence from local file checkbox is selected, it assumes that the user wants to upload a file containing the sequence. Thus, the user has to browse for the file before they push the Calculate button, allowing the script to get the filename in order to create a handle (an error message is displayed if this is not the case). File upload is always a potential danger that could be used to hack the webserver and even the operating system. If you are not sure about this, you might want to have a look at my Uploading files using CGI and Perl tutorial (the tutorial example is my DNA molecular weight calculator application). If the filename is given and the filename is valid, the script creates a file handle and reads the file content into a string variable (thus, no file is saved on the server here).
    • If the Load protein sequence from local file checkbox is not selected, the sequence is read from the text box.
  • The application web page is generated by reading a template file and replacing all custom tags (template lines containing a tag start with '#' and all tags are placed between '#' symbols) by the corresponding actual values, in particular the tag '#counts#' will be replaced with a HTML table, showing the counts and percentages of the different amino acids present in the sequence; the tag #classes# will be replaced by the links, that show the corresponding category counts table. These tables are actually part of the template file; the Perl script only replaces the category name and the count and percentage values.
  • The protein analysis routine iterates through the sequence (the FASTA header has been removed before) and for each amino acid increments its counter value, as well the one of the different category groups it belongs to. Before the protein is analyzed, another subroutine checks if the sequence is made of valid amino acid codes.
  • The category tables are fields of an outer table's rows, these rows being defined with a numbered id and the property style="visibility:collapse". Clicking one of the category links calls a Javascript function that makes the corresponding row of the outer table (i.e. the corresponding category table) visible (hiding all others). For details about how this can be implemented, you might want to have a look at my tutorial Using Javascript to hide/show given text paragraphs.
As I said above, the Perl script is too long to display the entire source code here. This is not the case for some protein sequences related code. Click the following links to display the code of the protein validation and the protein analysis subroutines.
Protein validation subroutine.
    sub valid_protein {
        my ($protein) = @_; my $valid = 0;
        my $aa = 'ACDEFGHIKLMNPQRSTVWY';
        if ($protein =~ /^([$aa]+)$/) {
            $valid = 1;
        }
        return $valid;
    }
Protein analysis subroutine.
    sub analyze {
        my ($protein, $ref_amino_acids, $ref_aa_counts) = @_;
        my %amino_acids = %$ref_amino_acids; my %aa_counts = %$ref_aa_counts;
        my @size_counts = (0, 0, 0, 0, 0); my @chain_counts = (0, 0, 0, 0, 0, 0, 0); my @polarity_counts = (0, 0);
        my @charge_counts = (0, 0, 0); my @hydro_counts = (0, 0, 0); my @requirement_counts = (0, 0, 0); my @function_counts = (0, 0, 0);
        my @sizes = ( 'AGS', 'CDNPT', 'EHQV', 'IKLMR', 'FWY' );
        my @chains = ( 'AIGLV', 'FHWY', 'P', 'CM', 'DE', 'KR', 'NQST' );
        my @polarities = ( 'DEHKNQRSTY', 'ACFGILMPVW' );
        my @charges = ( 'HKR', 'DE', 'ACFGILMNPSTQVWY' );
        my @hydros = ('ACFILMVW', 'DEKNQR', 'GHPSTY');
        my @requirements = ('FIKLMTVW', 'HR', 'ACDEGNPNSY');
        my @functions = ('KL', 'FITWY', 'ACDEGHMNPQRSV');
        for (my $i = 0; $i < length($protein); $i++) {
            my $aa = substr($protein, $i, 1);
            $aa_counts{$amino_acids{$aa}{'name'}}++;
            for (my $j = 0; $j <= 4; $j++) {
                if ($aa =~ /[$sizes[$j]]/) {
                    $size_counts[$j]++;
                }
            }
            for (my $j = 0; $j <= 6; $j++) {
                if ($aa =~ /[$chains[$j]]/) {
                    $chain_counts[$j]++;
                }
            }
            for (my $j = 0; $j <= 1; $j++) {
                if ($aa =~ /[$polarities[$j]]/) {
                    $polarity_counts[$j]++;
                }
            }
            for (my $j = 0; $j <= 2; $j++) {
                if ($aa =~ /[$charges[$j]]/) {
                    $charge_counts[$j]++;
                }
                if ($aa =~ /[$hydros[$j]]/) {
                    $hydro_counts[$j]++;
                }
                if ($aa =~ /[$requirements[$j]]/) {
                    $requirement_counts[$j]++;
                }
                if ($aa =~ /[$functions[$j]]/) {
                    $function_counts[$j]++;
                }
            }
        }
        return (\%aa_counts, \@size_counts, \@chain_counts, \@polarity_counts, \@charge_counts, \@hydro_counts, \@requirement_counts, \@function_counts);
    }
If you want to place a link to the application on some other page, include the following into that page's HTML:
    <a href="/cgi-bin/proteins.pl">Quantitative analysis of proteins</a>
4. Related stuff on this site.
 
If you are looking for a protein analysis desktop application, maybe you'll like my Lazarus/Free Pascal GUI application AAStats, that counts and graphically displays (charts) a protein's amino acids by number, volume (size), structure (class, side-chain), polarity, charge, hydropathy, diet-requirements and metabolism. Click the following link to view the description of the "AAStats" PC application.

If you find this page helpful or if you like the Quantitative analysis of proteins web application, please, support me and this website by signing my guestbook.