Working with Perl hashes.
A hash, sometimes called an associative array is an unsorted collection of key/value pairs. The hash keys are unique strings that reference particular scalar values. These values make up the elements of the hash.
Hashes are one of Perl’s core data types. Hash variables are preceded by a percentage symbol (%), as array variables are preceded by a @ symbol. You
can declare a hash similar to what you do with other data types:
my %hash;
my %hash = ();
The second declaration assigns an empty list to the hash, i.e. initializes it as an empty hash.
A hash can be initialized as a list of key/value pairs. Example: Here is a list of some amino acids, the hash keys being their 3-letter code,
and the hash values being their name:
my %amino_acids = ('Ala', 'alanine', 'Arg', 'arginine', 'Asn', 'aspargine');
Another (easier to read) way, to make this initialization is as follows:
my %amino_acids = ('Ala' => 'alanine', 'Arg' => 'arginine', 'Asn' => 'aspargine');
Note: If the key is a string without spaces, the quotes may be omitted.
A hash may also be created dynamically: Just add new key/value pairs. The same statement is used to add a new key/value pair and to
modify the value of an existing key/value pair. Examples:
$amino_acids{'Asn'} = 'asparagine';
$amino_acids{'Cys'} = 'cysteine';
The first assignment modifies the value referred to by the key "Asn" (correcting the typing error); the second one creates a new hash pair with key "Cys" and the corresponding
value "cysteine".
Note the usage of curly brackets when using a hash element (remember that for an array, you have to use square brackets). Also note that here, the variable name is prefixed by a dollar symbol ($). This is obvious: Hash elements are scalars (in our case strings), so the dollar symbol has to be used.
It is not only possible to extract single elements from the hash; you may also extract a slice, i.e. a list of values. Examples;
$arg = $amino_acids{'Asn'};
($amino_acids1, $amino_acids2) = @aa{('Ala', 'Cys')};
In the first example $arg will contain the string "arginine"; in the second case, $amino_acids1 will contain "alanine" and $amino_acids2 will contain "cysteine".
You can remove a key/value pair from the hash by using the delete instruction:
delete $amino_acids{'Asn'};
This removes the key "Asn" with its associated value "asparagine" from the hash.
You can check if a hash key exists using the exists function. Example:
if (exists($amino_acids{$code3})) {
print "$amino_acids{$code3}\n";
}
else {
print "Unknown amino acid code $code3\n";
}
You can get a list of all of the keys from a hash by using the keys function, and a list of all
of the values from a hash by using the values function:
@amino_acid_codes = keys %amino_acids;
@amino_acid_names = values %amino_acids;
Please, note that the hash key/value pairs are internally stored in a random order. Thus, the lists above are not ordered either (the position of a given key/value within the array is random).
Example 1: Access all keys of a hash to display all keys plus their associated value (in our case: access all amino acid codes and display
the code plus the amino acid name):
foreach my $code3 (keys %amino_acids) {
print "$code3 $amino_acids{$code3}\n";
}
Example 2: Access all values of a hash to display them (in our case: access all amino acid names and display a list of them):
foreach my $aa (values %amino_acids){
print "$aa\n";
}
Example 3: Access all key/value pairs of a hash to display them (in our case: access all amino acid codes/names and display a list of them):
while (my ($code3, $name) = each %amino_acids) {
print "$code3 $name\n";
}
You can get the size of a hash, i.e. the number of elements by using the scalar context on either keys or values:
@amino_acid_codes = keys %amino_acids; $number_of_amino_acids = $amino_acid_codes;
@amino_acid_names = values %amino_acids; $number_of_amino_acids = $amino_acid_names;
As I said above, the hash key/value pairs are stored in a random order. This means that if we want to display an alphabetically sorted list of
the hash keys or the hash values, we'll have to sort the key or the values. Here the "sorted" version of the 2 lists from above:
foreach my $code3 (sort keys %amino_acids) {
print "$code3 $amino_acids{$code3}\n";
}
foreach my $code3 (sort {$amino_acids{$a} cmp $amino_acids{$b}} keys %amino_acids){
print "$amino_acids{$code3}\n";
}
Using hashes in a subroutine.
How do we do to pass a hash to a subroutine, or to return a hash from a subroutine? If you know how we do with arrays, you know the answer: by using a hash reference.
Example 1: Usage of a subroutine to print a hash:
use strict; use warnings;
my %amino_acids = (
'Ala' => 'alanine',
'Cys' => 'cysteine',
'Leu' => 'leucine',
'Val' => 'valine'
);
print_hash(\%amino_acids);
exit;
# Subroutine to print a hash
sub print_hash {
my $ref_hash = shift @_;
foreach my $key (sort keys %$ref_hash) {
print "$key $$ref_hash{$key}\n";
}
}
Example 2: Usage of a subroutine to fill a hash:
use strict; use warnings;
my $ref_amino_acids = fill_hash(); my %amino_acids = %$ref_amino_acids;
foreach my $code3 (sort keys %amino_acids) {
print "$code3 $amino_acids{$code3}\n";
}
exit;
# Subroutine to fill the hash
sub fill_hash {
my %aa = (
'Ala' => 'alanine',
'Cys' => 'cysteine',
'Leu' => 'leucine',
'Val' => 'valine'
);
return \%aa;
}
Hashes of arrays and arrays of hashes.
These data structures are manipulated in a similar way than arrays of arrays.
With a hash of arrays (meaning that the hash has values that are arrays), the arrays may be accessed using an
array-reference, and the array itself is obtained by dereferencing this reference. Example (with "Children" being a key with array value of the
hash "%person"):
my $ref_allchildren = $person{Children};
my @allchildren = @$ref_allchildren;
Individual elements of the array may be directly accessed using a "double index" {hash-key}[array-index]. Example:
print "$person{Children}[0]\n";
With an array of hashes (meaning that the array elements are hashes), a given hash may be accessed using a hash-reference,
and the hash itself is obtained by dereferencing this reference. Example (with "@employees" being an array with hash elements):
my $ref_employee3_data = $employees[3];
my %employee3_data = %$ref_employee3_data;
Individual elements of the hash of a given array element may be directly accessed using a "double index" [array-index]{hash-key}. Example:
print "$employees[1]{Lastname}\n";
Some code showing the usage of a hash with array value:
use strict; use warnings;
my %person = (
'Lastname' => 'Smith',
'Firstname' => 'Linda',
'Children' => ['Bob', 'Tom', 'Jennifer']
);
# Adding a new child (direct access to the array element using a "double index")
$person{Children}[3] = 'Kim';
# Changing the name of the third child (direct access to the array element using a "double index")
$person{Children}[2] = 'Jenny';
# Deleting the second child from the array (direct access to the array element using a "double index")
delete($person{Children}[1]);
# Extracting the names of all children (using an array-reference)
my $ref_allchildren = $person{Children}; my @allchildren = @$ref_allchildren;
print "$allchildren[2]\n";
# Extracting the name of the first child (direct access to the array element using a "double index")
print "$person{Children}[0]\n\n";
# Displaying all children (dereferencing the reference to the array)
$ref_allchildren = $person{Children};
for my $child (@$ref_allchildren) {
if (defined($child)) {
print "$child\n";
}
}
Note: Checking if the array element exists is necessary, because in an array (as a difference with in a hash, where the key/value pair is effectively removed), the delete function sets the array element to undef, without changing the size of the array (= the number of its elements).
Some code showing the usage of an array of hashes:
use strict; use warnings;
my @employees = (
{ 'Lastname' => 'Smith', 'Firstname' => 'John', 'Salary' => 25000 },
{ 'Lastname' => 'Smith', 'Firstname' => 'Linda', 'Salary' => 22500 },
{ 'Lastname' => 'Jones', 'Firstname' => 'Lisa', 'Salary' => 26400 }
);
# Adding a new employee (using a reference to a hash)
my %new_employee = ( 'Lastname' => 'Burns', 'Firstname' => 'Mark', 'Salary' => 23800 );
$employees[3] = \%new_employee;
# Changing the salary of an employee (direct access to the hash element using a "double index")
$employees[1]{Salary} = 23000;
# Extracting all data concerning a given employee (using a reference to the hash)
my $ref_employee3_data = $employees[3]; my %employee3_data = %$ref_employee3_data;
print "$employee3_data{Lastname} $employee3_data{Firstname} $employee3_data{Salary}\n";
# Extracting data concerning a given employee (direct access to the hash elements using "double indices")
print "$employees[1]{Lastname} $employees[1]{Firstname} $employees[1]{Salary}\n\n";
# Displaying all employees (dereferencing the reference to the hashes)
for my $employee (@employees) {
my %employee_data = %$employee;
print "$employee_data{Lastname} $employee_data{Firstname} $employee_data{Salary}\n";
}
How can we do to sort the elements of a "hash array"? Simply by sorting the array that we obtain by dereferencing the array-reference stored as
hash value. Here the code to display the children names in alphabetic order:
use strict; use warnings;
my %person = (
'Lastname' => 'Smith',
'Firstname' => 'Linda',
'Children' => ['John', 'Kim', 'Frank', 'Eve', 'Adam', 'Jessie']
);
my $ref_allchildren = $person{Children};
my @sorted_children = sort @$ref_allchildren;
for my $child (@sorted_children) {
print "$child\n";
}
And how to do to sort an array of hashes by one of the values of this hash? In this case, the data structure to be sorted is an array and its
elements are hashes. The compare values for the sort are values of these hashes and we can access them by dereferencing the hash value using the
-> operator. Here is the code to display the employees list, sorted in descending order by their salary:
use strict; use warnings;
my @employees = (
{ 'Lastname' => 'Smith', 'Firstname' => 'John', 'Salary' => 25000 },
{ 'Lastname' => 'Smith', 'Firstname' => 'Linda', 'Salary' => 23000 },
{ 'Lastname' => 'Jones', 'Firstname' => 'Lisa', 'Salary' => 26400 },
{ 'Lastname' => 'Burns', 'Firstname' => 'Mark', 'Salary' => 23800 }
);
my @employees_sorted = sort { $b->{Salary} <=> $a->{Salary} } @employees;
for my $employee (@employees_sorted) {
print "$employee->{Lastname} $employee->{Firstname} $employee->{Salary}\n";
}
Bidimensional hashes.
A bidimensional hash is a hash of hashes, i.e. a hash where the value of the keys is another hash (composed of keys and their values). As in the cases that we saw before, the hashes that are values of the keys of the primary hash may be accessed by hash-references. Individual values of the second hash may be accessed using a "double index" {primary-hash-key}{secondary-hash-key}.
Here is some code that shows how to work with a bidimensional hash.
use strict; use warnings;
my %grades = (
'Smith Ben' => {
'Mathematics' => 92, 'Science' => 88, 'Literature' => 73, 'Art' => 67
},
'Jones Kim' => {
'Mathematics' => 75, 'Science' => 78, 'Literature' => 93, 'Art' => 97
}
);
# Adding the student Smith Jenny (using a hash-reference)
my %grades_jenny = (
'Art' => 80, 'Literature' => 85, 'Mathematics' => 90, 'Science' => 91
);
$grades{'Smith Jenny'} = \%grades_jenny;
print "\nSmith Jenny - Science: $grades{'Smith Jenny'}{'Science'}\n";
# Adding the student Burns Tom (direct access to the hash element using a "double index")
$grades{'Burns Tom'}{'Mathematics'} = 84;
$grades{'Burns Tom'}{'Literature'} = 78;
$grades{'Burns Tom'}{'Science'} = 86;
print "Burns Tom - Literature: $grades{'Burns Tom'}{'Literature'}\n";
# Changing a given grade of a given student (direct access to the hash element using a "double index")
$grades{'Smith Ben'}{'Art'} = 64;
# Deleting a given grade of a given student (direct access to the hash element using a "double index")
delete($grades{'Smith Ben'}{'Science'});
# Displaying all grades of a given student (using a hash-reference)
my $ref_grades_ben = $grades{'Smith Ben'}; my %grades_ben = %$ref_grades_ben;
print "\nGrades of Smith Ben:\n";
while (my ($subject, $grade) = each %grades_ben) {
print " $subject: $grade\n";
}
# Displaying the science grades of all students (direct access to the hash element using a "double index")
print "\nScience grades:\n";
for my $student (keys %grades) {
if (exists $grades{$student}{'Science'}) {
print " $student: $grades{$student}{'Science'}\n";
}
else {
print " $student: absent\n";
}
}
Note: If, in the last example, we don't make the if (exists $grades{$student}{'Science'}) test, we'll get a "Use of uninitialized value" warning message if the student was absent for some subject (i.e. when the corresponding secondary key does not exist).
Now lets see how to proceed in order to sort a hash of hashes in different ways. For all examples that follow, we'll use the following data
structure:
my %grades = (
'Smith Ben' => {
'Mathematics' => 92,
'Literature' => 73,
'Art' => 64
},
'Jones Kim' => {
'Mathematics' => 75,
'Science' => 78,
'Literature' => 93,
'Art' => 97
},
'Smith Jenny' => {
'Art' => 80,
'Literature' => 85,
'Mathematics' => 90,
'Science' => 91
},
'Burns Tom' => {
'Literature' => 78,
'Mathematics' => 84,
'Science' => 86
}
);
Example 1: Displaying the grades of all students for a given subject, ordered by student name. This is nothing
more than sorting the primary hash by key (and accessing the student's grade using a "double index").
# Displaying the Science grades of all students, ordered by student name
print "\nScience grades:\n";
for my $student (sort keys %grades) {
if (exists $grades{$student}{'Science'}) {
print " $student: $grades{$student}{'Science'}\n";
}
else {
print " $student: absent\n";
}
}
Example 2: Displaying all grades of a given student, ordered by subject. This is nothing more than sorting a given
secondary hash (that we get by dereferencing the reference stored as value for the corresponding primary key) by key.
# Displaying all grades of Ben, ordered by subject
my $ref_grades_ben = $grades{'Smith Ben'}; my %grades_ben = %$ref_grades_ben;
print "\nGrades of Smith Ben:\n";
for my $subject (sort keys %grades_ben) {
print " $subject: $grades_ben{$subject}\n";
}
Example 3: Displaying all grades of all students, ordered by student and subject. This is obviously a combination
of the ways to proceed from the two previous examples. Note, that the only reason for the usage of the variable $oldstudent is to avoid to print
the student name together with each subject (in other words, in order to print it only once).
# Displaying all grades of all students, ordered by student name and subject
my $oldstudent = '';
print "\nAll students grades:\n";
for my $student (sort keys %grades) {
my $ref_grades_student = $grades{$student};
for my $subject (sort keys %$ref_grades_student) {
if ($student ne $oldstudent) {
print " $student:\n";
$oldstudent = $student;
}
print " $subject: $grades{$student}{$subject}\n";
}
}
The screenshot below shows the output of the 3 examples (run as a single script).
Example 4: Displaying all grades of a given student, ordered by grade. This is essentially the same way to proceed
than in example 2. The difference is that instead of sorting by the secondary hash keys, we sort by these keys' values.
# Displaying all grades of Kim, ordered by grade
my $ref_grades_kim = $grades{'Jones Kim'}; my %grades_kim = %$ref_grades_kim;
print "\nGrades of Jones Kim:\n";
for my $subject (sort {$grades_kim{$b} <=> $grades_kim{$a}} keys %grades_kim) {
print " $subject: $grades_kim{$subject}\n";
}
Example 5: Displaying the grades of all students for a given subject, ordered by grade. In this somewhat more
complex case, we make a sorted list of the students (i.e. the keys of the primary hash), the sort routine doing a comparison of the concerned subject's grades. We than
iterate the list, displaying the students' name and the concerned subject's grade. Here is a draft of the code (a draft, because I will give an enhanced version of the
code below).
# Displaying the art grades of all students, ordered by grade (draft, resulting in Perl warnings)
print "\nArt grades:\n";
my @students = sort {$grades{$b}{'Art'} <=> $grades{$a}{'Art'} } keys %grades;
foreach my $student (@students) {
if (defined($grades{$student}{'Art'}) {
print " $student: $grades{$student}{'Art'}\n";
}
}
The problem with this code is that, when creating the sorted students list, we compare the art grades of each student to the one of the others. That means that if a student has no art grade, we do a numerical comparison of a number with an undefined value. The result is that we'll get several "Use of uninitialized value in numerical comparison" warnings. One way to solve this problem is that (after the script has been tested and is supposed to run correctly) we simply turn the warnings off. But, is this really good practice? The following code shows how I solved the problem by making a copy of the hash, where I replaced all "undef" values by -1, ensuring by that way, that all comparisons are between two numbers. # Displaying the art grades of all students, ordered by grade my %grades_temp = %grades; for my $student (keys %grades_temp) { if (!defined($grades_temp{$student}{'Art'})) { $grades_temp{$student}{'Art'} = -1; } } print "\nArt grades:\n"; my @students = sort {$grades_temp{$b}{'Art'} <=> $grades_temp{$a}{'Art'} } keys %grades_temp; foreach my $student (@students) { unless ($grades_temp{$student}{'Art'} == -1) { print " $student: $grades_temp{$student}{'Art'}\n"; } } %grades_temp = ();
Example 6: Displaying all grades of all students, ordered by student and grade. This case is rather simple again.
In fact, all we have to do is to make a list of the students sorted by their names (i.e. the keys of the primary hash), and then proceeding, for each student in the list,
the same way as we did in example 4.
# Displaying all grades of all students, ordered by student and grade
print "\nAll students grades:\n";
my @students = sort keys %grades;
foreach my $student (@students) {
my $ref_grades_student = $grades{$student}; my %grades_student = %$ref_grades_student;
print "\nGrades of $student:\n";
for my $subject (sort {$grades_student{$b} <=> $grades_student{$a}} keys %grades_student) {
print " $subject: $grades_student{$subject}\n";
}
}
The screenshot below shows the output of example 6.
Example 7: Displaying all grades of all students, ordered by subject and student name. I think that this is really heavy! The reason is because the first sort criteria is on the inner hash. I guess that there is a solution that is, perhaps a lot, better than mine. On the other hand my solution works, and I think that the code is rather easy to understand. How do I proceed? I create a sorted list with the student names (which are the keys of the primary hash), and a second one with the subjects (which are the keys of the secondary hash). Then for each element in the subjects list, I iterate through the students list and display the name and the grade.
However, there is an additional problem. Which secondary hash keys can or must I use? Does this make a difference? Yes, it does! In fact, there are students who have no grade for a given subject, thus the key does not exist. If the student, that I choose to create my subjects list, is among them, there wouldn't be any display for this subject. That's why I iterate the student array, extract the secondary hash for each of them and make my list with the subjects of the student who actually has a grade for a maximum of subjects. Of course, this is not 100% sure. But, the case where none of the students has a grade for all subjects may really be considered as a case that will probably never arrive in real life...
Note the necessity of the if (defined($grades{$student}{$subject})) statement to avoid "Use of uninitialized value" warnings each time a student hasn't a grade for some subject.
By the way, also note the declaration of @subjects at the beginning of the script, and no my with the @subjects within the foreach block! A matter of variable scope - if you work regularly with Perl, you should understand, what I'm talking about.
Here is the code of my (probably not best, but "in normal cases" correctly working) code:
# Displaying all grades of all students, ordered by subject and student name
print "\nAll students grades by subject:\n";
my @students = sort keys %grades; my @subjects = (); # make sorted list of students
foreach my $student (@students) { # make sorted list of subjects (keep the one with the maximum number of elements)
my $ref_subjects_student = $grades{$student};
if (scalar keys %$ref_subjects_student > scalar @subjects) {
@subjects = sort keys %$ref_subjects_student;
}
}
foreach my $subject (@subjects) { # for each subject, iterate the students list and display the student's name and grade
print "\n $subject:\n";
foreach my $student (@students) {
if (defined($grades{$student}{$subject})) { # exclude the students who have no grade for this subject
print " $student: $grades{$student}{$subject}\n";
}
}
}
And here is the screenshot of the output of example 7.
Example 8: Displaying all grades of all students, ordered by subject and grade. As for example 7, I guess that
there are better solutions than mine... The solution given here works correctly (provided that at least one of the students has grades for all subjects). The way to proceed
is similar to the one in the previous example. The difference is that for each subject, we have to iterate on a different student list, the list having to be sorted by
grade (and having to be created within the outer loop for each subject). Also, as there is a sort operation on the grades, we'll have to avoid the "use of uninitialized
value in numerical comparison" issue (cf. exercise 5). Here is my code:
# Displaying all grades of all students, ordered by subject and grade
print "\nAll students grades by subject and grade:\n";
# Make list of all students (we'll create sorted lists for each subject later)
my @students = keys %grades; my @subjects = ();
# Make sorted list of all subjects
foreach my $student (@students) {
my $ref_subjects_student = $grades{$student};
if (scalar keys %$ref_subjects_student > scalar @subjects) { # keep the list with the maximum number of sunjects
@subjects = sort keys %$ref_subjects_student;
}
}
# Perform the outer loop for each subject in the sorted subjects list
foreach my $subject (@subjects) {
print "\n $subject:\n";
# Replace undef variables with -1 to avoid undefined numeric values during sort
foreach my $student (@students) {
if (!defined($grades{$student}{$subject})) {
$grades{$student}{$subject} = -1;
}
}
# Using the initial students list, create another students list for the actual subject and sorted by grade
my @students_subject = sort { $grades{$b}{$subject} <=> $grades{$a}{$subject} } @students;
# Perform the inner loop (display) for each student in the students list sorted by grade
foreach my $student (@students_subject) {
unless ($grades{$student}{$subject} == -1) { # exclude the students who have no grade for this subject
print " $student: $grades{$student}{$subject}\n";
}
}
}
The screenshot shows the output of example 8.
Creating a hash database.
A hash, defined in some script, can be made permanent by saving it to disk as a DB_File database (DBM, Perl Database Management, is included by default with most, probably all, Perl distributions). This allows other scripts to access the hash data from disk, and the nice thing with all this is, that we can manipulate the hash on disk exactly the same way that we manipulate a hash defined within a script. All that we have to do is to tie the hash to the DB_File. This may be done using the dbmopen function.
To tie a hash to a DB_File with dbmopen, use some code like the following.
my %hash = (); my $filename = ...;
dbmopen %hash, $filename, 0666
or die "Cannot open file $filename:$!\n";
...
dbmclose(%hash);
If the file, specified by $filename does not exist, it will be created, and the script will have to fill to fill %hash (this will fill the hash database, too). If the file does exist, the hash will be filled from the database. In either case, you can use the hash as you would do with any ordinary hash. And all modifications, that you make to the hash, will automatically be made to the database.
Concerning the filename that you specify, please, note that it the base filename of one or two files created on disk (depending of the DBM version). The third parameter of the dbmopen function defines the UNIX file permissions. 0666 = anyone can read and write; 0644 = you can read and write, others can just read; 0600 = only you can read and write, others can not access the file; 0444 = anyone can read, nobody can write. No idea, what effect this parameter has on a Windows platform...
As example, here the script aa1.pl that creates a DB_File DBM containing a hash with the 20 amino acid 1-letter codes and names.
use strict; use warnings;
my %amino_acids;
dbmopen(%amino_acids, 'amino_acids', 0666)
or die "cannot create amino acids file\n";
%amino_acids = (
'A' => 'alanine', 'C' => 'cysteine', 'D' => 'aspartic acid', 'E' => 'glutamic acid', 'F' => 'phenylalanine',
'G' => 'glycine', 'H' => 'histidine', 'I' => 'isoleucine', 'K' => 'lysine', 'L' => 'leucine',
'M' => 'methionine', 'N' => 'asparagine', 'P' => 'proline', 'Q' => 'glutamine', 'R' => 'arginine',
'S' => 'serine', 'T' => 'threonine', 'V' => 'valine', 'W' => 'tryptophan', 'Y' => 'tyrosine'
);
dbmclose(%amino_acids);
And here the script aa2.pl that reads the DB_File and displays the amino acids list.
use strict; use warnings;
my %aa;
dbmopen(%aa, 'amino_acids', 0444)
or die "cannot open amino acids file\n";
print "\nAmino acids list by 1-letter code\n";
for my $a (sort keys %aa) {
print " $a $aa{$a}\n"
}
dbmclose(%aa);
The screenshot shows the execution of the two scripts. The directory listing shows that there have actually been two files created: a file with the extension .dir, and a file with the extension .pag.
Click the following link, if you want to download the source code of the tutorial examples.
If you find this text helpful, please, support me and this website by signing my guestbook.