Linux Perl is used in this class. You can download and install ActivePerl from here for Windows and Linux. We use the Learning Perl book as the reference.
From the Linux command line, type in the following line:
> perl -e 'print "Hello, world.\n"'
Create a file called hello.pl containing the followings:
#!/usr/bin/perl
print "Hello, world!\n";
Then, from the command line, type:
> perl hello.pl
Alternatively, you can make hello.pl an executable file by doing this:
> chmod u+x hello.pl
Then you can simply type
> hello.pl
to run the code. However, this assumes you have the current directory in the environment variable PATH. If you do not, then you can run the code by typing this:
> ./hello.pl
It is a good idea to let Perl give you warnings when something is not right while you are still learning the language, like this:
#!/usr/bin/perl -w
in the first line of your Perl code.
LWP is 'libwww-perl': http://lwp.linpro.no/lwp/. Online help: http://www.perldoc.com/perl5.6/lib/LWP/Simple.html.
#!/usr/bin/perl -w
use LWP::Simple;
$AccessionNumber = "AA494997";
$dnaSeq = DNA($AccessionNumber);
open(OUT,">ex_blast.fa");
print OUT ">$AccessionNumber\n";
print OUT $dnaSeq;
close(OUT);
chdir "/indirect/nrd4/Ming/blast/";
system("./blastall -p blastx -d nr -i ~/II/5004-Fall2004/ex_blast.fa -e 0.001 -o ~/public_html/ex_blast.html -T");
system("chmod go+rx ~/public_html/ex_blast.html");
sub DNA{
$aNum = $_[0];
$rtn = "";
$page = get("http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=nucleotide&cmd=search&term=$aNum");
if ($page =~ m/<b>1:\s<\/b><a\shref="(.*?)"/){$url = "http://www.ncbi.nlm.nih.gov".$1."&view=xml";
}
else{ return $rtn;}
$page = get($url);
@DNAs = ($page =~ m/IUPACna>(.*)<\/IUPACna/gm);
if (scalar(@DNAs) >= 1){
$rtn = $DNAs[0];
if (scalar(@DNAs) > 1){
print "$aNum has ", scalar(@DNAs), " sequences\n";
}
}
return $rtn;
}
http://www.ncbi.nlm.nih.gov/Class/PowerTools/blast/rules.html
| scalar | $ | an individual value |
| array | @ | a list of values, keyed by number |
| hash | % | a group of values, keyed by string |
| subroutine | & | a callable chunk of Perl code |
double quotes ": interpolation
single quotes ': no interpolation
backquotes `: run an external program and return output
| $answer = 42; | # an integer |
| $pi = 3.14159265; | # a "real" number |
| $avocados = 6.02e23; | # scientific notation |
| $pet = "Camel"; | # string |
| $sign = "I love my $pet"; | # string with interpolation |
| $cost = 'It costs $100'; | # string without interpolation |
| $thence = $whence; | # another variable's value |
| $salsa = $moles * $avocados; | # a gastrochemical expression |
| $exit = system("vi $file"); | # numeric status of a command |
| $cwd = `pwd`; | # string output from a command |
Create your own quotes: q//, qw//, qq//, qx//; see Table 2-3 on page 63 of the camel book.
We have seen "\n" for a newline. "\t" is a horizontal tab.
Scalars can be references to complicated structures. For example,
| $ary = \@myarray; | # reference to a named array |
| $hsh = \%myhash; | # reference to a named hash |
| $sub = \&mysub; | # reference to a named subroutine |
| $ary = [1,2,3,4,5]; | # reference to an unnamed array |
| $hsh = {Na => 19, Cl => 35}; | # reference to an unnamed hash |
| $sub = sub { print $state }; | # reference to an unnamed subroutine |
| $fido = new Camel "Amelia"; | # reference to an object |
Perl will automatically switch between numbers and string literals, determined according to the contexts. For example,
$camels = '123';
print $camels + 1, "\n";
@home = ("couch", "chair", "table", "stove");
($potato, $lift, $tennis, $pipe) = @home;
$home[0] = "couch";
$home[1] = "chair";
$home[2] = "table";
$home[3] = "stove";
We can use a "list assignment" to swap two variables:
($alpha,$omega) = ($omega,$alpha);
You can get the length of the array by scalar(@array). Another useful notation: $#array is one less than the length of the array, that is, the subscript of the last element of the array.
%longday = ("Sun", "Sunday", "Mon", "Monday", "Tue", "Tuesday", "Wed",
"Wednesday", "Thu", "Thursday", "Fri", "Friday", "Sat", "Saturday");
%longday = (
"Sun" => "Sunday",
"Mon" => "Monday",
"Tue" => "Tuesday",
"Wed" => "Wednesday",
"Thu" => "Thursday",
"Fri" => "Friday",
"Sat" => "Saturday",
);
$wife{"Adam"} = "Eve";
You can make an array of hashes, a hash of arrays, or more complicated structures. See pages 12-13 of the camel book.
Global variables are visible everywhere.
$a = 0;
mySub(5);
sub mySub{
my($parameter) = @_;
print "$a\n";
print "$parameter\n";
}
Local variables are visible in the block and the subroutines that are called within the block.
$a = 3;
{ local $a = 0;
mySub(5);
}
mySub(5);
sub mySub{
my($parameter) = @_;
print "$a\n";
print "$parameter\n";
}
My variables are visible only in the block.
$a = 3;
{ my $a = 0;
mySub(5);
}
sub mySub{
my($parameter) = @_;
print "$a\n";
print "$parameter\n";
}
$_, @ARGV, @INC, %ENV, STDIN, STDOUT, STDERR
An example using these special global variables:
#!/usr/bin/perl -w
while (@ARGV){
$ARGV = shift @ARGV;
print "$ARGV\n";
}
print $INC[0], "\n";
print $ENV{"PATH"}, "\n";
print "enter something\n";
while (<STDIN>){
print "stdout: $_";
print STDERR "stderr: $_";
}
Arithmetic operators: +, -, *, /, **, %, ++, --
String concatenation: .
Assignment operators: =, +=, -=, *=, /=, .=
Numeric comparison operators: ==, !=, <, <=, >, >=, <=>
String comparison operators: eq, ne, lt, gt, le, ge, cmp
Logical operators: &&, ||, !, and, or, not
Conditional operator: ?:
Pattern matching: =~, !~, m//, s///, tr///
#if
if ($a>$b){
print $a;
}
#if-else
if ($a>$b){
print $a;
}
else{
print $b;
}
#if-elsif-else
if ($a>$b){
print $a;
}
elsif ($a>$c){
print $c;
}
else{
print $b;
}
while(<STDIN>){
print;
}
unless ($a>$b){
print;
}
for($I=0;$I<10;$I++){
print $I, "\n";
}
foreach $I (@array) {
print $I, "\n";
}
Alternatively, use the default global variable $_:
# Each element is assigned to $_ in turn
foreach (@array)
{
print $_; # prints the element
print; # prints the element
}
$I = 11;
while ($I>0){
$I--;
if ($I==4) {next;}
# next if ($I==4);
print $I, "\n";
}
$I = 11;
while ($I>0){
if ($I==4) {last;}
# last if ($I==4);
print $I--, "\n";
}
Use the angle operator <> for line input. An example of opening a file for input:
open(IN,"filename") || die;
while (<IN>){
print $_;
}
close(IN);
An example of opening a file for output:
open(OUT,">filename") || die;
for($I=1;$I<=10;$I++){
print OUT $I, "\n";
#note that there is no comma after OUT
}
close(OUT);
In the above example, if there is an existing file of that name, it will be erased. An example of opening a file for appending:
open(OUT,">>filename") || die;
for($I=1;$I<=10;$I++){
print OUT $I, "\n";
}
close(OUT);
chomp
chomp VARIABLE
The function chomp deletes a trailling newline from the end of a string contained in a variable. The default variable that chomp operates on is $_.
split
split /PATTERN/
split /PATTERN/, EXPR
split(/[,|]/,EXPR) # using both , and | as the delimiters
The function split splits EXPR, using /PATTERN/ as separators. The default variable that split operates on is $_.
$_ = <STDIN>;
chomp;
@line = split;
@line = split(/ /);
@line = split(/\s+/,$_);
open(IN,"TabDelineatedFileExportedFromExcel.txt") || die;
<IN>; #skip the first line (column headings)
$lineNum = 1;
while (<IN>){
chomp;
@fields = split(/\t/,$_);
print "line ", ++$lineNum, " has ", scalar(@fields), " fields\n";
}
close(IN);
Adds an item to the end of an array.
push @array, $a;
push @array, ($a, $b);
push @array, @array;
Deletes the item from the end of the array.
$last = pop @array; # last element is gone
$last = pop (@array); # same as above
pop @array; # remove last element without saving it
Adds an item to the beginning of an array.
unshift @array, $a;
unshift @array, ($a, $b);
unshift @array, @array;
Deletes the item from the beginning of the array.
$first = shift @array; # first element is gone
$first = shift (@array); # same as above
shift @array; # remove first element without saving it
The function values returns a list of the values in a hash.
@valuesArray = values %aHash;
The function keys returns a list of the keys in a hash.
@keys = keys %ENV; # keys are in the same order as
@values = values %ENV; # values, as this demonstrates
while (@keys) {
print pop(@keys), '=', pop(@values), "\n";
}
The function each steps through a hash one key/value pair at a time.
while (($key,$value) = each %ENV) {
print "$key=$value\n";
}
sub numerically{
my($a,$b) = @_;
return $a <=> $b;
}
@sorted = sort numerically 53, 29, 11, 32, 7;
@descending = reverse sort numerically 53, 29, 11, 32, 7;
sub reverse_numerically { $b <=> $a }
@descending = sort reverse_numerically 53, 29, 11, 32, 7;
Assume @people is an array of hash references, where each hash contains fields of firstname and lastname. The following routine sorts these people by their names.
open(IN,"people.txt") || die;
while (<IN>){
$person = {}; #this step is important
chomp;
($first,$last) = split;
$person->{"firstname"} = $first;
$person->{"lastname"} = $last;
push @people, $person;
}
close(IN);
print scalar(@people), "\n";
print $people[0]->{"firstname"}, "\n";
print $people[1]->{"firstname"}, "\n";
print $people[2]->{"firstname"}, "\n";
print $people[3]->{"firstname"}, "\n";
@sorted = sort names @people;
print scalar(@sorted), "\n";
print $sorted[0]->{"firstname"}, "\n";
print $sorted[1]->{"firstname"}, "\n";
print $sorted[2]->{"firstname"}, "\n";
print $sorted[3]->{"firstname"}, "\n";
sub names{
$a->{lastname} cmp $b->{lastname}
||
$a->{firstname} cmp $b->{firstname}
}
join EXPR, LIST
The function join is the opposite of split. It puts the strings in LIST in one string, separated by the value of EXPR. For example,
$line = join '\t', @fields;
exit EXPR
exit
The function exit terminates the program, returning the value of EXPR.
Search Entrez for "human[Organism] jak3".
Compare the results to http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=nucleotide&term=human[orgn]+AND+jak3&retmax=100
#!/usr/bin/perl -w
use LWP::Simple;
$baseurl="http://eutils.ncbi.nlm.nih.gov/entrez/eutils/";
$eutil="esearch.fcgi?";
$parameters="db=nucleotide&term=human[orgn]+AND+jak3&retmax=100";
$url=$baseurl.$eutil.$parameters;
$raw=get($url);
#eutility output is given in XML by default
@lines=split(/^/,$raw);
foreach $line (@lines){
if ($line=~/<Id>(\d+)<\/Id>/){
print "$1\n";
}
}
What can we do with these GI numbers?
Go here: http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nucleotide&view=xml&val=47157314
A regular expression is a way of describing a set of strings without having to list all the strings in the set.
if (/Windows 95/) { print "Time to upgrade?\n";}
s/Windows/Linux/;
($good, $bad, $ugly) = split(/,/, "vi,emacs,teco");
Suppose we want to collect email addresses on the web. We download a webpage and go through the html source code with the following:
while ($line = <FILE>){
if ($line =~ /mailto:/) {
print $line;
}
}
Alternatively, a simpler code:
while (<FILE>) {
print if /mailto:/;
}
Matching is from left to right. Matching is greedy, attempting to match as much as possible as long as the whole pattern still matches.
| [a-z] | match one lower case letter |
| [a-zA-Z] | match one letter |
| [0-9] | one digit |
| \d | one digit |
| \s | whitespace: blank, \t, \n etc |
| \w | word characters: [a-zA-Z_0-9]; note the underline character |
| \D | anything that is not a digit |
| \S | anything that is not a whitespace |
| \W | anything that is not a word character |
| . | match any character except \n |
| \. | match . |
| ^ | at the beginning of the pattern, match the beginning of the string or beginning of line |
| \A | match the beginning of the string |
| $ | at the end of the pattern, match the end of the string or end of line |
| \z | match the end of the string |
| \Z | match the end of the string, or if there is a newline at the end of string, match before \n |
| \b | match a word boundary |
| \B | match except at the word boundary |
| a+ | one or more a's |
| a* | zero or more a's |
| a? | when ? appears after a pattern, it means matching zero or one a |
| \d{7,11} | at least 7 digits, no more than 11 digits |
| \d{7} | exactly 7 digits |
| \d{7,} | at least 7 digits |
| \d{0,7} | at most 7 digits |
| ? | when ? appears after a quantifier, it means minimal matching with the quantifier |
| .*: | the : will match the last : in the string |
| .*?: | the : will match the first : in the string |
| | | alternatives |
| () | grouping |
| /(Fred|George|Ron) Weasley/ | |
| [abcd] | a or b or c or d; alternatives on the character level; equivalently, a|b|c|d |
| [^a] | ^ is negation; match anything but a |
\1, \2, $1, $2, $`, $&, $'
1. /bam{2}/ matches "bamm"
2. /(bam){2}/ matches "bambam"
3. /\bFred\b/ matches "The Great Fred", "Fred the Great", but not "Frederick the Great"
4. /^#/ matches a comment line
5. /<(.*?)>.*?<\/\1>/ matches "<B>Bold</B>"
6. /U-?2/ matches "U2" or "U-2"
7. If you say:
$foo = "bar";
/$foo$/;
the pattern will be /bar$/, which matches "bar" at the end of the string.
m// pattern matching; match the pattern between the /'s
if ($shire =~ m/Baggins/) { ... } # search for Baggins in $shire
if ($shire =~ /Baggins/) { ... } # search for Baggins in $shire
if (m/Baggins/) { ... } # search for Baggins in $_
if (/Baggins/) { ... } # search for Baggins in $_
s/// pattern matching and substitution; match the pattern between the first two /'s, and replace it by the string between the last two /'s
=~ read as "matches" or "contains"
!~ read as "does not match" or "does not contain"
Remember, $_ is the default variable that pattern matching operates on.
/i ignore alphabetic case
/m the string has multiple lines
/s the string has a single line
/g globally find all matches
1. s/(\S+)\s+(\S+)/$2 $1/ swaps the first two words of a string
2. $haystack =~ m/needle/ # match a simple pattern
3. $haystack =~ /needle/ # same thing
4. $recipe =~ s/butter/olive oil/ # a substitution
5. tr/ATCG/TAGC/ # complement the DNA strand in $_
6.
if (@perls = $paragraph =~ /perl/gi) {
printf "Perl mentioned %d times.\n", scalar @perls;
}
7.
$lotr = $hobbit; # Just copy The Hobbit
$lotr =~ s/Bilbo/Frodo/g; # and write a sequel the easy way.
Equivalently,
($lotr = $hobbit) =~ s/Bilbo/Frodo/g;
8.
for (@chapters) { s/Bilbo/Frodo/g } # Do substitutions chapter by chapter.
s/Bilbo/Frodo/g for @chapters; # Same thing.
9.
@oldhues = ('bluebird', 'bluegrass', 'bluefish', 'the blues');
for (@newhues = @oldhues) { s/blue/red/ }
print "@newhues\n"; # redbird redgrass redfish the reds
10.
for ($string) {
s/^\s+//; # discard leading whitespace
s/\s+$//; # discard trailing whitespace
s/\s+/ /g; # collapse internal whitespace
}
Equivalently,
$string = join(" ", split " ", $string);
#!C:/Perl/bin/perl -w
$seq = "acgttgca";
print "$seq\n";
$seq =~ tr/acgt/tgca/;
print "$seq\n";
#!C:/Perl/bin/perl -w
$s = "agct tata acgt acgt acgt acgt acgt acgt a cccc gggg tttt";
print "$s\n";
$s =~ s/ //g;
print "$s\n";
if ($s =~ m/tata/){
print "$`\n$&\n$'\n";
$downstream = $';
$downstream =~ m/.{25}/;
print "$'\n";
}
The IL-2 signature is T-E-[LF]-x(2)-L-x-C-L-x(2)-E-L, in Prosite regular expression. For Perl, it is
/TE[LF]..L.CL..EL/
A Perl script to search for IL-2:
#!C:/Perl/bin/perl -w
$IL2 = "TE[LF]..L.CL..EL";
open(IN,"humanIL2") || die; $humanIL2 = <IN>; chomp $humanIL2; close(IN);
open(IN,"mouseIL2") || die; $mouseIL2 = <IN>; chomp $mouseIL2; close(IN);
open(IN,"chickIL2") || die; $chickIL2 = <IN>; chomp $chickIL2; close(IN);
if ($humanIL2 =~ m/$IL2/){
print "matched human IL-2: $&\n";
}
if ($mouseIL2 =~ m/$IL2/){
print "matched mouse IL-2: $&\n";
}
if ($chickIL2 =~ m/$IL2/){
print "matched chick IL-2: $&\n";
}
#!C:/Perl/bin/perl -w
print 'Please enter your name: ';
$name = <STDIN>;
chomp $name;
print "\n";
@array = split //, $name;
&munch(@array);
sub munch{
foreach(@_){
print "Munching...$_\n";
}
}
#!C:/Perl/bin/perl -w
use Getopt::Std;
getopts('dn:a:');
if ($opt_d){
print "Debugging mode\n";
}
if (!$opt_n || !$opt_a){
print "USAGE:\n\texample [-d] -n name -a age\n";
exit;
}
else{
if ($opt_d){ print "Forming string\n";}
$output = "$opt_n is $opt_a years old\n";
print $output;
}
#!/usr/bin/perl -w
$path = `pwd`;
print "Path is: $path\n";
#BE CAREFUL! This cd doesn't do what you want it to do!
`cd /home/ouyang/5020`;
$path = `pwd`;
print "Path is $path\n";
chdir "/home/ouyang/5020";
$path = `pwd`;
print "Path is $path\n";
#!/usr/bin/perl -w
use LWP::Simple;
$page=get("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=protein&term=interleukin-2[protein]&retmax=200");
@lines=split(/^/,$page);
foreach $line (@lines){
if ($line=~/<Id>(\d+)<\/Id>/){
$GI = $1;
$page=get("http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=protein&view=xml&val=$GI");
@protein = ($page =~ m/<NCBIeaa>(.*)<\/NCBIeaa/gm);
if (scalar(@protein)>0){
print "$GI => $protein[0]\n";
}
}
}
PDL - Perl Data Language: http://pdl.perl.org/