Class notes for GINF 5004, Platforms and Applications

Linux Perl is used in this class. You can download and install ActivePerl from here for Windows and Linux. We use the Learning Perl book as the reference.

1. An example

1.1.

From the Linux command line, type in the following line:

> perl -e 'print "Hello, world.\n"'

1.2.

Create a file called hello.pl containing the followings:

#!/usr/bin/perl

print "Hello, world!\n";

Then, from the command line, type:

> perl hello.pl

Alternatively, you can make hello.pl an executable file by doing this:

> chmod u+x hello.pl

Then you can simply type

> hello.pl

to run the code. However, this assumes you have the current directory in the environment variable PATH. If you do not, then you can run the code by typing this:

> ./hello.pl

It is a good idea to let Perl give you warnings when something is not right while you are still learning the language, like this:

#!/usr/bin/perl -w

in the first line of your Perl code.

1.3. Blast search

LWP is 'libwww-perl': http://lwp.linpro.no/lwp/. Online help: http://www.perldoc.com/perl5.6/lib/LWP/Simple.html.

#!/usr/bin/perl -w

use LWP::Simple;

$AccessionNumber = "AA494997";

$dnaSeq = DNA($AccessionNumber);

open(OUT,">ex_blast.fa");

print OUT ">$AccessionNumber\n";

print OUT $dnaSeq;

close(OUT);

chdir "/indirect/nrd4/Ming/blast/";

system("./blastall -p blastx -d nr -i ~/II/5004-Fall2004/ex_blast.fa -e 0.001 -o ~/public_html/ex_blast.html -T");

system("chmod go+rx ~/public_html/ex_blast.html");

sub DNA{

$aNum = $_[0];

$rtn = "";

$page = get("http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=nucleotide&cmd=search&term=$aNum");
if ($page =~ m/<b>1:\s<\/b><a\shref="(.*?)"/){

$url = "http://www.ncbi.nlm.nih.gov".$1."&view=xml";

}

else{ return $rtn;}

$page = get($url);

@DNAs = ($page =~ m/IUPACna&gt;(.*)&lt;\/IUPACna/gm);

if (scalar(@DNAs) >= 1){

$rtn = $DNAs[0];

if (scalar(@DNAs) > 1){

print "$aNum has ", scalar(@DNAs), " sequences\n";

}

}

return $rtn;

}

1.4 Useful constructs

http://www.ncbi.nlm.nih.gov/Class/PowerTools/blast/rules.html

2. Data

scalar $ an individual value
array @ a list of values, keyed by number
hash % a group of values, keyed by string
subroutine & a callable chunk of Perl code

2.1 Scalar

double quotes ": interpolation

single quotes ': no interpolation

backquotes `: run an external program and return output

$answer = 42; # an integer
$pi = 3.14159265; # a "real" number
$avocados = 6.02e23; # scientific notation
$pet = "Camel"; # string
$sign = "I love my $pet"; # string with interpolation
$cost = 'It costs $100'; # string without interpolation
$thence = $whence; # another variable's value
$salsa = $moles * $avocados; # a gastrochemical expression
$exit = system("vi $file"); # numeric status of a command
$cwd = `pwd`; # string output from a command

Create your own quotes: q//, qw//, qq//, qx//; see Table 2-3 on page 63 of the camel book.

We have seen "\n" for a newline. "\t" is a horizontal tab.

Scalars can be references to complicated structures. For example,

$ary = \@myarray; # reference to a named array
$hsh = \%myhash; # reference to a named hash
$sub = \&mysub; # reference to a named subroutine
$ary = [1,2,3,4,5]; # reference to an unnamed array
$hsh = {Na => 19, Cl => 35}; # reference to an unnamed hash
$sub = sub { print $state }; # reference to an unnamed subroutine
$fido = new Camel "Amelia"; # reference to an object

Perl will automatically switch between numbers and string literals, determined according to the contexts. For example,

$camels = '123';

print $camels + 1, "\n";

2.2 Arrays

@home = ("couch", "chair", "table", "stove");

($potato, $lift, $tennis, $pipe) = @home;

$home[0] = "couch";

$home[1] = "chair";

$home[2] = "table";

$home[3] = "stove";

We can use a "list assignment" to swap two variables:

($alpha,$omega) = ($omega,$alpha);

You can get the length of the array by scalar(@array). Another useful notation: $#array is one less than the length of the array, that is, the subscript of the last element of the array.

2.3 Hashes

%longday = ("Sun", "Sunday", "Mon", "Monday", "Tue", "Tuesday", "Wed",

"Wednesday", "Thu", "Thursday", "Fri", "Friday", "Sat", "Saturday");

%longday = (

"Sun" => "Sunday",

"Mon" => "Monday",

"Tue" => "Tuesday",

"Wed" => "Wednesday",

"Thu" => "Thursday",

"Fri" => "Friday",

"Sat" => "Saturday",

);

$wife{"Adam"} = "Eve";

You can make an array of hashes, a hash of arrays, or more complicated structures. See pages 12-13 of the camel book.

2.4 Variable scopes

2.4.1 global variables

Global variables are visible everywhere.

$a = 0;

mySub(5);

sub mySub{

my($parameter) = @_;

print "$a\n";

print "$parameter\n";

}

2.4.2 local variables

Local variables are visible in the block and the subroutines that are called within the block.

$a = 3;

{ local $a = 0;

mySub(5);

}

mySub(5);

sub mySub{

my($parameter) = @_;

print "$a\n";

print "$parameter\n";

}

2.4.3 my variables

My variables are visible only in the block.

$a = 3;

{ my $a = 0;

mySub(5);

}

sub mySub{

my($parameter) = @_;

print "$a\n";

print "$parameter\n";

}

2.5 Special global variables

$_, @ARGV, @INC, %ENV, STDIN, STDOUT, STDERR

An example using these special global variables:

#!/usr/bin/perl -w

while (@ARGV){

$ARGV = shift @ARGV;

print "$ARGV\n";

}

print $INC[0], "\n";

print $ENV{"PATH"}, "\n";

print "enter something\n";

while (<STDIN>){

print "stdout: $_";

print STDERR "stderr: $_";

}

3. Operators

Arithmetic operators: +, -, *, /, **, %, ++, --

String concatenation: .

Assignment operators: =, +=, -=, *=, /=, .=

Numeric comparison operators: ==, !=, <, <=, >, >=, <=>

String comparison operators: eq, ne, lt, gt, le, ge, cmp

Logical operators: &&, ||, !, and, or, not

Conditional operator: ?:

Pattern matching: =~, !~, m//, s///, tr///

4. Control Structures

4.1 if-elsif-else

#if

if ($a>$b){

print $a;

}

#if-else

if ($a>$b){

print $a;

}

else{

print $b;

}

#if-elsif-else

if ($a>$b){

print $a;

}

elsif ($a>$c){

print $c;

}

else{

print $b;

}

4.2 while

while(<STDIN>){

print;

}

4.3 unless

unless ($a>$b){

print;

}

4.4 for

for($I=0;$I<10;$I++){

print $I, "\n";

}

4.5 foreach

foreach $I (@array) {

print $I, "\n";

}

Alternatively, use the default global variable $_:

# Each element is assigned to $_ in turn

foreach (@array)

{

print $_; # prints the element

print; # prints the element

}

4.6 next

$I = 11;

while ($I>0){

$I--;

if ($I==4) {next;}

# next if ($I==4);

print $I, "\n";

}

4.7 last

$I = 11;

while ($I>0){

if ($I==4) {last;}

# last if ($I==4);

print $I--, "\n";

}

4.8 Quiz 1

5. Build-in Functions

5.1 print

5.2 open/close files

Use the angle operator <> for line input. An example of opening a file for input:

open(IN,"filename") || die;

while (<IN>){

print $_;

}

close(IN);

An example of opening a file for output:

open(OUT,">filename") || die;

for($I=1;$I<=10;$I++){

print OUT $I, "\n";

#note that there is no comma after OUT

}

close(OUT);

In the above example, if there is an existing file of that name, it will be erased. An example of opening a file for appending:

open(OUT,">>filename") || die;

for($I=1;$I<=10;$I++){

print OUT $I, "\n";

}

close(OUT);

5.3 chomp

chomp

chomp VARIABLE

The function chomp deletes a trailling newline from the end of a string contained in a variable. The default variable that chomp operates on is $_.

5.4 split

split

split /PATTERN/

split /PATTERN/, EXPR

split(/[,|]/,EXPR) # using both , and | as the delimiters

The function split splits EXPR, using /PATTERN/ as separators. The default variable that split operates on is $_.

5.4.1 split the keyboard input by white spaces

$_ = <STDIN>;

chomp;

@line = split;

@line = split(/ /);

@line = split(/\s+/,$_);

5.4.2 split by tabs

open(IN,"TabDelineatedFileExportedFromExcel.txt") || die;

<IN>; #skip the first line (column headings)

$lineNum = 1;

while (<IN>){

chomp;

@fields = split(/\t/,$_);

print "line ", ++$lineNum, " has ", scalar(@fields), " fields\n";

}

close(IN);

5.5 functions for arrays: push, pop, unshift, shift

5.5.1 push

Adds an item to the end of an array.

push @array, $a;

push @array, ($a, $b);

push @array, @array;

5.5.2 pop

Deletes the item from the end of the array.

$last = pop @array; # last element is gone

$last = pop (@array); # same as above

pop @array; # remove last element without saving it

5.5.3 unshift

Adds an item to the beginning of an array.

unshift @array, $a;

unshift @array, ($a, $b);

unshift @array, @array;

5.5.4 shift

Deletes the item from the beginning of the array.

$first = shift @array; # first element is gone

$first = shift (@array); # same as above

shift @array; # remove first element without saving it

5.6 functions for hashes: values, keys, each

5.6.1 values

The function values returns a list of the values in a hash.

@valuesArray = values %aHash;

5.6.2 keys

The function keys returns a list of the keys in a hash.

@keys = keys %ENV; # keys are in the same order as

@values = values %ENV; # values, as this demonstrates

while (@keys) {

print pop(@keys), '=', pop(@values), "\n";

}

5.6.3 each

The function each steps through a hash one key/value pair at a time.

while (($key,$value) = each %ENV) {

print "$key=$value\n";

}

5.7 sub

5.8 return

sub numerically{

my($a,$b) = @_;

return $a <=> $b;

}

5.9 sort

5.9.1 sort numbers

@sorted = sort numerically 53, 29, 11, 32, 7;

@descending = reverse sort numerically 53, 29, 11, 32, 7;

sub reverse_numerically { $b <=> $a }

@descending = sort reverse_numerically 53, 29, 11, 32, 7;

5.9.2 sort strings

Assume @people is an array of hash references, where each hash contains fields of firstname and lastname. The following routine sorts these people by their names.

open(IN,"people.txt") || die;

while (<IN>){

$person = {}; #this step is important

chomp;

($first,$last) = split;

$person->{"firstname"} = $first;

$person->{"lastname"} = $last;

push @people, $person;

}

close(IN);

print scalar(@people), "\n";

print $people[0]->{"firstname"}, "\n";

print $people[1]->{"firstname"}, "\n";

print $people[2]->{"firstname"}, "\n";

print $people[3]->{"firstname"}, "\n";

@sorted = sort names @people;

print scalar(@sorted), "\n";

print $sorted[0]->{"firstname"}, "\n";

print $sorted[1]->{"firstname"}, "\n";

print $sorted[2]->{"firstname"}, "\n";

print $sorted[3]->{"firstname"}, "\n";

sub names{

$a->{lastname} cmp $b->{lastname}

||

$a->{firstname} cmp $b->{firstname}

}

5.10 join

join EXPR, LIST

The function join is the opposite of split. It puts the strings in LIST in one string, separated by the value of EXPR. For example,

$line = join '\t', @fields;

5.11 exit

exit EXPR

exit

The function exit terminates the program, returning the value of EXPR.

5.12 Example

Search Entrez for "human[Organism] jak3".

Compare the results to http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=nucleotide&term=human[orgn]+AND+jak3&retmax=100

#!/usr/bin/perl -w

use LWP::Simple;

$baseurl="http://eutils.ncbi.nlm.nih.gov/entrez/eutils/";

$eutil="esearch.fcgi?";

$parameters="db=nucleotide&term=human[orgn]+AND+jak3&retmax=100";

$url=$baseurl.$eutil.$parameters;

$raw=get($url);

#eutility output is given in XML by default

@lines=split(/^/,$raw);

foreach $line (@lines){

if ($line=~/<Id>(\d+)<\/Id>/){

print "$1\n";

}

}

What can we do with these GI numbers?

Go here: http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nucleotide&view=xml&val=47157314

5.13 Quiz 2

6. Regular Expressions

A regular expression is a way of describing a set of strings without having to list all the strings in the set.

6.1 Three major uses

6.1.1 Conditionals

if (/Windows 95/) { print "Time to upgrade?\n";}

6.1.2 Replacements

s/Windows/Linux/;

6.1.3 Separators

($good, $bad, $ugly) = split(/,/, "vi,emacs,teco");

6.2 An example

Suppose we want to collect email addresses on the web. We download a webpage and go through the html source code with the following:

while ($line = <FILE>){

if ($line =~ /mailto:/) {

print $line;

}

}

Alternatively, a simpler code:

while (<FILE>) {

print if /mailto:/;

}

6.3 Regular expression constructs

Matching is from left to right. Matching is greedy, attempting to match as much as possible as long as the whole pattern still matches.

Basic patterns:

[a-z] match one lower case letter
[a-zA-Z] match one letter
[0-9] one digit
\d one digit
\s whitespace: blank, \t, \n etc
\w word characters: [a-zA-Z_0-9]; note the underline character
\D anything that is not a digit
\S anything that is not a whitespace
\W anything that is not a word character
. match any character except \n
\. match .
^ at the beginning of the pattern, match the beginning of the string or beginning of line
\A match the beginning of the string
$ at the end of the pattern, match the end of the string or end of line
\z match the end of the string
\Z match the end of the string, or if there is a newline at the end of string, match before \n
\b match a word boundary
\B match except at the word boundary

Quantifiers:

a+ one or more a's
a* zero or more a's
a? when ? appears after a pattern, it means matching zero or one a
\d{7,11} at least 7 digits, no more than 11 digits
\d{7} exactly 7 digits
\d{7,} at least 7 digits
\d{0,7} at most 7 digits

Metacharacters:

? when ? appears after a quantifier, it means minimal matching with the quantifier
.*: the : will match the last : in the string
.*?: the : will match the first : in the string
| alternatives
() grouping
/(Fred|George|Ron) Weasley/  
[abcd] a or b or c or d; alternatives on the character level; equivalently, a|b|c|d
[^a] ^ is negation; match anything but a

Backreferences:

\1, \2, $1, $2, $`, $&, $'

Examples:

1. /bam{2}/ matches "bamm"

2. /(bam){2}/ matches "bambam"

3. /\bFred\b/ matches "The Great Fred", "Fred the Great", but not "Frederick the Great"

4. /^#/ matches a comment line

5. /<(.*?)>.*?<\/\1>/ matches "<B>Bold</B>"

6. /U-?2/ matches "U2" or "U-2"

7. If you say:

$foo = "bar";

/$foo$/;

the pattern will be /bar$/, which matches "bar" at the end of the string.

7. Pattern Matching Operators

Pattern matching operators:

m// pattern matching; match the pattern between the /'s

if ($shire =~ m/Baggins/) { ... } # search for Baggins in $shire

if ($shire =~ /Baggins/) { ... } # search for Baggins in $shire

if (m/Baggins/) { ... } # search for Baggins in $_

if (/Baggins/) { ... } # search for Baggins in $_

s/// pattern matching and substitution; match the pattern between the first two /'s, and replace it by the string between the last two /'s

=~ read as "matches" or "contains"

!~ read as "does not match" or "does not contain"

Remember, $_ is the default variable that pattern matching operates on.

Pattern matching modifiers:

/i ignore alphabetic case

/m the string has multiple lines

/s the string has a single line

/g globally find all matches

Examples:

1. s/(\S+)\s+(\S+)/$2 $1/ swaps the first two words of a string

2. $haystack =~ m/needle/ # match a simple pattern

3. $haystack =~ /needle/ # same thing

4. $recipe =~ s/butter/olive oil/ # a substitution

5. tr/ATCG/TAGC/ # complement the DNA strand in $_

6.

if (@perls = $paragraph =~ /perl/gi) {

printf "Perl mentioned %d times.\n", scalar @perls;

}

7.

$lotr = $hobbit; # Just copy The Hobbit

$lotr =~ s/Bilbo/Frodo/g; # and write a sequel the easy way.

Equivalently,

($lotr = $hobbit) =~ s/Bilbo/Frodo/g;

8.

for (@chapters) { s/Bilbo/Frodo/g } # Do substitutions chapter by chapter.

s/Bilbo/Frodo/g for @chapters; # Same thing.

9.

@oldhues = ('bluebird', 'bluegrass', 'bluefish', 'the blues');

for (@newhues = @oldhues) { s/blue/red/ }

print "@newhues\n"; # redbird redgrass redfish the reds

10.

for ($string) {

s/^\s+//; # discard leading whitespace

s/\s+$//; # discard trailing whitespace

s/\s+/ /g; # collapse internal whitespace

}

Equivalently,

$string = join(" ", split " ", $string);

8. Examples

8.1 Complement a DNA sequence

#!C:/Perl/bin/perl -w

$seq = "acgttgca";

print "$seq\n";

$seq =~ tr/acgt/tgca/;

print "$seq\n";

8.2 Find the TATA box and the transcription start site

#!C:/Perl/bin/perl -w

$s = "agct tata acgt acgt acgt acgt acgt acgt a cccc gggg tttt";

print "$s\n";

$s =~ s/ //g;

print "$s\n";

if ($s =~ m/tata/){

print "$`\n$&\n$'\n";

$downstream = $';

$downstream =~ m/.{25}/;

print "$'\n";

}

8.3 Match Prosite IL-2 signature to human and mouse IL-2

The IL-2 signature is T-E-[LF]-x(2)-L-x-C-L-x(2)-E-L, in Prosite regular expression. For Perl, it is

/TE[LF]..L.CL..EL/

A Perl script to search for IL-2:

#!C:/Perl/bin/perl -w

$IL2 = "TE[LF]..L.CL..EL";

open(IN,"humanIL2") || die; $humanIL2 = <IN>; chomp $humanIL2; close(IN);

open(IN,"mouseIL2") || die; $mouseIL2 = <IN>; chomp $mouseIL2; close(IN);

open(IN,"chickIL2") || die; $chickIL2 = <IN>; chomp $chickIL2; close(IN);

if ($humanIL2 =~ m/$IL2/){

print "matched human IL-2: $&\n";

}

if ($mouseIL2 =~ m/$IL2/){

print "matched mouse IL-2: $&\n";

}

if ($chickIL2 =~ m/$IL2/){

print "matched chick IL-2: $&\n";

}

8.4

#!C:/Perl/bin/perl -w

print 'Please enter your name: ';

$name = <STDIN>;

chomp $name;

print "\n";

@array = split //, $name;

&munch(@array);

sub munch{

foreach(@_){

print "Munching...$_\n";

}

}

8.5 An example of using the standard module Getopt::Std

#!C:/Perl/bin/perl -w

use Getopt::Std;

getopts('dn:a:');

if ($opt_d){

print "Debugging mode\n";

}

if (!$opt_n || !$opt_a){

print "USAGE:\n\texample [-d] -n name -a age\n";

exit;

}

else{

if ($opt_d){ print "Forming string\n";}

$output = "$opt_n is $opt_a years old\n";

print $output;

}

8.6

#!/usr/bin/perl -w

$path = `pwd`;

print "Path is: $path\n";

#BE CAREFUL! This cd doesn't do what you want it to do!

`cd /home/ouyang/5020`;

$path = `pwd`;

print "Path is $path\n";

chdir "/home/ouyang/5020";

$path = `pwd`;

print "Path is $path\n";

8.7 Quiz 3

#!/usr/bin/perl -w

use LWP::Simple;

$page=get("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=protein&term=interleukin-2[protein]&retmax=200");

@lines=split(/^/,$page);

foreach $line (@lines){

if ($line=~/<Id>(\d+)<\/Id>/){

$GI = $1;

$page=get("http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=protein&view=xml&val=$GI");

@protein = ($page =~ m/&lt;NCBIeaa&gt;(.*)&lt;\/NCBIeaa/gm);

if (scalar(@protein)>0){

print "$GI => $protein[0]\n";

}

}

}

9. Miscellaneous

PDL - Perl Data Language: http://pdl.perl.org/