Happy Perl Utilities: 2010

This blog is about Perl
utilities I've been happy
to discover.

Here's one such utility:

Extract links from a file [perl]

Just in case the above link
ever disappears, I'll describe
it:

It is a perl untility that extracts
all the links out of an HTML file.
Of course, it is a command line utility.

Here's the usage line:

Usages: ./extractlink.pl filename.html

This utility relies on the following
perl module from CPAN:

HTML::SimpleLinkExtor -
Extract links from HTML

I learned about the above module, and
the script that works with it, by
finding out about this module first:

HTML::LinkExtor -
Extract links from an HTML document

Looks like LinkExtor is the
original module and SimpleLinkExtor
is the one I ended up using instead.

In other words, you do not need
LinkExtor to run the above
perl script.

Here are the steps I took to get this
to work. Note that I'm a Debian
Linux user:

I already have Perl installed
so no need for me to install Perl
I copied and pasted the above
script and placed it in a file called
extract_links.
I downloaded the
SimpleLinkExtor
module by clicking on the
download link. I
placed it in an empty
directory called download.
I did a cd to
download:
```
cd download
```
I typed the following
command to extract it:
```
tar xzf HTML-SimpleLinkExtor-1.23.tar.gz
```
Of course, the following
directory was created:
```
HTML-SimpleLinkExtor-1.23
```
I did a cd to the
following directory:
```
cd HTML-SimpleLinkExtor-1.23
```
I read the README file
I did the following command
as a regular user:
```
perl Makefile.PL
```
Next, I logged in as
root in order to do
the next few commands on
a system-wide basis.
I did the following
three commands as root:
```
make
make test
make install
```
I logged out as root
so as to go back to being
a normal user
Next, I gave my script execute
permission:
```
$ chmod u+x extract_links
```
Since I was working with
a website, I gathered all
the HTML files into one
file to make the HTML easy
to work with:
```
$ cat *html *htm >temp
```
Next, I ran the utility
against the temp file:
```
extract_links temp >temp2
```
Since I was looking for unique
URLs in the file, I decided to
put like URLs together by sorting
them with the Unix sort
utility:
```
$ sort temp2 >temp3
```

This turned out to be a very very
happy Perl utility for me.

I found out that the website had
approximately 1750 links. This
was too many for me to deal with
by hand. This is what I needed
to find out.

I like SimpleLinkExtor and
the above script that relies
on SimpleLinkExtor.

I'm sure I'll find use for it
again in the future.

Ed Abbott

Happy Perl Utilities

Friday, February 5, 2010

Happy Perl Utillities

About Me

Blog Archive