Friday, February 5, 2010

Happy Perl Utillities

This blog is about Perl
utilities I've been happy
to discover.

Here's one such utility:

Extract links from a file [perl]

Just in case the above link
ever disappears, I'll describe
it:

It is a perl untility that extracts
all the links out of an HTML file.
Of course, it is a command line utility.

Here's the usage line:

Usages: ./extractlink.pl filename.html

This utility relies on the following
perl module from CPAN:

HTML::SimpleLinkExtor -
Extract links from HTML


I learned about the above module, and
the script that works with it, by
finding out about this module first:

HTML::LinkExtor -
Extract links from an HTML document


Looks like LinkExtor is the
original module and SimpleLinkExtor
is the one I ended up using instead.

In other words, you do not need
LinkExtor to run the above
perl script.

Here are the steps I took to get this
to work. Note that I'm a Debian
Linux user:

  1. I already have Perl installed
    so no need for me to install Perl
  2. I copied and pasted the above
    script and placed it in a file called
    extract_links.
  3. I downloaded the
    SimpleLinkExtor
    module by clicking on the
    download link. I
    placed it in an empty
    directory called download.
  4. I did a cd to
    download:

    cd download
    
  5. I typed the following
    command to extract it:

    tar xzf HTML-SimpleLinkExtor-1.23.tar.gz
    
  6. Of course, the following
    directory was created:

    HTML-SimpleLinkExtor-1.23
    
  7. I did a cd to the
    following directory:

    cd HTML-SimpleLinkExtor-1.23
    

  8. I read the README file
  9. I did the following command
    as a regular user:

    perl Makefile.PL
    
  10. Next, I logged in as
    root in order to do
    the next few commands on
    a system-wide basis.
  11. I did the following
    three commands as root:

    make
    make test
    make install
    
  12. I logged out as root
    so as to go back to being
    a normal user

  13. Next, I gave my script execute
    permission:

    $ chmod u+x extract_links
    

  14. Since I was working with
    a website, I gathered all
    the HTML files into one
    file to make the HTML easy
    to work with:

    $ cat *html *htm >temp
    
  15. Next, I ran the utility
    against the temp file:

    extract_links temp >temp2
    

  16. Since I was looking for unique
    URLs in the file, I decided to
    put like URLs together by sorting
    them with the Unix sort
    utility:

    $ sort temp2 >temp3
    
This turned out to be a very very
happy Perl utility for me.

I found out that the website had
approximately 1750 links. This
was too many for me to deal with
by hand. This is what I needed
to find out.

I like SimpleLinkExtor and
the above script that relies
on SimpleLinkExtor.

I'm sure I'll find use for it
again in the future.

Ed Abbott