utilities I've been happy
to discover.
Here's one such utility:
Extract links from a file [perl]
Just in case the above link
ever disappears, I'll describe
it:
It is a perl untility that extracts
all the links out of an HTML file.
Of course, it is a command line utility.
Here's the usage line:
Usages: ./extractlink.pl filename.html
This utility relies on the following
perl module from CPAN:
HTML::SimpleLinkExtor -
Extract links from HTML
I learned about the above module, and
the script that works with it, by
finding out about this module first:
HTML::LinkExtor -
Extract links from an HTML document
Looks like LinkExtor is the
original module and SimpleLinkExtor
is the one I ended up using instead.
In other words, you do not need
LinkExtor to run the above
perl script.
Here are the steps I took to get this
to work. Note that I'm a Debian
Linux user:
- I already have Perl installed
so no need for me to install Perl
- I copied and pasted the above
script and placed it in a file called
extract_links.
- I downloaded the
SimpleLinkExtor
module by clicking on the
download link. I
placed it in an empty
directory called download.
- I did a cd to
download:
cd download
- I typed the following
command to extract it:
tar xzf HTML-SimpleLinkExtor-1.23.tar.gz
- Of course, the following
directory was created:
HTML-SimpleLinkExtor-1.23
- I did a cd to the
following directory:
cd HTML-SimpleLinkExtor-1.23
I read the README file
- I did the following command
as a regular user:
perl Makefile.PL
- Next, I logged in as
root in order to do
the next few commands on
a system-wide basis. - I did the following
three commands as root:
make make test make install
- I logged out as root
so as to go back to being
a normal user
Next, I gave my script execute
permission:
$ chmod u+x extract_links
Since I was working with
a website, I gathered all
the HTML files into one
file to make the HTML easy
to work with:
$ cat *html *htm >temp
- Next, I ran the utility
against the temp file:
extract_links temp >temp2
Since I was looking for unique
URLs in the file, I decided to
put like URLs together by sorting
them with the Unix sort
utility:
$ sort temp2 >temp3
happy Perl utility for me.
I found out that the website had
approximately 1750 links. This
was too many for me to deal with
by hand. This is what I needed
to find out.
I like SimpleLinkExtor and
the above script that relies
on SimpleLinkExtor.
I'm sure I'll find use for it
again in the future.
Ed Abbott