NAME

extract-links - build links database


SYNOPSIS

 extract-links [options]
 extract-links [options] URI directory
 extract-links [options] URL


DESCRIPTION

This program extracts links and builds the databases used by link-controllers various other programs.

The first database built is the links database containing information about the status of all of the links that are being checked. The second two are in cdb format and can be used as indexes for identifying which files contain which URIs and vica versa.


FILE MODE

In file mode this goes through a directory tree. The program requires a base URI which is the URI which would be used to refernce the directory in which all of the files are contained on the world wide web. This is used to convert internal references into full URIs which can be used to check that the files are visible from the outside.


WWW MODE

In WWW mode the program goes through a set of World Wide Web pages generating the databases.

The program requires a base-URL which where it starts from. It's default mode is to only work down from that URL, that is it will only get URIs from WWW pages who's URL starts with the base-url


FILTERING

There are two regular expressions which can be given for filtering. If the regular expression given with <--exclude-regex> matches the file name then it will not be read in. If the regular expression given with --prune-regex matches a directory name then that entire directory and all subdirectories are excluded.


CONFIGURATION

By default, extract-links extracts and refreshes all of the infostructures listed in the file $::infostrucs. The file looks like

   #mode        url     directory
   www  http://myserver.example.com /var/www/html223

This is covered in detail in the LinkController reference manual.


FILES

There are several configuration

  $HOME/.link-control.pl - base configuration file

This contains configuration variables which point to further files.

  $::links - link database
  $::infostrucs - infostructure configuration file

Full details of the format of these configuration files can be found in the LinkController reference manual.


NOTES

Unlike other programs which tend to resort to closing and re-opening files with lists of links, or holding them all in memory, this program uses a file containing list of links to follow the recursion in the WWW without actually using recursive functions. This relies on output to an unbuffered file being available for input immediately afterwards.

We also use a temporary database to record which links have been seen before. This could get LARGE.


BUGS

The HTML parsing is done by the perl html parser. This provides excelent and controllable results, but a custom parser carefully written in C would be alot faster. This program takes a long time to run. Since this program is run under human control this matters. If anyone knows of an efficient but good C based parser, suggestion would be greatfully accepted. Direct interface compatibility with the current Perl parser would be even better.

I think the program can get trapped in directories it can change into, but can't read (mode o+x-w). This should be fixed.

This program could put a large load on a given server if accidentally let go where it shouldn't be. This is your responsibility since it isn't reasonable to slow it down for when it's being used on a local machine or LAN. Some warning should be provided.. e.g. out of local domain check.

I don't really know if the tied database is really needed. I want to allow massive link collections though.


SEE ALSO

the verify-link-control manpage; the extract-links manpage; the build-schedule manpage the link-report manpage; the fix-link manpage; the link-report.cgi manpage; the fix-link.cgi manpage the suggest manpage; the link-report.cgi manpage; the configure-link-control manpage

cdbmake, cdbget, cdbmultiget

The LinkController manual in the distribution in HTML, info, or postscript formats, included in the distribution.

http://scotclimb.org.uk/software/linkcont - the LinkController homepage.