WWW-Crawl

The WWW::Crawl module provides a simple web crawling utility for extracting links and other resources from web pages within a single domain. It can be used to recursively explore a website and retrieve URLs, including those found in HTML href attributes, form actions, external JavaScript files, and JavaScript window.open links.

WWW::Crawl will not stray outside the supplied domain.

EXAMPLES

Basic crawling with HTTP::Tiny:

    use WWW::Crawl;

    my $crawler = WWW::Crawl->new();
    my @visited = $crawler->crawl('https://example.com', \&process_page);

    sub process_page {
        my $url = shift;
        print "Visited: $url\n";
    }

Crawling JavaScript-rendered pages with Chromium:

    use WWW::Crawl::Chromium;

    my $crawler = WWW::Crawl::Chromium->new(
        chromium_path    => '/usr/bin/chromium',
        chromium_timeout => 30,
    );

    my @visited = $crawler->crawl('https://example.com', \&process_page);

    sub process_page {
        my $url = shift;
        print "Visited: $url\n";
    }

INSTALLATION & TESTING

To run author tests, set the environment variable RELEASE_TESTING

Installation tests are only run if Test::Mock::HTTP::Tiny in installed.  If you wish to run a full set of tests, ensure this module is installed before installing WWW::Crawl.

To install this module, run the following commands:

	perl Makefile.PL
	make
	make test
	make install

SUPPORT AND DOCUMENTATION

After installing, you can find documentation for this module with the
perldoc command.

    perldoc WWW::Crawl

You can also look for information at:

    RT, CPAN's request tracker (report bugs here)
        https://rt.cpan.org/NoAuth/Bugs.html?Dist=WWW-Crawl

    Search CPAN
        https://metacpan.org/release/WWW-Crawl


LICENSE AND COPYRIGHT

This software is Copyright (c) 2023 by Ian Boddison.

This program is released under the following license:

  Perl