Community discussion forum

HELP: parsing unicode web sites

  • 4 months ago

    I need help in parsing unicode webpages & downloading jpeg image files via Perl scripts.

    I read http://www.cs.utk.edu/cs594ipm/perl/crawltut.html about using LWP or HTTP or get($url) functions & libraries. But the content returned is always garbled. I have used get($url) on a non-unicode webpage and the content is returned in perfect ascii.

    But now I want to parse http://www.tom365.com/movie_2004/html/5507.html and the page I get back is garbled encoded. I have read about Encode but don't know how to use it.

    I need a Perl script to parse that above page and extract the URL for the image in this pattern:

    If anyone knows how to do this parsing unicode webpages then I'd be very grateful. Thank you
  • Advertisement

    Simply the fastest line-level profiler for .NET ever

    “The low overhead means it has minimal impact on the execution of my program”
    Mark Everest, Development Team Leader, Renault F1 Team Ltd.

    Try out the new ANTS Profiler 4 for yourself. Download your 14-day trial now

  • 4 months ago

    Thanks to those who helped. Here's my working script:

     


     #!/usr/bin/perl
    # tom365crawl2.pl
    # http://www.cs.utk.edu/cs594ipm/perl/crawltut.html
    # http://perldoc.perl.org/Encode.html
    # http://juerd.nl/site.plp/perluniadvice
    # http://www.perlmonks.org/?node_id=620068

    use warnings;
    use strict;

    use File::stat;
    use Tie::File;

    use LWP::Simple;
    use LWP::UserAgent;
    use HTTP::Request;
    use HTTP::Response;
    use HTML::LinkExtor; # Allows you to extract the links off of an HTML page.
    #use File::Slurp;

    use Encode;

    my $site1 = "http://www.tom365.com/"; # Full url like http://www.tom365.com/movie_2004/html/????.html
    my $delim1a = "\<div class=\"movie\"\>\<img src=\"";
    my $delim1b = "\" class=\"mp\" \/\>";
    my $folder1 = "movie_2004/html/";
    my $url1;
    my $start1 = 1000;
    my $end1 = 1000;
    my $contents1;
    my $image1;

    my $browser1 = LWP::UserAgent->new();
    $browser1->timeout(10);
    my $request1;
    my $response1;

    my $count;
    for ($count=$start1; $count<=$end1; $count++) {
      $url1 = $site1 . $folder1 . $count . ".html";
      printf "Downloading %s\n", $url1;

      # Method 1
      #$contents1 = get($url1);

      # Method 2
      $request1 = HTTP::Request->new(GET => $url1);
      $response1 = $browser1->request($request1);
      if ($response1->is_error()) {
        printf "%s\n", $response1->status_line;
      }
      $contents1 = $response1->decoded_content();

      #open(NEWFILE1, "> Debug.txt");
      #(print NEWFILE1 $contents1)    or die "Can't write to Debug.txt: $!";
      #close(NEWFILE1);

      #print $contents1;

      if ($contents1 =~ /\<div class=\"movie\"\>\<img src=\"(.*)\" class=\"mp\" \/\>/m) {
        $image1 = "$1";
        printf "Downloading %s\n", $image1;
        `wget -q -O $count.jpg $image1`;

        #if ($image1 =~ /\/([^\/]*)$/m) {
        #  printf "Renaming %s to $count.jpg\n", $1;
        #} else {
        #  printf "Could not rename %s to $count.jpg\n", $image1;
        #}
      } else {
        #open(NEWFILE1, "> $count.txt");
        #(print NEWFILE1 "Download failed.\n")    or die "Can't write to $image1: $!";
        #close(NEWFILE1);
      }
    }


Post a reply

Enter your message below

Sign in or Join us (it's free).