Perl and CGI

 
Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
User Name:
Password:
Remember me



Go Back   Tutorialized ForumsWeb Design & DevelopmentPerl and CGI

Reply
Add This Thread To:
  Del.icio.us   Digg   Google   Spurl   Blink   Furl   Simpy   Y! MyWeb 
Thread Tools Search this Thread Rate Thread Display Modes
 
Unread Tutorialized Forums Sponsor:
  #1  
Old January 18th, 2009, 11:13 AM
sheff sheff is offline
Registered User
Tutorialized Newbie (0 - 499 posts)
 
Join Date: Jan 2009
Posts: 2 sheff User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 32 m 10 sec
Reputation Power: 0
Page ripper problem

Hello
I have some code that will go to a site and extract data and put it on a page of my choice. It's working fine BUT, the links on the page show my domain as a prefix instead of the original domain. I have adjusted the appropriate code but it doesn't seem to change anything. I have highlited the area below. Any suggestion would be welcome.
-----------
#!/usr/bin/perl
################################################## ###########################

################################################## ###########################

require LWP::UserAgent;

# read command prompt
read(STDIN, $commandprompt, $ENV{'CONTENT_LENGTH'});

# Split the name-value pairs
@theactualdata = split(/&/, $commandprompt);

$newcommandprompt = '';

foreach $actualdata (@theactualdata) {
($name, $value) = split(/=/, $actualdata);
if ($name ne 'page')
{ $newcommandprompt = $newcommandprompt . $actualdata } ;

# change all the + to blank spaces
$name =~ tr/+/ /;
$value =~ tr/+/ /;

# convert %20 to ' ' (a blank space), HEX - character conversion
$value =~ s/%([a-fA-F0-9][a-fA-F0-9])/pack("C", hex($1))/eg;

# convert < > to '' (NULL)
$value =~ s/<([^>]|\n)*>//g;

# eliminate the less than and greater than signs
$value =~ s/<//g;
$value =~ s/>//g;
$URL{$name} = $value;

}

print "Content-type: text/html\n\n";

# store my url

$url = $URL{'url'};


#================================================= ==============================
# #### VARIABLES THAT MUST BE CHANGED FOR SCRIPT TO WORK ####
#================================================= ==============================

# 1 - MUST CHANGE - Change The URL below to the page you wish to begin extracting from

$url = 'I had to take the link out - the forum won't let me post a url';




# 2 - Starting Point to extract from (must be exactly the way it is on the page)

$start = '<font color="black">';





# 3 - Ending HTML

$finish = '<!-- any content goes here -->';





# 4 - If links don't work change the HTML below to reflect the link source (remove the #'s)

$sourcelink = '<a href="';
$newlink = '<a href="';
(I had to take the links out - the forum won't let me post a url
but the link was to the main domain at $url)





# 5 - If images don't work change the html below to reflect the image source (remove the #'s)

# $sourceimage = 'img src=\"';
# $newimage = 'img src=\"';





#================================================= ==============================
# DO NOT CHANGE ANYTHING BELOW THIS LINE. SCRIPT WILL NOT WORK.
#================================================= ==============================






$start =~ s/ /\\s\+/;
$start =~ s/\//\\\//;
$finish =~s/ /\\s\+/;
$finish =~ s/\//\\\//;

# initialise the mem vars
$t = '';
$content = '';
$contenttype = '';
$body = '';
$head = '';
$title = '';
$keywords = '';
$description = '';
$error = 0 ;
$htmlcode = 0 ;

# assign the browser agent
$agent ='Mozilla/9.0 (Windows 98;US) Opera 4.00 [en]';

# print the HTML header
print "<HTML><HEAD></HEAD><BODY>";

# has the user defined the URL
if (!$url)
{ print "Enter the URL\n"; }
else
{
# EXTRACT the page
&getpage($url);

# if error grabbing the page then print the return CODE RFC 2068.
# HTTP 1.1 Protocol
if ($error)
{ print "Error Getting Page : $url Code : $htmlcode<br>"; }

# ERROR or NO ERROR print whatever is grabbed.
$t = $content;

# convert newline or return to blank spaces globally
$t =~ s/\r/\n/gi;
$t =~ s/\n/ /gi;

# convert multiple blanks to a single blank globally
$t =~ s/ / /gi;

# convert tab to single blank globally
$t =~ s/\t/ /gi;

# get the meta tags, keywords, title, and descriptions
&gettags($t);

# print the return results
print "$stuff";



# print "<tr><td bgcolor=\"#DDDDDD\">Body</td><td bgcolor=\"#CCCCCC\">$body</td></tr></table>";
}

# end HTML

#print "<div align='right'>";
#print " <table border='0' cellpadding='0' cellspacing='0' width='100%'>";
#print " <tr>";
#print " <td width='100%' align='right'><font size='1'>The Webs My Stage";
#print " </font></td>";
#print " </tr>";
#print " </table>";
print "</body></html>";



# GET THE META TAGS
sub gettags {
my $mmyurl = shift;

# store the content to $body variable
$stuff = $mmyurl;



#----------------begin here



# get the page...
$stuff = ($stuff =~s/$start(.+)$finish/ /i) ? $1 : 'No Title';



$stuff =~ s/$sourcelink/$newlink/ig;
$stuff =~ s/$newlink$newlink/$newlink/ig;



$stuff =~ s/$sourceimage/$newimage/ig;






#----------------end here
#================================================= ========================
# DO NOT CHANGE ANYTHING BELOW. SCRIPT WILL NOT WORK
#================================================= ========================



}

# GET THE PAGE
sub getpage {
my $myurl = shift;
# $myurl stores the URL

# define the user agent
my $ua = new LWP::UserAgent;

# $cookies ="";

# define the timeout
$ua->timeout(60);

# set the browser agent
$ua->agent($agent);

# GET THE PAGE
my $req = HTTP::Request->new(GET => $myurl);
my $res = $ua->request($req);

# get the content of the page and the content type
$content = $res->content;
$contenttype = $res->content_type;

# if 200 success, then no error.
if ($res->is_success) {
$error =0;
}

# get the HTML error code RFC 2068 and store in HTMLCODE var.
# dont forget to set the error flag.
else {
$htmlcode = $res->code();
$error =1;
}
}

Reply With Quote
Reply

Viewing: Tutorialized ForumsWeb Design & DevelopmentPerl and CGI > Page ripper problem


Developer Shed Advertisers and Affiliates


Thread Tools  Search this Thread 
Search this Thread:

Advanced Search
Display Modes  Rate This Thread 
Rate This Thread:


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
View Your Warnings | New Posts | Latest News | Latest Threads | Shoutbox
Forum Jump


Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
  
 

Powered by: vBulletin Version 3.0.5
Copyright ©2000 - 2018, Jelsoft Enterprises Ltd.

© 2003-2018 by Developer Shed. All rights reserved. DS Cluster - Follow our Sitemap