Xem mẫu
- Chapter 4: The Socket Library- P2
Now we wait for a response from the server. We read in the response and
selectively echo it out, where we look at the $response, $header, and $data
variables to see if the user is interested in looking at each part of the reply:
# get the HTTP response line
my $the_response=;
print $the_response if ($all || defined
$response);
# get the header data
while(=~ m/^(\S+):\s+(.+)/) {
print "$1: $2\n" if ($all || defined $header);
}
# get the entity body
if ($all || defined $data) {
- print while ();
}
The full source code looks like this:
#!/usr/local/bin/perl -w
# socket based hypertext version of UNIX cat
use strict;
use Socket; # include Socket
module
require 'tcp.pl'; # file with Open_TCP
routine
require 'web.pl'; # file with parseURL
routine
use vars qw($opt_h $opt_H $opt_r $opt_d);
use Getopt::Std;
- # parse command line arguments
getopts('hHrd');
# print out usage if needed
if (defined $opt_h || $#ARGV
- print " -h help\n";
print " -r print out response\n";
print " -H print out header\n";
print " -d print out data\n\n";
exit(-1);
}
# Subroutine to print out help text along with
usage information
sub help {
print "Hypertext cat help\n\n";
print "This program prints out documents on a
remote web server.\n";
print "By default, the response code, header, and
data are printed\n";
- print "but can be selectively printed with the -
r, -H, and -d options.\n\n";
usage();
}
# Given a URL, print out the data there
sub hcat {
# grab paramaters
my ($full_url, $response, $header, $data)=@_;
# assume that response, header, and data will be
printed
my $all = !($response || $header || $data);
- # if the URL isn't a full URL, assume that it is
a http request
$full_url="http://$full_url" if ($full_url !~
m/(\w+):\/\/([^\/:]+)(:\d*)?([^#]*)/);
# break up URL into meaningful parts
my @the_url = parse_URL($full_url);
if (!defined @the_url) {
print "Please use fully qualified valid URL\n";
exit(-1);
}
# we're only interested in HTTP URL's
return if ($the_url[0] !~ m/http/i);
# connect to server specified in 1st parameter
- if (!defined open_TCP('F', $the_url[1],
$the_url[2])) {
print "Error connecting to web server:
$the_url[1]\n";
exit(-1);
}
# request the path of the document to get
print F "GET $the_url[3] HTTP/1.0\n";
print F "Accept: */*\n";
print F "User-Agent: hcat/1.0\n\n";
# print out server's response.
# get the HTTP response line
my $the_response=;
print $the_response if ($all || defined
$response);
- # get the header data
while(=~ m/^(\S+):\s+(.+)/) {
print "$1: $2\n" if ($all || defined $header);
}
# get the entity body
if ($all || defined $data) {
print while ();
}
# close the network connection
close(F);
}
Shell Hypertext cat
- With hcat, one can easily retrieve documents from remote web servers. But
there are times when a client request needs to be more complex than hcat is
willing to allow. To give the user more flexibility in sending client requests,
we'll change hcat into shcat, a shell utility that accepts methods, headers, and
entity-body data from standard input. With this program, you can write shell
scripts that specify different methods, custom headers, and submit form data.
All of this can be done by changing a few lines around. In hcat, where you
see this:
# request the path of the document to get
print F "GET $the_url[3] HTTP/1.0\n";
print F "Accept: */*\n";
print F "User-Agent: hcat/1.0\n\n";
Replace it with this:
# copy STDIN to network connection
while () {print F;}
and save it as shcat. Now you can say whatever you want on shcat's STDIN,
and it will forward it on to the web server you specify. This allows you to do
things like HTML form postings with POST, or a file upload with PUT, and
selectively look at the results. At this point, it's really all up to you what you
want to say, as long as it's HTTP compliant.
Here's a UNIX shell script example that calls shcat to do a file upload:
- #!/bin/ksh
echo "PUT /~apm/hi.txt HTTP/1.0
User-Agent: shcat/1.0
Accept: */*
Content-type: text/plain
Content-length: 2
hi" | shcat http://publish.ora.com/
Grep out URL References
When you need to quickly get a list of all the references in an HTML page,
here's a utility you can use to fetch an HTML page from a server and print
out the URLs referenced within the page. We've taken the hcat code and
modified it a little. There's also another function that we added to parse out
URLs from the HTML. Let's go over that first:
sub grab_urls {
my($data, %tags) = @_;
my @urls;
- # while there are HTML tags
skip_others: while ($data =~ s/]*)>//) {
my $in_brackets=$1;
my $key;
foreach $key (keys %tags) {
if ($in_brackets =~ /^\s*$key\s+/i) { #
if tag matches, try parms
if ($in_brackets =~
/\s+$tags{$key}\s*=\s*"([^"]*)"/i) {
my $link=$1;
$link =~ s/[\n\r]//g; # kill
newlines,returns anywhere in url
push (@urls, $link);
next skip_others;
}
- # handle case when url isn't in quotes (ie:
)
elsif ($in_brackets =~
/\s+$tags{$key}\s*=\s*([^\s]+)/i) {
my $link=$1;
$link =~ s/[\n\r]//g; # kill
newlines,returns anywhere in url
push (@urls, $link);
next skip_others;
}
} # if tag matches
} # foreach tag
} # while there are brackets
@urls;
}
The grab_urls( ) function has two parameters. The first argument is a scalar
containing the HTML data to go through. The second argument is a hash of
tags and parameters that we're looking for. After going through the HTML,
grab_urls( ) returns an array of links that matched the regular expression of
the form: . The outer if statement looks for HTML
- tags, like , , , . The inner if statement looks
for parameters to the tags, like SRC and HREF, followed by text. Upon
finding a match, the referenced URL is pushed into an array, which is
returned at the end of the function. We've saved this in web.pl, and will
include it in the hgrepurl program with a require 'web.pl'.
The second major change from hcat to hgrepurl is the addition of:
my $data='';
# get the entity body
while () {$data.=$_};
# close the network connection
close(F);
# fetch images and hyperlinks into arrays, print
them out
if (defined $images || $all) {
- @links=grab_urls($data, ('img', 'src', 'body',
'background'));
}
if (defined $hyperlinks || $all) {
@links2= grab_urls($data, ('a', 'href'));
}
my $link;
for $link (@links, @links2) { print "$link\n"; }
This appends the entity-body into the scalar of $data. From there, we call
grab_urls( ) twice. The first time looks for image references by recognizing
and in the HTML. The second
time looks for hyperlinks by searching for instances of . Each
call to grab_urls( ) returns an array of URLs, stored in @links and @links2,
respectively. Finally, we print the results out.
Other than that, there are some smaller changes. For example, we look at the
response code. If it isn't 200 (OK), we skip it.
# if not an "OK" response of 200, skip it
if ($the_response !~ m@^HTTP/\d+\.\d+\s+200\s@)
{return;}
- We've retrofitted the reading of the response line, headers, and entity-body
to not echo to STDOUT. This isn't needed anymore in the context of this
program. Also, instead of parsing the -r, -H, and -d command-line
arguments, we look for -i for displaying image links only, and -l for
displaying only hyperlinks.
So, to see just the image references at www.ora.com, one would do this:
% hgrepurl -i http://www.ora.com
Or just the hyperlinks at www.ora.com:
% hgrepurl -l http://www.ora.com
Or both images and hyperlinks at www.ora.com:
% hgrepurl http://www.ora.com
The complete source code looks like this:
#!/usr/local/bin/perl -w
# socket based hypertext grep URLs. Given a URL,
this
# prints out URLs of hyperlinks and images.
- use strict;
use Socket; # include Socket
module
require 'tcp.pl'; # file with Open_TCP
routine
require 'web.pl'; # file with parseURL
routine
use vars qw($opt_h $opt_i $opt_l);
use Getopt::Std;
# parse command line arguments
getopts('hil');
# print out usage if needed
if (defined $opt_h || $#ARGV
- hgu($_, $opt_i, $opt_l);
}
# Subroutine to print out usage information
sub usage {
print "usage: $0 -hil URL(s)\n";
print " -h help\n";
print " -i print out image
URLs\n";
print " -l print out hyperlink
URLs\n";
exit(-1);
}
- # Subroutine to print out help text along with
usage information
sub help {
print "Hypertext grep URL help\n\n";
print "This program prints out hyperlink and
image links that\n";
print "are referenced by a user supplied URL on a
web server.\n\n";
usage();
}
# hypertext grep url
- sub hgu {
# grab parameters
my($full_url, $images, $hyperlinks)=@_;
my $all = !($images || $hyperlinks);
my @links;
my @links2;
# if the URL isn't a full URL, assume that it is
a http request
$full_url="http://$full_url" if ($full_url !~
m/(\w+):\/\/([^\/:]+)(:\d*)?([^#]*)/);
# break up URL into meaningful parts
my @the_url = parse_URL($full_url);
if (!defined @the_url) {
- print "Please use fully qualified valid URL\n";
exit(-1);
}
# we're only interested in HTTP URL's
return if ($the_url[0] !~ m/http/i);
# connect to server specified in 1st parameter
if (!defined open_TCP('F', $the_url[1],
$the_url[2])) {
print "Error connecting to web server:
$the_url[1]\n";
exit(-1);
}
# request the path of the document to get
print F "GET $the_url[3] HTTP/1.0\n";
print F "Accept: */*\n";
nguon tai.lieu . vn