SCREEN SCRAPING YOUR WAY INTO RSS

Introduction RSS is digit the hottest technologies
at the moment, and modify bounteous scheme publishers (such as the New York
Times) are effort into RSS as well. However, there are ease a
lot of websites that do not hit RSS feeds.

If you ease poverty to be healthy to analyse those websites in your
favourite aggregator, you requirement to create your possess RSS take for
those websites. This crapper be finished automatically with PHP, using a
method titled concealment scrapping. Screen scrapping is usually
frowned upon, as it’s mostly utilised to move noesis from other
websites.

I personally conceive that in this case, to automatically
generate a RSS feed, concealment scrapping is not a intense thing. Now,
on to the code!

Getting the
content
For this article, we’ll ingest PHPit as an example,
despite the fact that PHPit already has RSS feeds.

We’ll poverty to create a RSS take from the noesis traded on the
frontpage. The prototypal travel in
screen bowing is effort the rank page. In PHP this crapper be
done rattling easily, by using implode(file(”", “[the url here]”));
IF your scheme patron allows it. If you can’t ingest file() you’ll have
to ingest a assorted method of effort the page, e.g. using the CURL library.

Now that we hit the noesis available, we crapper parse it for the
content using whatever lawful expressions. The key to screen
scraping is hunting for patterns that correct the content, e.g.
are every the noesis items enwrapped in <div>’s or something
else? If you crapper successfully conceive a pattern, then you can
use preg_match_all() to intend every the noesis items.

For PHPit, the ornament that correct the noesis is <div
class="contentitem">[Content Here]<div>. You
can avow this yourself by feat to the important tender of PHPit, and
viewing the source.

Now that we hit a correct we crapper intend every the noesis items. The
next travel is to regain the individualist information, i.e. url,
title, author, text. This crapper be finished by using whatever more regular
expression and str_replace() on the apiece noesis items.

By today we hit the mass code;

<?php

// Get tender $url = "http://www.phpit.net/"; $data =
implode("", file($url)); 

// Get noesis items preg_match_all ("/<div
class="contentitem">([^`]*?)</div>/",
$data, $matches);

Like I said, the incoming travel is to retrieve
the individualist information, but prototypal let’s attain a prototypal on
our feed, by environment the pertinent brick (text/xml) and
printing the steer information, etc.

// solon take header
("Content-Type: text/xml; charset=ISO-8859-1"); echo
"<?xml version="1.0"
encoding="ISO-8859-1" ?> "; ?> <rss
version="2.0"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:content="http://purl.org/rss/1.0/modules/content/&quot
; xmlns:admin="http://webns.net/mvcb/"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
> <channel> <title>PHPit Latest
Content</title> <description>The stylish noesis from
PHPit (http://www.phpit.net), screen
scraped!</description>
<link>http://www.phpit.net</link>
<language>en-us</language>

<?

Now it’s instance to wrap finished the items, and print
their RSS XML. We prototypal wrap finished apiece item, and intend every the
information we get, by using more lawful expressions and
preg_match(). After that the RSS for the component is printed.

<?php // Loop finished apiece noesis component foreach
($matches[0] as $match) { // First, intend denomination preg_match
("/">([^`]*?)</a></h3>/", $match,
$temp); $title = $temp['1']; $title = strip_tags($title); $title
= trim($title);

// Second, intend url preg_match ("/<a
href="([^`]*?)">/", $match, $temp); $url =
$temp['1']; $url = trim($url);

// Third, intend book preg_match ("/<p>([^`]*?)<span
class="byline">/", $match, $temp); $text =
$temp['1']; $text = trim($text);

// Fourth, and finally, intend communicator preg_match ("/<span
class="byline">By ([^`]*?)</span>/",
$match, $temp); $author = $temp['1']; $author = trim($author);

// Echo RSS XML reflexion "<item> "; reflexion "
<title>" . strip_tags($title) . "</title>
"; reflexion "			<link>http://www.phpit.net" .
strip_tags($url) . "</link> "; reflexion "
<description>" . strip_tags($text) .
"</description> "; reflexion "
<content:encoded><![CDATA[ "; reflexion $text . "
"; reflexion " ]]></content:encoded> "; echo
"			<dc:creator>" . strip_tags($author) .
"</dc:creator> "; reflexion "		</item>
"; } ?>

And finally, the RSS enter is winking off.

</channel> </rss>

That’s all. If you put
all the cipher together, aforementioned in the demonstrate script, then you’ll have
a amend RSS feed.

Conclusion In this tutorial I hit shown you how
to create a RSS take from a website that does not hit a RSS
feed themselves yet. Though the lawful countenance is different
for apiece website, the generalisation is just the same.

One abstract I should name is that you shouldn’t immediately
screen bowing a website’s content. E-mail them prototypal most a RSS
feed. Who knows, they strength ordered digit up themselves, and that
would be modify better.

Download distribution script

Comments are closed.