Note: as of 2008/04/14 the code available here will certainly not work without a lot of hacking because phpBB has a new version.
In this document, I explain how to get the messages from a phpBB forum and put them into a RSS feed.
Why not?
Or to state it with more words:
I subscribe to
this forum. The problem
is that a lot of messages are posted every day and I can't cope
with all these messages and the very bad (in my opinion) navigation
proposed by phpBB.
On the other hand, I use the Opera browser, which supports RSS feeds in a very nice way (something very similar to email clients like the one I use: pine).
So, a few days ago I said to me: "Hey! why not download these horrible html pages and convert them into RSS 2.0 stuff?"
Four days (less than 30 hours in total) and 1800 lines of code later, I was done.
You can download the code if you don't believe me.
After I decided to do this, I said to me: "OK, this is a nice idea, but how the heck am I going to do that?"
Since I am crazy, the answer came quickly: "I will do it purely in C and will write from scratch all I need."
After some thinking, I realized that the most reliable solution to access the messages would be to write a full HTML parser.
Did you already write one? Do you think you can do it in less than 30 hours?
It was clear that I would not be able to get a quick solution in this way. In the meantime, messages would keep arriving. I needed a quick solution. So, I decided to use an existing parser.
After a quick search on the Internet, I decided to use libxml.
What can you do with libxml? You can parse an HTML document. What do you do with this nice but rather complex tree, hum?
Fortunately for you, there is this XSLT stuff. This is some kind of language that can take an XML document and transform it into another one. And the guy behind libxml also does libxslt.
You also need to explore the structure of the HTML page. firefox has a very nice DOM inspector for that.
I had my tools:
libxml2-2.6.23
) to parse the HTML page,
libxslt-1.1.15
) and some scripts to
transform the desired elements into something easy to parse.
First of all I needed to learn a bit of XSLT. I knew it conceptually, but never used it. (I am not a xml fan, too big, too complicated.)
After a few hours, I was done. I learned it and wrote my scripts.
Let me now explain you how I use this XSLT stuff in great details.
A phpBB forum is organized in the following way. The first page contains the list of forums. When you click on a forum, you get a list of topics. When you click a topic, you get the messages posted in this topic. So you have three levels here.
First of all, I download index.php which contains the forums' list. I watch its structure and I come up with this XSLT code to extract what I want:
<?xml version="1.0" encoding="ISO-8859-1"?> <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output method="xml" indent="yes" omit-xml-declaration = "yes" encoding="iso-8859-1"/> <xsl:variable name="bstuff">&sid=</xsl:variable> <xsl:template match="tr[count(*) = 5]"> <forum> <title> <xsl:value-of select="td[2]/span[1]/a"/> </title> <date> <xsl:value-of select="td[5]/span/text()"/> </date> <author> <xsl:value-of select="td[5]/span/a[1]/text()"/> </author> <link> <xsl:text>/forum/fr/</xsl:text> <xsl:if test= "string-length(substring-before(td[2]/span[1]/a/@href, $bstuff)) = 0"> <xsl:value-of select="td[2]/span[1]/a/@href"/> </xsl:if> <xsl:if test= "string-length(substring-before(td[2]/span[1]/a/@href, $bstuff)) != 0"> <xsl:value-of select="substring-before(td[2]/span[1]/a/@href, $bstuff)"/> </xsl:if> </link> </forum> </xsl:template> <xsl:template match="/"> <forums> <xsl:apply-templates/> </forums> </xsl:template> <xsl:template match="text()"> </xsl:template> </xsl:stylesheet>
It gives me a xml document with the following structure:
<forums> <forum> <title>informations et modalités de participation au forum</title> <date>Sam 04 Mar 2006, 11:36</date> <author>Garlik</author> <link>/forum/fr/viewforum.php?f=79</link> </forum> ... </forums>
This is damn easy to parse, isn't it?
So now, I need to get the topics of a forum (the
viewforum.php?f=X
stuff).
(Go here
to see how it looks.)
Here is the XSLT code I wrote to get the topics:
<?xml version="1.0" encoding="ISO-8859-1"?> <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output method="xml" indent="yes" omit-xml-declaration = "yes" encoding="iso-8859-1"/> <xsl:template match="tr[count(*) = 6]"> <topic> <title> <xsl:value-of select="td[2]/span/a[position()=last()]"/> </title> <date> <xsl:value-of select="td[6]/span/text()"/> </date> <author> <xsl:value-of select="td[6]/span/a[1]/text()"/> </author> <link> <xsl:text>/forum/fr/</xsl:text> <xsl:value-of select="td[2]/span/a[position()=last()]/@href"/> </link> </topic> </xsl:template> <xsl:template match="/"> <topics> <xsl:apply-templates/> </topics> </xsl:template> <xsl:template match="text()"> </xsl:template> </xsl:stylesheet>
It gives the following xml structure:
<topics> <topic> <title>Modalités de participation au forum</title> <date>Mar 08 Nov 2005, 13:34</date> <author>Jean-François Delcamp</author> <link>/forum/fr/viewtopic.php?t=1642</link> </topic> ... </topics>
It is very similar to the forums' list. Once again, very easy to parse.
A problem arises here. The topics may appear on more than one page. So I need to get the next pages.
After some DOM inspection with firefox, I wrote this:
<?xml version="1.0" encoding="ISO-8859-1"?> <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output method="xml" indent="yes" omit-xml-declaration = "yes" encoding="iso-8859-1"/> <xsl:template match="a[starts-with(@href,'viewforum.php') and contains(@href, '&start=')]"> <next> <xsl:value-of select="substring-after(@href,'start=')"/> </next> </xsl:template> <xsl:template match="text()"> </xsl:template> <xsl:template match="/"> <nexts> <xsl:apply-templates/> </nexts> </xsl:template> </xsl:stylesheet>
Which gives me something like:
<nexts> <next>50</next> <next>50</next> <next>50</next> <next>50</next> </nexts>
That is still easy to use (in fact you only need to take the last number and build an URL based on the current one and do it all again with the new downloaded page, well, check the C code to get the details).
Then, the topic's page has to be downloaded and messages from the page have to be extracted. (Watch here to see how it looks.)
After some DOM inspection, I wrote this XSLT code:
<?xml version="1.0" encoding="ISO-8859-1"?> <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output method="xml" indent="yes" omit-xml-declaration = "yes" encoding="iso-8859-1"/> <xsl:template match="@*" mode="links"> <xsl:choose> <xsl:when test="name() = 'href'"> <xsl:attribute name="href"> <xsl:text>http://delcamp.net/forum/fr/</xsl:text> <xsl:value-of select="string()"/> </xsl:attribute> </xsl:when> <xsl:when test="name() = 'src'"> <xsl:attribute name="src"> <xsl:text>http://delcamp.net/forum/fr/</xsl:text> <xsl:value-of select="string()"/> </xsl:attribute> </xsl:when> <xsl:otherwise> <xsl:copy/> </xsl:otherwise> </xsl:choose> </xsl:template> <xsl:template match="node()" mode="links"> <xsl:copy> <xsl:apply-templates mode="links" select="node()|@*"/> </xsl:copy> </xsl:template> <xsl:template match="text()" mode="links"> <xsl:copy/> </xsl:template> <xsl:template match="tr[count(*) = 2 and ((td[1]/@class='row1' and td[2]/@class='row1' and td[1]/span[1]/@class='name') or (td[1]/@class='row2' and td[2]/@class='row2' and td[1]/span[1]/@class='name'))]"> <message> <author> <xsl:value-of select="td[1]/span[1]"/> </author> <date> <xsl:value-of select="substring-after( td[2]/table/tr[1]/td[1]/span[1]/text(),': ')"/> </date> <body> <xsl:apply-templates mode="links" select="td[2]"/> </body> <link> <xsl:text>/forum/fr/</xsl:text> <xsl:value-of select="td[2]/table/tr[1]/td[1]/a/@href"/> </link> </message> </xsl:template> <xsl:template match="/"> <messages> <xsl:apply-templates/> </messages> </xsl:template> <xsl:template match="text()"> </xsl:template> </xsl:stylesheet>
It is a bit more complicated, because in the message's body, we can
find links and images and they use relative URLs, which I have to
translate into absolute URLs (this is done by the mode="links"
stuff in the code above).
I obtain a xml structure like:
<messages> <message> <author>CYRLOUD</author> <date>Jeu 16 Fév 2006, 01:12</date> <body> [...] (body of the message) </body> <link>/forum/fr/viewtopic.php?p=50172</link> </message> ... </messages>
Still easy to parse and handle.
I've got the same problem than for the topics. Messages may appear on several pages. Once again, we can extract the 'Next' link and do all again from this new page. I don't give you the XSLT code, it is pretty similar to the topics' case.
And that's it!
I've got the messages' list. All what remains is to put them into a RSS 2.0 file and tell Opera to use this file as a feed, and we are done.
I didn't mention it, but I've got to login to the server, because some topics can only be accessed by registered users.
I first said to me: "Let's do it the hardcore way, all by hand: socket(), connect(), write() and read()!"
Because I found libxml and libxslt very useful, I looked for some libs for the network stuff too. But nothing was good for me, so I ended up doing it by hand. It was easier than I first thought.
I need to login, which means send a HTTP request to the
server. How does this request look? Well, let Opera tell it
to me. For that, set up a fake proxy (on localhost, port 8000,
with pipo_server.c
listening
to this port, which I wrote some years ago and
use from time to time for various projects). Then go to the
login page
in Opera and see what the fake proxy gets when you login.
It is something like (I omit useless headers):
POST /forum/fr/login.php HTTP/1.0 Host: www.delcamp.net Content-Length: XXX Content-Type: application/x-www-form-urlencoded username=X&password=Y&redirect=&login=Connexion
OK, I've got the request.
Now, how does the server reply to it?
To know it, just telnet to the server, port 80, send the request and watch the answer. It is something like:
HTTP/1.1 302 Found Date: Thu, 02 Mar 2006 12:16:55 GMT Server: Apache X-Powered-By: PHP/4.3.10-16 Set-Cookie: phpbb2mysql_fr_data=XXX; expires=Fri, 02-Mar-07 12:17:16 GMT; path=/ Set-Cookie: phpbb2mysql_fr_sid=YYY; path=/ Set-Cookie: phpbb2mysql_fr_data=ZZZ; expires=Fri, 02-Mar-07 12:17:16 GMT; path=/ Set-Cookie: phpbb2mysql_fr_sid=AAA; path=/ Location: http://www.delcamp.net/forum/fr/index.php?sid=XX Connection: close Content-Type: text/html
OK, so I need to handle cookies in my requests. But which one? There are four here.
After trying a bit I found that the last one is good, so let's roll with it.
After login, I memorize the cookie and just download all my pages using it.
I finish by logging out.
How to logout?
The logout link appears in the index.php
, so one more
little XSLT script:
<?xml version="1.0" encoding="ISO-8859-1"?> <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output method="xml" indent="yes" omit-xml-declaration = "yes" encoding="iso-8859-1"/> <xsl:template match="a[contains(@href,'login.php') and contains(@href, 'logout')]"> <logout> <xsl:text>/forum/fr/</xsl:text> <xsl:value-of select="@href"/> </logout> </xsl:template> <xsl:template match="text()"> </xsl:template> <xsl:template match="/"> <logouts> <xsl:apply-templates/> </logouts> </xsl:template> </xsl:stylesheet>
Which gives:
<logouts> <logout> /forum/fr/login.php?logout=true&sid=XXXXXXXXXXXXX </logout> </logouts>
And we are done.
We then need some C code to remember dates of latest downloaded messages
and get only the recent ones, and it's OK. (For that purpose, I create
the file forums.txt
which contains the latest date of
each forum. This file is updated at each run of the program.)
The code, in its current incarnation, is tied to Delcamp's forum.
But you can get the sources my dear Luke, and hack it as you like.
I guess with some (a lot of?) modifications you can use it for any other web forum (or even any website you want).
I use C code, maybe it's too complicated and useless for such a job. The bottleneck is the network access here, not the CPU consumption. Maybe python or something like that is enough. But since I only know C...
Damned, jwz also did something similar, but in pure Perl. Check it here. But he uses regular expressions, he doesn't parse the document "the real way", so I suspect that my approach is more robust.
Hackers, unite? Hell, no. The more diverse we are, the better.
In this page, I present a method to extract phpBB messages and store them into a RSS feed.
The method involves the usage of the libxml and libxslt libraries and some custom C code to access the network and store only recent messages.
It tooks less than 30 hours and 1800 lines of code (both C and XSLT).
So now, instead of wasting my time browsing a phpBB
forum, I waste it browsing hundreds of messages from
a RSS feed.
Evolution...
Contact: sed@free.fr
Creation time:
Mon, 06 Mar 2006 15:14:10 +0100
Last update:
Mon, 14 Apr 2008 12:04:26 +0200
Powered by a human brain, best viewed with your eyes (or your fingers if you are blind).