Crawling phpBB or other forms based apps with SharePoint

Out-of-the-box, the Microsoft Office SharePoint Server comes with a search engine, which is the 2008 Search Server. It’s capable of crawling different content sources via http, by requesting the page, indexing it and visiting links on the page after that if you prefer to.

Now within SharePoint sites this is all pretty straightforward. The crawling of internet websites is also pretty easy, as long as they allow anonymous access or support NTLM (Integrated Windows) authentication. It becomes different when you want to index a web application protected by a forms based username and password, like a phpBB forum (and loads of other php based sites).

For this purpose, Microsoft released a hotfix and a downloadable application which update the crawl rules API in order to allow this.

To enable this functionality, first download

The Hotfix package and the addrule.exe application

Install the two (make sure to run the configuration wizard after updating SharePoint) on your SharePoint server. To use the addrule.exe application, you’ll need a XML which tell’s the application what kind of crawl rule to insert. The XML I used for phpBB looked like this:

<rules ssp=”SharedServices”>

<rule>
<path>http://forum/*</path>
<type>FORMS</type>
<auth_url>http://forum/ucp.php?mode=login</auth_url>
<login_type>post</login_type>
<error_pages>
<error_page>http://forum/ucp.php?mode=login</error_page>
</error_pages>
<parameters>
<param name=”username” public=”true”>user</param>
<param name=”password” public=”true”>pass</param>
</parameters>
</rule>

</rules>

Make sure you set the correct SSP name, the server your forum is hosted on and the user account which SharePoint should use to crawl the content. If you want to crawl a different application (not phpBB), you should check the HTML source of the login page to find out what name the username/password textboxes have. Those names should exactly match the names of the parameters added in the XML document.

Execute addrule.exe, and when it succeeded you now have the correct rule available in the configuration pages of your shared service provider. When you edit the crawl rule, the access account portion will be grayed out since you manually provided these settings (the GUI doesn’t support forms based authentication yet). Proceed with creating a content source and adding a search scope as usual. If the crawl log isn’t giving you any errors, search results should become visible after everything is crawled/updated!

Leave a Reply

Your email address will not be published. Required fields are marked *