We needed a solution that would integrate search between various independent web applications. For example, in the case with PowerGUI.org, we had the Jive Forums / Jive Integrated site which housed the discussion forums and a library of downloadable powerpacks – mostly XML files. Additionally, we had a wiki using MediaWiki that was used for a collaborative approach at internationalizing the application. There were also a small handful of bloggers that were active participants on the site who all blogged extensively on the topic of Powershell. Our chosen solution was the Google Mini. For the most part, the solution was perfect. However, we ran into one problem. The Google Mini was only licensed for 100,000 pages. When looking at the number of threads in the PowerGUI forums and the size of the wiki and blogs, we expected the page usage on the Mini to be maybe 15,000. However, it maxed out the 100,000 page count on the initial crawl.
After much investigation, we determined that the reason for the unusually high page count had to do with the Jive Application. Jive allows for an independent URL of not only threads, but for each message of each thread as well. This resulted in a high amount of duplicate content in the Google Mini. The problem was made worse with the user profiles. We had multiple websites all sharing the same user base. So for each website, each users profile was crawled again and again and again. No wonder our page count was out of site.
From the Crawl Diagnostics, I was able to determine that we were experiencing duplicate content with the following three pages: message.jspa, accountView.jspa, and profile.jspa. Furthermore “tstart” was also a part of the URL with a lot of duplicate content.
The solution was in the “Crawl URLs” section of the Google Mini. In that section is a subsection called “Do Not Crawl URLs with the following Patterns:”
From there, we added the following:
http://domainname.com/profile.jspa$
http://domainname.com/accountView.jspa$
http://domainname.com/message.jspa$
contains:tstart
These four lines added to the “Do Not Crawl” list (of course modified with the proper domain name) wiped out virtually all of the duplicate content while keeping in all of the necessary content to be searched on. The results were a success. We had a successful implementation of the Google Mini which we later ported to numerous communities.
Comments