Creating a search engine for a website using OpenSource systems is highly painful, even though it doesn't need to be.
It was my task (prescribed by myself) recently to create a search engine for an internal website for basically no money. Because the application will be built in ASP.Net, the code had to be in C#.
Searching around the net, I found a promising project by the Apache project called "
Lucene". Lucene looked very powerful, and most of all, was OpenSource. The only problem with Apache's Lucene is that it's all in Java (which I don't want to use).
So doing a search for "Lucene C#", I came across several projects - both of which are dead:
It took me a lot more searching to find "
dotLucene" which is still OpenSource, and is still active.
To cut a long story short, dotLucene did not work for me out-of-the-box, and its programming interfaces appear to have inherited the pain of the Java version. Plus dotLucene could not index Word or PDF files out-of-the-box (it only did HTML).
Things I tried to get dotLucene to work:
Through much trial-and-error, I finally got something that appears to "Just Work".
Note: this source code uses sources taken from all-over-the-place. I do not assert any of my own intellectual rights to any of this.
Labels: programming