I am looking for any really free alternatives for implementing an intranet web-s
ID: 661702 • Letter: I
Question
I am looking for any really free alternatives for implementing an intranet web-search engine.
I know that GSA would be probably the best but it's extremely expensive and I do want to be able to crawl millions of pages.
I tried SearchBlox and in addition to the fact that I found it poorly documented and counter-intuive it also have a limit of 25.000 documents, which is almost nothing compared to the level I want to scale it to. Maybe if it was better I would have considered upgrading to commercial license but based on the experience, I wouldn't pay for it.
So, now I am looking for other approaches
Explanation / Answer
Solr can do this. With it you can define a datasource to crawl it and it interprets the data. Solr is gratis and open source.
Solr has an extremely fast search engine (Lucene) and can import a lot of data. It is optimized for "field like" data like XML, JSON or HTML but there are converters of everything that contains text (like Word, PDF, etc.)
If you just need a few intranet sites (<100.000) to be indexed AND you can access the database you should just set up the Data Import Handler to do the work for you.
Else you will need to code some own thing that sends the data to the server.
Keep in mind that this is a development tool and not an end user program. You will need to create some interface. (It is quite easy, in my experience)
If you store the information in various fields (like meta, title, url, content, language, filetype...) you can search through this fields specifically, thous having the possibility to narrow your search down. If for example all your sites in the intranet have an author and you can access that field and index it, you can search for all documents that are from this author, but ignore all that are about him. It also supports fuzzy search ("seach" finding "search")
I used Solr in one (and the underling search engine Lucene in many) projects and was very impressed by it. The high flexibility in the data processing engines are incredible. The searching part is so fast that I have it on my list to one day understand how it works :)
If all you need is a search crawler & search interface, then the configuration overhead of Solr might not be what you need. But if you need a tool that chews through 30.000.000 documents than this is the tool to go. In the project I used it (with said amount of documents) we had more trouble with network latency then with the Solr search time. You can replicate the index and use a load balancing Solr instance that distributives the search requests to the other ones. And and and... the amount of different optimizations of this tool are staggering. That of course comes with a bit of necessary configuration that might not be super intuitive.
As hinted above, Solr is a wrapper around Lucene so if you already have a CMS to do the site creation for you there might be already a Lucene plugin for it that you can tap into.
Related Questions
Navigate
Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.