Natural Language Search Engine
As our data / content grows exponentially our need to accurately and quickly find particular items of information also increases. Previous search mechanisms implemented simple binary like matching where if you searched for "Software integration" only this exact phrase would match a particular article and create a search result hit. As you can imagine this can be quite limited in its success to find the content you are looking for as you may not have these specific two words adjacent to each other or the words may be using a slightly different variant such as "Software integrator". Natural language searching allows you to overcome these things and many others to produce far superior matching abilities.
As you may expect to find in any enterprise application platform, there is an embedded natural language search engine within WebDirector. The specific engine we have chosen for our architecture is called Lucene which you may not of heard of but the chances are you will have used it. It's the natural language search engine which underpins Wikipedia; the well known online collaborative encyclopedia. Lucene allows lightening fast searches across enormous data sets producing relevance ranked results.
Some of the main Lucene searching functionality is outlined as follows:
- Word stemming
- Proximity
- Results relevance
- Phrase searching
- Wildcard querying
- Allows simultaneous update and searching
To take further advantage of the above functionality we have implemented into the core WebDirector libraries the ability to perform text extraction from all well known file types (all microsoft file formats, PDF, CSV, XML, plain text, etc.). Once we have all this relevant content we then pour this into the Lucene natural search engine for indexing. The indexing process is incredibly quick; 95GB/hour. Once the content has been indexed just like any other textual content it can be search across and retrieved from the search results. For any form of document management this of course is incredibly powerful as a built-in feature which you essentially get for free; the indexing and re-indexing occurs behind the scenes everytime an asset in WebDirector is inserted or updated. You don't have to do anything special, we simply integrate to the natural language engine as you perform your normal work.