Full text search and text replace are killer new features of WebRadar 5. In previous posts, we discussed what these features are and what they can do to solve everyday problems. I thought we might take a dive under the hood to see what makes this functionality tick.
I’ll discuss the “search” and “replace” features separately, as they present separate technical challenges.
Full Text Search
Earlier versions of WebRadar only indexed the structured metadata of a content item—name, title, authors, expiry dates, etc. This design is well-suited to a traditional relational database. However, when it comes to searching within freeform textual content, this approach falls down. We can’t just store the full text in database tables and do “where content contains x” queries—it would be much too slow and not deliver the results we need.
Full text searching needs a search engine. In our evaluation, we chose to use ElasticSearch—I’ve heard it described as “Google for the JVM”.
Now, when the WebRadar data extractor runs, it stores metadata in the SQL database as normal, but when it comes to any HTML, Rich Text or certain other text fields, the data is stored in ElasticSearch.
We feed the content along with the content ID to ElasticSearch, which tokenizes it into words and phrases using an HTML analyzer and builds an “inverted index”. This index is organized by these tokens and lists the documents where a token can be found. By default, it indexes whole words and phrases, but it can be configured to index partial words (requiring extra system resources).
When the user searches for some text, ElasticSearch tokenizes the search query into a set of words and phrases to search for in the inverted index. We also store some context of where the word or phrase was found in a document, which we display in the search results.
ElasticSearch is also a clustered and sharded database, so it scales well. If you have a clustered WebSphere Portal environment, each cluster node’s instance of WebRadar will have its own ElasticSearch node. Doing this allows queries to be served from the local instance, spreading the load over the cluster.
WebRadar’s data extractor currently runs on one WebSphere cluster node designated as “primary,” which needs to be configured on installation. The extractor feeds data into its ElasticSearch node, which then replicates automatically to the secondary nodes.
WebRadar combines the results it obtains from SQL and ElasticSearch into the one set of search results, so it’s seamless to the user.
Full Text Replace
Once you’ve found the content items you’re searching for, it’s time to do a search and replace on the text content of these content items.
In some ways, this is more challenging—we’re actually changing content, so it has to be solid and dependable. This feature needs to be aware of HTML, WCM’s special component tags and the different Unicode characters we might encounter. We have chosen specific techniques to address each of these aspects.
To handle HTML correctly, you can’t just parse it with regex; we need a good HTML parser. HTML content is our business—all of our products focus on it. As such, we’ve learned a lot of hard lessons about HTML parsing over the years—it’s a difficult beast. Different people, tools, and browsers have very different ideas of what constitutes HTML. Writing an HTML parser that can deal with all of these variations is quite challenging. After some investigation, we chose the excellent JSoup library.
WCM Content Parsing
WCM has “component tags” that allow the content author to embed content which will be expanded when published. Component tags are a good way to reuse assets which are defined in a central place. These tags are similar to HTML, but use square brackets, e.g. [StyleElement source=”template”]. We aren’t aware of any existing parsers for this format, so we had to write our own.
We believe that parser combinators represent the best approach to constructing parsers. They allow you to write parsers in a modular fashion that reads like a grammar definition. We’re using Scala’s built-in parser combinators to construct a parser for WCM content tags.
A further challenge in these parsers is that people occasionally use square bracket characters in normal text – we need to distinguish between normal text usage and WCM content tags. Fortunately, the parser combinator approach enables us to write parsers that can handle this situation.
In parsing these WCM content tags, our primary focus is that we preserve them—these are special parts of your content that shouldn’t be touched. We believe we have achieved this.
At this point, we can parse HTML and WCM content tags, and now we want to go and replace text. Here we have another set of challenges.
Unicode and languages
Our search-and-replace has several settings:
- Respect word boundaries
- Case sensitive
- Retain first-letter capitalization
These concepts sound simple enough, but in reality, they are not—especially when you consider the full range of languages supported by WCM. The idea of “case” is different between languages – some don’t use case, some do; some languages use the same characters, but capitalize them differently.
The idea of what constitutes a word differs between languages. Do we delimit on hyphens—is “take-away” one word or two? How about “co-operate” or “e-ink”? Do we delimit on apostrophes—is “l’orange” one word or two? What about “Stephen’s”?
Even full-stops are challenging—is “3.4” one word or two? What about “textbox.io”? Some languages don’t even have word boundary characters—you have to know all of the words to understand which sequence of characters constitutes a word.
In all honesty, this is utterly mind-bending work. Quite simply—the world’s languages are far too complicated and nuanced to fit perfectly into a set of rules. We based our work on the Unicode Text Segmentation annex to the Unicode standard, which deals with things like “collation” and word boundaries. I love this quote from the standard:
“It is not possible to provide a uniform set of rules that resolves all issues across languages or that handles all ambiguous situations within a given language.”
Wow. Just … wow.
So, while everyone might have a simple intuition about what a “word” is, when you try to codify this in an application, it is indeed challenging.
So, what did we do? We took the simplifications that:
- Hyphens and apostrophes are parts of the word and do not represent word boundaries.
- Full stops are only word boundaries if followed by a space, i.e. the end of a sentence.
Fortunately, a lot of the recommendations from the Unicode Text Segmentation annex are encoded in the open source ICU library. We use this library to help with capitalization and word boundary detection, to make our search-and-replace better in the presence of the world’s language.
We still have a few improvements to do in this area—particularly in order to properly handle the nuances of Asian and Turkic languages. This task is on our todo list for future releases.
We think that clustered databases are the ideal technology to achieve scale in an analytics tool, and it’s certainly an area we’re interested in exploring further in later releases. We know some of our customers are using WebRadar with large data sets—in the hundreds of thousands of content items—so designing for scale and performance is important to us.
Text replacement is certainly a challenging task, but it’s one we take seriously. We use the best techniques we can to ensure we treat your content with care while fixing your content problems.
To learn more about WebRadar 5 Preview, start with our post introducing WebRadar’s Multi-Edit superpowers. Content creators, managers & developers looking for use cases, you’ll enjoy this. IBM Web Content Manager customers ready to install or update, start with our documentation.