Demystifying Oak Search Part 1: Exploring Indexes with LukePublished on by Dan Klco
The Oak Indexes and Queries used by AEM and Apache Sling for searching content are powerful and cloaked in an aura of mystery. Let's work together to pierce that veil, so you can develop AEM / Apache Sling applications to take advantages of the strengths of searching in Oak without experiencing the painful pitfalls.
Let's start with Oak Indexes, the feature that makes Oak queries performant.
To understand something, we must observe it. While there are some features in AEM to give us a glimpse into the indexes, such as:
However, they provide a limited, abstracted view of the indexes. We need to dive deeper. To do so, we can use a tool called Luke to open and explore the Lucene indexes used by AEM & Apache Sling.
An aside on Lucene vs Other Index Types
This series will be referring to Lucene indexes as AEM as a Cloud Service only supports specifying Lucene indexes and Lucene indexes have a much richer and more robust feature set than property indexes. You can read more on Property indexes in the Oak documentation. Elastic indexes are implemented with a different tech stack than Lucene indexes, but are (as of now) functionally identical to Lucene indexes.
There is documentation on using Luke on the Oak site. However, Oak uses a relatively old version of Lucene and a custom Lucene codec so getting the right version of Luke and the classloader setup correctly to read an Oak index is not a trivial task. To make it easier, I made a bash script. Simply download the Gist:
/bin/bash oak-luke.sh [path-to-oak-index]
The indexes themselves are located under the
./crx-quickstart/repository/index folder in AEM and the
./launcher/repository/index in Apache Sling. For example:
/bin/bash run-luke.sh ./launcher/repository/index/slingFile-1648209254334
This will open up the Luke UI:
Let's break down the relevant parts of the Luke UI:
First, the Overview screen allows you to explore the index summary, field and terms contained in the index.
- Summary - a quick summary of the index including the number of documents and fields
- Fields - all of the fields defined in the index along with the number of distinct terms for each field
- Decoder - set the decoder to display the terms. Note that you may need to change the decoder to see int or long terms, including Dates (which will be longs) and booleans (which will be ints)
- Top Terms - displays the top Terms for the field in the index. Note that the terms may display incorrectly if you have not set the correct decoder for the field
While there are several other tabs in Luke, though honestly I've found the overview to be the most helpful.
As a quick overview, the documents tab allows you to page through the documents in the index, however since Oak essentially only stores the path to the node on the index documents, this is not very illuminating. Similarly, since Oak handles the updating and index queries, the Commit and Search tabs are also less helpful for most use cases.
What Luke Reveals
Lucene indexes are composed of:
- Documents - each maps to a node in Oak
- Fields - each maps to one or more properties in Oak
- Terms - the values extracted for the fields from the properties in Oak
How do these fields make their way into the Lucene indexes? Next we'll explore Oak Index Definitions and how they are translated into Lucene.