Sometimes one ends up wondering why the Solr search returns the documents that it does. Or maybe the order of the items is a bit confusing. The problem for a developer is mostly not to understand it but to be able to explain it to the editors that oftenly do not have the technical background that we do.
I will try to go through the basics here. With that said I am a .Net developer and my expertise is not in Apache or Java, but I don’t think it needs to be. I may also have misunderstood something. If so please enlighten me in the comments below.
There are two different types of boosting, index-time and query-time. The two major differences between the two are:
- index-time boosting allows you to boost entire documents
- index-time boosting will have an effect on all your queries
Query-time boosting will allow you to boost specific fields when the query is constructed.
Boosting is a relative construction so it is only relevant in comparison to each other. E.g. if all fields that are queried have a boost of 10 it will have the same effect as not boosting anything. The effect comes in to play if one field has a boost of 0.5 and another of 1. The second field will then be twice as important as the first one.
Examples of use are if articles are searched a hit in the title is more important than a hit in the text of the article. If this behavior is desired the boosting of the title field should be higher than the boosting of the text field.
The main scoring functions are:
- Term Frequency (TF)
The frequency of a search term in a document, the higher the frequency is the higher score the hit will get. In plain English if a search term is found many times in a document the higher the score is.
- Inverse Document Frequency (IDF)
The rarity of a search term in all documents, the lower the frequency is the higher score the hit will get. If a search term only occurs at a few places in an index the higher the score is. This is to avoid high ranking hits on very common words.
- Coordination Factor (CF)
The number of search terms that are present in a document, the more search terms that are found in a hit the higher the score is. If many search terms are used documents that contain more of them will score higher.
- Fieldnorm (FN)
The length of a field value, the longer a field value with a hit is the lower the hit score will be. If a search term is found in a very short text in a field the hit will score higher than if the text is long.
When playing around with boosting and scoring it might be a good idea to give yourself and the editors the possibility to see the score of the documents, especially if the boosting values can be set from Sitecore. In a test environment it might be a good idea to simply add the score to each hit on each search result. The editor can then easily tweak the query-time boosting values and see how the score changes.
So all this seems fine and logical. There might however be a problem if one index is used for many things and there are computed fields. IDF comes into play.
Let’s say that the there is computed field called description. The field is filled with values from different Sitecore fields depending on the template that is being indexed. This is a common construct to achieve searchability over many different item types in the same field.
In a specific search the query filters out two different item types (based on two different templates, template1 and template2) while the index contains 10 different item types overall. All have been indexed with the computed description field in the index, but the field contains information from different Sitecore fields depending on template.
The query contains different search terms, let’s say it is “term1 term2”. term1 is found in a document of template1 and term2 is found in a document of template2. If term1 is more frequent in the entire index the hit of term1 in template1 is less important than the hit of term2 in template2 according to Solr. This is since the IDF is not relative to the search results but the entire index.