mirror of http://git.sairate.top/sairate/doc.git
638 lines
26 KiB
Markdown
638 lines
26 KiB
Markdown
|
---
|
|||
|
date: 2021-09-13
|
|||
|
authors: [squidfunk]
|
|||
|
readtime: 15
|
|||
|
description: >
|
|||
|
How we rebuilt client-side search, delivering a better user experience while
|
|||
|
making it faster and smaller at the same time
|
|||
|
categories:
|
|||
|
- Search
|
|||
|
- Performance
|
|||
|
links:
|
|||
|
- plugins/search.md
|
|||
|
- insiders/index.md#how-to-become-a-sponsor
|
|||
|
---
|
|||
|
|
|||
|
# Search: better, faster, smaller
|
|||
|
|
|||
|
__This is the story of how we managed to completely rebuild client-side search,
|
|||
|
delivering a significantly better user experience while making it faster and
|
|||
|
smaller at the same time.__
|
|||
|
|
|||
|
The [search] of Material for MkDocs is by far one of its best and most-loved
|
|||
|
assets: [multilingual], [offline-capable], and most importantly: _all
|
|||
|
client-side_. It provides a solution to empower the users of your documentation
|
|||
|
to find what they're searching for instantly without the headache of managing
|
|||
|
additional servers. However, even though several iterations have been made,
|
|||
|
there's still some room for improvement, which is why we rebuilt the search
|
|||
|
plugin and integration from the ground up. This article shines some light on the
|
|||
|
internals of the new search, why it's much more powerful than the previous
|
|||
|
version, and what's about to come.
|
|||
|
|
|||
|
<!-- more -->
|
|||
|
|
|||
|
_The next section discusses the architecture and issues of the current search
|
|||
|
implementation. If you immediately want to learn what's new, skip to the
|
|||
|
[section just after that][what's new]._
|
|||
|
|
|||
|
[search]: ../../setup/setting-up-site-search.md
|
|||
|
[multilingual]: ../../setup/setting-up-site-search.md#lang
|
|||
|
[offline-capable]: ../../setup/building-for-offline-usage.md
|
|||
|
[what's new]: #whats-new
|
|||
|
|
|||
|
## Architecture
|
|||
|
|
|||
|
Material for MkDocs uses [lunr] together with [lunr-languages] to implement
|
|||
|
its client-side search capabilities. When a documentation page is loaded and
|
|||
|
JavaScript is available, the search index as generated by the
|
|||
|
[built-in search plugin] during the build process is requested from the
|
|||
|
server:
|
|||
|
|
|||
|
``` ts
|
|||
|
const index$ = document.forms.namedItem("search")
|
|||
|
? __search?.index || requestJSON<SearchIndex>(
|
|||
|
new URL("search/search_index.json", config.base)
|
|||
|
)
|
|||
|
: NEVER
|
|||
|
```
|
|||
|
|
|||
|
[lunr]: https://lunrjs.com
|
|||
|
[lunr-languages]: https://github.com/MihaiValentin/lunr-languages
|
|||
|
[built-in search plugin]: ../../plugins/search.md
|
|||
|
|
|||
|
### Search index
|
|||
|
|
|||
|
The search index includes a stripped-down version of all pages. Let's take a
|
|||
|
look at an example to understand precisely what the search index contains from
|
|||
|
the original Markdown file:
|
|||
|
|
|||
|
??? example "Expand to inspect example"
|
|||
|
|
|||
|
=== ":octicons-file-code-16: `docs/page.md`"
|
|||
|
|
|||
|
```` markdown
|
|||
|
# Example
|
|||
|
|
|||
|
## Text
|
|||
|
|
|||
|
It's very easy to make some words **bold** and other words *italic*
|
|||
|
with Markdown. You can even add [links](#), or even `code`:
|
|||
|
|
|||
|
```
|
|||
|
if (isAwesome) {
|
|||
|
return true
|
|||
|
}
|
|||
|
```
|
|||
|
|
|||
|
## Lists
|
|||
|
|
|||
|
Sometimes you want numbered lists:
|
|||
|
|
|||
|
1. One
|
|||
|
2. Two
|
|||
|
3. Three
|
|||
|
|
|||
|
Sometimes you want bullet points:
|
|||
|
|
|||
|
* Start a line with a star
|
|||
|
* Profit!
|
|||
|
````
|
|||
|
|
|||
|
=== ":octicons-codescan-16: `search_index.json`"
|
|||
|
|
|||
|
``` json
|
|||
|
{
|
|||
|
"config": {
|
|||
|
"indexing": "full",
|
|||
|
"lang": [
|
|||
|
"en"
|
|||
|
],
|
|||
|
"min_search_length": 3,
|
|||
|
"prebuild_index": false,
|
|||
|
"separator": "[\\s\\-]+"
|
|||
|
},
|
|||
|
"docs": [
|
|||
|
{
|
|||
|
"location": "page/",
|
|||
|
"title": "Example",
|
|||
|
"text": "Example Text It's very easy to make some words bold and other words italic with Markdown. You can even add links , or even code : if (isAwesome) { return true } Lists Sometimes you want numbered lists: One Two Three Sometimes you want bullet points: Start a line with a star Profit!"
|
|||
|
},
|
|||
|
{
|
|||
|
"location": "page/#example",
|
|||
|
"title": "Example",
|
|||
|
"text": ""
|
|||
|
},
|
|||
|
{
|
|||
|
"location": "page/#text",
|
|||
|
"title": "Text",
|
|||
|
"text": "It's very easy to make some words bold and other words italic with Markdown. You can even add links , or even code : if (isAwesome) { return true }"
|
|||
|
},
|
|||
|
{
|
|||
|
"location": "page/#lists",
|
|||
|
"title": "Lists",
|
|||
|
"text": "Sometimes you want numbered lists: One Two Three Sometimes you want bullet points: Start a line with a star Profit!"
|
|||
|
}
|
|||
|
]
|
|||
|
}
|
|||
|
```
|
|||
|
|
|||
|
If we inspect the search index, we immediately see several problems:
|
|||
|
|
|||
|
1. __All content is included twice__: the search index contains one entry
|
|||
|
with the entire contents of the page, and one entry for each section of
|
|||
|
the page, i.e., each block preceded by a headline or subheadline. This
|
|||
|
significantly contributes to the size of the search index.
|
|||
|
|
|||
|
2. __All structure is lost__: when the search index is built, all structural
|
|||
|
information like HTML tags and attributes are stripped from the content.
|
|||
|
While this approach works well for paragraphs and inline formatting, it
|
|||
|
might be problematic for lists and code blocks. An excerpt:
|
|||
|
|
|||
|
```
|
|||
|
… links , or even code : if (isAwesome) { … } Lists Sometimes you want …
|
|||
|
```
|
|||
|
|
|||
|
- __Context__: for an untrained eye, the result can look like gibberish, as
|
|||
|
it's not immediately apparent what classifies as text and what as code.
|
|||
|
Furthermore, it's not clear that `Lists` is a headline as it's merged
|
|||
|
with the code block before and the paragraph after it.
|
|||
|
|
|||
|
- __Punctuation__: inline elements like links that are immediately followed
|
|||
|
by punctuation are separated by whitespace (see `,` and `:` in the
|
|||
|
excerpt). This is because all extracted text is joined with a whitespace
|
|||
|
character during the construction of the search index.
|
|||
|
|
|||
|
It's not difficult to see that it can be quite challenging to implement a good
|
|||
|
search experience for theme authors, which is why Material for MkDocs (up to
|
|||
|
now) did some [monkey patching] to be able to render slightly more
|
|||
|
meaningful search previews.
|
|||
|
|
|||
|
[monkey patching]: https://github.com/squidfunk/mkdocs-material/blob/ec7ccd2b2d15dd033740f388912f7be7738feec2/src/assets/javascripts/integrations/search/document/index.ts#L68-L71
|
|||
|
|
|||
|
### Search worker
|
|||
|
|
|||
|
The actual search functionality is implemented as part of a web worker[^1],
|
|||
|
which creates and manages the [lunr] search index. When search is initialized,
|
|||
|
the following steps are taken:
|
|||
|
|
|||
|
[^1]:
|
|||
|
Prior to <!-- md:version 5.0.0 -->, search was carried out in the main
|
|||
|
thread which locked up the browser, rendering it unusable. This problem was
|
|||
|
first reported in #904 and, after some back and forth, fixed and released in
|
|||
|
<!-- md:version 5.0.0 -->.
|
|||
|
|
|||
|
1. __Linking sections with pages__: The search index is parsed, and each
|
|||
|
section is linked to its parent page. The parent page itself is _not
|
|||
|
indexed_, as it would lead to duplicate results, so only the sections
|
|||
|
remain. Linking is necessary, as search results are grouped by page.
|
|||
|
|
|||
|
2. __Tokenization__: The `title` and `text` values of each section are split
|
|||
|
into tokens by using the [`separator`][separator] as configured in
|
|||
|
`mkdocs.yml`. Tokenization itself is carried out by
|
|||
|
[lunr's default tokenizer][default tokenizer], which doesn't allow for
|
|||
|
lookahead or separators spanning multiple characters.
|
|||
|
|
|||
|
> Why is this important and a big deal? We will see later how much more we
|
|||
|
> can achieve with a tokenizer that is capable of separating strings with
|
|||
|
> lookahead.
|
|||
|
|
|||
|
3. __Indexing__: As a final step, each section is indexed. When querying the
|
|||
|
index, if a search query includes one of the tokens as returned by step 2.,
|
|||
|
the section is considered to be part of the search result and passed to the
|
|||
|
main thread.
|
|||
|
|
|||
|
Now, that's basically how the search worker operates. Sure, there's a little
|
|||
|
more magic involved, e.g., search results are [post-processed] and [rescored] to
|
|||
|
account for some shortcomings of [lunr], but in general, this is how data gets
|
|||
|
into and out of the index.
|
|||
|
|
|||
|
[separator]: ../../setup/setting-up-site-search.md#search-separator
|
|||
|
[default tokenizer]: https://github.com/olivernn/lunr.js/blob/aa5a878f62a6bba1e8e5b95714899e17e8150b38/lunr.js#L413-L456
|
|||
|
[post-processed]: https://github.com/squidfunk/mkdocs-material/blob/ec7ccd2b2d15dd033740f388912f7be7738feec2/src/assets/javascripts/integrations/search/_/index.ts#L249-L272
|
|||
|
[rescored]: https://github.com/squidfunk/mkdocs-material/blob/ec7ccd2b2d15dd033740f388912f7be7738feec2/src/assets/javascripts/integrations/search/_/index.ts#L274-L275
|
|||
|
|
|||
|
### Search previews
|
|||
|
|
|||
|
Users should be able to quickly scan and evaluate the relevance of a search
|
|||
|
result in the given context, which is why a concise summary with highlighted
|
|||
|
occurrences of the search terms found is an essential part of a great search
|
|||
|
experience.
|
|||
|
|
|||
|
This is where the current search preview generation falls short, as some of the
|
|||
|
search previews appear not to include any occurrence of any of the search
|
|||
|
terms. This was due to the fact that search previews were [truncated after a
|
|||
|
maximum of 320 characters][truncated], as can be seen here:
|
|||
|
|
|||
|
<figure markdown>
|
|||
|
|
|||
|
![search preview]
|
|||
|
|
|||
|
<figcaption markdown>
|
|||
|
|
|||
|
The first two results look like they're not relevant, as they don't seem to
|
|||
|
include the query string the user just searched for. Yet, they are.
|
|||
|
|
|||
|
</figcaption>
|
|||
|
</figure>
|
|||
|
|
|||
|
A better solution to this problem has been on the roadmap for a very, very long
|
|||
|
time, but in order to solve this once and for all, several factors need to be
|
|||
|
carefully considered:
|
|||
|
|
|||
|
1. __Word boundaries__: some themes[^2] for static site generators generate
|
|||
|
search previews by expanding the text left and right next to an occurrence,
|
|||
|
stopping at a whitespace character when enough words have been consumed. A
|
|||
|
preview might look like this:
|
|||
|
|
|||
|
```
|
|||
|
… channels, e.g., or which can be configured via mkdocs.yml …
|
|||
|
```
|
|||
|
|
|||
|
While this may work for languages that use whitespace as a separator
|
|||
|
between words, it breaks down for languages like Japanese or Chinese[^3],
|
|||
|
as they have non-whitespace word boundaries and use dedicated segmenters to
|
|||
|
split strings into tokens.
|
|||
|
|
|||
|
[^2]:
|
|||
|
At the time of writing, [Just the Docs] and [Docusaurus] use this method
|
|||
|
for generating search previews. Note that the latter also integrates with
|
|||
|
Algolia, which is a fully managed server-based solution.
|
|||
|
|
|||
|
[^3]:
|
|||
|
China and Japan are both within the top 5 countries of origin of users of
|
|||
|
Material for MkDocs.
|
|||
|
|
|||
|
[truncated]: https://github.com/squidfunk/mkdocs-material/blob/master/src/templates/assets/javascripts/templates/search/index.tsx#L90
|
|||
|
[search preview]: search-better-faster-smaller/search-preview.png
|
|||
|
[Just the Docs]: https://pmarsceill.github.io/just-the-docs/
|
|||
|
[Docusaurus]: https://github.com/lelouch77/docusaurus-lunr-search
|
|||
|
|
|||
|
1. __Context-awareness__: Although whitespace doesn't work for all languages,
|
|||
|
one could argue that it could be a good enough solution. Unfortunately, this
|
|||
|
is not necessarily true for code blocks, as the removal of whitespace might
|
|||
|
change meaning in some languages.
|
|||
|
|
|||
|
3. __Structure__: Preserving structural information is not a must, but
|
|||
|
apparently beneficial to build more meaningful search previews which allow
|
|||
|
for a quick evaluation of relevance. If a word occurrence is part of a code
|
|||
|
block, it should be rendered as a code block.
|
|||
|
|
|||
|
## What's new?
|
|||
|
|
|||
|
After we built a solid understanding of the problem space and before we dive
|
|||
|
into the internals of our new search implementation to see which of the
|
|||
|
problems it already solves, a quick overview of what features and improvements
|
|||
|
it brings:
|
|||
|
|
|||
|
- __Better__: support for [rich search previews], preserving the structural
|
|||
|
information of code blocks, inline code, and lists, so they are rendered
|
|||
|
as-is, as well as [lookahead tokenization], [more accurate highlighting], and
|
|||
|
improved stability of typeahead. Also, a [slightly better UX].
|
|||
|
- __Faster__ and __smaller__: significant decrease in search index size of up
|
|||
|
to 48% due to improved extraction and construction techniques, resulting in a
|
|||
|
search experience that is up to 95% faster, which is particularly helpful for
|
|||
|
large documentation projects.
|
|||
|
|
|||
|
[rich search previews]: #rich-search-previews
|
|||
|
[lookahead tokenization]: #tokenizer-lookahead
|
|||
|
[more accurate highlighting]: #accurate-highlighting
|
|||
|
[slightly better UX]: #user-interface
|
|||
|
|
|||
|
### Rich search previews
|
|||
|
|
|||
|
As we rebuilt the search plugin from scratch, we reworked the construction of
|
|||
|
the search index to preserve the structural information of code blocks, inline
|
|||
|
code, as well as unordered and ordered lists. Using the example from the
|
|||
|
[search index] section, here's how it looks:
|
|||
|
|
|||
|
=== "Now"
|
|||
|
|
|||
|
![search preview now]
|
|||
|
|
|||
|
=== "Before"
|
|||
|
|
|||
|
![search preview before]
|
|||
|
|
|||
|
Now, __code blocks are first-class citizens of search previews__, and even
|
|||
|
inline code formatting is preserved. Let's take a look at the new structure of
|
|||
|
the search index to understand why:
|
|||
|
|
|||
|
??? example "Expand to inspect search index"
|
|||
|
|
|||
|
=== "Now"
|
|||
|
|
|||
|
``` json
|
|||
|
{
|
|||
|
...
|
|||
|
"docs": [
|
|||
|
{
|
|||
|
"location": "page/",
|
|||
|
"title": "Example",
|
|||
|
"text": ""
|
|||
|
},
|
|||
|
{
|
|||
|
"location": "page/#text",
|
|||
|
"title": "Text",
|
|||
|
"text": "<p>It's very easy to make some words bold and other words italic with Markdown. You can even add links, or even <code>code</code>:</p> <pre><code>if (isAwesome){\n return true\n}\n</code></pre>"
|
|||
|
},
|
|||
|
{
|
|||
|
"location": "page/#lists",
|
|||
|
"title": "Lists",
|
|||
|
"text": "<p>Sometimes you want numbered lists:</p> <ol> <li>One</li> <li>Two</li> <li>Three</li> </ol> <p>Sometimes you want bullet points:</p> <ul> <li>Start a line with a star</li> <li>Profit!</li> </ul>"
|
|||
|
}
|
|||
|
]
|
|||
|
}
|
|||
|
```
|
|||
|
|
|||
|
=== "Before"
|
|||
|
|
|||
|
``` json
|
|||
|
{
|
|||
|
...
|
|||
|
"docs": [
|
|||
|
{
|
|||
|
"location": "page/",
|
|||
|
"title": "Example",
|
|||
|
"text": "Example Text It's very easy to make some words bold and other words italic with Markdown. You can even add links , or even code : if (isAwesome) { return true } Lists Sometimes you want numbered lists: One Two Three Sometimes you want bullet points: Start a line with a star Profit!"
|
|||
|
},
|
|||
|
{
|
|||
|
"location": "page/#example",
|
|||
|
"title": "Example",
|
|||
|
"text": ""
|
|||
|
},
|
|||
|
{
|
|||
|
"location": "page/#text",
|
|||
|
"title": "Text",
|
|||
|
"text": "It's very easy to make some words bold and other words italic with Markdown. You can even add links , or even code : if (isAwesome) { return true }"
|
|||
|
},
|
|||
|
{
|
|||
|
"location": "page/#lists",
|
|||
|
"title": "Lists",
|
|||
|
"text": "Sometimes you want numbered lists: One Two Three Sometimes you want bullet points: Start a line with a star Profit!"
|
|||
|
}
|
|||
|
]
|
|||
|
}
|
|||
|
```
|
|||
|
|
|||
|
If we inspect the search index again, we can see how the situation improved:
|
|||
|
|
|||
|
1. __Content is included only once__: the search index does not include the
|
|||
|
content of the page twice, as only the sections of a page are part of the
|
|||
|
search index. This leads to a significant reduction in size, fewer bytes to
|
|||
|
transfer, and a smaller search index.
|
|||
|
|
|||
|
2. __Some structure is preserved__: each section of the search index includes
|
|||
|
a small subset of HTML to provide the necessary structure to allow for more
|
|||
|
sophisticated search previews. Revisiting our example from before, let's
|
|||
|
look at an excerpt:
|
|||
|
|
|||
|
=== "Now"
|
|||
|
|
|||
|
``` html
|
|||
|
… links, or even <code>code</code>:</p> <pre><code>if (isAwesome){ … }\n</code></pre>
|
|||
|
```
|
|||
|
|
|||
|
=== "Before"
|
|||
|
|
|||
|
```
|
|||
|
… links , or even code : if (isAwesome) { … }
|
|||
|
```
|
|||
|
|
|||
|
The punctuation issue is gone, as no additional whitespace is inserted, and
|
|||
|
the preserved markup yields additional context to make scanning search
|
|||
|
results more effective.
|
|||
|
|
|||
|
On to the next step in the process: __tokenization__.
|
|||
|
|
|||
|
[search index]: #search-index
|
|||
|
[search preview now]: search-better-faster-smaller/search-preview-now.png
|
|||
|
[search preview before]: search-better-faster-smaller/search-preview-before.png
|
|||
|
|
|||
|
### Tokenizer lookahead
|
|||
|
|
|||
|
The [default tokenizer] of [lunr] uses a regular expression to split a given
|
|||
|
string by matching each character against the [`separator`][separator] as
|
|||
|
defined in `mkdocs.yml`. This doesn't allow for more complex separators based
|
|||
|
on lookahead or multiple characters.
|
|||
|
|
|||
|
Fortunately, __our new search implementation provides an advanced tokenizer__
|
|||
|
that doesn't have these shortcomings and supports more complex regular
|
|||
|
expressions. As a result, Material for MkDocs just changed its own separator
|
|||
|
configuration to the following value:
|
|||
|
|
|||
|
```
|
|||
|
[\s\-,:!=\[\]()"/]+|(?!\b)(?=[A-Z][a-z])|\.(?!\d)|&[lg]t;
|
|||
|
```
|
|||
|
|
|||
|
While the first part up to the first `|` contains a list of single control
|
|||
|
characters at which the string should be split, the following three sections
|
|||
|
explain the remainder of the regular expression.[^4]
|
|||
|
|
|||
|
[^4]:
|
|||
|
As a fun fact: the [`separator`][separator] [default value] of the search
|
|||
|
plugin being `[\s\-]+` always has been kind of irritating, as it suggests
|
|||
|
that multiple characters can be considered being a separator. However, the
|
|||
|
`+` is completely irrelevant, as regular expression groups involving
|
|||
|
multiple characters were never supported by
|
|||
|
[lunr's default tokenizer][default tokenizer].
|
|||
|
|
|||
|
[default value]: https://www.mkdocs.org/user-guide/configuration/#separator
|
|||
|
|
|||
|
#### Case changes
|
|||
|
|
|||
|
Many programming languages use `PascalCase` or `camelCase` naming conventions.
|
|||
|
When a user searches for the term `case`, it's quite natural to expect for
|
|||
|
`PascalCase` and `camelCase` to show up. By adding the following match group to
|
|||
|
the separator, this can now be achieved with ease:
|
|||
|
|
|||
|
```
|
|||
|
(?!\b)(?=[A-Z][a-z])
|
|||
|
```
|
|||
|
|
|||
|
This regular expression is a combination of a negative lookahead (`\b`, i.e.,
|
|||
|
not a word boundary) and a positive lookahead (`[A-Z][a-z]`, i.e., an uppercase
|
|||
|
character followed by a lowercase character), and has the following behavior:
|
|||
|
|
|||
|
- `PascalCase` :octicons-arrow-right-24: `Pascal`, `Case`
|
|||
|
- `camelCase` :octicons-arrow-right-24: `camel`, `Case`
|
|||
|
- `UPPERCASE` :octicons-arrow-right-24: `UPPERCASE`
|
|||
|
|
|||
|
Searching for [:octicons-search-24: searchHighlight][q=searchHighlight]
|
|||
|
now brings up the section discussing the `search.highlight` feature flag, which
|
|||
|
also demonstrates that this now even works properly for search queries.[^5]
|
|||
|
|
|||
|
[^5]:
|
|||
|
Previously, the search query was not correctly tokenized due to the way
|
|||
|
[lunr] treats wildcards, as it disables the pipeline for search terms that
|
|||
|
contain wildcards. In order to provide a good typeahead experience,
|
|||
|
Material for MkDocs adds wildcards to the end of each search term not
|
|||
|
explicitly preceded with `+` or `-`, effectively disabling tokenization.
|
|||
|
|
|||
|
[q=searchHighlight]: ?q=searchHighlight
|
|||
|
|
|||
|
#### Version numbers
|
|||
|
|
|||
|
Indexing version numbers is another problem that can be solved with a small
|
|||
|
lookahead. Usually, `.` should be considered a separator to split words like
|
|||
|
`search.highlight`. However, splitting version numbers at `.` will make them
|
|||
|
undiscoverable. Thus, the following expression:
|
|||
|
|
|||
|
```
|
|||
|
\.(?!\d)
|
|||
|
```
|
|||
|
|
|||
|
This regular expression matches a `.` only if not immediately followed by a
|
|||
|
digit `\d`, which leaves version numbers discoverable. Searching for
|
|||
|
[:octicons-search-24: 7.2.6][q=7.2.6] brings up the [7.2.6] release notes.
|
|||
|
|
|||
|
[q=7.2.6]: ?q=7.2.6
|
|||
|
[7.2.6]: ../../changelog/index.md#726-_-september-1-2021
|
|||
|
|
|||
|
#### HTML/XML tags
|
|||
|
|
|||
|
If your documentation includes HTML/XML code examples, you may want to allow
|
|||
|
users to find specific tag names. Unfortunately, the `<` and `>` control
|
|||
|
characters are encoded in code blocks as `<` and `>`. Now, adding the
|
|||
|
following expression to the separator allows for just that:
|
|||
|
|
|||
|
```
|
|||
|
&[lg]t;
|
|||
|
```
|
|||
|
|
|||
|
Searching for [:octicons-search-24: custom search worker script][q=script]
|
|||
|
brings up the section on [custom search] and matches the `script` tag among the
|
|||
|
other search terms discovered.
|
|||
|
|
|||
|
---
|
|||
|
|
|||
|
_We've only just begun to scratch the surface of the new possibilities
|
|||
|
tokenizer lookahead brings. If you found other useful expressions, you're
|
|||
|
invited to share them in the comment section._
|
|||
|
|
|||
|
[q=script]: ?q=custom+search+worker+script
|
|||
|
[custom search]: ../../setup/setting-up-site-search.md#custom-search
|
|||
|
|
|||
|
### Accurate highlighting
|
|||
|
|
|||
|
Highlighting is the last step in the process of search and involves the
|
|||
|
highlighting of all search term occurrences in a given search result. For a
|
|||
|
long time, highlighting was implemented through dynamically generated
|
|||
|
[regular expressions].[^6]
|
|||
|
|
|||
|
This approach has some problems with non-whitespace languages like Japanese or
|
|||
|
Chinese[^3] since it only works if the highlighted term is at a word boundary.
|
|||
|
However, Asian languages are tokenized using a [dedicated segmenter], which
|
|||
|
cannot be modeled with regular expressions.
|
|||
|
|
|||
|
[^6]:
|
|||
|
Using the separator as defined in `mkdocs.yml`, a regular expression was
|
|||
|
constructed that was trying to mimic the tokenizer. As an example, the
|
|||
|
search query `search highlight` was transformed into the rather cumbersome
|
|||
|
regular expression `(^|<separator>)(search|highlight)`, which only matches
|
|||
|
at word boundaries.
|
|||
|
|
|||
|
Now, as a direct result of the [new tokenization approach], __our new search
|
|||
|
implementation uses token positions for highlighting__, making it exactly as
|
|||
|
powerful as tokenization:
|
|||
|
|
|||
|
1. __Word boundaries__: as the new highlighter uses token positions, word
|
|||
|
boundaries are equal to token boundaries. This means that more complex cases
|
|||
|
of tokenization (e.g., [case changes], [version numbers], [HTML/XML tags]),
|
|||
|
are now all highlighted accurately.
|
|||
|
|
|||
|
2. __Context-awareness__: as the new search index preserves some of the
|
|||
|
structural information of the original document, the content of a section
|
|||
|
is now divided into separate content blocks – paragraphs, code blocks, and
|
|||
|
lists.
|
|||
|
|
|||
|
Now, only the content blocks that actually contain occurrences of one of
|
|||
|
the search terms are considered for inclusion into the search preview. If a
|
|||
|
term only occurs in a code block, it's the code block that gets rendered,
|
|||
|
see, for example, the results of
|
|||
|
[:octicons-search-24: twitter][q=twitter].
|
|||
|
|
|||
|
[regular expressions]: https://github.com/squidfunk/mkdocs-material/blob/ec7ccd2b2d15dd033740f388912f7be7738feec2/src/assets/javascripts/integrations/search/highlighter/index.ts#L61-L91
|
|||
|
[dedicated segmenter]: http://chasen.org/~taku/software/TinySegmenter/
|
|||
|
[new tokenization approach]: #tokenizer-lookahead
|
|||
|
[case changes]: #case-changes
|
|||
|
[version numbers]: #version-numbers
|
|||
|
[HTML/XML tags]: #htmlxml-tags
|
|||
|
[q=twitter]: ?q=twitter
|
|||
|
|
|||
|
### Benchmarks
|
|||
|
|
|||
|
We conducted two benchmarks – one with the documentation of Material for MkDocs
|
|||
|
itself, and one with a very massive corpus of Markdown files with more than
|
|||
|
800,000 words – a size most documentation projects will likely never
|
|||
|
reach:
|
|||
|
|
|||
|
<figure markdown>
|
|||
|
|
|||
|
| | Before | Now | Relative |
|
|||
|
| ----------------------- | -------: | -------------: | -----------: |
|
|||
|
| __Material for MkDocs__ | | | |
|
|||
|
| Index size | 573 kB | __335 kB__ | __–42%__ |
|
|||
|
| Index size (`gzip`) | 105 kB | __78 kB__ | __–27%__ |
|
|||
|
| Indexing time[^7] | 265 ms | __177 ms__ | __–34%__ |
|
|||
|
| __KJV Markdown[^8]__ | | | |
|
|||
|
| Index size | 8.2 MB | __4.4 MB__ | __–47%__ |
|
|||
|
| Index size (`gzip`) | 2.3 MB | __1.2 MB__ | __–48%__ |
|
|||
|
| Indexing time | 2,700 ms | __1,390 ms__ | __–48%__ |
|
|||
|
|
|||
|
<figcaption>
|
|||
|
<p>Benchmark results</p>
|
|||
|
</figcaption>
|
|||
|
|
|||
|
</figure>
|
|||
|
|
|||
|
[^7]:
|
|||
|
Smallest value of ten distinct runs.
|
|||
|
|
|||
|
[^8]:
|
|||
|
We agnostically use [KJV Markdown] as a tool for testing to learn how
|
|||
|
Material for MkDocs behaves on large corpora, as it's a very large set of
|
|||
|
Markdown files with over 800k words.
|
|||
|
|
|||
|
The results show that indexing time, which is the time that it takes to set up
|
|||
|
the search when the page is loaded, has dropped by up to 48%, which means __the
|
|||
|
new search is up to 95% faster__. This is a significant improvement,
|
|||
|
particularly relevant for large documentation projects.
|
|||
|
|
|||
|
While 1,3s still may sound like a long time, using the new client-side search
|
|||
|
together with [instant loading] only creates the search index on the initial
|
|||
|
page load. When navigating, the search index is preserved across pages, so the
|
|||
|
cost does only have to be paid once.
|
|||
|
|
|||
|
[KJV Markdown]: https://github.com/arleym/kjv-markdown
|
|||
|
[instant loading]: ../../setup/setting-up-navigation.md#instant-loading
|
|||
|
|
|||
|
### User interface
|
|||
|
|
|||
|
Additionally, some small improvements have been made, most prominently the
|
|||
|
__more results on this page__ button, which now sticks to the top of the search
|
|||
|
result list when open. This enables the user to jump out of the list more
|
|||
|
quickly.
|
|||
|
|
|||
|
## What's next?
|
|||
|
|
|||
|
Our new search implementation is a big improvement to Material for MkDocs. It
|
|||
|
solves some long-standing issues which needed to be tackled for years. Yet,
|
|||
|
it's only the start of a search experience that is going to get better and
|
|||
|
better. Next up:
|
|||
|
|
|||
|
- __Context-aware search summarization__: currently, the first two matching
|
|||
|
content blocks are rendered as a search preview. With the new tokenization
|
|||
|
technique, we laid the groundwork for more sophisticated shortening and
|
|||
|
summarization methods, which we're tackling next.
|
|||
|
|
|||
|
- __User interface improvements__: as we now gained full control over the
|
|||
|
search plugin, we can now add meaningful metadata to provide more context and
|
|||
|
a better experience. We'll explore some of those paths in the future.
|
|||
|
|
|||
|
If you've made it this far, thank you for your time and interest in Material
|
|||
|
for MkDocs! This is the first blog article that I decided to write after a
|
|||
|
short [Twitter survey] made me to. You're invited to leave a comment
|
|||
|
to share your experiences with the new search implementation.
|
|||
|
|
|||
|
[Twitter survey]: https://twitter.com/squidfunk/status/1434477478823743488
|