酒店数据全文检索(5)-搜索框拼音自动补全功能实现
自动补全示例
当用户在搜索框输入字符时,我们应该提示出与该字符有关的搜索项,提示完整词条的功能,就是自动补全了。比如京东、淘宝的商品搜索
拼音搜索示例
我们用拼音首字母全拼也能搜索,还是用京东举例
拼音分词器
如果我们需要根据拼音字母来推断,因此要用到拼音分词功能。
要实现根据字母做补全,就必须对文档按照拼音分词。插件地址:https://github.com/medcl/elasticsearch-analysis-pinyin
下载插件后解压,上传到插件目录。使用 docker volume inspect es-plugins
查看插件目录,将下载的文件解压上传,重启 Elasticsearch
测试用法如下:
POST /_analyze
{
"text": "如家酒店还不错",
"analyzer": "pinyin"
}
结果如下
{
"tokens" : [
{
"token" : "ru",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 0
},
{
"token" : "rjjdhbc",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 0
},
{
"token" : "jia",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 1
},
{
"token" : "jiu",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 2
},
{
"token" : "dian",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 3
},
{
"token" : "hai",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 4
},
{
"token" : "bu",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 5
},
{
"token" : "cuo",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 6
}
]
}
自定义分词器
默认的拼音分词器会将每个汉字单独分为拼音,而我们希望的是每个词条形成一组拼音,需要对拼音分词器做个性化定制,形成自定义分词器。
elasticsearch 中分词器(analyzer)的组成包含三部分:
- character filters:在 tokenizer 之前对文本进行处理。例如删除字符、替换字符
- tokenizer:将文本按照一定的规则切割成词条(term)。例如 keyword,就是不分词;还有 ik_smart
- tokenizer filter:将 tokenizer 输出的词条做进一步处理。例如大小写转换、同义词处理、拼音处理等
文档分词时会依次由这三部分来处理文档:
声明自定义分词器的语法如下:
PUT /test
{
"settings": {
"analysis": {
"analyzer": { // 自定义分词器
"my_analyzer": { // 分词器名称
"tokenizer": "ik_max_word",
"filter": "py"
}
},
"filter": { // 自定义tokenizer filter
"py": { // 过滤器名称
"type": "pinyin", // 过滤器类型,这里是pinyin
"keep_full_pinyin": false,
"keep_joined_full_pinyin": true,
"keep_original": true,
"limit_first_letter_length": 16,
"remove_duplicated_term": true,
"none_chinese_pinyin_tokenize": false
}
}
}
},
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "my_analyzer",
"search_analyzer": "ik_smart" // 搜索时使用ik分词器
}
}
}
}
测试一下
POST /test/_analyze
{
"text": "如家酒店还不错",
"analyzer": "my_analyzer"
}
结果如下:
{
"tokens" : [
{
"token" : "如家",
"start_offset" : 0,
"end_offset" : 2,
"type" : "CN_WORD",
"position" : 0
},
{
"token" : "rujia",
"start_offset" : 0,
"end_offset" : 2,
"type" : "CN_WORD",
"position" : 0
},
{
"token" : "rj",
"start_offset" : 0,
"end_offset" : 2,
"type" : "CN_WORD",
"position" : 0
},
{
"token" : "酒店",
"start_offset" : 2,
"end_offset" : 4,
"type" : "CN_WORD",
"position" : 1
},
{
"token" : "jiudian",
"start_offset" : 2,
"end_offset" : 4,
"type" : "CN_WORD",
"position" : 1
},
{
"token" : "jd",
"start_offset" : 2,
"end_offset" : 4,
"type" : "CN_WORD",
"position" : 1
},
{
"token" : "还不",
"start_offset" : 4,
"end_offset" : 6,
"type" : "CN_WORD",
"position" : 2
},
{
"token" : "haibu",
"start_offset" : 4,
"end_offset" : 6,
"type" : "CN_WORD",
"position" : 2
},
{
"token" : "hb",
"start_offset" : 4,
"end_offset" : 6,
"type" : "CN_WORD",
"position" : 2
},
{
"token" : "不错",
"start_offset" : 5,
"end_offset" : 7,
"type" : "CN_WORD",
"position" : 3
},
{
"token" : "bucuo",
"start_offset" : 5,
"end_offset" : 7,
"type" : "CN_WORD",
"position" : 3
},
{
"token" : "bc",
"start_offset" : 5,
"end_offset" : 7,
"type" : "CN_WORD",
"position" : 3
}
]
}
注意:为了避免搜索到同音字,搜索时不要使用拼音分词器。比如:有两个文档,一个文档是狮子
,另一个文档是虱子
。由于拼音都是shizi
在建立倒排索引时,都会有shizi
、sz
的分词。如果搜索时用拼音分词器,那么在搜索狮子
的时候,也会把虱子
搜出来。
因此,字段在创建的时候应该用my_analyzer分词器,字段在搜索的时候应该用ik_smart分词器
自动补全查询
elasticsearch 提供了 Completion Suggester 查询来实现自动补全功能。这个查询会匹配以用户输入内容开头的词条并返回;为了提高补全查询的效率,对于文档中字段的类型有一些约束
- 参与补全查询的字段必须是 completion 类型。
- 字段的内容一般是用来补全的多个词条形成的数组。
// 创建索引库
PUT test2
{
"mappings": {
"properties": {
"title":{
"type": "completion"
}
}
}
}
然后插入下面的数据
// 示例数据
POST test2/_doc
{
"title": ["Sony", "WH-1000XM3"]
}
POST test2/_doc
{
"title": ["SK-II", "PITERA"]
}
POST test2/_doc
{
"title": ["Nintendo", "switch"]
}
DSL查询
查询的 DSL 语句如下
// 自动补全查询
GET /test2/_search
{
"suggest": {
"title_suggest": {
"text": "s", // 关键字
"completion": {
"field": "title", // 补全查询的字段
"skip_duplicates": true, // 跳过重复的
"size": 10 // 获取前10条结果
}
}
}
}
结果如下
{
"took" : 51,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 0,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"suggest" : {
"title_suggest" : [
{
"text" : "s",
"offset" : 0,
"length" : 1,
"options" : [
{
"text" : "SK-II",
"_index" : "test2",
"_type" : "_doc",
"_id" : "ZtEyi34BIxziSDxgYv4X",
"_score" : 1.0,
"_source" : {
"title" : [
"SK-II",
"PITERA"
]
}
},
{
"text" : "Sony",
"_index" : "test2",
"_type" : "_doc",
"_id" : "ZdEyi34BIxziSDxgWv7q",
"_score" : 1.0,
"_source" : {
"title" : [
"Sony",
"WH-1000XM3"
]
}
},
{
"text" : "switch",
"_index" : "test2",
"_type" : "_doc",
"_id" : "Z9Eyi34BIxziSDxgbv7x",
"_score" : 1.0,
"_source" : {
"title" : [
"Nintendo",
"switch"
]
}
}
]
}
]
}
}
RestClient查询
@Test
public void testSuggest() throws IOException {
SearchRequest request = new SearchRequest("hotel");
request.source().suggest(new SuggestBuilder().addSuggestion("mySuggest",
SuggestBuilders
.completionSuggestion("title")
.prefix("h")
.skipDuplicates(true)
.size(10)
));
client.search(request,RequestOptions.DEFAULT);
}
实战
重建索引
首先,我们要重新建立mapping映射
在建立前先删除之前的索引
DELETE /hotel
然后,建立索引
PUT /hotel
{
"settings": {
"analysis": {
"analyzer": {
"text_anlyzer": {
"tokenizer": "ik_max_word",
"filter": "py"
},
"completion_analyzer": {
"tokenizer": "keyword",
"filter": "py"
}
},
"filter": {
"py": {
"type": "pinyin",
"keep_full_pinyin": false,
"keep_joined_full_pinyin": true,
"keep_original": true,
"limit_first_letter_length": 16,
"remove_duplicated_term": true,
"none_chinese_pinyin_tokenize": false
}
}
}
},
"mappings": {
"properties": {
"id":{
"type": "keyword"
},
"name":{
"type": "text",
"analyzer": "text_anlyzer",
"search_analyzer": "ik_smart",
"copy_to": "all"
},
"address":{
"type": "keyword",
"index": false
},
"price":{
"type": "integer"
},
"score":{
"type": "integer"
},
"brand":{
"type": "keyword",
"copy_to": "all"
},
"city":{
"type": "keyword"
},
"starName":{
"type": "keyword"
},
"business":{
"type": "keyword",
"copy_to": "all"
},
"location":{
"type": "geo_point"
},
"pic":{
"type": "keyword",
"index": false
},
"isAD":{
"type": "boolean"
},
"adCost":{
"type":"integer"
},
"all":{
"type": "text",
"analyzer": "text_anlyzer",
"search_analyzer": "ik_smart"
},
"suggestion":{
"type": "completion",
"analyzer": "completion_analyzer"
}
}
}
}
增加字段
给HotelDoc
增加suggestion
字段,在构造函数中将设置包含品牌和商圈。实现品牌和商圈自动补全。
@Data
@NoArgsConstructor
public class HotelDoc {
private Long id;
private String name;
private String address;
private Integer price;
private Integer score;
private String brand;
private String city;
private String starName;
private String business;
private String location;
private String pic;
private Object distance;
private Boolean isAD;
private Integer adCost;
private List<String> suggestion;
public HotelDoc(Hotel hotel) {
this.id = hotel.getId();
this.name = hotel.getName();
this.address = hotel.getAddress();
this.price = hotel.getPrice();
this.score = hotel.getScore();
this.brand = hotel.getBrand();
this.city = hotel.getCity();
this.starName = hotel.getStarName();
this.business = hotel.getBusiness();
this.location = hotel.getLatitude() + ", " + hotel.getLongitude();
this.pic = hotel.getPic();
this.suggestion = Arrays.asList(this.brand, this.business);
}
}
观察前端请求
当我们在搜索框输入文章时,前端就会往后端发送请求,前端会接收一个String的数组,用来补全展示
构造Controller方法
@GetMapping("/suggestion")
public List<String> suggestion(String key){
return hotelService.suggestion(key);
}
构建Service方法
@Override
public List<String> suggestion(String key) {
List<String> resultList = new ArrayList<>();
SearchRequest request = new SearchRequest("hotel");
request.source().suggest(new SuggestBuilder().addSuggestion("suggestions",
SuggestBuilders.completionSuggestion("suggestion")
.prefix(key)
.skipDuplicates(true)
.size(10)
)
);
try {
SearchResponse response = client.search(request, RequestOptions.DEFAULT);
//获取结果
Suggest suggest = response.getSuggest();
CompletionSuggestion suggestions = suggest.getSuggestion("suggestions");
List<CompletionSuggestion.Entry.Option> options = suggestions.getOptions();
for (CompletionSuggestion.Entry.Option option : options) {
String text = option.getText().toString();
resultList.add(text);
}
} catch (IOException e) {
e.printStackTrace();
}
return resultList;
}
启动项目,我们观察一下效果
好啦,大功告成~