酒店数据全文检索(5)-搜索框拼音自动补全功能实现

自动补全示例

当用户在搜索框输入字符时,我们应该提示出与该字符有关的搜索项,提示完整词条的功能,就是自动补全了。比如京东、淘宝的商品搜索

image-20220124154044057

拼音搜索示例

我们用拼音首字母全拼也能搜索,还是用京东举例

image-20220124154915188

image-20220124154933258

拼音分词器

如果我们需要根据拼音字母来推断,因此要用到拼音分词功能。

要实现根据字母做补全,就必须对文档按照拼音分词。插件地址:https://github.com/medcl/elasticsearch-analysis-pinyin

下载插件后解压,上传到插件目录。使用 docker volume inspect es-plugins 查看插件目录,将下载的文件解压上传,重启 Elasticsearch

测试用法如下:

POST /_analyze
{
  "text": "如家酒店还不错",
  "analyzer": "pinyin"
}

结果如下

{
  "tokens" : [
    {
      "token" : "ru",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "rjjdhbc",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "jia",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "jiu",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "dian",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "hai",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "bu",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "cuo",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 6
    }
  ]
}

自定义分词器

默认的拼音分词器会将每个汉字单独分为拼音,而我们希望的是每个词条形成一组拼音,需要对拼音分词器做个性化定制,形成自定义分词器。

elasticsearch 中分词器(analyzer)的组成包含三部分:

  • character filters:在 tokenizer 之前对文本进行处理。例如删除字符、替换字符
  • tokenizer:将文本按照一定的规则切割成词条(term)。例如 keyword,就是不分词;还有 ik_smart
  • tokenizer filter:将 tokenizer 输出的词条做进一步处理。例如大小写转换、同义词处理、拼音处理等

文档分词时会依次由这三部分来处理文档:

img

声明自定义分词器的语法如下:

PUT /test
{
  "settings": {
    "analysis": {
      "analyzer": { // 自定义分词器
        "my_analyzer": {  // 分词器名称
          "tokenizer": "ik_max_word",
          "filter": "py"
        }
      },
      "filter": { // 自定义tokenizer filter
        "py": { // 过滤器名称
          "type": "pinyin", // 过滤器类型,这里是pinyin
          "keep_full_pinyin": false,
          "keep_joined_full_pinyin": true,
          "keep_original": true,
          "limit_first_letter_length": 16,
          "remove_duplicated_term": true,
          "none_chinese_pinyin_tokenize": false
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "name": {
        "type": "text",
        "analyzer": "my_analyzer",
        "search_analyzer": "ik_smart"  // 搜索时使用ik分词器
      }
    }
  }
}

测试一下

POST /test/_analyze
{
  "text": "如家酒店还不错",
  "analyzer": "my_analyzer"
}

结果如下:

{
  "tokens" : [
    {
      "token" : "如家",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "rujia",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "rj",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "酒店",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "jiudian",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "jd",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "还不",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "haibu",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "hb",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "不错",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "bucuo",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "bc",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 3
    }
  ]
}

注意为了避免搜索到同音字,搜索时不要使用拼音分词器。比如:有两个文档,一个文档是狮子,另一个文档是虱子。由于拼音都是shizi在建立倒排索引时,都会有shizisz的分词。如果搜索时用拼音分词器,那么在搜索狮子的时候,也会把虱子搜出来。

image-20220124161132472

因此,字段在创建的时候应该用my_analyzer分词器,字段在搜索的时候应该用ik_smart分词器

自动补全查询

elasticsearch 提供了 Completion Suggester 查询来实现自动补全功能。这个查询会匹配以用户输入内容开头的词条并返回;为了提高补全查询的效率,对于文档中字段的类型有一些约束

  • 参与补全查询的字段必须是 completion 类型。
  • 字段的内容一般是用来补全的多个词条形成的数组。
// 创建索引库
PUT test2
{
  "mappings": {
    "properties": {
      "title":{
        "type": "completion"
      }
    }
  }
}

然后插入下面的数据

// 示例数据
POST test2/_doc
{
  "title": ["Sony", "WH-1000XM3"]
}
POST test2/_doc
{
  "title": ["SK-II", "PITERA"]
}
POST test2/_doc
{
  "title": ["Nintendo", "switch"]
}
DSL查询

查询的 DSL 语句如下

// 自动补全查询
GET /test2/_search
{
  "suggest": {
    "title_suggest": {
      "text": "s", // 关键字
      "completion": {
        "field": "title", // 补全查询的字段
        "skip_duplicates": true, // 跳过重复的
        "size": 10 // 获取前10条结果
      }
    }
  }
}

结果如下

{
  "took" : 51,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "suggest" : {
    "title_suggest" : [
      {
        "text" : "s",
        "offset" : 0,
        "length" : 1,
        "options" : [
          {
            "text" : "SK-II",
            "_index" : "test2",
            "_type" : "_doc",
            "_id" : "ZtEyi34BIxziSDxgYv4X",
            "_score" : 1.0,
            "_source" : {
              "title" : [
                "SK-II",
                "PITERA"
              ]
            }
          },
          {
            "text" : "Sony",
            "_index" : "test2",
            "_type" : "_doc",
            "_id" : "ZdEyi34BIxziSDxgWv7q",
            "_score" : 1.0,
            "_source" : {
              "title" : [
                "Sony",
                "WH-1000XM3"
              ]
            }
          },
          {
            "text" : "switch",
            "_index" : "test2",
            "_type" : "_doc",
            "_id" : "Z9Eyi34BIxziSDxgbv7x",
            "_score" : 1.0,
            "_source" : {
              "title" : [
                "Nintendo",
                "switch"
              ]
            }
          }
        ]
      }
    ]
  }
}
RestClient查询
@Test
public void testSuggest() throws IOException {
    SearchRequest request = new SearchRequest("hotel");
    request.source().suggest(new SuggestBuilder().addSuggestion("mySuggest",
            SuggestBuilders
                    .completionSuggestion("title")
                    .prefix("h")
                    .skipDuplicates(true)
                    .size(10)
    ));
    client.search(request,RequestOptions.DEFAULT);
}

实战

重建索引

首先,我们要重新建立mapping映射

在建立前先删除之前的索引

DELETE /hotel

然后,建立索引

PUT /hotel
{
  "settings": {
    "analysis": {
      "analyzer": {
        "text_anlyzer": {
          "tokenizer": "ik_max_word",
          "filter": "py"
        },
        "completion_analyzer": {
          "tokenizer": "keyword",
          "filter": "py"
        }
      },
      "filter": {
        "py": {
          "type": "pinyin",
          "keep_full_pinyin": false,
          "keep_joined_full_pinyin": true,
          "keep_original": true,
          "limit_first_letter_length": 16,
          "remove_duplicated_term": true,
          "none_chinese_pinyin_tokenize": false
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "id":{
        "type": "keyword"
      },
      "name":{
        "type": "text",
        "analyzer": "text_anlyzer",
        "search_analyzer": "ik_smart",
        "copy_to": "all"
      },
      "address":{
        "type": "keyword",
        "index": false
      },
      "price":{
        "type": "integer"
      },
      "score":{
        "type": "integer"
      },
      "brand":{
        "type": "keyword",
        "copy_to": "all"
      },
      "city":{
        "type": "keyword"
      },
      "starName":{
        "type": "keyword"
      },
      "business":{
        "type": "keyword",
        "copy_to": "all"
      },
      "location":{
        "type": "geo_point"
      },
      "pic":{
        "type": "keyword",
        "index": false
      },
      "isAD":{
        "type": "boolean"
      },
      "adCost":{
        "type":"integer"
      },
      "all":{
        "type": "text",
        "analyzer": "text_anlyzer",
        "search_analyzer": "ik_smart"
      },
      "suggestion":{
          "type": "completion",
          "analyzer": "completion_analyzer"
      }
    }
  }
}
增加字段

HotelDoc增加suggestion字段,在构造函数中将设置包含品牌和商圈。实现品牌和商圈自动补全。

@Data
@NoArgsConstructor
public class HotelDoc {
    private Long id;
    private String name;
    private String address;
    private Integer price;
    private Integer score;
    private String brand;
    private String city;
    private String starName;
    private String business;
    private String location;
    private String pic;
    private Object distance;
    private Boolean isAD;
    private Integer adCost;
    private List<String> suggestion;

    public HotelDoc(Hotel hotel) {
        this.id = hotel.getId();
        this.name = hotel.getName();
        this.address = hotel.getAddress();
        this.price = hotel.getPrice();
        this.score = hotel.getScore();
        this.brand = hotel.getBrand();
        this.city = hotel.getCity();
        this.starName = hotel.getStarName();
        this.business = hotel.getBusiness();
        this.location = hotel.getLatitude() + ", " + hotel.getLongitude();
        this.pic = hotel.getPic();
        this.suggestion = Arrays.asList(this.brand, this.business);
    }
}
观察前端请求

当我们在搜索框输入文章时,前端就会往后端发送请求,前端会接收一个String的数组,用来补全展示

image-20220125143510170

image-20220125143656680

构造Controller方法
@GetMapping("/suggestion")
public List<String> suggestion(String key){
    return hotelService.suggestion(key);
}

构建Service方法

@Override
public List<String> suggestion(String key) {
    List<String> resultList = new ArrayList<>();
    SearchRequest request = new SearchRequest("hotel");
    request.source().suggest(new SuggestBuilder().addSuggestion("suggestions",
                    SuggestBuilders.completionSuggestion("suggestion")
                            .prefix(key)
                            .skipDuplicates(true)
                            .size(10)
            )
    );
    try {
        SearchResponse response = client.search(request, RequestOptions.DEFAULT);
        //获取结果
        Suggest suggest = response.getSuggest();

        CompletionSuggestion suggestions = suggest.getSuggestion("suggestions");
        List<CompletionSuggestion.Entry.Option> options = suggestions.getOptions();
        for (CompletionSuggestion.Entry.Option option : options) {
            String text = option.getText().toString();
            resultList.add(text);
        }
    } catch (IOException e) {
        e.printStackTrace();
    }
    return resultList;
}

启动项目,我们观察一下效果

image-20220125155518573

好啦,大功告成~


酒店数据全文检索(5)-搜索框拼音自动补全功能实现
https://www.zhaojun.inkhttps://www.zhaojun.ink/archives/1016
作者
卑微幻想家
发布于
2022-02-08
许可协议