ElasticSearch
一、字段的类型Field type的详解
(资料图片仅供参考)
下面就是所有的字段类型
字符串类型text,keyword整数类型integer,long,short,byte浮点类型double,float,half_float,scaled_float逻辑类型boolean日期类型date范围类型range二进制类型binary复合类型数组类型array对象类型object嵌套类型nested地理类型地理坐标类型geo_point地理地图geo_shape特殊类型IP类型ip范围类型completion令牌计数类型token_count附件类型attachment抽取类型percolator
1、字符串类型
1.1、text
会被分词器解析, 生成倒排索引, 支持模糊、精确查询, 不用于排序, 很少用于聚合,常用于全文检索.
1.2、keyword
不进行分词,直接索引, 支持模糊、精确查询, 支持聚合。如果需要为结构化内容, 比如 id、email、hostnames、状态码、标签等进行索引,则推荐使用 keyword类型以进行完全匹配. 比如我们需要查询"已发货"的订单, 标记为"已发布"的文章等.
ES会对"数值(numeric)"类型的字段(比如, integer, long)进行优化以支持范围(range)查询. 但是, 不是所有的数值类型的数据都需要使用"数值"类型, 比如产品id, 会员id, ISDN(出版社编号), 这些很少会被进行范围查询, 通常都是精确匹配(term query).
keyword类型的查询通常比numeric的要快, 如果不需要范围查询, 则建议使用keyword类型.
如果不确定使用哪一种, 可以通过multi-field同时设置keyword和numeric类型.
2、数值类型(numeric)
类型说明取值范围
byte8位有符号整数-128 ~ 127
short16位有符号整数-32768 ~ 32767
integer32位有符号整数-2,147,483,648 ~ 2,147,483,647 即:-2^31 ~ 2^32 - 1
long64位有符号整数-2^63 ~ 2^63 - 1
float32位单精度IEEE 754浮点类型, 有限值, 24bits2^-149 ~ (2 - 2^-23) · 2^127
double64位双精度IEEE 754浮点类型, 有限值, 53bits2^-1074 ~ (2 - 2^-52) · 2^1023
half_float16位半精度IEEE 754浮点类型, 有限值, 11bits2^-24 ~ 65504
scaled_float带有缩放因子scaling_factor的浮点数, 可以当整型看待
2.1、整型
byte, short, integer, long
2.2、浮点型
float, half_float, scaled_float, double 对于上面展示的3种浮点类型(float, half_float, scaled_float)来说, -0.00和+0.00是不同的值,使用term查询-0.00时不会匹配+0.00, 反之亦然。对于范围查询(range query)来说也是如此:如果上边界是-0.00则无法匹配+0.00,如果下边界是+0.00则不会匹配-0.00
数值类型使用的注意事项:
在满足业务需求的情况下, 尽量选择范围小的类型, 这与mysql等关系型数据库的设计要求是一致的. 字段占用的空间越小, 搜索和索引的效率越高(这个有点废话, 数据量越小肯定越快了).
如果是浮点数, 则也要优先考虑使用scaled_float类型。对于scaled_float类型, 在索引时会乘以scaling_factor并四舍五入到最接近的一个long类型的值.
如果我们存储商品价格只需要精确到分, 两位小数点, 把缩放因子设置为100, 那么浮点价格99.99存储的是9999.
如果我们要存储的值是2.34, 而因子是10, 那么会存储为23, 在查询/聚合/排序时会表现为这个值是2.3. 如果设置更大的因子(比如100), 可以提高精度, 但是会增加空间需求.
3、Object类型
JSON文档天生具有层级关系: 文档可能包含内部对象, 而这个对象可能本身又包含对象, 就是可以包含嵌套的对象.
4、Date
JSON没有日期(date)类型, 在ES中date类型可以表现为:
字符串格式的日期, 比如: “2015-01-01”, “2015/01/01 12:10:30”long类型的自 epoch (1970-1-1) 以来的毫秒数integer类型的自 epoch (1970-1-1) 以来的秒数(时间戳, timestamp)
在ES内部, 日期被转换为UTC格式(如果指定了时区), 并存储为long类型的自 epoch (1970-1-1) 以来的毫秒数.
查询时 , date类型会在内部转换为long类型的范围查询,聚合和存储字段的结果根据与字段关联的日期格式转换回字符串。
5、ip类型
ip类型的字段只能存储ipv4或ipv6地址.
二、属性的详解
1、fielddata字段数据
参考链接:https://www.cnblogs.com/rickie/p/11665168.html
默认情况下是false
作用:把不能聚合、排序、通过脚本访问的字段变为可以聚合、排序、通过脚本访问的字段
text类型是不支持doc_value。所以要想对text类型的字段拥有聚合等功能,只能设置属性的fielddata为true即可
缺点:构建和管理全部在内存,也就是在JVM的内存中。这意味着它本质上是不可扩展的。
fielddata可能会消耗大量的堆空间,尤其是在加载高基数(high cardinality)text字段时。一旦fielddata已加载到堆中,它将在该段的生命周期内保留。此外,加载fielddata是一个昂贵的过程,可能会导致用户遇到延迟命中。这就是默认情况下禁用fielddata的原因。
如果需要对 text 类型字段进行排序、聚合、或者从脚本中访问字段值,则会出现如下异常:
Fielddata is disabled on text fields by default. Set fielddata=true on [your_field_name] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory.
但是,在启动fielddata 设置之前,需要考虑为什么针对text 类型字段进行排序、聚合、或脚本呢?通常情况下,这是不太合理的。
text字段在索引时,例如New York,这样的词会被分词,会被拆成new、york 2个词项,这样当搜索new 或 york时,可以被搜索到。在此字段上面来一个terms的聚合会返回一个new的bucket和一个york的bucket,但是你可能想要的是一个单一new york的bucket。
怎么解决这一问题呢?
你可以使用 text 字段来实现全文本查询,同时使用一个未分词的 keyword 字段,且启用doc_values,来处理聚合操作。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-qlza4utS-1644556258944)()]
(1) 使用my_field 字段用于查询;
(2) 使用my_field.keyword 字段用于聚合、排序、或脚本; 注意:但是我试了试,不行,我也不知道为啥
PUT /user3{"mappings": {"properties": {"name":{"type": "text", "fields": {"keyword":{"type":"keyword" } } } } }}POST /user3/_doc{"name":"hello word 的数据结构 你真棒"}GET /user3/_search{"query": {"match": {"name": "hello" } }, "aggs":{"testq":{"terms":{"field":"name.keyword" } } }}结果输出:{"took" : 1, "timed_out" : false, "_shards" : {"total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : {"total" : {"value" : 1, "relation" : "eq" }, "max_score" : 0.2876821, "hits" : [ {"_index" : "user3", "_type" : "_doc", "_id" : "N_X1n34BgddXgx4111ZC", "_score" : 0.2876821, "_source" : {"name" : "hello word 的数据结构 你真棒" } } ] }, "aggregations" : {"testq" : {"doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [ {"key" : "hello word 的数据结构 你真棒", "doc_count" : 1 } ] } }}
也可以使用 PUT mapping API 在现有text 字段上启用 fielddata,如下所示:
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-C0iFF2N6-1644556258949)()]
2、doc_value
doc_values
默认情况下,大部分字段是索引的,这样让这些字段可被搜索。倒排索引(inverted index)允许查询请求在词项列表中查找搜索项(search term),并立即获得包含该词项的文档列表。
倒排索引(inverted index):
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-D0WU7j87-1644556258952)()]
如果我们想要获得所有包含 brown 的文档的词的完整列表,我们会创建如下查询:
GET /my_index/_search{"query" : {"match" : {"body" : "brown" } }, "aggs" : {"popular_terms": {"terms" : {"field" : "body" } } }}
倒排索引是根据词项来排序的,所以我们首先在词项列表中找到 brown,然后扫描所有列,找到包含 brown 的文档。我们可以快速看到 Doc_1 和 Doc_2 包含 brown 这个 token。
然后,对于聚合部分,我们需要找到 Doc_1 和 Doc_2 里所有唯一的词项。用倒排索引做这件事情代价很高: 我们会迭代索引里的每个词项并收集 Doc_1 和 Doc_2 列里面 token。这很慢而且难以扩展:随着词项和文档的数量增加,执行时间也会增加。
Doc values 通过转置两者间的关系来解决这个问题。倒排索引将词项映射到包含它们的文档,doc values 将文档映射到它们包含的词项:
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-deqcJ89G-1644556258957)()]
当数据被转置之后,想要收集到 Doc_1 和 Doc_2 的唯一 token 会非常容易。获得每个文档行,获取所有的词项,然后求两个集合的并集。
Doc values 可以使聚合更快、更高效并且内存友好。Doc values 的存在是因为倒排索引只对某些操作是高效的。
倒排索引的优势:在于查找包含某个项的文档,而对于从另外一个方向的相反操作并不高效,即:确定哪些项是否存在单个文档里,聚合需要这种访问模式。
在 Elasticsearch 中,Doc Values 就是一种列式存储结构,默认情况下每个字段的 Doc Values 都是激活的,Doc Values 是在索引时创建的。当字段索引时,Elasticsearch 为了能够快速检索,会把字段的值加入倒排索引中,同时它也会存储该字段的 Doc Values。
Elasticsearch 中的 Doc Values 常被应用到以下场景:
对一个字段进行排序对一个字段进行聚合某些过滤,比如地理位置过滤某些与字段相关的脚本计算
因为文档值(doc values)被序列化到磁盘,我们可以依靠操作系统的帮助来快速访问。当 working set 远小于节点的可用内存,系统会自动将所有的文档值保存在内存中,使得其读写十分高速;当其远大于可用内存,操作系统会自动把 Doc Values 加载到系统的页缓存中,从而避免了 jvm 堆内存溢出异常。
因此,搜索和聚合是相互紧密缠绕的。搜索使用倒排索引查找文档,聚合操作收集和聚合 doc values 里的数据。
doc values 支持大部分字段类型,但是text 字段类型不支持(因为analyzed)。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-LgoL0gRs-1644556258959)()]
(1) status_code 字段默认启动 doc_values 属性;
(2) session_id 显式设置 doc_values = false,但是仍然可以被查询;
如果确信某字段不需要排序或者聚合,或者从脚本中访问字段值,那么我们可以设置 doc_values = false,这样可以节省磁盘空间。
3、executionHint
1、global ordinals
(1)what’s this?
当我们使用doc values或者fielddata存储时,在磁盘中存储的值不是真正的字段值,而是一个字典值(ordinal)。当我们进行聚合查询的时候,es会把这个字典值跟真正字段值的映射字典加载到内存中,并对结果集做映射,转化为真正的值。这份映射关系是shard级别的,为这个shard里面是所有segment服务,这也是global的体现。
(2)detail
字典关系是lazy init的,只有第一次使用的时候才会加载到内存中。在es的内存表现中提现成fielddata,这也是全keyword的index为什么也会有fielddata使用的原因。只会加载命中的segment的字典不会加载全部。字典关系在shard被触发refresh以后就会失效。下次使用的时候需要再重新构建。所以可以提高refresh_interval的值,减少fresh频率提高字典的生存时间。
2、eager_global_ordinals
(1)what’s this?
当在global ordinals的时候,refresh以后下一次查询字典就需要重新构建,在追求查询的场景下很影响查询性能。可以使用eager_global_ordinals,即在每次refresh以后即可更新字典,字典常驻内存,减少了查询的时候构建字典的耗时。
(2)使用场景
因为这份字典需要常驻内存,并且每次refresh以后就会重构,所以增大了内存以及cpu的消耗。推荐在低写高查、数据量不大的index中使用。
(3)使用
PUT my_index/_mapping{"properties": {"tags": {"type": "keyword", "eager_global_ordinals": true } }}
3、execution_hint
(1)what’ this?
上面介绍了global ordinal的使用场景,是doc_values以及fileddata的默认数据架构。除了这种模式,还可以选择map模式。即不再使用字典而是直接把值加载到内存中计算,减去了构建字典的耗时。当查询的结果集很小的情况下,可以使用map的模式不去构建字典。使用map还是global_ordinals的取决于构建字典的开销与加载原始字典的开销。当结果集大到一定程序,map的内存开销的代价可能抵消了构建字典的开销。
(2)how to use?
GET /_search{"aggs" : {"tags" : {"terms" : {"field" : "tags", "execution_hint": "map" } } }}
4、Shard Size
为了提高该聚合的精确度,可以通过shard_size参数设置协调节点向各个分片请求的词根个数,然后在协调节点进行聚合,最后只返回size个词根给到客户端,shard_size >= size,如果shard_size设置小于size,ES会自动将其设置为size,默认情况下shard_size建议设置为(1.5 * size + 10)。
三、java实现对Es的curd
//match查询age是20的条件 QueryBuilders.matchQuery("age",20); //term查询age是20的条件 QueryBuilders.termQuery("age",20); //terms查询age是20或者200,或者50的条件 QueryBuilders.termsQuery("age",20,200,50); //query_string全文检索“xiumu” QueryBuilders.queryStringQuery("xiumu"); //query_string检索username是“xiumu” QueryBuilders.queryStringQuery("xiumu").field("username"); //multi_match查询字段username或者description的值是xiumu QueryBuilders.multiMatchQuery("xiumu","username","description"); //range查询字段age的范围是在[18-200]之间 QueryBuilders.rangeQuery("age").gte(18).lte(200); //exits查询字段age有值 QueryBuilders.existsQuery("age"); //wildcard查询字段description是以“男人”结尾的 QueryBuilders.wildcardQuery("description","*男人"); //bool查询 BoolQueryBuilder boolQueryBuilder = QueryBuilders.boolQuery(); //bool查询里添加must多个条件 boolQueryBuilder.must(QueryBuilders.termQuery("age",20)); boolQueryBuilder.must(QueryBuilders.existsQuery("age")); boolQueryBuilder.must(QueryBuilders.wildcardQuery("description","*男人"));
(1)统计某个字段的数量 ValueCountBuilder vcb= AggregationBuilders.count("count_uid").field("uid");(2)去重统计某个字段的数量(有少量误差) CardinalityBuilder cb= AggregationBuilders.cardinality("distinct_count_uid").field("uid");(3)聚合过滤FilterAggregationBuilder fab= AggregationBuilders.filter("uid_filter").filter(QueryBuilders.queryStringQuery("uid:001"));(4)按某个字段分组TermsBuilder tb= AggregationBuilders.terms("group_name").field("name");(5)求和SumBuilder sumBuilder=AggregationBuilders.sum("sum_price").field("price");(6)求平均AvgBuilder ab= AggregationBuilders.avg("avg_price").field("price");(7)求最大值MaxBuilder mb= AggregationBuilders.max("max_price").field("price"); (8)求最小值MinBuilder min=AggregationBuilders.min("min_price").field("price");(9)按日期间隔分组DateHistogramBuilder dhb= AggregationBuilders.dateHistogram("dh").field("date");(10)获取聚合里面的结果TopHitsBuilder thb= AggregationBuilders.topHits("top_result");(11)嵌套的聚合NestedBuilder nb= AggregationBuilders.nested("negsted_path").path("quests");(12)反转嵌套AggregationBuilders.reverseNested("res_negsted").path("kps ");
package com.robin.elasticsearch;import com.alibaba.fastjson.JSON;import com.robin.elasticsearch.common.DataUtil;import com.robin.elasticsearch.entity.User;import com.robin.elasticsearch.entity.User2;import com.robin.elasticsearch.entity.User3;import com.robin.elasticsearch.reponsity.UserMapper;import org.elasticsearch.action.admin.indices.delete.DeleteIndexRequest;import org.elasticsearch.action.bulk.BulkRequest;import org.elasticsearch.action.bulk.BulkResponse;import org.elasticsearch.action.delete.DeleteRequest;import org.elasticsearch.action.delete.DeleteResponse;import org.elasticsearch.action.get.GetRequest;import org.elasticsearch.action.get.GetResponse;import org.elasticsearch.action.index.IndexRequest;import org.elasticsearch.action.index.IndexResponse;import org.elasticsearch.action.search.SearchRequest;import org.elasticsearch.action.search.SearchResponse;import org.elasticsearch.action.support.master.AcknowledgedResponse;import org.elasticsearch.action.update.UpdateRequest;import org.elasticsearch.action.update.UpdateResponse;import org.elasticsearch.client.RequestOptions;import org.elasticsearch.client.RestHighLevelClient;import org.elasticsearch.client.indices.CreateIndexRequest;import org.elasticsearch.client.indices.CreateIndexResponse;import org.elasticsearch.client.indices.GetIndexRequest;import org.elasticsearch.common.xcontent.XContentType;import org.elasticsearch.index.query.*;import org.elasticsearch.rest.RestStatus;import org.elasticsearch.search.SearchHit;import org.elasticsearch.search.aggregations.Aggregation;import org.elasticsearch.search.aggregations.AggregationBuilders;import org.elasticsearch.search.aggregations.Aggregations;import org.elasticsearch.search.aggregations.BucketOrder;import org.elasticsearch.search.aggregations.bucket.filter.Filters;import org.elasticsearch.search.aggregations.bucket.filter.FiltersAggregationBuilder;import org.elasticsearch.search.aggregations.bucket.filter.FiltersAggregator;import org.elasticsearch.search.aggregations.bucket.histogram.DateHistogramAggregationBuilder;import org.elasticsearch.search.aggregations.bucket.histogram.DateHistogramInterval;import org.elasticsearch.search.aggregations.bucket.histogram.Histogram;import org.elasticsearch.search.aggregations.bucket.histogram.HistogramAggregationBuilder;import org.elasticsearch.search.aggregations.bucket.range.DateRangeAggregationBuilder;import org.elasticsearch.search.aggregations.bucket.range.IpRangeAggregationBuilder;import org.elasticsearch.search.aggregations.bucket.range.Range;import org.elasticsearch.search.aggregations.bucket.range.RangeAggregationBuilder;import org.elasticsearch.search.aggregations.bucket.terms.Terms;import org.elasticsearch.search.aggregations.bucket.terms.TermsAggregationBuilder;import org.elasticsearch.search.aggregations.metrics.*;import org.elasticsearch.search.builder.SearchSourceBuilder;import org.elasticsearch.search.fetch.subphase.highlight.HighlightBuilder;import org.elasticsearch.search.sort.FieldSortBuilder;import org.elasticsearch.search.sort.SortBuilders;import org.elasticsearch.search.sort.SortOrder;import org.junit.jupiter.api.Test;import org.springframework.beans.factory.annotation.Autowired;import org.springframework.boot.test.context.SpringBootTest;import org.springframework.data.domain.PageRequest;import org.springframework.data.elasticsearch.core.AggregationsContainer;import org.springframework.data.elasticsearch.core.ElasticsearchRestTemplate;import org.springframework.data.elasticsearch.core.SearchHits;import org.springframework.data.elasticsearch.core.clients.elasticsearch7.ElasticsearchAggregations;import org.springframework.data.elasticsearch.core.mapping.IndexCoordinates;import org.springframework.data.elasticsearch.core.query.NativeSearchQuery;import org.springframework.data.elasticsearch.core.query.NativeSearchQueryBuilder;import java.io.IOException;import java.util.ArrayList;import java.util.Iterator;import java.util.List;import java.util.Map;@SpringBootTestclass ESTests {@Autowired UserMapper userMapper; @Autowired RestHighLevelClient restHighLevelClient; @Autowired private ElasticsearchRestTemplate elasticsearchRestTemplate; //创建索引 @Test void createIndex() throws IOException {CreateIndexRequest request = new CreateIndexRequest("zkwz"); CreateIndexResponse exists = restHighLevelClient.indices().create(request, RequestOptions.DEFAULT); } //删除索引 @Test void isExistIndex() throws IOException {GetIndexRequest request = new GetIndexRequest("zkwz"); boolean exists = restHighLevelClient.indices().exists(request, RequestOptions.DEFAULT); System.out.println(exists); } //删除索引 @Test void deleteIndex() throws IOException {DeleteIndexRequest request = new DeleteIndexRequest("user"); AcknowledgedResponse delete = restHighLevelClient.indices().delete(request, RequestOptions.DEFAULT); System.out.println(delete.isAcknowledged()); } //创建文档 @Test void createDoc() throws IOException {IndexRequest request = new IndexRequest("user"); User user = new User("111", "cwx", 23); IndexRequest source = request.id("123").source(JSON.toJSONString(user), XContentType.JSON); String id = source.id(); IndexResponse index = restHighLevelClient.index(request, RequestOptions.DEFAULT); System.out.println(index); } //获取文档的信息 @Test void getDocInfo() throws IOException {//创建获取文档的请求 GetRequest request = new GetRequest("user", "123"); boolean exists = restHighLevelClient.exists(request, RequestOptions.DEFAULT); System.out.println(exists); GetResponse documentFields = restHighLevelClient.get(request, RequestOptions.DEFAULT); String source = documentFields.getSourceAsString(); System.out.println(source); } //根据id更新文档 @Test void updateDocById() throws IOException {UpdateRequest request = new UpdateRequest("user", "123"); request.doc(JSON.toJSONString(new User("111", "cwx", 33)), XContentType.JSON); UpdateResponse update = restHighLevelClient.update(request, RequestOptions.DEFAULT); RestStatus status = update.status(); System.out.println(status); } //根据id删除文档 @Test void deleteDocById() throws IOException {DeleteRequest request = new DeleteRequest("user", "123"); DeleteResponse delete = restHighLevelClient.delete(request, RequestOptions.DEFAULT); RestStatus status = delete.status(); System.out.println(status); } //批量创建文档 @Test void createDocBulk() throws IOException {BulkRequest request = new BulkRequest(); User user1 = new User("777", "thegoodmen", 30); User user2 = new User("888", "thegoodwomen", 50); User user3 = new User("999", "thewellboy", 23); User user4 = new User("100", "thewellgirl", 43); User user5 = new User("121", "thegirlmen", 12); ArrayListlist = new ArrayList<>(); list.add(user1); list.add(user2); list.add(user3); list.add(user4); list.add(user5); for (User user : list) {request.add(new IndexRequest("user").source(JSON.toJSONString(user), XContentType.JSON)); } BulkResponse bulk = restHighLevelClient.bulk(request, RequestOptions.DEFAULT); boolean status = bulk.hasFailures(); System.out.println(status); } //基本查询 @Test void selectByAge() throws IOException {SearchRequest request = new SearchRequest(); SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder(); //matchQuery是分词之后再查询的、termQuery是直接进行匹配(但是这个查询不到数据,换这个可以matchPhraseQuery) MatchQueryBuilder builder = QueryBuilders.matchQuery("name", "good"); searchSourceBuilder.query(builder); request.source(searchSourceBuilder); SearchResponse search = restHighLevelClient.search(request, RequestOptions.DEFAULT); Iteratoriterator = search.getHits().iterator(); while (iterator.hasNext()) {System.out.println(iterator.next()); } } //单条件查询 @Test void select1() throws IOException {//左右模糊查询,相当于MySQL的%good% SearchHitssearchHits = elasticsearchRestTemplate.search( new NativeSearchQuery(QueryBuilders.queryStringQuery("good").field("name")), User.class, IndexCoordinates.of("user")); //精准匹配,相当于name="good" SearchHitssearchHits1 = elasticsearchRestTemplate.search( new NativeSearchQuery(QueryBuilders.termQuery("name", "good")), User.class, IndexCoordinates.of("user")); //普通匹配,相当于name like "%good%" SearchHitssearchHits2 = elasticsearchRestTemplate.search( new NativeSearchQuery(QueryBuilders.matchQuery("name", "good")), User.class, IndexCoordinates.of("user")); //模糊匹配 SearchHitssearchHits3 = elasticsearchRestTemplate.search( new NativeSearchQuery(QueryBuilders.fuzzyQuery("name", "good")), User.class, IndexCoordinates.of("user")); //前缀匹配 SearchHitssearchHits6 = elasticsearchRestTemplate.search( new NativeSearchQuery(QueryBuilders.prefixQuery("name", "men")), User.class, IndexCoordinates.of("user")); //多字段的模糊匹配 SearchHitssearchHits7 = elasticsearchRestTemplate.search( new NativeSearchQuery(QueryBuilders.multiMatchQuery("aaa", "name")), User.class, IndexCoordinates.of("user")); //前缀匹配 SearchHitssearchHits8 = elasticsearchRestTemplate.search( new NativeSearchQuery(QueryBuilders.wildcardQuery("name", "m*n")), User.class, IndexCoordinates.of("user")); //多内容多字段的模糊匹配 String[] strings = {"good", "men"}; SearchHitssearchHits9 = elasticsearchRestTemplate.search( new NativeSearchQuery(QueryBuilders.moreLikeThisQuery(new String[]{"aaa"}, MoreLikeThisQueryBuilder.Item.EMPTY_ARRAY)), User.class, IndexCoordinates.of("user")); for (org.springframework.data.elasticsearch.core.SearchHite1 : searchHits9.getSearchHits()) {System.out.println(e1.getContent()); } } //多条件查询 @Test void select2() throws IOException {//数值型、日期、Ip的范围查询,相当于 between 80 and 90 ,80包含,90不包含 SearchHitssearchHits1 = elasticsearchRestTemplate.search( new NativeSearchQuery(QueryBuilders.boolQuery().must(QueryBuilders.rangeQuery("age").from(80).to(90).includeLower(true).includeUpper(false))), User.class, IndexCoordinates.of("user")); //范围查询、大于50小于等于90 SearchHitssearchHits2 = elasticsearchRestTemplate.search( new NativeSearchQuery(QueryBuilders.boolQuery().must(QueryBuilders.rangeQuery("age").gt(50).lte(90))), User.class, IndexCoordinates.of("user")); for (org.springframework.data.elasticsearch.core.SearchHite1 : searchHits1.getSearchHits()) {System.out.println(e1.getContent()); } } //多条件查询 @Test void select3() {// 排序 FieldSortBuilder price = SortBuilders.fieldSort("age").order(SortOrder.DESC); // 高亮显示 HighlightBuilder highlightBuilder = new HighlightBuilder().field("name"); // 分页 PageRequest pageRequest = PageRequest.of(0, 2); // 查询条件 BoolQueryBuilder queryBuilder1 = new BoolQueryBuilder().must(new MatchQueryBuilder("name", "陈万祥")); BoolQueryBuilder queryBuilder2 = new BoolQueryBuilder().must(new MatchPhraseQueryBuilder("loc", "武威市")); BoolQueryBuilder queryBuilder3 = new BoolQueryBuilder().must(QueryBuilders.queryStringQuery("兰州市").field("name")); //求和 SumAggregationBuilder field = AggregationBuilders.sum("sum").field("age"); //求平均值 AvgAggregationBuilder avgAggregationBuilder = AggregationBuilders.avg("avg").field("age"); ValueCountAggregationBuilder countAggregationBuilder = AggregationBuilders.count("count").field("name"); // 组装上述查询条件 NativeSearchQuery searchQuery = new NativeSearchQueryBuilder().withQuery(queryBuilder1) .withSorts(price) .withHighlightBuilder(highlightBuilder) .withPageable(pageRequest) .build(); NativeSearchQuery searchQuery2 = new NativeSearchQueryBuilder() .withAggregations(field) .withAggregations(avgAggregationBuilder) .withAggregations(countAggregationBuilder) .build(); SearchHitsuser = elasticsearchRestTemplate.search(searchQuery, User2.class, IndexCoordinates.of("user")); for (org.springframework.data.elasticsearch.core.SearchHithit : user.getSearchHits()) {System.out.println(hit.getContent()); MaphighlightFields = hit.getHighlightFields(); System.out.println(highlightFields.toString()); } AggregationsContainer aggregations = user.getAggregations(); ElasticsearchAggregations aggregations1 = (ElasticsearchAggregations) aggregations; Aggregations aggregations2; if (aggregations1 != null) {aggregations2 = aggregations1.aggregations(); Sum sum = aggregations2.get("sum"); Avg avg = aggregations2.get("avg"); ValueCount count = aggregations2.get("count"); System.out.println(DataUtil.rmZeroSuper(sum.getValue())); System.out.println(DataUtil.rmZeroSuper(avg.getValue())); System.out.println(DataUtil.rmZeroSuper(count.getValue())); } } //最基础的聚合查询 @Test void select4() throws IOException {SearchRequest request = new SearchRequest("user"); SearchSourceBuilder sourceBuilder = new SearchSourceBuilder(); sourceBuilder.query(QueryBuilders.matchQuery("name", "yyy")); TermsAggregationBuilder termsAggregationBuilder = AggregationBuilders.terms("ageAvg").field("age").size(10); sourceBuilder.aggregation(termsAggregationBuilder); AvgAggregationBuilder sumAggregationBuilder = AggregationBuilders.avg("avg").field("age"); sourceBuilder.aggregation(sumAggregationBuilder); request.source(sourceBuilder); SearchResponse searchResponse = restHighLevelClient.search(request, RequestOptions.DEFAULT); Aggregations aggregations = searchResponse.getAggregations(); Avg ageAvg = aggregations.get("avg"); System.out.println(ageAvg.getValue()); Terms terms = aggregations.get("ageAvg"); for (Terms.Bucket e1 : terms.getBuckets()) {System.out.println("年龄是" + e1.getKeyAsString() + "=====>" + "人数是" + e1.getDocCount()); } } //先分组再得到不同分组的指标查询 @Test void select5() {//先分组,并按指标排序 TermsAggregationBuilder termsAggregationBuilder1 = AggregationBuilders.terms("team_count").field("team").order(BucketOrder.aggregation("max_age", true)); termsAggregationBuilder1.subAggregation(AggregationBuilders.max("max_age").field("age")); termsAggregationBuilder1.subAggregation(AggregationBuilders.avg("avg_age").field("age")); termsAggregationBuilder1.subAggregation(AggregationBuilders.min("min_age").field("age")); termsAggregationBuilder1.subAggregation(AggregationBuilders.sum("sum_age").field("age")); NativeSearchQuery searchQuery = new NativeSearchQueryBuilder() .withAggregations(termsAggregationBuilder1) .build(); SearchHitsuser = elasticsearchRestTemplate.search(searchQuery, User3.class, IndexCoordinates.of("user1")); for (org.springframework.data.elasticsearch.core.SearchHithit : user.getSearchHits()) {System.out.println(hit.getContent()); } AggregationsContainer aggregations = user.getAggregations(); ElasticsearchAggregations aggregations1 = (ElasticsearchAggregations) aggregations; Aggregations aggregations2; if (aggregations1 != null) {aggregations2 = aggregations1.aggregations(); Terms terms = aggregations2.get("team_count"); //获取聚合值的两种方法 for (Terms.Bucket bucket : terms.getBuckets()) {Max max_age = bucket.getAggregations().get("max_age"); Min min_age = bucket.getAggregations().get("min_age"); Sum sum_age = bucket.getAggregations().get("sum_age"); Avg avg_age = bucket.getAggregations().get("avg_age"); System.out.println(bucket.getKeyAsString() + "的最大值是:" + max_age.getValue() + "的最小值是:" + min_age.getValue() + "的和是:" + sum_age.getValue() + "的平均值是:" + avg_age.getValue()); } for (Terms.Bucket bucket : terms.getBuckets()) {MapasMap = bucket.getAggregations().asMap(); double max_age = ((ParsedMax) asMap.get("max_age")).getValue(); double min_age = ((ParsedMin) asMap.get("min_age")).getValue(); double sum_age = ((ParsedSum) asMap.get("sum_age")).getValue(); double avg_age = ((ParsedAvg) asMap.get("avg_age")).getValue(); System.out.println(bucket.getKeyAsString() + "的最大值是:" + max_age + "的最小值是:" + min_age + "的和是:" + sum_age + "的平均值是:" + avg_age); } } } //区间分组查询 @Test void select6() {HistogramAggregationBuilder histogramAggregationBuilder = AggregationBuilders.histogram("his_age").field("age").interval(10); NativeSearchQuery searchQuery = new NativeSearchQueryBuilder() .withAggregations(histogramAggregationBuilder) .build(); SearchHitsuser = elasticsearchRestTemplate.search(searchQuery, User3.class, IndexCoordinates.of("user1")); for (org.springframework.data.elasticsearch.core.SearchHithit : user.getSearchHits()) {System.out.println(hit.getContent()); } AggregationsContainer aggregations = user.getAggregations(); ElasticsearchAggregations aggregations1 = (ElasticsearchAggregations) aggregations; Aggregations aggregations2; if (aggregations1 != null) {aggregations2 = aggregations1.aggregations(); Histogram terms = aggregations2.get("his_age"); for (Histogram.Bucket bucket : terms.getBuckets()) {System.out.println(bucket.getKeyAsString() + "的数量是:" + bucket.getDocCount()); } } } //时间区间分组查询 @Test void select7() {/* calendarInterval表示时间间隔是类型、minDocCount设置最小值、format设置输出时间的格式化、 missing当那个统计字段为null的时候,填充某个值,然后按照填充的值,统计计算 */ DateHistogramAggregationBuilder aggregationBuilder = AggregationBuilders.dateHistogram("his_birth") .field("birth") .calendarInterval(DateHistogramInterval.MONTH) .minDocCount(0) .format("yyyy-MM-dd") .missing("2021-07-27"); NativeSearchQuery searchQuery = new NativeSearchQueryBuilder() .withAggregations(aggregationBuilder) .build(); SearchHitsuser = elasticsearchRestTemplate.search(searchQuery, User3.class, IndexCoordinates.of("user1")); for (org.springframework.data.elasticsearch.core.SearchHithit : user.getSearchHits()) {System.out.println(hit.getContent()); } ElasticsearchAggregations aggregations = (ElasticsearchAggregations) user.getAggregations(); if (aggregations != null) {Histogram terms = aggregations.aggregations().get("his_birth"); for (Histogram.Bucket bucket : terms.getBuckets()) {System.out.println(bucket.getKeyAsString() + "的数量是:" + bucket.getDocCount()); } } } //指定范围区间分组查询 @Test void select8() {/* calendarInterval表示时间间隔是类型、minDocCount设置最小值、format设置输出时间的格式化、 missing当那个统计字段为null的时候,填充某个值,然后按照填充的值,统计计算 */ RangeAggregationBuilder aggregationBuilder = AggregationBuilders.range("range_age") .field("age") .addUnboundedTo(10) .addRange(11, 30) .addRange(31, 50) .addUnboundedFrom(100); NativeSearchQuery searchQuery = new NativeSearchQueryBuilder() .withAggregations(aggregationBuilder) .build(); SearchHitsuser = elasticsearchRestTemplate.search(searchQuery, User3.class, IndexCoordinates.of("user1")); for (org.springframework.data.elasticsearch.core.SearchHithit : user.getSearchHits()) {System.out.println(hit.getContent()); } ElasticsearchAggregations aggregations = (ElasticsearchAggregations) user.getAggregations(); if (aggregations != null) {Range terms = aggregations.aggregations().get("range_age"); for (Range.Bucket bucket : terms.getBuckets()) {System.out.println(bucket.getKeyAsString() + "的数量是:" + bucket.getDocCount()); //起始范围值 Object from = bucket.getFrom(); //终止范围值 Object to = bucket.getTo(); System.out.println(from + "\t" + to); } } } //指定时间范围区间分组查询 @Test void select9() {DateRangeAggregationBuilder aggregationBuilder = AggregationBuilders.dateRange("data_range_birth") .field("birth") .format("yyyy") .addUnboundedTo("2000") .addRange("2001", "2010") .addRange("2011", "2020") .addUnboundedFrom("2021"); NativeSearchQuery searchQuery = new NativeSearchQueryBuilder() .withAggregations(aggregationBuilder) .build(); SearchHitsuser = elasticsearchRestTemplate.search(searchQuery, User3.class, IndexCoordinates.of("user1")); for (org.springframework.data.elasticsearch.core.SearchHithit : user.getSearchHits()) {System.out.println(hit.getContent()); } ElasticsearchAggregations aggregations = (ElasticsearchAggregations) user.getAggregations(); if (aggregations != null) {Range terms = aggregations.aggregations().get("data_range_birth"); for (Range.Bucket bucket : terms.getBuckets()) {System.out.println(bucket.getKeyAsString() + "的数量是:" + bucket.getDocCount()); } } } //指定ip区间分组查询 @Test void select10() {//TODO 这个地方的ip类型不能用来聚合,加上.keyword也不行,不知如何解决 IpRangeAggregationBuilder aggregationBuilder = AggregationBuilders.ipRange("ip_range") .field("ip") .addUnboundedTo("192.168.0.0") .addRange("192.168.0.1", "192.168.0.122") .addRange("192.168.5.0", "192.168.7.134") .addUnboundedFrom("192.168.12.122"); NativeSearchQuery searchQuery = new NativeSearchQueryBuilder() .withAggregations(aggregationBuilder) .build(); SearchHitsuser = elasticsearchRestTemplate.search(searchQuery, User3.class, IndexCoordinates.of("user1")); for (org.springframework.data.elasticsearch.core.SearchHithit : user.getSearchHits()) {System.out.println(hit.getContent()); } ElasticsearchAggregations aggregations = (ElasticsearchAggregations) user.getAggregations(); if (aggregations != null) {Range terms = aggregations.aggregations().get("ip_range"); for (Range.Bucket bucket : terms.getBuckets()) {System.out.println(bucket.getKeyAsString() + "的数量是:" + bucket.getDocCount()); } } } //指定过滤条件分组查询 @Test void select11() {FiltersAggregationBuilder aggregationBuilder = AggregationBuilders.filters("filter", new FiltersAggregator.KeyedFilter("men", QueryBuilders.termQuery("sex", "true")), new FiltersAggregator.KeyedFilter("women", QueryBuilders.termQuery("sex", "false"))); // 高亮显示 HighlightBuilder highlightBuilder = new HighlightBuilder().field("sex").preTags("").postTags(""); NativeSearchQuery searchQuery = new NativeSearchQueryBuilder() .withAggregations(aggregationBuilder) .withHighlightBuilder(highlightBuilder) .build(); SearchHitsuser = elasticsearchRestTemplate.search(searchQuery, User3.class, IndexCoordinates.of("user1")); for (org.springframework.data.elasticsearch.core.SearchHithit : user.getSearchHits()) {MaphighlightFields = hit.getHighlightFields(); System.out.println(hit.getContent()); System.out.println(highlightFields.toString()); } ElasticsearchAggregations aggregations = (ElasticsearchAggregations) user.getAggregations(); if (aggregations != null) {Filters terms = aggregations.aggregations().get("filter"); for (Filters.Bucket bucket : terms.getBuckets()) {System.out.println(bucket.getKeyAsString() + "的数量是:" + bucket.getDocCount()); } } }}
四、命令的方式实现对Es的curd
DELETE /user3PUT /user{"settings": {"number_of_replicas": 2, "number_of_shards": 3 }, "mappings": {"properties": {"name":{"type": "text", "analyzer": "ik_max_word" }, "sex":{"type": "boolean" }, "age":{"type": "long" }, "loc":{"type": "text", "analyzer": "ik_max_word" }, "team":{"type": "keyword" }, "birth":{"type": "date" }, "id":{"type": "ip" }, "country":{"type": "keyword" }, "habby":{"type": "text" }, "balance":{"type": "float" } } }}POST /user/_doc/{"name":"陈万祥", "age":12, "sex":true, "loc":"甘肃省武威市", "team":"red", "birth":"1998-01-10", "ip":"192.168.0.3", "country":"china", "habby":["basketball","sing","dance"], "balance":8700.12}POST /user/_doc/{"name":"陈万祥", "age":122, "sex":true, "loc":"甘肃省武威市", "team":"blue", "birth":"1998-01-10", "ip":"192.168.0.", "country":"france", "habby":["basketball","sing"], "balance":82111}POST /user/_doc/{"name":"劳霞明", "age":21, "sex":true, "loc":"广西钦州市", "team":"red", "birth":"1997-01-10", "ip":"192.168.3.11", "country":"china", "habby":["swming","dance"], "balance":800.12}POST /user/_doc/{"name":"汪明", "age":24, "sex":false, "loc":"天津市", "team":"green", "birth":"1998-02-10", "ip":"192.165.0.3", "country":"china", "habby":["sing","read"], "balance":9000}POST /user/_doc/{"name":"蔡徐坤", "age":12, "sex":true, "loc":"甘肃省兰州市", "team":"green", "birth":"1998-03-10", "ip":"192.167.0.0", "country":"france", "habby":["game","eat"], "balance":2300}POST /user/_doc/{"name":"爱英斯坦", "age":45, "sex":true, "loc":"甘肃省张掖市", "team":"blue", "birth":"2003-06-10", "ip":"192.168.0.32", "country":"england", "habby":["basketball","sing","dance"], "balance":6900}POST /user/_doc/{"name":"欧拉", "age":65, "sex":false, "loc":"甘肃省天水市", "team":"blue", "birth":"2021-08-10", "ip":"192.168.6.3", "country":"china", "habby":["basketball","dance"], "balance":4300}POST /user/_doc/{"name":"上帝", "age":22, "sex":false, "loc":"西方世界", "team":"black", "ip":"192.168.3.3", "country":"england", "habby":["eat","game","sing"], "balance":12000.12}GET /user/_doc/_searchGET /user/_doc/_search{"query":{"term":{"name":"陈万祥" } }, "from":0, "size":10, "sort":[{"age":"desc" }], "highlight":{"fields":{"name":{} } }}#求出user索引的和GET /user/_doc/_search{"aggs":{"user1_sum":{"sum":{"field":"age" } } }}#求出user索引的最大值GET /user/_doc/_search{"size":0, "aggs":{"user1_max":{"max":{"field":"age" } } }}#根据team字段聚合分组、并得到每个分组的求和、最大值、最小值、平均值GET /user/_doc/_search{"size":0, "aggs":{"user1_term":{"terms":{"field":"team" }, "aggs":{"term_sum":{"sum":{"field":"age" } }, "term_max":{"max":{"field":"age" }}, "term_min":{"min":{"field":"age" }}, "term_avg":{"avg":{"field":"age" }} } } }}#查看工资范围的百分比GET /user/_search{"size": 0, "aggs": {"balance_a": {"percentile_ranks": {"field": "balance", "values": [2000,4000,6000,8000,10000] } } }}#查看一些计数、最小值、最大值、平均值、求和分别是多少GET /user/_search{"size": 0, "aggs": {"balance_a": {"stats": {"field": "balance" } } }}#查看一些其他的指标结果GET /user/_search{"size": 0, "aggs": {"balance_a": {"extended_stats": {"field": "age" } } }}#查看对字段sex的聚合GET /user/_search{"size": 0, "aggs": {"sex_a": {"terms": {"field": "sex" } } }}#聚合并排序GET /user/_search{"size": 0, "aggs": {"sex_a": {"terms": {"field": "team", "order": {"max_a": "desc" } }, "aggs":{"max_a":{"max":{"field": "age" } } } } }}#聚合支持脚本GET /user/_search{"size": 0, "aggs": {"sex_a": {"terms": {"script": "doc["team"].value" } } }}#筛选出包含指定词语和不包含的GET /user/_search{"size": 0, "aggs": {"term_a": {"terms": {"field": "age", "include": "*.市" } } }}#------------------------------------------------DELETE user3#多字段聚合的实现方式:#1、通过脚本合并字段#2、使用copy_to方法合并两个字段PUT /user3{"mappings": {"properties": {"name":{"type": "text", "fielddata": true } } }}PUT /user3{"mappings": {"properties": {"name":{"type": "text", "fields": {"keyword":{"type":"keyword" } } } } }}POST /user3/_doc{"name":"hello word"}GET /user3/_search{"query": {"match": {"name": "hello" } }, "aggs":{"testq":{"terms":{"field":"name.keyword" } } }}PUT test3POST test3/stu{"name":"cwx"}POST test3/tes{"name":"lxm"}
五、集群
六、优化
1、聚合为什么慢?
大多数时候对单个字段的聚合查询还是非常快的, 但是当需要同时聚合多个字段时,就可能会产生大量的分组,最终结果就是占用 es 大量内存,从而导致 OOM 的情况发生。 实践应用发现,以下情况都会比较慢: 1)待聚合文档数比较多(千万、亿、十亿甚至更多); 2)聚合条件比较复杂(多重条件聚合); 3)全量聚合(翻页的场景用)。
2、聚合优化方案探讨
参考链接:https://blog.csdn.net/laoyang360/article/details/79253294
优化方案一:默认深度优先聚合改为广度优先聚合。
"collect_mode" : "breadth_first"
depth_first 直接进行子聚合的计算 breadth_first 先计算出当前聚合的结果,针对这个结果在对子聚合进行计算。 优化方案二: 每一层terms aggregation内部加一个 “execution_hint”: “map”。
"execution_hint": "map"
Map方式的结论可简要概括如下: 1)查询结果直接放入内存中构建map,在查询结果集小的场景下,速度极快; 2)但如果待结果集合很大的情况,map方式不一定也快。
3、运用 shard_size 来提高 term aggregation 的精度
参考链接:https://blog.csdn.net/UbuntuTouch/article/details/104141398
请求的大小(size)越大,结果将越准确,但计算最终结果的成本也将越高(这两者都是由于在分片级别上管理的优先级队列更大,并且节点和客户端之间的数据传输也更大)。
shard_size 参数可用于最大程度地减少请求的大小带来的额外工作。 定义后,它将确定协调节点将从每个分片请求多少个术语。 一旦所有分片都做出响应,协调节点便会将它们缩减为最终结果,该最终结果将基于size参数-这样一来,可以提高返回条款的准确性,并避免流回大量存储桶的开销给客户。
注意:shard_size 不能小于 size(因为意义不大)。 启用时,Elasticsearch 将覆盖它并将其重置为等于大小。
缺省 shard_size为(size* 1.5 + 10)。
Terms aggregation 对于大量数据来说通常是不精确的
我们先来看一下如下的一个图:
从上面的图中,在 shard_size 为3的情况下,我们想对 geoip.country_name 这个字段来进行 terms aggregation:
从 shard 0 中提取文档数靠前的前三个,它们分别是 USA,India 及 France。它们的文档数分别是5,4及4。 从 shard 1 中提取文档数靠前的前单个,它们分别是 USA,India 及 Japan。它们的文档数分别是4,5及3。 那么总的文档数是:
USA为:5 + 4 = 9India为:4 + 5 = 9France为:4 + 0 = 4Japan为: 3 + 0 = 3
根据上面的计算,返回的结果将会是 USA,India 及 France。细心的开发者可能马上可以看出来,在上面的统计中国其实是不精确的,这是因为在 shard 0 中,我们可以看见 Japan 有3个文档没有被统计进去。这个统计是基于我们对 shard_size 为3的情况。假如我们把 shard_size 提供到4,情况马上就会不同,而且更加接近我们的实际的统计数据的结果。在这种情况下,Japan 将会有 3 + 6 共6很个文档,应该是排名第3。
我们可以修改我们的请求如下:
GET logs_server*/_search{ "size": 0, "aggs": { "top_10_urls": { "terms": { "field": "geoip.country_name.keyword", "size": 10, "shard_size": 100 } } }}
我们可以通过增加 shard_size 来提高数据的精确性,但是必须注意的是这样的代价是计算的成本增加,特别是针对大量数据而言。