公告:“业余草”微信公众号提供免费CSDN下载服务(只下Java资源),关注业余草微信公众号,添加作者微信:xttblog2,发送下载链接帮助你免费下载!
本博客日IP超过2000,PV 3000 左右,急需赞助商。
极客时间所有课程通过我的二维码购买后返现24元微信红包,请加博主新的微信号:xttblog2,之前的微信号好友位已满,备注:返现
受密码保护的文章请关注“业余草”公众号,回复关键字“0”获得密码
所有面试题(java、前端、数据库、springboot等)一网打尽,请关注文末小程序
腾讯云】1核2G5M轻量应用服务器50元首年,高性价比,助您轻松上云
本博客日IP超过2000,PV 3000 左右,急需赞助商。
极客时间所有课程通过我的二维码购买后返现24元微信红包,请加博主新的微信号:xttblog2,之前的微信号好友位已满,备注:返现
受密码保护的文章请关注“业余草”公众号,回复关键字“0”获得密码
所有面试题(java、前端、数据库、springboot等)一网打尽,请关注文末小程序
腾讯云】1核2G5M轻量应用服务器50元首年,高性价比,助您轻松上云
Lucene 中的 IndexSearcher 提供了一整套完整的搜索体系,这套查询体系是建立在 SpanQuery 类的基础上。SpanQuery 类大致的反映了 Lucene 的 Query 类体系。今天我们一起来学习学习 SpanQuery 类的一些子类和使用实例。
SpanQuery 是指 Field 域(字段)中的起始语汇单元和终止语汇单元的位置。SpanQuery 中的一些常用的子类如下表所示:
SpanQuery类型 | 描述 |
---|---|
FieldMaskingSpanQuery | 用于在多个域之间查询,即把另一个域看作某个域,从而看起来就像在同一个域里查询,因为 Lucene 默认某个条件只能作用在单个域上,不支持跨域查询,只能在同一个域里查询,所以有了 FieldMaskingSpanQuery |
SpanTermQuery | 和其它跨度查询类型结合使用,单独使用时相当于 TermQuery,唯一的区别就是使用 SpanTermQuery 可以得到 Term 的 Span 跨度信息 |
SpanNearQuery | 用来匹配两个 Term 之间的跨度的,即一个 Term 经过几个跨度可以到达另一个 Term,slop 为跨度因子,用来限制两个 Term 之间的最大跨度。还有一个 inOrder 参数,它用来设置是否允许进行倒序跨度,什么意思?即 TermA 到 TermB 不一定是从左到右去匹配也可以从右到左,而从右到左就是倒序,inOrder 为 true 即表示 order(顺序)很重要不能倒序去匹配必须正向去匹配,false 则反之。注意停用词不在 slop 统计范围内 |
SpanFirstQuery | 表示对出现在一个域中的[0, n]范围内的 term 项进行的匹配查询,关键是n指定了查询的 term 出现范围的上限 |
SpanContainingQuery | 返回在另一个范围内的查询匹配结果,big 和 little 的子句可以是任何 span 类型查询。在包含 little 匹配中从 big 匹配跨度返回。例如“a beautiful and boring world”,big 查询是 SpanNearQuery(SpanTermQuery(“beautiful”), SpanTermQuery(“world”)).setSlop(2),而 little 查询是 SpanTermQuery(“boring”),则该 Doc 命中,并从 big 匹配跨度返回,即 big 优先级高 |
SpanWithinQuery | 与 SpanContainingQuery 类似,只不过是从 little 匹配跨度返回,换句话说,SpanContainingQuery 是 big 查询优先级更高,而 SpanWithinQuery 是 little 查询优先级更高。也许英文说的更清晰,In clear, the SpanContainingQuery will keep matches that contain another Spans, while the SpanWithinQuery will keep matches that are contained within another Spans. |
SpanNotQuery | 使用场景是当使用 SpanNearQuery 时,如果两个 Term 从 TermA 到 TermB 有多种情况,即可能出现 TermA 或者 TermB 在索引中重复出现,则可能有多种情况,SpanNotQuery 就是用来限制 TermA 和 TermB 之间不存在 TermC,从而排除一些情况,实现更精确的控制 |
SpanOrQuery | 这个查询会嵌套一些子查询,子查询之间的逻辑关系为或 |
关于这些 SpanQuery 的使用,我们直接上代码好了!
import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.core.WhitespaceAnalyzer; import org.apache.lucene.analysis.tokenattributes.OffsetAttribute; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.document.StringField; import org.apache.lucene.document.TextField; import org.apache.lucene.index.*; import org.apache.lucene.search.*; import org.apache.lucene.search.spans.*; import org.apache.lucene.store.RAMDirectory; import org.junit.After; import org.junit.Assert; import org.junit.Before; import org.junit.Test; import java.io.IOException; import java.io.StringReader; public class SpanQueryDemo { private RAMDirectory directory; private IndexSearcher indexSearcher; private IndexReader indexReader; private SpanTermQuery quick; private SpanTermQuery brown; private SpanTermQuery red; private SpanTermQuery fox; private SpanTermQuery lazy; private SpanTermQuery sleepy; private SpanTermQuery dog; private SpanTermQuery cat; private Analyzer analyzer; private IndexWriter indexWriter; private IndexWriterConfig indexWriterConfig; @Before public void setUp() throws IOException { directory = new RAMDirectory(); analyzer = new WhitespaceAnalyzer(); indexWriterConfig = new IndexWriterConfig(analyzer); indexWriterConfig.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND); indexWriter = new IndexWriter(directory, indexWriterConfig); Document document = new Document(); //TextField使用Store.YES时索引文档、频率、位置信息 document.add(new TextField("f", "What's amazing, the quick brown fox jumps over the lazy dog", Field.Store.YES)); indexWriter.addDocument(document); document = new Document(); document.add(new TextField("f", "Wow! the quick red fox jumps over the sleepy cat", Field.Store.YES)); indexWriter.addDocument(document); indexWriter.commit(); indexSearcher = new IndexSearcher(DirectoryReader.open(directory)); indexReader = indexSearcher.getIndexReader(); quick = new SpanTermQuery(new Term("f", "quick")); brown = new SpanTermQuery(new Term("f", "brown")); red = new SpanTermQuery(new Term("f", "red")); fox = new SpanTermQuery(new Term("f", "fox")); lazy = new SpanTermQuery(new Term("f", "lazy")); dog = new SpanTermQuery(new Term("f", "dog")); sleepy = new SpanTermQuery(new Term("f", "sleepy")); cat = new SpanTermQuery(new Term("f", "cat")); } @After public void setDown() { if (indexWriter != null && indexWriter.isOpen()) { try { indexWriter.close(); } catch (IOException e) { System.out.println(e); } } } private void assertOnlyBrownFox(Query query) throws IOException { TopDocs search = indexSearcher.search(query, 10); Assert.assertEquals(1, search.totalHits); Assert.assertEquals("wrong doc", 0, search.scoreDocs[0].doc); } private void assertBothFoxes(Query query) throws IOException { TopDocs search = indexSearcher.search(query, 10); Assert.assertEquals(2, search.totalHits); } private void assertNoMatches(Query query) throws IOException { TopDocs search = indexSearcher.search(query, 10); Assert.assertEquals(0, search.totalHits); } /** * 输出跨度查询的结果 * * @param query * @throws IOException */ private void dumpSpans(SpanQuery query) throws IOException { SpanWeight weight = query.createWeight(indexSearcher, true); Spans spans = weight.getSpans(indexReader.getContext().leaves().get(0), SpanWeight.Postings.POSITIONS); System.out.println(query); TopDocs search = indexSearcher.search(query, 10); float[] scores = new float[2]; for (ScoreDoc sd : search.scoreDocs) { scores[sd.doc] = sd.score; } int numSpans = 0; //处理所有跨度 while (spans.nextDoc() != Spans.NO_MORE_DOCS) { while (spans.nextStartPosition() != Spans.NO_MORE_POSITIONS) { numSpans++; int id = spans.docID(); //检索文档 Document doc = indexReader.document(id); //重新分析文本 TokenStream stream = analyzer.tokenStream("contents", new StringReader(doc.get("f"))); stream.reset(); OffsetAttribute offsetAttribute = stream.addAttribute(OffsetAttribute.class); int i = 0; StringBuilder sb = new StringBuilder(); sb.append(" "); //处理所有语汇单元 while (stream.incrementToken()) { if (i == spans.startPosition()) { sb.append("<"); } sb.append(offsetAttribute.toString()); if (i + 1 == spans.endPosition()) { sb.append(">"); } sb.append(" "); i++; } sb.append("(").append(scores[id]).append(")"); System.out.println(sb.toString()); stream.close(); } if (numSpans == 0) { System.out.println(" No Spans"); } } } @Test public void testSpanTermQuery() throws IOException { assertOnlyBrownFox(brown); dumpSpans(brown); dumpSpans(new SpanTermQuery(new Term("f", "the"))); dumpSpans(new SpanTermQuery(new Term("f", "fox"))); } /** * SpanFirstQuery可以对出现在域中前面某位置的跨度进行查询 */ @Test public void testSpanFirstQuery() throws IOException { SpanFirstQuery spanFirstQuery = new SpanFirstQuery(brown, 2); assertNoMatches(spanFirstQuery); dumpSpans(spanFirstQuery); //设置从前面开始10个跨度查找brown,一个切分结果是一个跨度,使用的是空格分词器 spanFirstQuery = new SpanFirstQuery(brown, 5); dumpSpans(spanFirstQuery); assertOnlyBrownFox(spanFirstQuery); } @Test public void testSpanNearQuery() throws IOException { SpanQuery[] queries = new SpanQuery[]{quick, brown, dog}; SpanNearQuery spanNearQuery = new SpanNearQuery(queries, 0, true); assertNoMatches(spanNearQuery); dumpSpans(spanNearQuery); spanNearQuery = new SpanNearQuery(queries, 4, true); assertNoMatches(spanNearQuery); dumpSpans(spanNearQuery); spanNearQuery = new SpanNearQuery(queries, 5, true); assertOnlyBrownFox(spanNearQuery); dumpSpans(spanNearQuery); //将inOrder设置为false,那么就不会考虑顺序,只要两者中间不超过三个就可以检索到 spanNearQuery = new SpanNearQuery(new SpanQuery[]{lazy, fox}, 3, false); assertOnlyBrownFox(spanNearQuery); dumpSpans(spanNearQuery); //在考虑顺序的情况下,是检索不到的 spanNearQuery = new SpanNearQuery(new SpanQuery[]{lazy, fox}, 3, true); assertNoMatches(spanNearQuery); dumpSpans(spanNearQuery); PhraseQuery phraseQuery = new PhraseQuery.Builder().add(new Term("f", "lazy")).add(new Term("f", "fox")).setSlop(4).build(); assertNoMatches(phraseQuery); PhraseQuery phraseQuery1 = new PhraseQuery.Builder().setSlop(5).add(phraseQuery.getTerms()[0]).add(phraseQuery.getTerms()[1]).build(); assertOnlyBrownFox(phraseQuery1); } @Test public void testSpanNotQuery() throws IOException { //设置slop为1 SpanNearQuery quickFox = new SpanNearQuery(new SpanQuery[]{quick, fox}, 1, true); assertBothFoxes(quickFox); dumpSpans(quickFox); //第一个参数表示要包含的跨度对象,第二个参数则表示要排除的跨度对象 SpanNotQuery quickFoxDog = new SpanNotQuery(quickFox, dog); assertBothFoxes(quickFoxDog); dumpSpans(quickFoxDog); //下面把red排除掉,那么就只能查到一条记录 SpanNotQuery noQuickRedFox = new SpanNotQuery(quickFox, red); assertOnlyBrownFox(noQuickRedFox); dumpSpans(noQuickRedFox); } @Test public void testSpanOrQuery() throws IOException { SpanNearQuery quickFox = new SpanNearQuery(new SpanQuery[]{quick, fox}, 1, true); SpanNearQuery lazyDog = new SpanNearQuery(new SpanQuery[]{lazy, dog}, 0, true); SpanNearQuery sleepyCat = new SpanNearQuery(new SpanQuery[]{sleepy, cat}, 0, true); SpanNearQuery quickFoxNearLazyDog = new SpanNearQuery(new SpanQuery[]{quickFox, lazyDog}, 3, true); assertOnlyBrownFox(quickFoxNearLazyDog); dumpSpans(quickFoxNearLazyDog); SpanNearQuery quickFoxNearSleepyCat = new SpanNearQuery(new SpanQuery[]{quickFox, sleepyCat}, 3, true); dumpSpans(quickFoxNearSleepyCat); SpanOrQuery or = new SpanOrQuery(new SpanQuery[]{quickFoxNearLazyDog, quickFoxNearSleepyCat}); assertBothFoxes(or); dumpSpans(or); } /** * 测试安全过滤 * * @throws IOException */ @Test public void testSecurityFilter() throws IOException { Document document = new Document(); document.add(new StringField("owner", "eric", Field.Store.YES)); document.add(new TextField("keywords", "A B of eric", Field.Store.YES)); indexWriter.addDocument(document); document = new Document(); document.add(new TextField("owner", "jobs", Field.Store.YES)); document.add(new TextField("keywords", "A B of jobs", Field.Store.YES)); indexWriter.addDocument(document); document.add(new TextField("owner", "jack", Field.Store.YES)); document.add(new TextField("keywords", "A B of jack", Field.Store.YES)); indexWriter.addDocument(document); indexWriter.commit(); TermQuery termQuery = new TermQuery(new Term("owner", "eric")); TermQuery termQuery1 = new TermQuery(new Term("keywords", "A")); //把FILTER看做是Like即可,也就是说owner必须是eric的才允许检索到 Query query = new BooleanQuery.Builder().add(termQuery1, BooleanClause.Occur.MUST).add(termQuery, BooleanClause.Occur.FILTER).build(); System.out.println(query); indexSearcher = new IndexSearcher(DirectoryReader.open(directory)); TopDocs search = indexSearcher.search(query, 10); Assert.assertEquals(1, search.totalHits); //如果不加安全过滤的话,那么应该检索到三条记录 query = new BooleanQuery.Builder().add(termQuery1, BooleanClause.Occur.MUST).build(); System.out.println(query); search = indexSearcher.search(query, 10); Assert.assertEquals(3, search.totalHits); String keywords = indexSearcher.doc(search.scoreDocs[0].doc).get("keywords"); System.out.println("使用Filter查询:" + keywords); //使用BooleanQuery可以实现同样的功能 BooleanQuery booleanQuery = new BooleanQuery.Builder().add(termQuery, BooleanClause.Occur.MUST).add(termQuery1, BooleanClause.Occur.MUST).build(); search = indexSearcher.search(booleanQuery, 10); Assert.assertEquals(1, search.totalHits); System.out.println("使用BooleanQuery查询:" + indexSearcher.doc(search.scoreDocs[0].doc).get("keywords")); } }
有人说我这些案例都是哪里来的,答案就是官方文档。所以想要学好一门新框架,看官方文档就可以了。
最后,欢迎关注我的个人微信公众号:业余草(yyucao)!可加作者微信号:xttblog2。备注:“1”,添加博主微信拉你进微信群。备注错误不会同意好友申请。再次感谢您的关注!后续有精彩内容会第一时间发给您!原创文章投稿请发送至532009913@qq.com邮箱。商务合作也可添加作者微信进行联系!
本文原文出处:业余草: » Lucene 实战教程第十二章详解 Lucene 的高级搜索技术