Skip to main content

java static 块

· One min read

背景

记录java的一些简单的语法

static 块

static block

8.7 Static Initializers
A static initializer declared in a class is executed when the class is initialized
(§12.4.2). Together with any field initializers for class variables (§8.3.2), static
initializers may be used to initialize the class variables of the class.

static 块会在类加载之后回调,在对象实例化之前

lucene tim格式

· 5 min read

背景

tim文件是lucene 存储词相关统计信息的文件. 与它相关的还有tip文件

格式和例子

文件格式:

可以从最下面的相关阅读可以获取对应的文档

TermsDict (.tim) --> Header, PostingsHeader, NodeBlockNumBlocks, Footer
NodeBlock --> (OuterNode | InnerNode)
OuterNode --> EntryCount, SuffixLength, ByteSuffixLength, StatsLength, < TermStats >EntryCount, MetaLength, <TermMetadata>EntryCount
InnerNode --> EntryCount, SuffixLength[,Sub?], ByteSuffixLength, StatsLength, < TermStats ? >EntryCount, MetaLength, <TermMetadata ? >EntryCount
TermStats --> DocFreq, TotalTermFreq
Header --> CodecHeader
EntryCount,SuffixLength,StatsLength,DocFreq,MetaLength --> VInt
TotalTermFreq --> VLong
Footer --> CodecFooter

例子

hexdump -C  _j_Lucene90_0.tim 

00000000 3f d7 6c 17 12 42 6c 6f 63 6b 54 72 65 65 54 65 |?.l..BlockTreeTe|
00000010 72 6d 73 44 69 63 74 00 00 00 00 fe ea 80 e6 45 |rmsDict........E|
00000020 20 d8 56 64 1b 1b 1b 89 70 fe 67 0a 4c 75 63 65 | .Vd....p.g.Luce|
00000030 6e 65 39 30 5f 30 25 bc 03 61 6d 61 6e 64 62 75 |ne90_0%..amandbu|
00000040 74 63 61 6e 64 6f 68 65 6c 6c 6f 68 69 69 69 73 |tcandohellohiiis|
00000050 69 74 6b 6e 6f 77 6d 61 79 6d 6f 6e 67 6f 6e 6f |itknowmaymongono|
00000060 74 74 72 79 77 68 61 74 77 6f 72 6c 64 79 6f 75 |ttrywhatworldyou|
00000070 24 02 03 03 03 02 05 02 01 02 02 04 03 05 03 03 |$...............|
00000080 04 05 03 10 04 00 09 02 01 04 00 03 02 01 01 02 |................|
00000090 01 07 02 02 26 7a 3d 04 01 02 03 01 01 01 01 01 |....&z=.........| <--- 第六个字节 ,也就是7a开头
000000a0 05 01 01 01 00 02 04 00 02 01 01 01 01 01 02 01 |................|
000000b0 01 01 02 01 01 01 01 05 01 03 01 05 a4 03 2f 68 |............../h|
000000c0 6f 6d 65 2f 75 62 75 6e 74 75 2f 64 6f 63 2f 68 |ome/ubuntu/doc/h|
000000d0 65 6c 6c 6f 2e 74 78 74 2f 68 6f 6d 65 2f 75 62 |ello.txt/home/ub|
000000e0 75 6e 74 75 2f 64 6f 63 2f 6d 6f 6e 67 6f 2e 74 |untu/doc/mongo.t|
000000f0 78 74 05 1a 01 03 04 82 01 01 03 c0 28 93 e8 00 |xt..........(...|
00000100 00 00 00 00 00 00 00 da 02 a3 a3 |...........|

这里的ste.in 是tim文件的数据

main[2] list
472 }
473 }
474
475 // metadata
476 => ste.fr.parent.postingsReader.decodeTerm(bytesReader, ste.fr.fieldInfo, state, absolute);
477
478 metaDataUpto++;
479 absolute = false;
480 }
481 state.termBlockOrd = metaDataUpto;
main[2] print ste.in
ste.in = "MMapIndexInput(path="/home/ubuntu/index/_j_Lucene90_0.tim")"

这里的对应的是

main[2] dump bytesReader.bytes
bytesReader.bytes = {
122, 61, 4, 1, 2, 3, 1, 1, 1, 1, 1, 5, 1, 1, 1, 0, 2, 4, 0, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 5, 1, 3, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
}

hexdump -C _j_Lucene90_0.tim

00000000 3f d7 6c 17 12 42 6c 6f 63 6b 54 72 65 65 54 65 |?.l..BlockTreeTe| 00000010 72 6d 73 44 69 63 74 00 00 00 00 fe ea 80 e6 45 |rmsDict........E| 00000020 20 d8 56 64 1b 1b 1b 89 70 fe 67 0a 4c 75 63 65 | .Vd....p.g.Luce| 00000030 6e 65 39 30 5f 30 25 bc 03 61 6d 61 6e 64 62 75 |ne90_0%..amandbu| 00000040 74 63 61 6e 64 6f 68 65 6c 6c 6f 68 69 69 69 73 |tcandohellohiiis| 00000050 69 74 6b 6e 6f 77 6d 61 79 6d 6f 6e 67 6f 6e 6f |itknowmaymongono| 00000060 74 74 72 79 77 68 61 74 77 6f 72 6c 64 79 6f 75 |ttrywhatworldyou| 00000070 24 02 03 03 03 02 05 02 01 02 02 04 03 05 03 03 |$...............| 00000080 04 05 03 10 04 00 09 02 01 04 00 03 02 01 01 02 |................| 00000090 01 07 02 02 26 7a 3d 04 01 02 03 01 01 01 01 01 |....&z=.........| 000000a0 05 01 01 01 00 02 04 00 02 01 01 01 01 01 02 01 |................| 000000b0 01 01 02 01 01 01 01 05 01 03 01 05 a4 03 2f 68 |............../h| 000000c0 6f 6d 65 2f 75 62 75 6e 74 75 2f 64 6f 63 2f 68 |ome/ubuntu/doc/h| 000000d0 65 6c 6c 6f 2e 74 78 74 2f 68 6f 6d 65 2f 75 62 |ello.txt/home/ub| 000000e0 75 6e 74 75 2f 64 6f 63 2f 6d 6f 6e 67 6f 2e 74 |untu/doc/mongo.t| 000000f0 78 74 05 1a 01 03 04 82 01 01 03 c0 28 93 e8 00 |xt..........(...| 00000100 00 00 00 00 00 00 00 da 02 a3 a3 |...........|

相关阅读

paper

· One min read

clickhouse parser

· One min read

clickhouse 堆栈

(lldb) bt
* thread #3, name = 'HTTPHandler', stop reason = breakpoint 1.1
* frame #0: 0x000000001d9c6522 clickhouse-server`DB::tryParseQuery(parser=0x00007fff2d3ef620, _out_query_end=0x00007fff2d3ecc60, all_queries_end="\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZtext/plain; charset=UTF-8", out_error_message="", hilite=false, query_description="", allow_multi_statements=false, max_query_size=262144, max_parser_depth=1000) at parseQuery.cpp:237:32
frame #1: 0x000000001d9c77bd clickhouse-server`DB::parseQueryAndMovePosition(parser=0x00007fff2d3ef620, pos=0x00007fff2d3ecc60, end="\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZtext/plain; charset=UTF-8", query_description="", allow_multi_statements=false, max_query_size=262144, max_parser_depth=1000) at parseQuery.cpp:343:18
frame #2: 0x000000001d9c7926 clickhouse-server`DB::parseQuery(parser=0x00007fff2d3ef620, begin="show databases\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZtext/plain; charset=UTF-8", end="\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZtext/plain; charset=UTF-8", query_description="", max_query_size=262144, max_parser_depth=1000) at parseQuery.cpp:360:12
frame #3: 0x000000001b95ec13 clickhouse-server`DB::executeQueryImpl(begin="show databases\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZtext/plain; charset=UTF-8", end="\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZtext/plain; charset=UTF-8", context=std::__1::shared_ptr<DB::Context>::element_type @ 0x00007ffff5b23000 strong=4 weak=5, internal=false, stage=Complete, istr=0x00007fff2403d000) at executeQuery.cpp:442:15
frame #4: 0x000000001b965181 clickhouse-server`DB::executeQuery(istr=0x00007fff2403d000, ostr=0x00007fff240393d8, allow_into_outfile=false, context=std::__1::shared_ptr<DB::Context>::element_type @ 0x00007ffff5b23000 strong=4 weak=5, set_result_details=DB::SetResultDetailsFunc @ 0x00007fff2d3f0bf0, output_format_settings= Has Value=false )>, std::__1::optional<DB::FormatSettings> const&) at executeQuery.cpp:1136:30
frame #5: 0x000000001d3b7b17 clickhouse-server`DB::HTTPHandler::processQuery(this=0x00007ffff7154d40, request=0x00007fff2d3f2438, params=0x00007fff2d3f1e10, response=0x00007fff2d3f24f0, used_output=0x00007fff2d3f1ec8, query_scope= Has Value=true ) at HTTPHandler.cpp:822:5
frame #6: 0x000000001d3ba4fe clickhouse-server`DB::HTTPHandler::handleRequest(this=0x00007ffff7154d40, request=0x00007fff2d3f2438, response=0x00007fff2d3f24f0) at HTTPHandler.cpp:960:9
frame #7: 0x000000001d3f1e17 clickhouse-server`DB::HTTPServerConnection::run(this=0x00007ffff5b18000) at HTTPServerConnection.cpp:65:34
frame #8: 0x000000002308e1d9 clickhouse-server`Poco::Net::TCPServerConnection::start(this=0x00007ffff5b18000) at TCPServerConnection.cpp:43:3
frame #9: 0x000000002308e9e6 clickhouse-server`Poco::Net::TCPServerDispatcher::run(this=0x00007fff12875500) at TCPServerDispatcher.cpp:115:20
frame #10: 0x00000000232cecf4 clickhouse-server`Poco::PooledThread::run(this=0x00007ffff702df80) at ThreadPool.cpp:199:14
frame #11: 0x00000000232cb81a clickhouse-server`Poco::(anonymous namespace)::RunnableHolder::run(this=0x00007ffff7001330) at Thread.cpp:55:11
frame #12: 0x00000000232ca5fe clickhouse-server`Poco::ThreadImpl::runnableEntry(pThread=0x00007ffff702dfb8) at Thread_POSIX.cpp:345:27
frame #13: 0x00007ffff7df8b43 libc.so.6`start_thread(arg=<unavailable>) at pthread_create.c:442:8
frame #14: 0x00007ffff7e8aa00 libc.so.6`__clone3 at clone3.S:81

lucene 10源码分析

· 15 min read

背景

我家里的电脑的lucene是10版本的

创建索引和保存

### 断点
java -agentlib:jdwp=transport=dt_socket,server=y,address=8000 -cp /home/dai/lucene/lucene/demo/build/libs/lucene-demo-10.0.0-SNAPSHOT.jar:/home/dai/lucene/lucene/core/build/libs/lucene-core-10.0.0-SNAPSHOT.jar org.apache.lucene.demo.IndexFiles -docs /home/dai/docs
### jdb 调试
jdb -attach 8000 -sourcepath /home/dai/lucene/lucene/demo/src/java/:/home/dai/lucene/lucene/core/src/java/

分词

倒排索引和分词都在这块代码

main[1] where
[1] org.apache.lucene.index.IndexingChain$PerField.invert (IndexingChain.java:1,140)
[2] org.apache.lucene.index.IndexingChain.processField (IndexingChain.java:729)
[3] org.apache.lucene.index.IndexingChain.processDocument (IndexingChain.java:620)
[4] org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments (DocumentsWriterPerThread.java:241)
[5] org.apache.lucene.index.DocumentsWriter.updateDocuments (DocumentsWriter.java:432)
[6] org.apache.lucene.index.IndexWriter.updateDocuments (IndexWriter.java:1,531)
[7] org.apache.lucene.index.IndexWriter.updateDocument (IndexWriter.java:1,816)
[8] org.apache.lucene.index.IndexWriter.addDocument (IndexWriter.java:1,469)
[9] org.apache.lucene.demo.IndexFiles.indexDoc (IndexFiles.java:271)
[10] org.apache.lucene.demo.IndexFiles$1.visitFile (IndexFiles.java:212)
[11] org.apache.lucene.demo.IndexFiles$1.visitFile (IndexFiles.java:208)
[12] java.nio.file.Files.walkFileTree (Files.java:2,811)
[13] java.nio.file.Files.walkFileTree (Files.java:2,882)
[14] org.apache.lucene.demo.IndexFiles.indexDocs (IndexFiles.java:206)
[15] org.apache.lucene.demo.IndexFiles.main (IndexFiles.java:157)

Step completed: "thread=main", org.apache.lucene.index.TermsHashPerField.add(), line=193 bci=22
193 int termID = bytesHash.add(termBytes);

main[1] print termBytes
termBytes = "[2f 68 6f 6d 65 2f 64 61 69 2f 64 6f 63 73 2f 62 62 62 2e 74 78 74]"

invert

倒排索引,核心是构造一个term=>doc 的映射。比较核心的类是lucene/core/src/java/org/apache/lucene/index/FreqProxTermsWriterPerField.java,这是

  @Override
void addTerm(final int termID, final int docID) {
final FreqProxPostingsArray postings = freqProxPostingsArray;
assert !hasFreq || postings.termFreqs[termID] > 0;

if (!hasFreq) {
assert postings.termFreqs == null;
if (termFreqAtt.getTermFrequency() != 1) {
throw new IllegalStateException(
"field \""
+ getFieldName()
+ "\": must index term freq while using custom TermFrequencyAttribute");
}
if (docID != postings.lastDocIDs[termID]) {
// New document; now encode docCode for previous doc:
assert docID > postings.lastDocIDs[termID];
writeVInt(0, postings.lastDocCodes[termID]);
postings.lastDocCodes[termID] = docID - postings.lastDocIDs[termID];
postings.lastDocIDs[termID] = docID;
fieldState.uniqueTermCount++;
}
} else if (docID != postings.lastDocIDs[termID]) {
assert docID > postings.lastDocIDs[termID]
: "id: " + docID + " postings ID: " + postings.lastDocIDs[termID] + " termID: " + termID;
// Term not yet seen in the current doc but previously
// seen in other doc(s) since the last flush

// Now that we know doc freq for previous doc,
// write it & lastDocCode
if (1 == postings.termFreqs[termID]) {
writeVInt(0, postings.lastDocCodes[termID] | 1);
} else {
writeVInt(0, postings.lastDocCodes[termID]);
writeVInt(0, postings.termFreqs[termID]);
}

// Init freq for the current document
postings.termFreqs[termID] = getTermFreq();
fieldState.maxTermFrequency =
Math.max(postings.termFreqs[termID], fieldState.maxTermFrequency);
postings.lastDocCodes[termID] = (docID - postings.lastDocIDs[termID]) << 1;
postings.lastDocIDs[termID] = docID;
if (hasProx) {
writeProx(termID, fieldState.position);
if (hasOffsets) {
postings.lastOffsets[termID] = 0;
writeOffsets(termID, fieldState.offset);
}
} else {
assert !hasOffsets;
}
fieldState.uniqueTermCount++;
} else {
postings.termFreqs[termID] = Math.addExact(postings.termFreqs[termID], getTermFreq());
fieldState.maxTermFrequency =
Math.max(fieldState.maxTermFrequency, postings.termFreqs[termID]);
if (hasProx) {
writeProx(termID, fieldState.position - postings.lastPositions[termID]);
if (hasOffsets) {
writeOffsets(termID, fieldState.offset);
}
}
}
}

生成termId

堆栈

main[1] where
[1] org.apache.lucene.index.TermsHashPerField.initStreamSlices (TermsHashPerField.java:150)
[2] org.apache.lucene.index.TermsHashPerField.add (TermsHashPerField.java:198)
[3] org.apache.lucene.index.IndexingChain$PerField.invert (IndexingChain.java:1,224)
[4] org.apache.lucene.index.IndexingChain.processField (IndexingChain.java:729)
[5] org.apache.lucene.index.IndexingChain.processDocument (IndexingChain.java:620)
[6] org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments (DocumentsWriterPerThread.java:241)
[7] org.apache.lucene.index.DocumentsWriter.updateDocuments (DocumentsWriter.java:432)
[8] org.apache.lucene.index.IndexWriter.updateDocuments (IndexWriter.java:1,531)
[9] org.apache.lucene.index.IndexWriter.updateDocument (IndexWriter.java:1,816)
[10] org.apache.lucene.index.IndexWriter.addDocument (IndexWriter.java:1,469)
[11] org.apache.lucene.demo.IndexFiles.indexDoc (IndexFiles.java:271)
[12] org.apache.lucene.demo.IndexFiles$1.visitFile (IndexFiles.java:212)
[13] org.apache.lucene.demo.IndexFiles$1.visitFile (IndexFiles.java:208)
[14] java.nio.file.Files.walkFileTree (Files.java:2,811)
[15] java.nio.file.Files.walkFileTree (Files.java:2,882)
[16] org.apache.lucene.demo.IndexFiles.indexDocs (IndexFiles.java:206)
[17] org.apache.lucene.demo.IndexFiles.main (IndexFiles.java:157)

      IntBlockPool intPool,
ByteBlockPool bytePool,
ByteBlockPool termBytePool,

首先介绍intPool这个变量这个变量维护了一个二维数组int buffers[][]和三个偏移量来保存bytePool的偏移量。

public final class IntBlockPool {
...

// 类初始化是10 , 后面会自动扩容,核心结构 , 这个二维数组存的是bytePool 偏移量,默认初始化容量是10
public int[][] buffers = new int[10][];

// 二维数组偏移量,也就是联合buffers使用 。一般这样用 buffers[bufferUpto+offset]
private int bufferUpto = -1;
// 二维数组中的一维数组 , 描述的是最新写入的buffers
// 举例 buffer = buffers[1];
public int[] buffer;
//intUpto 描述的是相对于一维数组的偏移
public int intUpto = INT_BLOCK_SIZE;
// 绝对偏移 ,相对于二维数组的偏移 ,有点像计算机里面的相对跳转和绝对跳转
public int intOffset = -INT_BLOCK_SIZE;
}

然后和intPool一样,bytePooltermBytePool 也是用几个变量加一个二维数组描述

public final class ByteBlockPool implements Accountable {
...
// 核心结构,一个二维数组
public byte[][] buffers = new byte[10][];

/** index into the buffers array pointing to the current buffer used as the head */
private int bufferUpto = -1; // Which buffer we are upto
/** Where we are in head buffer */
public int byteUpto = BYTE_BLOCK_SIZE;

/** Current head buffer */
public byte[] buffer;
/** Current head offset */
public int byteOffset = -BYTE_BLOCK_SIZE;

查询搜索

断点

## 断点
java -agentlib:jdwp=transport=dt_socket,server=y,address=8000 -cp /home/dai/lucene/lucene/demo/build/libs/lucene-demo-10.0.0-SNAPSHOT.jar:/home/dai/lucene/lucene/core/build/libs/lucene-core-10.0.0-SNAPSHOT.jar:/home/dai/lucene/lucene/queryparser/build/libs/lucene-queryparser-10.0.0-SNAPSHOT.jar org.apache.lucene.demo.SearchFiles

## jdb 调试
jdb -attach 8000 -sourcepath /home/dai/lucene/lucene/demo/src/java/:/home/dai/lucene/lucene/core/src/java/

termState描述的是term的统计信息

ain[1] print termState
termState = "TermStates
state=docFreq=1 totalTermFreq=1 termBlockOrd=2 blockFP=0 docStartFP=63 posStartFP=63 payStartFP=0 lastPosBlockOffset=-1 singletonDocID=6
"
main[1] print term
term = "contents:am"
main[1] where
[1] org.apache.lucene.search.TermQuery.createWeight (TermQuery.java:233)
[2] org.apache.lucene.search.IndexSearcher.createWeight (IndexSearcher.java:894)
[3] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:686)
[4] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
[5] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
[6] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
[7] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)

排序

默认排序是BM25Similarity

main[1] where
[1] org.apache.lucene.search.similarities.BM25Similarity.scorer (BM25Similarity.java:200)
[2] org.apache.lucene.search.TermQuery$TermWeight.<init> (TermQuery.java:75)
[3] org.apache.lucene.search.TermQuery.createWeight (TermQuery.java:233)
[4] org.apache.lucene.search.IndexSearcher.createWeight (IndexSearcher.java:894)
[5] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:686)
[6] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
[7] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
[8] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
[9] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)

核心搜索参数

main[1] list
763 // there is no doc of interest in this reader context
764 // continue with the following leaf
765 continue;
766 }
767 => BulkScorer scorer = weight.bulkScorer(ctx);
768 if (scorer != null) {
769 try {
770 scorer.score(leafCollector, ctx.reader().getLiveDocs());
771 } catch (
772 @SuppressWarnings("unused")
main[1] where
[1] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:767)
[2] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:693)
[3] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:687)
[4] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
[5] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
[6] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
[7] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)

获取reader

Step completed: "thread=main", org.apache.lucene.index.LeafReaderContext.reader(), line=67 bci=0
67 return reader;

main[1] print reader
reader = "_0(10.0.0):c7:[diagnostics={source=flush, os.arch=amd64, java.runtime.version=17.0.3+7-Ubuntu-0ubuntu0.22.04.1, os.version=5.15.0-33-generic, java.vendor=Private Build, os=Linux, timestamp=1656601918836, java.version=17.0.3, java.vm.version=17.0.3+7-Ubuntu-0ubuntu0.22.04.1, lucene.version=10.0.0}]:[attributes={Lucene90StoredFieldsFormat.mode=BEST_SPEED}] :id=c276i3vlaza4c6uumuxapfnvf"
main[1] where
[1] org.apache.lucene.index.LeafReaderContext.reader (LeafReaderContext.java:67)
[2] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:770)
[3] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:693)
[4] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:687)
[5] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
[6] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
[7] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
[8] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)

其中的reader 对象

main[1] dump reader
reader = {
si: instance of org.apache.lucene.index.SegmentCommitInfo(id=1531)
originalSi: instance of org.apache.lucene.index.SegmentCommitInfo(id=1532)
metaData: instance of org.apache.lucene.index.LeafMetaData(id=1533)
liveDocs: null
hardLiveDocs: null
numDocs: 7
core: instance of org.apache.lucene.index.SegmentCoreReaders(id=1534)
segDocValues: instance of org.apache.lucene.index.SegmentDocValues(id=1535)
isNRT: false
docValuesProducer: null
fieldInfos: instance of org.apache.lucene.index.FieldInfos(id=1536)
readerClosedListeners: instance of java.util.concurrent.CopyOnWriteArraySet(id=1537)
readerCacheHelper: instance of org.apache.lucene.index.SegmentReader$1(id=1538)
coreCacheHelper: instance of org.apache.lucene.index.SegmentReader$2(id=1539)
$assertionsDisabled: true
org.apache.lucene.index.LeafReader.readerContext: instance of org.apache.lucene.index.LeafReaderContext(id=1540)
org.apache.lucene.index.LeafReader.$assertionsDisabled: true
org.apache.lucene.index.IndexReader.closed: false
org.apache.lucene.index.IndexReader.closedByChild: false
org.apache.lucene.index.IndexReader.refCount: instance of java.util.concurrent.atomic.AtomicInteger(id=1541)
org.apache.lucene.index.IndexReader.parentReaders: instance of java.util.Collections$SynchronizedSet(id=1542)
}

排序:

main[1] list
222
223 @Override
224 public int score(LeafCollector collector, Bits acceptDocs, int min, int max)
225 throws IOException {
226 => collector.setScorer(scorer);
227 DocIdSetIterator scorerIterator = twoPhase == null ? iterator : twoPhase.approximation();
228 DocIdSetIterator competitiveIterator = collector.competitiveIterator();
229 DocIdSetIterator filteredIterator;
230 if (competitiveIterator == null) {
231 filteredIterator = scorerIterator;
main[1] where
[1] org.apache.lucene.search.Weight$DefaultBulkScorer.score (Weight.java:226)
[2] org.apache.lucene.search.BulkScorer.score (BulkScorer.java:38)
[3] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:770)
[4] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:693)
[5] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:687)
[6] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
[7] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
[8] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
[9] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)

排序

  private static class SimpleTopScoreDocCollector extends TopScoreDocCollector {

...

@Override
public LeafCollector getLeafCollector(LeafReaderContext context) throws IOException {
...
return new ScorerLeafCollector() {
...
@Override
public void collect(int doc) throws IOException {
float score = scorer.score(); <---- 这里不用传docId 就能获取score ,是因为可以从父类TopScoreDocCollector 获取docId

// This collector relies on the fact that scorers produce positive values:
assert score >= 0; // NOTE: false for NaN

totalHits++;
hitsThresholdChecker.incrementHitCount();

if (minScoreAcc != null && (totalHits & minScoreAcc.modInterval) == 0) {
updateGlobalMinCompetitiveScore(scorer);
}

if (score <= pqTop.score) {
if (totalHitsRelation == TotalHits.Relation.EQUAL_TO) {
// we just reached totalHitsThreshold, we can start setting the min
// competitive score now
updateMinCompetitiveScore(scorer);
}
// Since docs are returned in-order (i.e., increasing doc Id), a document
// with equal score to pqTop.score cannot compete since HitQueue favors
// documents with lower doc Ids. Therefore reject those docs too.
return;
}
pqTop.doc = doc + docBase;
pqTop.score = score;
pqTop = pq.updateTop();
updateMinCompetitiveScore(scorer);
}
};
}
main[1] print scorer
scorer = "scorer(weight(contents:am))[org.apache.lucene.search.TermScorer@290dbf45]"
main[1] dump scorer
scorer = {
postingsEnum: instance of org.apache.lucene.index.SlowImpactsEnum(id=1546)
impactsEnum: instance of org.apache.lucene.index.SlowImpactsEnum(id=1546)
iterator: instance of org.apache.lucene.search.ImpactsDISI(id=1547)
docScorer: instance of org.apache.lucene.search.LeafSimScorer(id=1548)
impactsDisi: instance of org.apache.lucene.search.ImpactsDISI(id=1547)
$assertionsDisabled: true
org.apache.lucene.search.Scorer.weight: instance of org.apache.lucene.search.TermQuery$TermWeight(id=1549)
}
main[1] where
[1] org.apache.lucene.search.TopScoreDocCollector$SimpleTopScoreDocCollector$1.collect (TopScoreDocCollector.java:76) <--- 这里没有传doc_id 进去scorer 是因为有个回调, 可以获取doc_id , 这里会有歌pq,是一个排序好的doc
[2] org.apache.lucene.search.Weight$DefaultBulkScorer.scoreAll (Weight.java:305)
[3] org.apache.lucene.search.Weight$DefaultBulkScorer.score (Weight.java:247)
[4] org.apache.lucene.search.BulkScorer.score (BulkScorer.java:38)
[5] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:770)
[6] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:693)
[7] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:687)
[8] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
[9] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
[10] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
[11] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)
main[1]

核心算分函数

排序算分

main[1] list
246 // float. And then monotonicity is preserved through composition via
247 // x -> 1 + x and x -> 1 - 1/x.
248 // Finally we expand weight * (1 - 1 / (1 + freq * 1/norm)) to
249 // weight - weight / (1 + freq * 1/norm), which runs slightly faster.
250 => float normInverse = cache[((byte) encodedNorm) & 0xFF];
251 return weight - weight / (1f + freq * normInverse);
252 }
253
254 @Override
255 public Explanation explain(Explanation freq, long encodedNorm) {
main[1] where
[1] org.apache.lucene.search.similarities.BM25Similarity$BM25Scorer.score (BM25Similarity.java:250)
[2] org.apache.lucene.search.LeafSimScorer.score (LeafSimScorer.java:60)
[3] org.apache.lucene.search.TermScorer.score (TermScorer.java:75)
[4] org.apache.lucene.search.TopScoreDocCollector$SimpleTopScoreDocCollector$1.collect (TopScoreDocCollector.java:73)
[5] org.apache.lucene.search.Weight$DefaultBulkScorer.scoreAll (Weight.java:305)
[6] org.apache.lucene.search.Weight$DefaultBulkScorer.score (Weight.java:247)
[7] org.apache.lucene.search.BulkScorer.score (BulkScorer.java:38)
[8] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:770)
[9] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:693)
[10] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:687)
[11] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
[12] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
[13] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
[14] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)

reduce 过程

main[1] list
60 * Populates the results array with the ScoreDoc instances. This can be overridden in case a
61 * different ScoreDoc type should be returned.
62 */
63 protected void populateResults(ScoreDoc[] results, int howMany) {
64 => for (int i = howMany - 1; i >= 0; i--) {
65 results[i] = pq.pop();
66 }
67 }
68
69 /**
main[1] where
[1] org.apache.lucene.search.TopDocsCollector.populateResults (TopDocsCollector.java:64)
[2] org.apache.lucene.search.TopDocsCollector.topDocs (TopDocsCollector.java:166)
[3] org.apache.lucene.search.TopDocsCollector.topDocs (TopDocsCollector.java:98)
[4] org.apache.lucene.search.IndexSearcher$2.reduce (IndexSearcher.java:526)
[5] org.apache.lucene.search.IndexSearcher$2.reduce (IndexSearcher.java:505)
[6] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:694)
[7] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:687)
[8] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
[9] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
[10] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
[11] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)

辅助函数,获取topk的数据内容

堆栈:

main[1] where
[1] org.apache.lucene.search.TopDocs.mergeAux (TopDocs.java:312)
[2] org.apache.lucene.search.TopDocs.merge (TopDocs.java:216)
[3] org.apache.lucene.search.IndexSearcher$2.reduce (IndexSearcher.java:528)
[4] org.apache.lucene.search.IndexSearcher$2.reduce (IndexSearcher.java:505)
[5] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:694)
[6] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:687)
[7] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
[8] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
[9] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
[10] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)

  /**
* Auxiliary method used by the {@link #merge} impls. A sort value of null is used to indicate
* that docs should be sorted by score.
*/
private static TopDocs mergeAux(
Sort sort, int start, int size, TopDocs[] shardHits, Comparator<ScoreDoc> tieBreaker) {

final PriorityQueue<ShardRef> queue;
if (sort == null) {
queue = new ScoreMergeSortQueue(shardHits, tieBreaker);
} else {
queue = new MergeSortQueue(sort, shardHits, tieBreaker);
}

long totalHitCount = 0;
TotalHits.Relation totalHitsRelation = TotalHits.Relation.EQUAL_TO;
int availHitCount = 0;
for (int shardIDX = 0; shardIDX < shardHits.length; shardIDX++) {
final TopDocs shard = shardHits[shardIDX];
// totalHits can be non-zero even if no hits were
// collected, when searchAfter was used:
totalHitCount += shard.totalHits.value;
// If any hit count is a lower bound then the merged
// total hit count is a lower bound as well
if (shard.totalHits.relation == TotalHits.Relation.GREATER_THAN_OR_EQUAL_TO) {
totalHitsRelation = TotalHits.Relation.GREATER_THAN_OR_EQUAL_TO;
}
if (shard.scoreDocs != null && shard.scoreDocs.length > 0) {
availHitCount += shard.scoreDocs.length;
queue.add(new ShardRef(shardIDX));
}
}

final ScoreDoc[] hits;
boolean unsetShardIndex = false;
if (availHitCount <= start) {
hits = new ScoreDoc[0];
} else {
hits = new ScoreDoc[Math.min(size, availHitCount - start)];
int requestedResultWindow = start + size;
int numIterOnHits = Math.min(availHitCount, requestedResultWindow);
int hitUpto = 0;
while (hitUpto < numIterOnHits) {
assert queue.size() > 0;
ShardRef ref = queue.top();
final ScoreDoc hit = shardHits[ref.shardIndex].scoreDocs[ref.hitIndex++];

// Irrespective of whether we use shard indices for tie breaking or not, we check for
// consistent
// order in shard indices to defend against potential bugs
if (hitUpto > 0) {
if (unsetShardIndex != (hit.shardIndex == -1)) {
throw new IllegalArgumentException("Inconsistent order of shard indices");
}
}

unsetShardIndex |= hit.shardIndex == -1;

if (hitUpto >= start) {
hits[hitUpto - start] = hit;
}

hitUpto++;

if (ref.hitIndex < shardHits[ref.shardIndex].scoreDocs.length) {
// Not done with this these TopDocs yet:
queue.updateTop();
} else {
queue.pop();
}
}
}

TotalHits totalHits = new TotalHits(totalHitCount, totalHitsRelation);
if (sort == null) {
return new TopDocs(totalHits, hits);
} else {
return new TopFieldDocs(totalHits, hits, sort.getSort());
}
}

通过docid 获取对应的文档

        fieldsStream.seek(startPointer);
decompressor.decompress(fieldsStream, totalLength, offset, length, bytes);
assert bytes.length == length;
documentInput = new ByteArrayDataInput(bytes.bytes, bytes.offset, bytes.length);

堆栈:

main[1] where
[1] org.apache.lucene.store.ByteBufferIndexInput$SingleBufferImpl.seek (ByteBufferIndexInput.java:576)
[2] org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader$BlockState.document (Lucene90CompressingStoredFieldsReader.java:594)
[3] org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader.document (Lucene90CompressingStoredFieldsReader.java:610)
[4] org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader.visitDocument (Lucene90CompressingStoredFieldsReader.java:628)
[5] org.apache.lucene.index.CodecReader.document (CodecReader.java:89)
[6] org.apache.lucene.index.BaseCompositeReader.document (BaseCompositeReader.java:154)
[7] org.apache.lucene.index.IndexReader.document (IndexReader.java:380)
[8] org.apache.lucene.search.IndexSearcher.doc (IndexSearcher.java:380)
[9] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:214)
[10] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)


mmap加载文件到内存:

Breakpoint hit: "thread=main", org.apache.lucene.store.ByteBufferIndexInput.setCurBuf(), line=86 bci=0
86 this.curBuf = curBuf;

main[1] where
[1] org.apache.lucene.store.ByteBufferIndexInput.setCurBuf (ByteBufferIndexInput.java:86)
[2] org.apache.lucene.store.ByteBufferIndexInput$SingleBufferImpl.<init> (ByteBufferIndexInput.java:556)
[3] org.apache.lucene.store.ByteBufferIndexInput.newInstance (ByteBufferIndexInput.java:63)
[4] org.apache.lucene.store.MMapDirectory.openInput (MMapDirectory.java:238)
[5] org.apache.lucene.store.Directory.openChecksumInput (Directory.java:152)
[6] org.apache.lucene.index.SegmentInfos.readCommit (SegmentInfos.java:290)
[7] org.apache.lucene.index.StandardDirectoryReader$1.doBody (StandardDirectoryReader.java:88)
[8] org.apache.lucene.index.StandardDirectoryReader$1.doBody (StandardDirectoryReader.java:77)
[9] org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run (SegmentInfos.java:798)
[10] org.apache.lucene.index.StandardDirectoryReader.open (StandardDirectoryReader.java:109)
[11] org.apache.lucene.index.StandardDirectoryReader.open (StandardDirectoryReader.java:67)
[12] org.apache.lucene.index.DirectoryReader.open (DirectoryReader.java:60)
[13] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:105)

很明显,打开文件是在org.apache.lucene.store.MMapDirectory.openInput 这个类实现就是打开文件。

先打开文件segments_1

main[1] print name
name = "segments_1"
main[1] list
228
229 /** Creates an IndexInput for the file with the given name. */
230 @Override
231 public IndexInput openInput(String name, IOContext context) throws IOException {
232 => ensureOpen();
233 ensureCanRead(name);
234 Path path = directory.resolve(name);
235 try (FileChannel c = FileChannel.open(path, StandardOpenOption.READ)) {
236 final String resourceDescription = "MMapIndexInput(path=\"" + path.toString() + "\")";
237 final boolean useUnmap = getUseUnmap();
main[1]

举例读取字符串:

  private static void readField(DataInput in, StoredFieldVisitor visitor, FieldInfo info, int bits)
throws IOException {
switch (bits & TYPE_MASK) {
case BYTE_ARR:
int length = in.readVInt();
byte[] data = new byte[length];
in.readBytes(data, 0, length);
visitor.binaryField(info, data);
break;
case STRING:
visitor.stringField(info, in.readString());
break;
case NUMERIC_INT:
visitor.intField(info, in.readZInt());
break;
case NUMERIC_FLOAT:
visitor.floatField(info, readZFloat(in));
break;
case NUMERIC_LONG:
visitor.longField(info, readTLong(in));
break;
case NUMERIC_DOUBLE:
visitor.doubleField(info, readZDouble(in));
break;
default:
throw new AssertionError("Unknown type flag: " + Integer.toHexString(bits));
}
}
main[1] where
[1] org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader.readField (Lucene90CompressingStoredFieldsReader.java:246)
[2] org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader.visitDocument (Lucene90CompressingStoredFieldsReader.java:640)
[3] org.apache.lucene.index.CodecReader.document (CodecReader.java:89)
[4] org.apache.lucene.index.BaseCompositeReader.document (BaseCompositeReader.java:154)
[5] org.apache.lucene.index.IndexReader.document (IndexReader.java:380)
[6] org.apache.lucene.search.IndexSearcher.doc (IndexSearcher.java:380)
[7] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:214)
[8] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)
main[1]

main[1] list
66 }
67
68 @Override
69 public void stringField(FieldInfo fieldInfo, String value) throws IOException {
70 => final FieldType ft = new FieldType(TextField.TYPE_STORED);
71 ft.setStoreTermVectors(fieldInfo.hasVectors());
72 ft.setOmitNorms(fieldInfo.omitsNorms());
73 ft.setIndexOptions(fieldInfo.getIndexOptions());
74 doc.add(
75 new StoredField(
main[1] print value
value = "/home/dai/docs/aaa.txt"
main[1] where
[1] org.apache.lucene.document.DocumentStoredFieldVisitor.stringField (DocumentStoredFieldVisitor.java:70)
[2] org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader.readField (Lucene90CompressingStoredFieldsReader.java:246)
[3] org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader.visitDocument (Lucene90CompressingStoredFieldsReader.java:640)
[4] org.apache.lucene.index.CodecReader.document (CodecReader.java:89)
[5] org.apache.lucene.index.BaseCompositeReader.document (BaseCompositeReader.java:154)
[6] org.apache.lucene.index.IndexReader.document (IndexReader.java:380)
[7] org.apache.lucene.search.IndexSearcher.doc (IndexSearcher.java:380)
[8] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:214)
[9] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)

将读到的string 加载到doc对象里面

这是核心函数 , mmap 读取文件,然后seek 算出偏移和长度 ,从文件中读取出来并构造成对象


/**
* Get the serialized representation of the given docID. This docID has to be contained in the
* current block.
*/
SerializedDocument document(int docID) throws IOException {
if (contains(docID) == false) {
throw new IllegalArgumentException();
}

final int index = docID - docBase;
final int offset = Math.toIntExact(offsets[index]);
final int length = Math.toIntExact(offsets[index + 1]) - offset;
final int totalLength = Math.toIntExact(offsets[chunkDocs]);
final int numStoredFields = Math.toIntExact(this.numStoredFields[index]);

final BytesRef bytes;
if (merging) {
bytes = this.bytes;
} else {
bytes = new BytesRef();
}
...
fieldsStream.seek(startPointer); // 计算偏移量
decompressor.decompress(fieldsStream, totalLength, offset, length, bytes); // 解压内容
assert bytes.length == length;
documentInput = new ByteArrayDataInput(bytes.bytes, bytes.offset, bytes.length); // 将内容塞到对象里面
}

return new SerializedDocument(documentInput, length, numStoredFields);
}
}

获取doc

Breakpoint hit: "thread=main", org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockDocsEnum.advance(), line=498 bci=0
498 if (docFreq > BLOCK_SIZE && target > nextSkipDoc) {

main[1] where
[1] org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockDocsEnum.advance (Lucene90PostingsReader.java:498)
[2] org.apache.lucene.index.SlowImpactsEnum.advance (SlowImpactsEnum.java:77)
[3] org.apache.lucene.search.ImpactsDISI.advance (ImpactsDISI.java:128)
[4] org.apache.lucene.search.ImpactsDISI.nextDoc (ImpactsDISI.java:133)
[5] org.apache.lucene.search.Weight$DefaultBulkScorer.scoreAll (Weight.java:301)
[6] org.apache.lucene.search.Weight$DefaultBulkScorer.score (Weight.java:247)
[7] org.apache.lucene.search.BulkScorer.score (BulkScorer.java:38)
[8] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:770)
[9] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:693)
[10] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:687)
[11] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
[12] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
[13] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
[14] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)

term query 和遍历

注意到 ImpactsEnum 实现了iteratorDocId

1,138      }
1,139
1,140 @Override
1,141 public ImpactsEnum impacts(int flags) throws IOException {
1,142 => assert !eof;
1,143 // if (DEBUG) {
1,144 // System.out.println("BTTR.docs seg=" + segment);
1,145 // }
1,146 currentFrame.decodeMetaData();
1,147 // if (DEBUG) {
main[1] where
[1] org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum.impacts (SegmentTermsEnum.java:1,142)
[2] org.apache.lucene.search.TermQuery$TermWeight.scorer (TermQuery.java:114)
[3] org.apache.lucene.search.Weight.bulkScorer (Weight.java:166)
[4] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:767)
[5] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:693)
[6] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:687)
[7] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
[8] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
[9] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
[10] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)

注意到PostingsEnum 也有docidIterater

排序topk


main[1] where
[1] org.apache.lucene.util.PriorityQueue.upHeap (PriorityQueue.java:276)
[2] org.apache.lucene.util.PriorityQueue.add (PriorityQueue.java:161)
[3] org.apache.lucene.search.TopDocs.mergeAux (TopDocs.java:303)
[4] org.apache.lucene.search.TopDocs.merge (TopDocs.java:216)
[5] org.apache.lucene.search.IndexSearcher$2.reduce (IndexSearcher.java:528)
[6] org.apache.lucene.search.IndexSearcher$2.reduce (IndexSearcher.java:505)
[7] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:694)
[8] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:687)
[9] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
[10] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
[11] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
[12] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)


@Override
public boolean lessThan(ShardRef first, ShardRef second) {
assert first != second;
ScoreDoc firstScoreDoc = shardHits[first.shardIndex][first.hitIndex];
ScoreDoc secondScoreDoc = shardHits[second.shardIndex][second.hitIndex];
if (firstScoreDoc.score < secondScoreDoc.score) {
return false;
} else if (firstScoreDoc.score > secondScoreDoc.score) {
return true;
} else {
return tieBreakLessThan(first, firstScoreDoc, second, secondScoreDoc, tieBreakerComparator);
}

相关阅读

cpp基础

· One min read

Storage duration

这个是描述变量的生命周期的,分为四类

  • automatic: 生命周期在代码块内,在代码块内分配内存,在代码块内析构
  • static:生命周期是整个程序。分配内存的时机是程序开始前,析构是在程序结束之后(和static 关键词没有太大关系)
  • thread: 生命周期是线程开始和线程结束
  • dynamic:动态生命周期,一般是new、malloc一类

Linkage

linkage 描述的是变量可见性,分为三种:

  • no linkage: 当前代码块可见
  • internal linkage:当前编译单元内可见
  • external linkage:其他编译单元可见

Storage-class specifiers

Storage-class specifiers 分为四类:

  • auto
  • register
  • static
  • extern

这四个Storage-class specifiers 会映射Storage durationLinkage

相关阅读

lucene 编译安装

· 5 min read

需要编译和了解lucene代码

编译

因为lucene锁死了版本,所以要切换成jdk17,我本地是jdk18

clone代码

## clone 代码
git clone https://github.com/apache/lucene.git

### 切换目录
cd lucene

### 编译
./gradlew

## 如果是翻墙,可以使用代理,这样会快一点
## 指定域名和端口
./gradlew -DsocksProxyHost=192.168.1.102 -DsocksProxyPort=1081

启动和测试

### 打包demo
./gradlew lucene:demo:jar

### 执行demo
java -cp /home/ubuntu/lucene-9.1.0/lucene/demo/build/classes/java/main:/home/ubuntu/lucene-9.1.0/lucene/core/build/classes/java/main/ org.apache.lucene.demo.IndexFiles -

操作系统是ubuntu切换jdk17命令如下:

### 安装jdk17
sudo apt install openjdk-17-jdk
# Configure Java 切换java
sudo update-alternatives --config java

# Configure Java Compiler 切换javac
sudo update-alternatives --config javac


### 查看切换之后的命令,java 已经是17了
java --version
openjdk 17.0.3 2022-04-19
OpenJDK Runtime Environment (build 17.0.3+7-Ubuntu-0ubuntu0.22.04.1)
OpenJDK 64-Bit Server VM (build 17.0.3+7-Ubuntu-0ubuntu0.22.04.1, mixed mode, sharing)

遇到的错误

gradle-wrapper.jar 下载不下来,跳过证书:

wget --no-check-certificate  https://raw.githubusercontent.com/gradle/gradle/v7.3.3/gradle/wrapper/gradle-wrapper.jar

然后放到{$luceneGitDir}/gradle/wrapper/ 下面 , 这里luceneGitDir 是你的git clone 下来的lucuene 目录

相关代码

      IndexWriterConfig iwc = new IndexWriterConfig(analyzer);
iwc.setUseCompoundFile(false); // 生成多个文件

写入header

对应的jdb调试

main[1] stop in  org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsWriter:136
Deferring breakpoint org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsWriter:136.
It will be set after the class is loaded.
main[1] cont
> Set deferred breakpoint org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsWriter:136

Breakpoint hit: "thread=main", org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsWriter.<init>(), line=136 bci=180
136 CodecUtil.writeIndexHeader(

main[1] list
132
133 fieldsStream =
134 directory.createOutput(
135 IndexFileNames.segmentFileName(segment, segmentSuffix, FIELDS_EXTENSION), context);
136 => CodecUtil.writeIndexHeader(
137 fieldsStream, formatName, VERSION_CURRENT, si.getId(), segmentSuffix);
138 assert CodecUtil.indexHeaderLength(formatName, segmentSuffix)
139 == fieldsStream.getFilePointer();
140
141 indexWriter =
main[1] print formatName
formatName = "Lucene90StoredFieldsFastData"

对应堆栈

  [1] org.apache.lucene.store.OutputStreamIndexOutput.writeByte (OutputStreamIndexOutput.java:54)
[2] org.apache.lucene.codecs.CodecUtil.writeBEInt (CodecUtil.java:653)
[3] org.apache.lucene.codecs.CodecUtil.writeHeader (CodecUtil.java:82)
[4] org.apache.lucene.codecs.CodecUtil.writeIndexHeader (CodecUtil.java:125)
[5] org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsWriter.<init> (Lucene90CompressingStoredFieldsWriter.java:128)
[6] org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsFormat.fieldsWriter (Lucene90CompressingStoredFieldsFormat.java:140)
[7] org.apache.lucene.codecs.lucene90.Lucene90StoredFieldsFormat.fieldsWriter (Lucene90StoredFieldsFormat.java:154)
[8] org.apache.lucene.index.StoredFieldsConsumer.initStoredFieldsWriter (StoredFieldsConsumer.java:49)
[9] org.apache.lucene.index.StoredFieldsConsumer.startDocument (StoredFieldsConsumer.java:56)
[10] org.apache.lucene.index.IndexingChain.startStoredFields (IndexingChain.java:556)
[11] org.apache.lucene.index.IndexingChain.processDocument (IndexingChain.java:587)
[12] org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments (DocumentsWriterPerThread.java:241)
[13] org.apache.lucene.index.DocumentsWriter.updateDocuments (DocumentsWriter.java:432)
[14] org.apache.lucene.index.IndexWriter.updateDocuments (IndexWriter.java:1,531)
[15] org.apache.lucene.index.IndexWriter.updateDocument (IndexWriter.java:1,816)
[16] org.apache.lucene.index.IndexWriter.addDocument (IndexWriter.java:1,469)
[17] org.apache.lucene.demo.IndexFiles.indexDoc (IndexFiles.java:271)
[18] org.apache.lucene.demo.IndexFiles$1.visitFile (IndexFiles.java:212)
[19] org.apache.lucene.demo.IndexFiles$1.visitFile (IndexFiles.java:208)
[20] java.nio.file.Files.walkFileTree (Files.java:2,725)
[21] java.nio.file.Files.walkFileTree (Files.java:2,797)
[22] org.apache.lucene.demo.IndexFiles.indexDocs (IndexFiles.java:206)
[23] org.apache.lucene.demo.IndexFiles.main (IndexFiles.java:157)

倒排索引

main[1] where
[1] org.apache.lucene.index.TermsHashPerField.initStreamSlices (TermsHashPerField.java:150)
[2] org.apache.lucene.index.TermsHashPerField.add (TermsHashPerField.java:198)
[3] org.apache.lucene.index.IndexingChain$PerField.invert (IndexingChain.java:1,224)
[4] org.apache.lucene.index.IndexingChain.processField (IndexingChain.java:729)
[5] org.apache.lucene.index.IndexingChain.processDocument (IndexingChain.java:620)
[6] org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments (DocumentsWriterPerThread.java:241)
[7] org.apache.lucene.index.DocumentsWriter.updateDocuments (DocumentsWriter.java:432)
[8] org.apache.lucene.index.IndexWriter.updateDocuments (IndexWriter.java:1,531)
[9] org.apache.lucene.index.IndexWriter.updateDocument (IndexWriter.java:1,816)
[10] org.apache.lucene.demo.IndexFiles.indexDoc (IndexFiles.java:277)
[11] org.apache.lucene.demo.IndexFiles$1.visitFile (IndexFiles.java:212)
[12] org.apache.lucene.demo.IndexFiles$1.visitFile (IndexFiles.java:208)
[13] java.nio.file.Files.walkFileTree (Files.java:2,725)
[14] java.nio.file.Files.walkFileTree (Files.java:2,797)
[15] org.apache.lucene.demo.IndexFiles.indexDocs (IndexFiles.java:206)
[16] org.apache.lucene.demo.IndexFiles.main (IndexFiles.java:157)

写入内容

main[1] where
[1] org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsWriter.writeField (Lucene90CompressingStoredFieldsWriter.java:276)
[2] org.apache.lucene.index.StoredFieldsConsumer.writeField (StoredFieldsConsumer.java:65)
[3] org.apache.lucene.index.IndexingChain.processField (IndexingChain.java:749)
[4] org.apache.lucene.index.IndexingChain.processDocument (IndexingChain.java:620)
[5] org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments (DocumentsWriterPerThread.java:241)
[6] org.apache.lucene.index.DocumentsWriter.updateDocuments (DocumentsWriter.java:432)
[7] org.apache.lucene.index.IndexWriter.updateDocuments (IndexWriter.java:1,531)
[8] org.apache.lucene.index.IndexWriter.updateDocument (IndexWriter.java:1,816)
[9] org.apache.lucene.index.IndexWriter.addDocument (IndexWriter.java:1,469)
[10] org.apache.lucene.demo.IndexFiles.indexDoc (IndexFiles.java:271)
[11] org.apache.lucene.demo.IndexFiles$1.visitFile (IndexFiles.java:212)
[12] org.apache.lucene.demo.IndexFiles$1.visitFile (IndexFiles.java:208)
[13] java.nio.file.Files.walkFileTree (Files.java:2,725)
[14] java.nio.file.Files.walkFileTree (Files.java:2,797)
[15] org.apache.lucene.demo.IndexFiles.indexDocs (IndexFiles.java:206)
[16] org.apache.lucene.demo.IndexFiles.main (IndexFiles.java:157)

查看fdt文件

hexdump -C _0.fdt
00000000 3f d7 6c 17 1c 4c 75 63 65 6e 65 39 30 53 74 6f |?.l..Lucene90Sto|
00000010 72 65 64 46 69 65 6c 64 73 46 61 73 74 44 61 74 |redFieldsFastDat|
00000020 61 00 00 00 01 85 88 12 2b 0c 73 6b 95 30 38 76 |a.......+.sk.08v|
00000030 c9 0a 2a 52 29 00 00 0a 00 01 00 1c 02 06 03 07 |..*R)...........|
00000040 07 07 07 07 07 07 07 07 20 00 1a 60 2f 68 6f 6d |........ ..`/hom|
00000050 65 2f 60 75 62 75 6e 74 75 60 2f 64 6f 63 2f 6d |e/`ubuntu`/doc/m|
00000060 60 6f 6e 67 6f 2e 74 60 78 74 00 1a 2f 68 60 6f |`ongo.t`xt../h`o|
00000070 6d 65 2f 75 62 60 75 6e 74 75 2f 64 60 6f 63 2f |me/ub`untu/d`oc/|
00000080 68 65 6c 60 6c 6f 2e 74 78 74 c0 28 93 e8 00 00 |hel`lo.txt.(....|
00000090 00 00 00 00 00 00 c8 75 0a 41 |.......u.A|
0000009a

fdt描述

然后分析fdt格式: [1-4]代表第一个字节到第四个字节

[1-4]前四位字节是大端的magic number CODEC_MAGIC = 0x3fd76c17 [5-33] 第五个字节描述字符串长度,后面的[6-33]是具体的字符串,也就是16进制1c也就是10进制的28 , 因为字符串长度是28的字符串Lucene90StoredFieldsFastData [34-37]字符串后面是写死的版本大端的1 [38-53] 16字节用唯一id描述这个文件

缓冲池

TermsHashPerField持有三个缓冲池intPool,bytePool,termBytePool

  TermsHashPerField(
int streamCount,
IntBlockPool intPool,
ByteBlockPool bytePool,
ByteBlockPool termBytePool,
Counter bytesUsed,
TermsHashPerField nextPerField,
String fieldName,
IndexOptions indexOptions) {
this.intPool = intPool;
this.bytePool = bytePool;
this.streamCount = streamCount;
this.fieldName = fieldName;
this.nextPerField = nextPerField;
assert indexOptions != IndexOptions.NONE;
this.indexOptions = indexOptions;
PostingsBytesStartArray byteStarts = new PostingsBytesStartArray(this, bytesUsed);
bytesHash = new BytesRefHash(termBytePool, HASH_INIT_SIZE, byteStarts);
}

生成term

main[1] where
[1] org.apache.lucene.util.BytesRefHash.add (BytesRefHash.java:247)
[2] org.apache.lucene.index.TermsHashPerField.add (TermsHashPerField.java:193)
[3] org.apache.lucene.index.IndexingChain$PerField.invert (IndexingChain.java:1,224)
[4] org.apache.lucene.index.IndexingChain.processField (IndexingChain.java:729)
[5] org.apache.lucene.index.IndexingChain.processDocument (IndexingChain.java:620)
[6] org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments (DocumentsWriterPerThread.java:241)
[7] org.apache.lucene.index.DocumentsWriter.updateDocuments (DocumentsWriter.java:432)
[8] org.apache.lucene.index.IndexWriter.updateDocuments (IndexWriter.java:1,531)
[9] org.apache.lucene.index.IndexWriter.updateDocument (IndexWriter.java:1,816)
[10] org.apache.lucene.demo.IndexFiles.indexDoc (IndexFiles.java:277)
[11] org.apache.lucene.demo.IndexFiles$1.visitFile (IndexFiles.java:212)
[12] org.apache.lucene.demo.IndexFiles$1.visitFile (IndexFiles.java:208)
[13] java.nio.file.Files.walkFileTree (Files.java:2,725)
[14] java.nio.file.Files.walkFileTree (Files.java:2,797)
[15] org.apache.lucene.demo.IndexFiles.indexDocs (IndexFiles.java:206)
[16] org.apache.lucene.demo.IndexFiles.main (IndexFiles.java:157)

arch 查询

相关阅读

main[1] where
[1] org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum.seekExact (SegmentTermsEnum.java:476)
[2] org.apache.lucene.index.TermStates.loadTermsEnum (TermStates.java:117)
[3] org.apache.lucene.index.TermStates.build (TermStates.java:102)
[4] org.apache.lucene.search.TermQuery.createWeight (TermQuery.java:227)
[5] org.apache.lucene.search.IndexSearcher.createWeight (IndexSearcher.java:885)
[6] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:686)
[7] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
[8] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
[9] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
[10] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)

相关阅读

clickhouse function

· 5 min read

背景

想要熟悉clickhouse的内容。

实现

## /ClickHouse/src/Functions/isNaN.cpp
## 添加一个struct
struct testtrue
{
static constexpr auto name = "testtrue"; ##函数名
template <typename T>
static bool execute(const T t) ### 执行回调
{
/// Suppression for PVS-Studio.
return true || t;
}
};
### 起别名
using FunctionTestTure = FunctionNumericPredicate<testtrue>;


void registerFunctionIsNaN(FunctionFactory & factory)
{
factory.registerFunction<FunctionIsNaN>();
factory.registerFunction<FunctionTestTure>(); ### 回调注册这个函数
}

重新编译并调用:

SELECT testtrue(2)

Query id: 87e4625d-8b79-4c3a-8153-b333d6a0614f

┌─testtrue(2)─┐
│ 1 │
└─────────────┘

注册路径

(lldb) bt
* thread #1, name = 'clickhouse-serv', stop reason = breakpoint 1.1
* frame #0: 0x00000000148ee2ac clickhouse-server`DB::registerFunctionIsNaN(factory=0x000000002f984a50) at isNaN.cpp:43:5
frame #1: 0x0000000010339e2c clickhouse-server`DB::registerFunctionsMiscellaneous(factory=0x000000002f984a50) at registerFunctionsMiscellaneous.cpp:128:5
frame #2: 0x00000000103391a0 clickhouse-server`DB::registerFunctions() at registerFunctions.cpp:96:5
frame #3: 0x000000000d063d35 clickhouse-server`DB::Server::main(this=0x00007fffffffd9b8, (null)=size=0) at Server.cpp:623:5
frame #4: 0x00000000230abec5 clickhouse-server`Poco::Util::Application::run(this=0x00007fffffffd9b8) at Application.cpp:334:8
frame #5: 0x000000000d06250b clickhouse-server`DB::Server::run(this=0x00007fffffffd9b8) at Server.cpp:461:25
frame #6: 0x00000000230c6c70 clickhouse-server`Poco::Util::ServerApplication::run(this=0x00007fffffffd9b8, argc=1, argv=0x00007ffff70f7038) at ServerApplication.cpp:611:9
frame #7: 0x000000000d05f8e1 clickhouse-server`mainEntryClickHouseServer(argc=1, argv=0x00007ffff70f7038) at Server.cpp:187:20
frame #8: 0x000000000cf7fe63 clickhouse-server`main(argc_=1, argv_=0x00007fffffffdfb8) at main.cpp:409:12
frame #9: 0x00007ffff7d92d90 libc.so.6`__libc_start_call_main(main=(clickhouse-server`main at main.cpp:380), argc=1, argv=0x00007fffffffdfb8) at libc_start_call_main.h:58:16
frame #10: 0x00007ffff7d92e40 libc.so.6`__libc_start_main_impl(main=(clickhouse-server`main at main.cpp:380), argc=1, argv=0x00007fffffffdfb8, init=0x00007ffff7ffd040, fini=<unavailable>, rtld_fini=<unavailable>, stack_end=0x00007fffffffdfa8) at libc-start.c:392:3
frame #11: 0x000000000cf7fb55 clickhouse-server`_start + 37

函数调用路径

lldb 调试

(lldb) b DB::(anonymous namespace)::testtrue::execute

调用堆栈:

(lldb) bt
* thread #3, name = 'TCPHandler', stop reason = breakpoint 3.1
* frame #0: 0x00000000148f3dca clickhouse-server`bool DB::(anonymous namespace)::testtrue::execute<char8_t>(t=0x02 u8'\U00000002') at isNaN.cpp:28:9
frame #1: 0x00000000148f29dd clickhouse-server`COW<DB::IColumn>::immutable_ptr<DB::IColumn> DB::FunctionNumericPredicate<DB::(anonymous namespace)::testtrue>::execute<char8_t>(this=0x00007fff1e44db58, in_untyped=0x00007fff1e447820) const at FunctionNumericPredicate.h:89:31
frame #2: 0x00000000148f1e8a clickhouse-server`DB::FunctionNumericPredicate<DB::(anonymous namespace)::testtrue>::executeImpl(this=0x00007fff1e44db58, arguments=size=1, (null)=std::__1::shared_ptr<const DB::IDataType>::element_type @ 0x00007fff1e4899c8 strong=2 weak=2, (null)=1) const at FunctionNumericPredicate.h:61:22
frame #3: 0x00000000103a611c clickhouse-server`DB::IFunction::executeImplDryRun(this=0x00007fff1e44db58, arguments=size=1, result_type=std::__1::shared_ptr<const DB::IDataType>::element_type @ 0x00007fff1e4899c8 strong=2 weak=2, input_rows_count=1) const at IFunction.h:395:16
frame #4: 0x00000000103a484d clickhouse-server`DB::FunctionToExecutableFunctionAdaptor::executeDryRunImpl(this=0x00007fff1e44dba0, arguments=size=1, result_type=std::__1::shared_ptr<const DB::IDataType>::element_type @ 0x00007fff1e4899c8 strong=2 weak=2, input_rows_count=1) const at IFunctionAdaptors.h:26:26
frame #5: 0x000000001a8fa3d9 clickhouse-server`DB::IExecutableFunction::executeWithoutLowCardinalityColumns(this=0x00007fff1e44dba0, args=size=1, result_type=std::__1::shared_ptr<const DB::IDataType>::element_type @ 0x00007fff1e4899c8 strong=2 weak=2, input_rows_count=1, dry_run=true) const at IFunction.cpp:217:15
frame #6: 0x000000001a8fa00a clickhouse-server`DB::IExecutableFunction::defaultImplementationForConstantArguments(this=0x00007fff1e44dba0, args=size=1, result_type=std::__1::shared_ptr<const DB::IDataType>::element_type @ 0x00007fff1e4899c8 strong=2 weak=2, input_rows_count=1, dry_run=true) const at IFunction.cpp:160:31
frame #7: 0x000000001a8fa2c4 clickhouse-server`DB::IExecutableFunction::executeWithoutLowCardinalityColumns(this=0x00007fff1e44dba0, args=size=1, result_type=std::__1::shared_ptr<const DB::IDataType>::element_type @ 0x00007fff1e4899c8 strong=2 weak=2, input_rows_count=1, dry_run=true) const at IFunction.cpp:209:20
frame #8: 0x000000001a8faf55 clickhouse-server`DB::IExecutableFunction::executeWithoutSparseColumns(this=0x00007fff1e44dba0, arguments=size=1, result_type=std::__1::shared_ptr<const DB::IDataType>::element_type @ 0x00007fff1e4899c8 strong=2 weak=2, input_rows_count=1, dry_run=true) const at IFunction.cpp:267:22
frame #9: 0x000000001a8fbe43 clickhouse-server`DB::IExecutableFunction::execute(this=0x00007fff1e44dba0, arguments=size=1, result_type=std::__1::shared_ptr<const DB::IDataType>::element_type @ 0x00007fff1e4899c8 strong=2 weak=2, input_rows_count=1, dry_run=true) const at IFunction.cpp:337:16
frame #10: 0x000000001b02a7b8 clickhouse-server`DB::ActionsDAG::addFunction(this=0x00007fff1e4ac248, function=std::__1::shared_ptr<DB::IFunctionOverloadResolver>::element_type @ 0x00007fff1e44dde0 strong=2 weak=1, children=size=0, result_name="testtrue(2)") at ActionsDAG.cpp:199:37
frame #11: 0x000000001cdad1a1 clickhouse-server`DB::ScopeStack::addFunction(this=0x00007fff2d1e96f0, function=std::__1::shared_ptr<DB::IFunctionOverloadResolver>::element_type @ 0x00007fff1e44dde0 strong=2 weak=1, argument_names=size=1, result_name="") at ActionsVisitor.cpp:598:51
frame #12: 0x000000001cdb7485 clickhouse-server`DB::ActionsMatcher::Data::addFunction(this=0x00007fff2d1e9698, function=std::__1::shared_ptr<DB::IFunctionOverloadResolver>::element_type @ 0x00007fff1e44dde0 strong=2 weak=1, argument_names=size=1, result_name=<unavailable>) at ActionsVisitor.h:140:27
frame #13: 0x000000001cdb0a2b clickhouse-server`DB::ActionsMatcher::visit(node=0x00007fff2a45c9b8, ast=std::__1::shared_ptr<DB::IAST>::element_type @ 0x00007fff2a45c9b8 strong=1 weak=2, data=0x00007fff2d1e9698) at ActionsVisitor.cpp:1093:14
frame #14: 0x000000001cdad64d clickhouse-server`DB::ActionsMatcher::visit(ast=std::__1::shared_ptr<DB::IAST>::element_type @ 0x00007fff2a45c9b8 strong=1 weak=2, data=0x00007fff2d1e9698) at ActionsVisitor.cpp:655:9
frame #15: 0x000000001cdb125c clickhouse-server`DB::ActionsMatcher::visit(expression_list=0x00007fff1e4895b8, (null)=std::__1::shared_ptr<DB::IAST>::element_type @ 0x00007fff1e4895b8 strong=2 weak=2, data=0x00007fff2d1e9698) at ActionsVisitor.cpp:763:17
frame #16: 0x000000001cdad6b9 clickhouse-server`DB::ActionsMatcher::visit(ast=std::__1::shared_ptr<DB::IAST>::element_type @ 0x00007fff1e4895b8 strong=2 weak=2, data=0x00007fff2d1e9698) at ActionsVisitor.cpp:659:9
frame #17: 0x000000001b1ebce5 clickhouse-server`DB::InDepthNodeVisitor<DB::ActionsMatcher, true, false, std::__1::shared_ptr<DB::IAST> const>::visit(this=0x00007fff2d1e9638, ast=std::__1::shared_ptr<DB::IAST>::element_type @ 0x00007fff1e4895b8 strong=2 weak=2) at InDepthNodeVisitor.h:34:13
frame #18: 0x000000001b1dc0ea clickhouse-server`DB::ExpressionAnalyzer::getRootActions(this=0x00007fff1e47f780, ast=std::__1::shared_ptr<DB::IAST>::element_type @ 0x00007fff1e4895b8 strong=2 weak=2, no_makeset_for_subqueries=false, actions=nullptr, only_consts=false) at ExpressionAnalyzer.cpp:587:48
frame #19: 0x000000001b1e2f9e clickhouse-server`DB::SelectQueryExpressionAnalyzer::appendSelect(this=0x00007fff1e47f780, chain=0x00007fff2d1ea5a0, only_types=false) at ExpressionAnalyzer.cpp:1383:5
frame #20: 0x000000001b1e6da8 clickhouse-server`DB::ExpressionAnalysisResult::ExpressionAnalysisResult(this=0x00007fff2d1eaa88, query_analyzer=0x00007fff1e47f780, metadata_snapshot=std::__1::shared_ptr<const DB::StorageInMemoryMetadata>::element_type @ 0x00007ffff705a800 strong=4 weak=1, first_stage_=true, second_stage_=true, only_types=false, filter_info_=nullptr, source_header=0x00007fff1e4c9550) at ExpressionAnalyzer.cpp:1830:24
frame #21: 0x000000001b57ab9d clickhouse-server`DB::InterpreterSelectQuery::getSampleBlockImpl(this=0x00007fff1e4c9000) at InterpreterSelectQuery.cpp:692:23
frame #22: 0x000000001b5747f9 clickhouse-server`DB::InterpreterSelectQuery::InterpreterSelectQuery(this=0x00007fff2d1ec148, try_move_to_prewhere=true)::$_1::operator()(bool) const at InterpreterSelectQuery.cpp:552:25
frame #23: 0x000000001b5709f6 clickhouse-server`DB::InterpreterSelectQuery::InterpreterSelectQuery(this=0x00007fff1e4c9000, query_ptr_=std::__1::shared_ptr<DB::IAST>::element_type @ 0x00007fff1e48e198 strong=1 weak=2, context_=std::__1::shared_ptr<const DB::Context>::element_type @ 0x00007fff1e495000 strong=3 weak=2, input_pipe_= Has Value=false , storage_=nullptr, options_=0x00007fff1e461770, required_result_column_names=size=0, metadata_snapshot_=nullptr, subquery_for_sets_=size=0, prepared_sets_=size=0) at InterpreterSelectQuery.cpp:555:5
frame #24: 0x000000001b56edf3 clickhouse-server`DB::InterpreterSelectQuery::InterpreterSelectQuery(this=0x00007fff1e4c9000, query_ptr_=std::__1::shared_ptr<DB::IAST>::element_type @ 0x00007fff1e48e198 strong=1 weak=2, context_=std::__1::shared_ptr<const DB::Context>::element_type @ 0x00007fff1e495000 strong=3 weak=2, options_=0x00007fff1e461770, required_result_column_names_=size=0) at InterpreterSelectQuery.cpp:165:7
frame #25: 0x000000001b5f4ce5 clickhouse-server`std::__1::__unique_if<DB::InterpreterSelectQuery>::__unique_single std::__1::make_unique<DB::InterpreterSelectQuery, std::__1::shared_ptr<DB::IAST> const&, std::__1::shared_ptr<DB::Context>&, DB::SelectQueryOptions&, std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&>(__args=std::__1::shared_ptr<DB::IAST>::element_type @ 0x00007fff1e48e198 strong=1 weak=2, __args=std::__1::shared_ptr<DB::Context>::element_type @ 0x00007fff1e495000 strong=3 weak=2, __args=0x00007fff1e461770, __args=size=0) at unique_ptr.h:725:32
frame #26: 0x000000001b5f2d09 clickhouse-server`DB::InterpreterSelectWithUnionQuery::buildCurrentChildInterpreter(this=0x00007fff1e461700, ast_ptr_=std::__1::shared_ptr<DB::IAST>::element_type @ 0x00007fff1e48e198 strong=1 weak=2, current_required_result_column_names=size=0) at InterpreterSelectWithUnionQuery.cpp:223:16
frame #27: 0x000000001b5f23ed clickhouse-server`DB::InterpreterSelectWithUnionQuery::InterpreterSelectWithUnionQuery(this=0x00007fff1e461700, query_ptr_=std::__1::shared_ptr<DB::IAST>::element_type @ 0x00007fff1e48f218 strong=2 weak=2, context_=std::__1::shared_ptr<const DB::Context>::element_type @ 0x00007fff1e492800 strong=5 weak=9, options_=0x00007fff2d1ef588, required_result_column_names=size=0) at InterpreterSelectWithUnionQuery.cpp:140:13
frame #28: 0x000000001b53b830 clickhouse-server`std::__1::__unique_if<DB::InterpreterSelectWithUnionQuery>::__unique_single std::__1::make_unique<DB::InterpreterSelectWithUnionQuery, std::__1::shared_ptr<DB::IAST>&, std::__1::shared_ptr<DB::Context>&, DB::SelectQueryOptions const&>(__args=std::__1::shared_ptr<DB::IAST>::element_type @ 0x00007fff1e48f218 strong=2 weak=2, __args=std::__1::shared_ptr<DB::Context>::element_type @ 0x00007fff1e492800 strong=5 weak=9, __args=0x00007fff2d1ef588) at unique_ptr.h:725:32
frame #29: 0x000000001b539e5d clickhouse-server`DB::InterpreterFactory::get(query=std::__1::shared_ptr<DB::IAST>::element_type @ 0x00007fff1e48f218 strong=2 weak=2, context=std::__1::shared_ptr<DB::Context>::element_type @ 0x00007fff1e492800 strong=5 weak=9, options=0x00007fff2d1ef588) at InterpreterFactory.cpp:122:16
frame #30: 0x000000001b961577 clickhouse-server`DB::executeQueryImpl(begin="select testtrue(2);", end="", context=std::__1::shared_ptr<DB::Context>::element_type @ 0x00007fff1e492800 strong=5 weak=9, internal=false, stage=Complete, istr=0x0000000000000000) at executeQuery.cpp:658:27
frame #31: 0x000000001b95ee64 clickhouse-server`DB::executeQuery(query="select testtrue(2);", context=std::__1::shared_ptr<DB::Context>::element_type @ 0x00007fff1e492800 strong=5 weak=9, internal=false, stage=Complete) at executeQuery.cpp:1067:30
frame #32: 0x000000001c590f25 clickhouse-server`DB::TCPHandler::runImpl(this=0x00007fff1e46e000) at TCPHandler.cpp:332:24
frame #33: 0x000000001c59f9e5 clickhouse-server`DB::TCPHandler::run(this=0x00007fff1e46e000) at TCPHandler.cpp:1781:9
frame #34: 0x0000000023091f79 clickhouse-server`Poco::Net::TCPServerConnection::start(this=0x00007fff1e46e000) at TCPServerConnection.cpp:43:3
frame #35: 0x0000000023092786 clickhouse-server`Poco::Net::TCPServerDispatcher::run(this=0x00007fff26648600) at TCPServerDispatcher.cpp:115:20
frame #36: 0x00000000232d2a94 clickhouse-server`Poco::PooledThread::run(this=0x00007ffff702df80) at ThreadPool.cpp:199:14
frame #37: 0x00000000232cf5ba clickhouse-server`Poco::(anonymous namespace)::RunnableHolder::run(this=0x00007ffff7001330) at Thread.cpp:55:11
frame #38: 0x00000000232ce39e clickhouse-server`Poco::ThreadImpl::runnableEntry(pThread=0x00007ffff702dfb8) at Thread_POSIX.cpp:345:27
frame #39: 0x00007ffff7dfdb43 libc.so.6`start_thread(arg=<unavailable>) at pthread_create.c:442:8
frame #40: 0x00007ffff7e8fa00 libc.so.6`__clone3 at clone3.S:81

function的实现

所有函数都是继承IFunction

## ClickHouse/src/Functions/IFunction.h

class IFunction
{
public:

virtual ~IFunction() = default;

virtual String getName() const = 0;

virtual ColumnPtr executeImpl(const ColumnsWithTypeAndName & arguments, const DataTypePtr & result_type, size_t input_rows_count) const = 0;
virtual ColumnPtr executeImplDryRun(const ColumnsWithTypeAndName & arguments, const DataTypePtr & result_type, size_t input_rows_count) const
{
return executeImpl(arguments, result_type, input_rows_count);
}
...
};

using FunctionPtr = std::shared_ptr<IFunction>;

核心是virtual方法executeImpl

看看实现的模板类

## ClickHouse/src/Functions/FunctionNumericPredicate.h
ColumnPtr executeImpl(const ColumnsWithTypeAndName & arguments, const DataTypePtr &, size_t /*input_rows_count*/) const override
{
const auto * in = arguments.front().column.get();

ColumnPtr res;
if (!((res = execute<UInt8>(in))
|| (res = execute<UInt16>(in))
|| (res = execute<UInt32>(in))
|| (res = execute<UInt64>(in))
|| (res = execute<Int8>(in))
|| (res = execute<Int16>(in))
|| (res = execute<Int32>(in))
|| (res = execute<Int64>(in))
|| (res = execute<Float32>(in))
|| (res = execute<Float64>(in))))
throw Exception{"Illegal column " + in->getName() + " of first argument of function " + getName(), ErrorCodes::ILLEGAL_COLUMN};

return res;
}

template <typename T>
ColumnPtr execute(const IColumn * in_untyped) const
{
if (const auto in = checkAndGetColumn<ColumnVector<T>>(in_untyped))
{
const auto size = in->size();

auto out = ColumnUInt8::create(size);

const auto & in_data = in->getData();
auto & out_data = out->getData();

for (const auto i : collections::range(0, size))
out_data[i] = Impl::execute(in_data[i]);

return out;
}

return nullptr;
}

clickhouse 请求的生命周期

· 2 min read

当前业务上是用curl 请求clickhouse,然后写入clickhouse。所以很好奇clickhouse的整个生命周期

请求

解析http请求

报文请求

echo -ne '1,Hello\n2,World\n' | curl -sSF 'file=@-' "http://localhost:8123/&qu?query=SELECT+*+FROM+file&file_format=CSV&file_types=UInt8,String";

请求到clickhouse

(lldb) p in
(DB::PeekableReadBuffer) $3 = {
DB::BufferWithOwnMemory<DB::ReadBuffer> = {
DB::ReadBuffer = {
DB::BufferBase = {
pos = 0x0000398b7c4ea0fe "--------------------------0c8b1c0a5e3c9b36\r\nContent-Disposition: form-data; name=\"file\"; filename=\"-\"\r\n\r\n1,Hello\n2,World\n\r\n--------------------------0c8b1c0a5e3c9b36--\r\n"
bytes = 0
working_buffer = (begin_pos = "--------------------------0c8b1c0a5e3c9b36\r\nContent-Disposition: form-data; name=\"file\"; filename=\"-\"\r\n\r\n1,Hello\n2,World\n\r\n--------------------------0c8b1c0a5e3c9b36--\r\n", end_pos = "")
internal_buffer = (begin_pos = "--------------------------0c8b1c0a5e3c9b36\r\nContent-Disposition: form-data; name=\"file\"; filename=\"-\"\r\n\r\n1,Hello\n2,World\n\r\n--------------------------0c8b1c0a5e3c9b36--\r\n", end_pos = "")
padded = false
}
nextimpl_working_buffer_offset = 0
}
memory = (m_capacity = 0, m_size = 0, m_data = 0x0000000000000000, alignment = 0)
}
sub_buf = 0x00007fff2628c080
peeked_size = 0
checkpoint = Has Value=false {}
checkpoint_in_own_memory = false
stack_memory = "'\xf9\f\0\0\0\0(\0\0\0\0\0\0\0 "
use_stack_memory = true
}
(lldb) bt
* thread #4, name = 'HTTPHandler', stop reason = step over
* frame #0: 0x000000001c5a98dc clickhouse-server`DB::HTMLForm::MultipartReadBuffer::readLine(this=0x00007fff2c9eede8, append_crlf=true) at HTMLForm.cpp:271:9
frame #1: 0x000000001c5a95df clickhouse-server`DB::HTMLForm::MultipartReadBuffer::skipToNextBoundary(this=0x00007fff2c9eede8) at HTMLForm.cpp:253:21
frame #2: 0x000000001c5a8ad4 clickhouse-server`DB::HTMLForm::readMultipart(this=0x00007fff2c9f0e10, in_=0x00007fff2628c080, handler=0x00007fff2c9ef1f0) at HTMLForm.cpp:186:13
frame #3: 0x000000001c5a7e39 clickhouse-server`DB::HTMLForm::load(this=0x00007fff2c9f0e10, request=0x00007fff2c9f1438, requestBody=0x00007fff2628c080, handler=0x00007fff2c9ef1f0) at HTMLForm.cpp:99:13
frame #4: 0x000000001d3ba404 clickhouse-server`DB::DynamicQueryHandler::getQuery(this=0x00007fff262b4000, request=0x00007fff2c9f1438, params=0x00007fff2c9f0e10, context=std::__1::shared_ptr<DB::Context>::element_type @ 0x00007fff2628f800 strong=2 weak=6) at HTTPHandler.cpp:1032:12
frame #5: 0x000000001d3b5ed4 clickhouse-server`DB::HTTPHandler::processQuery(this=0x00007fff262b4000, request=0x00007fff2c9f1438, params=0x00007fff2c9f0e10, response=0x00007fff2c9f14f0, used_output=0x00007fff2c9f0ec8, query_scope= Has Value=true ) at HTTPHandler.cpp:764:26
frame #6: 0x000000001d3b90de clickhouse-server`DB::HTTPHandler::handleRequest(this=0x00007fff262b4000, request=0x00007fff2c9f1438, response=0x00007fff2c9f14f0) at HTTPHandler.cpp:960:9
frame #7: 0x000000001d3f09f7 clickhouse-server`DB::HTTPServerConnection::run(this=0x00007fff2628c000) at HTTPServerConnection.cpp:65:34
frame #8: 0x000000002308f119 clickhouse-server`Poco::Net::TCPServerConnection::start(this=0x00007fff2628c000) at TCPServerConnection.cpp:43:3
frame #9: 0x000000002308f926 clickhouse-server`Poco::Net::TCPServerDispatcher::run(this=0x00007fff29fa8800) at TCPServerDispatcher.cpp:115:20
frame #10: 0x00000000232cfc34 clickhouse-server`Poco::PooledThread::run(this=0x00007ffff702e200) at ThreadPool.cpp:199:14
frame #11: 0x00000000232cc75a clickhouse-server`Poco::(anonymous namespace)::RunnableHolder::run(this=0x00007ffff7001350) at Thread.cpp:55:11
frame #12: 0x00000000232cb53e clickhouse-server`Poco::ThreadImpl::runnableEntry(pThread=0x00007ffff702e238) at Thread_POSIX.cpp:345:27
frame #13: 0x00007ffff7dfeb43 libc.so.6`start_thread(arg=<unavailable>) at pthread_create.c:442:8
frame #14: 0x00007ffff7e90a00 libc.so.6`__clone3 at clone3.S:81

生成语法树:

    std::tie(ast, streams) = executeQueryImpl(begin, end, context, false, QueryProcessingStage::Complete, &istr);