Skip to main content

14 posts tagged with "clickhouse"

View All Tags

clickhouse 请求的生命周期

· 2 min read

当前业务上是用curl 请求clickhouse,然后写入clickhouse。所以很好奇clickhouse的整个生命周期

请求

解析http请求

报文请求

echo -ne '1,Hello\n2,World\n' | curl -sSF 'file=@-' "http://localhost:8123/&qu?query=SELECT+*+FROM+file&file_format=CSV&file_types=UInt8,String";

请求到clickhouse

(lldb) p in
(DB::PeekableReadBuffer) $3 = {
DB::BufferWithOwnMemory<DB::ReadBuffer> = {
DB::ReadBuffer = {
DB::BufferBase = {
pos = 0x0000398b7c4ea0fe "--------------------------0c8b1c0a5e3c9b36\r\nContent-Disposition: form-data; name=\"file\"; filename=\"-\"\r\n\r\n1,Hello\n2,World\n\r\n--------------------------0c8b1c0a5e3c9b36--\r\n"
bytes = 0
working_buffer = (begin_pos = "--------------------------0c8b1c0a5e3c9b36\r\nContent-Disposition: form-data; name=\"file\"; filename=\"-\"\r\n\r\n1,Hello\n2,World\n\r\n--------------------------0c8b1c0a5e3c9b36--\r\n", end_pos = "")
internal_buffer = (begin_pos = "--------------------------0c8b1c0a5e3c9b36\r\nContent-Disposition: form-data; name=\"file\"; filename=\"-\"\r\n\r\n1,Hello\n2,World\n\r\n--------------------------0c8b1c0a5e3c9b36--\r\n", end_pos = "")
padded = false
}
nextimpl_working_buffer_offset = 0
}
memory = (m_capacity = 0, m_size = 0, m_data = 0x0000000000000000, alignment = 0)
}
sub_buf = 0x00007fff2628c080
peeked_size = 0
checkpoint = Has Value=false {}
checkpoint_in_own_memory = false
stack_memory = "'\xf9\f\0\0\0\0(\0\0\0\0\0\0\0 "
use_stack_memory = true
}
(lldb) bt
* thread #4, name = 'HTTPHandler', stop reason = step over
* frame #0: 0x000000001c5a98dc clickhouse-server`DB::HTMLForm::MultipartReadBuffer::readLine(this=0x00007fff2c9eede8, append_crlf=true) at HTMLForm.cpp:271:9
frame #1: 0x000000001c5a95df clickhouse-server`DB::HTMLForm::MultipartReadBuffer::skipToNextBoundary(this=0x00007fff2c9eede8) at HTMLForm.cpp:253:21
frame #2: 0x000000001c5a8ad4 clickhouse-server`DB::HTMLForm::readMultipart(this=0x00007fff2c9f0e10, in_=0x00007fff2628c080, handler=0x00007fff2c9ef1f0) at HTMLForm.cpp:186:13
frame #3: 0x000000001c5a7e39 clickhouse-server`DB::HTMLForm::load(this=0x00007fff2c9f0e10, request=0x00007fff2c9f1438, requestBody=0x00007fff2628c080, handler=0x00007fff2c9ef1f0) at HTMLForm.cpp:99:13
frame #4: 0x000000001d3ba404 clickhouse-server`DB::DynamicQueryHandler::getQuery(this=0x00007fff262b4000, request=0x00007fff2c9f1438, params=0x00007fff2c9f0e10, context=std::__1::shared_ptr<DB::Context>::element_type @ 0x00007fff2628f800 strong=2 weak=6) at HTTPHandler.cpp:1032:12
frame #5: 0x000000001d3b5ed4 clickhouse-server`DB::HTTPHandler::processQuery(this=0x00007fff262b4000, request=0x00007fff2c9f1438, params=0x00007fff2c9f0e10, response=0x00007fff2c9f14f0, used_output=0x00007fff2c9f0ec8, query_scope= Has Value=true ) at HTTPHandler.cpp:764:26
frame #6: 0x000000001d3b90de clickhouse-server`DB::HTTPHandler::handleRequest(this=0x00007fff262b4000, request=0x00007fff2c9f1438, response=0x00007fff2c9f14f0) at HTTPHandler.cpp:960:9
frame #7: 0x000000001d3f09f7 clickhouse-server`DB::HTTPServerConnection::run(this=0x00007fff2628c000) at HTTPServerConnection.cpp:65:34
frame #8: 0x000000002308f119 clickhouse-server`Poco::Net::TCPServerConnection::start(this=0x00007fff2628c000) at TCPServerConnection.cpp:43:3
frame #9: 0x000000002308f926 clickhouse-server`Poco::Net::TCPServerDispatcher::run(this=0x00007fff29fa8800) at TCPServerDispatcher.cpp:115:20
frame #10: 0x00000000232cfc34 clickhouse-server`Poco::PooledThread::run(this=0x00007ffff702e200) at ThreadPool.cpp:199:14
frame #11: 0x00000000232cc75a clickhouse-server`Poco::(anonymous namespace)::RunnableHolder::run(this=0x00007ffff7001350) at Thread.cpp:55:11
frame #12: 0x00000000232cb53e clickhouse-server`Poco::ThreadImpl::runnableEntry(pThread=0x00007ffff702e238) at Thread_POSIX.cpp:345:27
frame #13: 0x00007ffff7dfeb43 libc.so.6`start_thread(arg=<unavailable>) at pthread_create.c:442:8
frame #14: 0x00007ffff7e90a00 libc.so.6`__clone3 at clone3.S:81

生成语法树:

    std::tie(ast, streams) = executeQueryImpl(begin, end, context, false, QueryProcessingStage::Complete, &istr);

clickhouse 建表

· 2 min read

背景是目前在使用clickhouse,想自己搭建一个实例,并且建表成功

zookeeper

#### 切换到bin目录
cd apache-zookeeper-3.8.0-bin/bin/
## 启动zk
./zkServer.sh
  • 步骤3:创建zk节点 : path为/path/to/zookeeper/node
### 启动zkCli
cd apache-zookeeper-3.8.0-bin/bin/
### 启动节点
./zkCli.sh
### 创建zk节点 需要一级一级建
### 创建节点命令: create path
[zk: localhost:2181(CONNECTED) 11] create /path
Created /path
[zk: localhost:2181(CONNECTED) 12] create /path/to
Created /path/to
[zk: localhost:2181(CONNECTED) 13] create /path/to/zookeeper
Created /path/to/zookeeper
[zk: localhost:2181(CONNECTED) 14] create /path/to/zookeeper/node
Created /path/to/zookeeper/node

建表

建表之前先要把cluseter配置好 我的配置:

<!-- zk 配置 -->
<zookeeper>
<node>
<host>localhost</host>
<port>2181</port>
</node>
<session_timeout_ms>30000</session_timeout_ms>
<operation_timeout_ms>10000</operation_timeout_ms>
<!-- Optional. Chroot suffix. Should exist. -->
<root>/path/to/zookeeper/node</root>

</zookeeper>
<!-- 宏变量, 建表的时候指定的path中的变量从宏里面读取 -->
<macros>
<cluster>testcluster</cluster>
<shard>01</shard>
<replica>example01-01-1</replica>
</macros>
<remote_servers>
<!-- cluster 名称叫做 testcluster , 名字随便取的-->
<testcluster>
<shard>
<replica>
<host>localhost</host>
<port>9000</port>
</replica>
</shard>
</testcluster>
</remote_servers>
<distributed_ddl>

建表语句:

### 这里的{cluster} 就是上面配置的testcluster
CREATE TABLE test ON CLUSTER `{cluster}`
(
`timestamp` DateTime,
`contractid` UInt32,
`userid` UInt32
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{cluster}/{shard}/default/test', '{replica}')
PARTITION BY toYYYYMM(timestamp)
ORDER BY (contractid, toDate(timestamp), userid)
SAMPLE BY userid

Query id: 56c07fac-9a0b-4b0b-bf8f-fb808ce452e6

查询zk配置

SELECT  path  FROM system.zookeeper

遇到错误

遇到错误:There is no DistributedDDL configuration in server config

原因是: clickhosue的配置没有配对,需要参考上面给的链接添加配置

<distributed_ddl>
<!-- Path in ZooKeeper to queue with DDL queries -->
<path>/clickhouse/task_queue/ddl</path>
<cleanup_delay_period>60</cleanup_delay_period>
<task_max_lifetime>86400</task_max_lifetime>
<max_tasks_in_queue>1000</max_tasks_in_queue>
</distributed_ddl>

clickhosue insert insert_deduplicate

· 3 min read

问题复现

  • 建表语句如下
CREATE TABLE test ON CLUSTER `{cluster}`
(
`timestamp` DateTime,
`contractid` UInt32,
`userid` UInt32
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{cluster}/{shard}/default/test', '{replica}')
PARTITION BY toYYYYMM(timestamp)
ORDER BY (contractid, toDate(timestamp), userid)
SAMPLE BY userid

  • 第一次insert
insert into test ( userid ,contractid ,  timestamp ) values (1,1,'2022-02-02');

  • 返回结果是一行记录:
SELECT *  FROM test;

┌───────────timestamp─┬─contractid─┬─userid─┐
│ 2022-02-02 00:00:00 │ 1 │ 1 │
└─────────────────────┴────────────┴────────┘

第二次insert

insert into test ( userid ,contractid ,  timestamp ) values (1,1,'2022-02-02');

返回还是一行:

:) insert into test ( userid ,contractid ,  timestamp ) values (1,1,'2022-02-02');

INSERT INTO test (userid, contractid, timestamp) FORMAT Values

Query id: 706e2447-95eb-4515-a7b7-cf363512b673

Ok.

1 row in set. Elapsed: 0.056 sec.

dai-MS-7B89 :) select * from test

SELECT *
FROM test

Query id: 3ba7cd7f-4621-4286-8646-79737ec3e763

┌───────────timestamp─┬─contractid─┬─userid─┐
│ 2022-02-02 00:00:00 │ 1 │ 1 │
└─────────────────────┴────────────┴────────┘

1 row in set. Elapsed: 0.030 sec.

两次插入一样的数据的话, clickhouse会做对应的去重操作,这样两次插入只会插入一条数据

如何解决

clickhouse提供了参数控制是否去重的参数insert-deduplicate

set insert_deduplicate=0;

然后重新insert同一行记录,就不会因为去重导致重复插入数据被丢弃了。

insert into test ( userid ,contractid ,  timestamp ) values (1,1,'2022-02-02');

INSERT INTO test (userid, contractid, timestamp) FORMAT Values

Query id: a8df989b-0b63-4b45-a1b8-22c13b18bf0a

Ok.

1 row in set. Elapsed: 0.070 sec.

dai-MS-7B89 :) select * from test

SELECT *
FROM test

Query id: e077b55e-bfd9-4678-ae46-9fc05714b3f7

┌───────────timestamp─┬─contractid─┬─userid─┐
│ 2022-02-02 00:00:00 │ 1 │ 1 │
└─────────────────────┴────────────┴────────┘
┌───────────timestamp─┬─contractid─┬─userid─┐
│ 2022-02-02 00:00:00 │ 1 │ 1 │
└─────────────────────┴────────────┴────────┘

日志和源码分析

日志分析

2022.05.15 23:32:04.515912 [ 68323 ] {64b40d4f-0d00-4747-9af3-4afb56b6a84b} <Trace> MergedBlockOutputStream: filled checksums 202202_2_2_0 (state Temporary)
2022.05.15 23:32:04.517872 [ 68323 ] {64b40d4f-0d00-4747-9af3-4afb56b6a84b} <Debug> default.test (7d656761-7cd0-4866-a43e-f0e4cea97654) (Replicated OutputStream): Wrote block with ID '202202_8166901380224458449_12408515745921908624', 1 rows
2022.05.15 23:32:04.533981 [ 68323 ] {64b40d4f-0d00-4747-9af3-4afb56b6a84b} <Information> default.test (7d656761-7cd0-4866-a43e-f0e4cea97654) (Replicated OutputStream): Block with ID 202202_8166901380224458449_12408515745921908624 already exists locally as part 202202_0_0_0; ignoring it.


用lldb调试clickhouse

lldb ./clickhouse-server
void ReplicatedMergeTreeSink::consume(Chunk chunk)
{
auto block = getHeader().cloneWithColumns(chunk.detachColumns());

String block_id;

if (deduplicate) // 从上面生成
{
String block_dedup_token; // 生成token

/// We add the hash from the data and partition identifier to deduplication ID.
/// That is, do not insert the same data to the same partition twice.

const String & dedup_token = settings.insert_deduplication_token;
if (!dedup_token.empty())
{
/// multiple blocks can be inserted within the same insert query
/// an ordinal number is added to dedup token to generate a distinctive block id for each block
block_dedup_token = fmt::format("{}_{}", dedup_token, chunk_dedup_seqnum);
++chunk_dedup_seqnum;
}

block_id = temp_part.part->getZeroLevelPartBlockID(block_dedup_token);
LOG_DEBUG(log, "Wrote block with ID '{}', {} rows", block_id, current_block.block.rows());
}
else
{
LOG_DEBUG(log, "Wrote block with {} rows", current_block.block.rows());
}

相关阅读

clickhouse 编译安装

· 2 min read

编译流程

  • github 拉代码
git clone https://github.com/ClickHouse/ClickHouse.git
  • 创建子目录
cd ClickHouse/
mkdir build
cmake ..
  • 编译需要升级到gcc-8 g++8:
 GCC version must be at least 8.  For example, if GCC 8 is available under
gcc-8, g++-8 names, do the following: export CC=gcc-8 CXX=g++-8;

我的操作系统是ubuntu所以

$ sudo  apt-get install gcc-8 g++-8
$ sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-8 800 --slave /usr/bin/g++ g++ /usr/bin/g++-8
update-alternatives: using /usr/bin/gcc-8 to provide /usr/bin/gcc (gcc) in auto mode

然后看gcc版本升级到8.3.0

$ gcc -v
Using built-in specs.
...
gcc version 8.3.0 (Ubuntu 8.3.0-6ubuntu1~18.04.1)

  • 删掉cmake相关缓存
$ rm -rf   CMakeCache.txt CMakeFiles

重新跑有这样的错误:

  Submodules are not initialized.  Run

git submodule update --init --recursive

然后初始化git submodule:

$ git submodule update --init --recursive

因为我用的是v2ray + proxychains4,勉强把那些包下下来了 然后继续跑

cmake -DCMAKE_BUILD_TYPE=Debug  -DCMAKE_INSTALL_PREFIX=/home/ubuntu/click  ..

ninja all

然后我发现内存要比较大,而且硬盘要ssd,不然会编译特别慢

使用docker-compse启动

相关参考

  • 第一步创建文件 config.xmlusers.xml
  • 第二步 新建docker-compose.yml
version: '3'
services:
clickhouse-server:
image: yandex/clickhouse-server
container_name: clickhouse-server
hostname: clickhouse-server
ports:
- 8123:8123
expose:
- 9000
- 9009
volumes:
- ./config.xml:/etc/clickhouse-server/config.xml
- ./users.xml:/etc/clickhouse-server/users.xml
- ./data:/var/lib/clickhouse
- ./log/clickhouse-server.log:/var/log/clickhouse-server/clickhouse-server.log
- ./log/clickhouse-server.err.log:/var/log/clickhouse-server/clickhouse-server.err.log
  • 第二步
docker-compose up

使用clickhouse-client 连接

docker run -it --rm --link clickhouse-server:clickhouse-server --net clickhouse_default yandex/clickhouse-client --host clickhouse-server --user seluser --password 8d969eef6ecad3c29a3a629280e686cf0c3f5d5a86aff3ca12020c923adc6c92

相关阅读