Clickhouse log aggregation using Ansible
First Clone the project :)
git clone https://github.com/anvaari/ClickSight.git
cd ClickSight- Prepare Inventory file and Servers
- Edit hosts and replace
{node-1-address}and{node-2-address}with Clickhouse addresses.- You can use either the name or the IPv4 or IPv6 address.
- This playbook suppose Clickhouse error log place in
/var/log/clickhouse-server/clickhouse-server.err.log.- If it's different in your case, open issue to provide option to get it as variable.
- Also replace
{user}with user which have sudo access on Clickhouse servers.- Sudo user needed in case of
{user}hasn't read permission to read/var/log/clickhouse-server.err.log. If it has, you can remove allbecome: truefrom playbook and remove-Kfrom command.
- Sudo user needed in case of
- It's better to add your public key to authorized_key of Clickhouse servers so you don't need to enter password every time. See how
2.Install Ansible
You can Install it using official document. This playbook developed by Ansible core 2.15.0. And recommended to use this version.
ClickSight works in five mode for each node:
Timeline of all Fatal Errors in clickhouse-server.error.log.
-
Variables:
mode: For this, mode must befatal. Default =fatallines_after: Control number of lines you want to see after the line contains<Fatal>. Default = 20
-
Usage:
Show
Fatallogs and 35 lines after it.ansible-playbook -K -i hosts click-sight-playbook.yaml -e '{"mode":"fatal","lines_after":"35"}'Notes:
- Result will save in
{clickhouse_node_name}_clickhouse_fatal_errors.txt - You need enter
{user}password after executing command
- Result will save in
Timeline of Error log in clickhouse-server.err.log from from_datetime

-
Variables:
mode: For this, mode must beerror_timeline. Default =fatalfrom_datetime: Log after this time will aggregate and show. Default = 2 hours before now- Format must be
YYYY.MM.DD HH:MM - Important: If there isn't any
<Error>in given time, result will be empty so try to provide broad range like2023.07.14 15(It search for error which its line starts with2023.07.14 15So most likely there is error in that hour. But if you provide2023.07.14 15:00, it's more likely to there isn't any Error in that minute, so result will be empty!)
- Format must be
-
Usage: Show
Errorlogs from2023-07-14 12:00ansible-playbook -K -i hosts click-sight-playbook.yaml -e '{"mode":"error_timeline","from_datetime":"2023.07.14 12"}'Notes:
- Result will save in
{clickhouse_node_name}_clickhouse_error_timeline.txt - Result will group by minutes and sorted by time of occurrence
- Result will save in
Timeline of errors associated with query_id from from_datetime

-
Variables:
mode: For this, mode must bequeries_error_timeline. Default =fatalfrom_datetime: See detail of this variable in documentation of mode = error_timeline
-
Usage: Show
Errorlogs associated withquery_idfrom2023-07-14 12:00ansible-playbook -K -i hosts click-sight-playbook.yaml -e '{"mode":"queries_error_timeline","from_datetime":"2023.07.14 12"}'Notes:
- Result will save in
{clickhouse_node_name}_clickhouse_queries_error_timeline.txt - Result will group by minutes and sorted by time of occurrence
- You can see full detail about specific
query_idby executing this query in{clickhouse_node_name}:select * from system.query_log where query_id = {query_id}
- Result will save in
Timeline of all errors associated with given query_id from from_datetime

-
Variables:
mode: For this, mode must begiven_query_id_error_timeline. Default =fatalquery_id: query_id you want to have its error timeline. You can get one (query_id) from mode =queries_error_timeline. Default = "" (Default will show allErrorlevel logs which not associated with specific query)- Note that
query_idfrom node_x only valid for this node. You can't rely on result in another node (query is different in that node)
- Note that
from_datetime: See detail of this variable in documentation of mode = error_timeline
-
Usage: Show all
Errorlevel logs for7d6601eb-633d-4dc1-9863-65f0e3483717query incl-1from2023-07-14 12:00ansible-playbook -K -i hosts -l cl-1 click-disaster-playbook.yaml -e '{"mode":"given_query_id_error_timeline","from_datetime":"2023.07.14 12","query_id":"7d6601eb-633d-4dc1-9863-65f0e3483717"}'Notes:
- Result will save in
cl-1_clickhouse_given_query_id_error_timeline.txt - Result will group by minutes and sorted by time of occurrence
- You can see full detail about this
query_idby executing this query in node-1 :select * from system.query_log where query_id = 7d6601eb-633d-4dc1-9863-65f0e3483717
- Result will save in
Timeline of given error from from_datetime with more detail (contains full error line)

-
Variables:
mode: For this, mode must begiven_error_timeline. Default =fatalerror: Error you want to have its timeline with full detail. You can get one from mode =error_timeline. Default = "StorageKafka"from_datetime: See detail of this variable in documentation of mode = error_timeline
-
Usage: Show all logs which its error is
ServerErrorHandlerfrom2023-07-14 12:00with more detail.ansible-playbook -K -i hosts click-sight-playbook.yaml -e '{"mode":"given_error_timeline","from_datetime":"2023.07.14 12","error":"ServerErrorHandler"}'Notes:
- Result save in
{clickhouse_node_name}_clickhouse_given_error_timeline.txt - Result not grouped but sorted by time of occurrence
- Result save in