脚本怎么读取日志多个json

脚本高效读取日志中多个JSON对象的方法与实践

在软件开发和系统运维中,日志分析是排查问题、监控系统性能和了解用户行为的关键环节，现代应用程序常常将日志结构化为JSON格式，因为它易于机器解析且包含丰富的上下文信息，当日志文件包含多个连续或分散的JSON对象时，如何高效、准确地用脚本读取并处理它们，成为了一个常见的需求，本文将探讨几种常见的场景及对应的脚本实现方法，帮助您轻松应对这一挑战。

理解日志中多个JSON的存储方式

在开始编写脚本之前,首先要明确日志文件中多个JSON对象的存储形式，主要有以下几种情况：

每行一个JSON对象（JSON Lines / .jsonl）：这是最理想也最常见的格式，每个JSON对象独占一行，行与行之间由换行符分隔。

{"timestamp": "2023-10-27T10:00:00Z", "level": "info", "message": "User logged in"}
{"timestamp": "2023-10-27T10:00:01Z", "level": "error", "message": "Database connection failed", "error_code": 500}
{"timestamp": "2023-10-27T10:00:02Z", "level": "info", "message": "Processing order"}

多个JSON对象连续存储在同一行：较少见，但可能发生，JSON对象之间可能没有分隔符或使用特定分隔符（如逗号，但这不符合JSON标准，除非是数组的一部分）。
多个JSON对象连续存储在多行，没有严格换行分隔：一个较大的JSON对象跨越多行，多个这样的对象紧挨着。

JSON数组格式：整个日志文件是一个JSON数组，每个元素是一个日志对象。

[
  {"timestamp": "2023-10-27T10:00:00Z", "level": "info", "message": "User logged in"},
  {"timestamp": "2023-10-27T10:00:01Z", "level": "error", "message": "Database connection failed", "error_code": 500},
  {"timestamp": "2023-10-27T10:00:02Z", "level": "info", "message": "Processing order"}
]

针对以上不同情况,脚本读取策略也会有所不同，本文将重点介绍最常用的每行一个JSON对象的处理方式，并简要提及其他情况的处理思路。

脚本读取多个JSON对象的方法

我们将以几种主流的脚本语言为例,介绍如何读取日志中的多个JSON对象。

每行一个JSON对象（JSON Lines）

这是最简单高效的情况,因为每一行都是一个完整的、独立的JSON。

Python

Python的json模块是处理JSON的利器，结合文件读取，可以轻松实现。

import json
def read_json_logs(file_path):
    """读取每行一个JSON对象的日志文件"""
    json_objects = []
    try:
        with open(file_path, 'r', encoding='utf-8') as f:
            for line in f:
                line = line.strip()
                if line:  # 跳过空行
                    try:
                        json_obj = json.loads(line)
                        json_objects.append(json_obj)
                        # 在这里处理每个json_obj，例如提取字段、过滤、统计等
                        # print(f"Timestamp: {json_obj.get('timestamp')}, Message: {json_obj.get('message')}")
                    except json.JSONDecodeError as e:
                        print(f"Error decoding JSON from line: {line}. Error: {e}")
    except FileNotFoundError:
        print(f"Error: Log file not found at {file_path}")
    except Exception as e:
        print(f"An error occurred: {e}")
    return json_objects
# 使用示例
log_file_path = 'app.log'
logs = read_json_logs(log_file_path)
# 现在logs是一个列表，包含了所有有效的JSON对象
# 可以进一步处理logs，
for log in logs:
    if log.get('level') == 'error':
        print(f"Found error log: {log.get('message')}")

说明：

with open(...) 确保文件正确关闭。
for line in f: 逐行读取文件，内存效率高。
json.loads(line) 将每一行的字符串解析为Python字典。
try-except 用于捕获JSON解析错误，避免因个别格式错误的行导致整个脚本中断。

Bash (结合 jq)

如果是在Linux/Unix环境下，Bash脚本结合jq工具（一个轻量级、灵活的命令行JSON处理器）会非常高效。

首先确保系统已安装jq（sudo apt-get install jq 或 sudo yum install jq）。

#!/bin/bash
log_file="app.log"
# 检查文件是否存在
if [ ! -f "$log_file" ]; then
    echo "Error: Log file not found at $log_file"
    exit 1
fi
# 逐行读取日志文件，并用jq解析
# -r 选项表示输出raw string，避免引号
while IFS= read -r line; do
    # 跳过空行
    if [ -n "$line" ]; then
        # 使用jq解析并提取特定字段（例如timestamp和message）
        timestamp=$(echo "$line" | jq -r '.timestamp // "N/A"')
        message=$(echo "$line" | jq -r '.message // "N/A"')
        level=$(echo "$line" | jq -r '.level // "N/A"')
        echo "Timestamp: $timestamp, Level: $level, Message: $message"
        # 可以在这里添加条件判断，例如只处理error级别的日志
        if [ "$level" == "error" ]; then
            echo "!!! ERROR LOG DETECTED: $message"
        fi
    fi
done < "$log_file"

说明：

while IFS= read -r line; done < "$log_file" 逐行读取文件。
echo "$line" | jq -r '.field' 使用jq提取JSON字段，// "N/A"表示如果字段不存在则返回"N/A"。
-r选项让jq输出原始字符串，而不是JSON字符串（不带引号）。

JavaScript (Node.js)

Node.js同样非常适合处理日志文件，其fs模块提供了文件系统操作能力。

const fs = require('fs');
const readline = require('readline');
function readJsonLogs(filePath) {
    const jsonObjects = [];
    const fileStream = fs.createReadStream(filePath);
    const rl = readline.createInterface({
        input: fileStream,
        crlfDelay: Infinity // 处理不同操作系统的换行符
    });
    rl.on('line', (line) => {
        line = line.trim();
        if (line) {
            try {
                const jsonObj = JSON.parse(line);
                jsonObjects.push(jsonObj);
                // 处理每个jsonObj
                // console.log(`Timestamp: ${jsonObj.timestamp}, Message: ${jsonObj.message}`);
            } catch (error) {
                console.error(`Error parsing JSON: ${line}`, error.message);
            }
        }
    });
    rl.on('close', () => {
        console.log('Finished reading log file.');
        // 在这里处理所有收集到的jsonObjects
        processLogs(jsonObjects);
    });
    rl.on('error', (error) => {
        console.error('Error reading file:', error);
    });
}
function processLogs(logs) {
    logs.forEach(log => {
        if (log.level === 'error') {
            console.log(`Found error log: ${log.message}`);
        }
    });
}
// 使用示例
const logFilePath = 'app.log';
readJsonLogs(logFilePath);

说明：

使用readline模块逐行读取大文件，避免内存问题。
JSON.parse(line) 将字符串解析为JavaScript对象。
通过事件监听（'line', 'close', 'error'）来处理读取过程。

其他存储方式的简要处理思路

JSON数组格式：
- Python：直接使用json.load(f)读取整个文件，得到一个列表，然后遍历这个列表。
```
with open('array_logs.json', 'r', encoding='utf-8') as f:
    log_array = json.load(f)
    for log_obj in log_array:
        # 处理每个log_obj
        pass
```
- Bash/jq：可以直接用jq '.' array_logs.json来查看或操作整个数组，然后用jq '.[]' array_logs.json来逐个输出数组元素，再用管道传递给其他命令处理。
- Node.js：类似Python，fs.readFileSync或`fs