Hadoop,  Spring & JVM

记一次 springboot+hive+log4j冲突导致CPU爆满问题

背景

  • 为实现公司实时业务数据流处理微服务(springboot:2.7.13)高压缩比,集成了 apache orc:1.9.3

现象

  • 在本地 IDEA 单测启动发现 @Test 方法死活不进入

分析

  • 开始以为是构建问题,反复尝试多次gradlew clean build ,但重启依然卡死无法进入 @Test 方法。
  • 后来发现 CPU 接近 90% (由于是最新 Apple M3 高配机器,开始都没感觉 CPU 高 (* ̄︶ ̄))

    • 第一反应,大概率是死循环了,于是想看看 GradleWorkerMain 进程的线程状态

      jps
      2305 Launcher
      2306 ExternalJavacProcess
      1446 GradleDaemon
      1386 Main
      1435 RemoteMavenServer36
      2318 Jps
      2303 GradleWorkerMain
      
      jstack 2303 > /tmp/1
      
      vim /tmp/1
    • 似乎只有 C1/C2 CompilerThreadTest worker 一直是 RUNNABLE 状态(刚启动 JVM 的 JITC1/C2编译线程就开始 working 说明调用极其频繁啊),且反复 jstack 了几次也是一样

    ...
    "Service Thread" #10 daemon prio=9 os_prio=31 tid=0x000000011dc6e800 nid=0x4a03 runnable [0x0000000000000000]
       java.lang.Thread.State: RUNNABLE
    
    "C1 CompilerThread3" #9 daemon prio=9 os_prio=31 tid=0x000000011e081000 nid=0x4603 waiting on condition [0x0000000000000000]
       java.lang.Thread.State: RUNNABLE
    
    "C2 CompilerThread2" #8 daemon prio=9 os_prio=31 tid=0x000000011dc59800 nid=0x4503 waiting on condition [0x0000000000000000]
       java.lang.Thread.State: RUNNABLE
    
    "C2 CompilerThread1" #7 daemon prio=9 os_prio=31 tid=0x000000011e080800 nid=0x4d03 waiting on condition [0x0000000000000000]
       java.lang.Thread.State: RUNNABLE
    
    "C2 CompilerThread0" #6 daemon prio=9 os_prio=31 tid=0x000000011421d800 nid=0x4f03 waiting on condition [0x0000000000000000]
       java.lang.Thread.State: RUNNABLE
    
    "Signal Dispatcher" #4 daemon prio=9 os_prio=31 tid=0x000000011e05b000 nid=0x5103 runnable [0x0000000000000000]
       java.lang.Thread.State: RUNNABLE
    ...
    "Test worker" #1 prio=5 os_prio=31 tid=0x000000011d80d000 nid=0x1603 runnable [0x000000016de2e000]
       java.lang.Thread.State: RUNNABLE
      at java.util.concurrent.ConcurrentHashMap.computeIfAbsent(ConcurrentHashMap.java:1674)
      - locked <0x00000007afab3108> (a java.util.concurrent.ConcurrentHashMap$ReservationNode)
      at org.apache.log4j.Hierarchy.getInstance(Hierarchy.java:110)
      at org.apache.log4j.Hierarchy.getInstance(Hierarchy.java:102)
      at org.apache.log4j.Hierarchy.getLogger(Hierarchy.java:297)
      at org.apache.log4j.LogManager.getLogger(LogManager.java:148)
      at org.slf4j.impl.Reload4jLoggerFactory.getLogger(Reload4jLoggerFactory.java:84)
      at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:363)
      at org.apache.logging.slf4j.SLF4JLoggerContext.getLogger(SLF4JLoggerContext.java:36)
      at org.apache.log4j.Category.(Category.java:170)
      at org.apache.log4j.Logger.(Logger.java:53)
      at org.apache.log4j.Hierarchy$PrivateLoggerAdapter.newLogger(Hierarchy.java:74)
      at org.apache.log4j.Hierarchy.lambda$getInstance$1(Hierarchy.java:110)
      at org.apache.log4j.Hierarchy$$Lambda$300/1839138870.apply(Unknown Source)
      at java.util.concurrent.ConcurrentHashMap.computeIfAbsent(ConcurrentHashMap.java:1660)
      - locked <0x00000007afab3108> (a java.util.concurrent.ConcurrentHashMap$ReservationNode)
      at org.apache.log4j.Hierarchy.getInstance(Hierarchy.java:110)
      at org.apache.log4j.Hierarchy.getInstance(Hierarchy.java:102)
      at org.apache.log4j.Hierarchy.getLogger(Hierarchy.java:297)
      at org.apache.log4j.LogManager.getLogger(LogManager.java:148)
      at org.slf4j.impl.Reload4jLoggerFactory.getLogger(Reload4jLoggerFactory.java:84)
      at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:363)
      at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:388)
      at org.apache.hadoop.conf.Configuration.(Configuration.java:230)
      at com.xxx.xxx.common.dataformat.orc.OrcJsonHolder.(OrcJsonHolder.java:463)
      at com.xxx.xxx.dataformat.orc.OrcJsonHolderTest.testFastJsonOrcCompression(OrcJsonHolderTest.java:80)
      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      at java.lang.reflect.Method.invoke(Method.java:498)
      at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
    • 难道是log4j.Hierarchy.getInstance() 死循环了? 得继续尝试使用 arthas 抓个 flamegraph 确认一下
    java -jar arthas-boot.jar 
    [INFO] JAVA_HOME: /usr/local/jdk1.8.0_401.jdk/Contents/Home/jre
    [INFO] arthas-boot version: 3.7.2
    [INFO] Found existing java process, please choose one and input the serial number of the process, eg : 1. Then hit ENTER.
    * [1]: 2323 org.jetbrains.jps.cmdline.Launcher
      [2]: 1446 org.gradle.launcher.daemon.bootstrap.GradleDaemon
      [3]: 1386 com.intellij.idea.Main
      [4]: 1435 org.jetbrains.idea.maven.server.RemoteMavenServer36
      [5]: 2303 worker.org.gradle.process.internal.worker.GradleWorkerMain
    5
    [INFO] arthas home: /Users/jw/.arthas/lib/3.7.2/arthas
    [INFO] Try to attach process 2303
    [INFO] Attach process 2303 success.
    [INFO] arthas-client connect 127.0.0.1 3658
      ,---.  ,------. ,--------.,--.  ,--.  ,---.   ,---.                           
     /  O  \ |  .--. ''--.  .--'|  '--'  | /  O  \ '   .-'                          
    |  .-.  ||  '--'.'   |  |   |  .--.  ||  .-.  |.  -.                          
    |  | |  ||  |\  \    |  |   |  |  |  ||  | |  |.-'    |                         
    --' --'--' '--'   --'   --'  --'--' --'`-----'                          
    
    wiki       https://arthas.aliyun.com/doc                                        
    tutorials  https://arthas.aliyun.com/doc/arthas-tutorials.html                  
    version    3.7.2                                                                
    main_class worker.org.gradle.process.internal.worker.GradleWorkerMain           
    pid        2303                                                                 
    time       2024-04-14 22:14:36                                                  
    
    [arthas@2303]$ profiler start
    Profiling started
    [arthas@2303]$ profiler stop
    OK
    profiler output file: ~/arthas-output/20240414-221449.html
    [arthas@2303]$ 
    • 好家伙,果然调用链只到 log4j.Hierarchy.getInstance() 就死了

    hive+springboot+log4j+cpu+high.png

解决

  • 根据如上分析 100% 确认就是 log4j-1.2-api-2.17.2.jar!org.apache.log4j.Hierarchy.getInstance() 死循环了,这种基础库一般不会出现这种bug,大概率是冲突了,于是看看依赖树
./gradlew dependencies > /tmp/2
vim /tmp/2
  • 果然 hive-exec:3.1.13 依赖的 log4j-1.2-api:2.17.1springboot:2.7.13log4j-1.2-api:2.17.2 覆盖了
...
--- org.springframework.boot:spring-boot-dependencies:{strictly 2.7.13} -> 2.7.13
     +--- ...
     +--- org.apache.logging.log4j:log4j-1.2-api:2.17.2 (c)
...
+--- org.apache.hive:hive-exec:3.1.3
|    +--- org.apache.hive:hive-llap-tez:3.1.3
|    |    +--- org.apache.hive:hive-common:3.1.3
...
|    |    |    +--- org.apache.logging.log4j:log4j-1.2-api:2.17.1 -> 2.17.2
  • 解决方案
    • 方案1:直接 hive-exec:3.1.13 中彻底排除掉 log4j-1.2-api 库的依赖(都是 slf4j 体系,排除它不影响运行)。 —- 亲测成功
    • 方案2:重新引入并强制指定版本 org.apache.logging.log4j:log4j-1.2-api:**2.17.1** 即可。 —- 可行

留言

您的电子邮箱地址不会被公开。