记一次 springboot+hive+log4j冲突导致CPU爆满问题
背景
- 为实现公司实时业务数据流处理微服务
(springboot:2.7.13)
高压缩比,集成了apache orc:1.9.3
现象
- 在本地 IDEA 单测启动发现
@Test
方法死活不进入
分析
- 开始以为是构建问题,反复尝试多次
gradlew clean build
,但重启依然卡死无法进入@Test
方法。 -
后来发现 CPU 接近
90%
(由于是最新 Apple M3 高配机器,开始都没感觉 CPU 高 (* ̄︶ ̄))-
第一反应,大概率是死循环了,于是想看看
GradleWorkerMain
进程的线程状态jps 2305 Launcher 2306 ExternalJavacProcess 1446 GradleDaemon 1386 Main 1435 RemoteMavenServer36 2318 Jps 2303 GradleWorkerMain jstack 2303 > /tmp/1 vim /tmp/1
-
似乎只有
C1/C2 CompilerThread
和Test worker
一直是RUNNABLE
状态(刚启动 JVM 的JIT
的C1/C2
编译线程就开始 working 说明调用极其频繁啊),且反复jstack
了几次也是一样
... "Service Thread" #10 daemon prio=9 os_prio=31 tid=0x000000011dc6e800 nid=0x4a03 runnable [0x0000000000000000] java.lang.Thread.State: RUNNABLE "C1 CompilerThread3" #9 daemon prio=9 os_prio=31 tid=0x000000011e081000 nid=0x4603 waiting on condition [0x0000000000000000] java.lang.Thread.State: RUNNABLE "C2 CompilerThread2" #8 daemon prio=9 os_prio=31 tid=0x000000011dc59800 nid=0x4503 waiting on condition [0x0000000000000000] java.lang.Thread.State: RUNNABLE "C2 CompilerThread1" #7 daemon prio=9 os_prio=31 tid=0x000000011e080800 nid=0x4d03 waiting on condition [0x0000000000000000] java.lang.Thread.State: RUNNABLE "C2 CompilerThread0" #6 daemon prio=9 os_prio=31 tid=0x000000011421d800 nid=0x4f03 waiting on condition [0x0000000000000000] java.lang.Thread.State: RUNNABLE "Signal Dispatcher" #4 daemon prio=9 os_prio=31 tid=0x000000011e05b000 nid=0x5103 runnable [0x0000000000000000] java.lang.Thread.State: RUNNABLE ... "Test worker" #1 prio=5 os_prio=31 tid=0x000000011d80d000 nid=0x1603 runnable [0x000000016de2e000] java.lang.Thread.State: RUNNABLE at java.util.concurrent.ConcurrentHashMap.computeIfAbsent(ConcurrentHashMap.java:1674) - locked <0x00000007afab3108> (a java.util.concurrent.ConcurrentHashMap$ReservationNode) at org.apache.log4j.Hierarchy.getInstance(Hierarchy.java:110) at org.apache.log4j.Hierarchy.getInstance(Hierarchy.java:102) at org.apache.log4j.Hierarchy.getLogger(Hierarchy.java:297) at org.apache.log4j.LogManager.getLogger(LogManager.java:148) at org.slf4j.impl.Reload4jLoggerFactory.getLogger(Reload4jLoggerFactory.java:84) at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:363) at org.apache.logging.slf4j.SLF4JLoggerContext.getLogger(SLF4JLoggerContext.java:36) at org.apache.log4j.Category.
(Category.java:170) at org.apache.log4j.Logger. (Logger.java:53) at org.apache.log4j.Hierarchy$PrivateLoggerAdapter.newLogger(Hierarchy.java:74) at org.apache.log4j.Hierarchy.lambda$getInstance$1(Hierarchy.java:110) at org.apache.log4j.Hierarchy$$Lambda$300/1839138870.apply(Unknown Source) at java.util.concurrent.ConcurrentHashMap.computeIfAbsent(ConcurrentHashMap.java:1660) - locked <0x00000007afab3108> (a java.util.concurrent.ConcurrentHashMap$ReservationNode) at org.apache.log4j.Hierarchy.getInstance(Hierarchy.java:110) at org.apache.log4j.Hierarchy.getInstance(Hierarchy.java:102) at org.apache.log4j.Hierarchy.getLogger(Hierarchy.java:297) at org.apache.log4j.LogManager.getLogger(LogManager.java:148) at org.slf4j.impl.Reload4jLoggerFactory.getLogger(Reload4jLoggerFactory.java:84) at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:363) at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:388) at org.apache.hadoop.conf.Configuration. (Configuration.java:230) at com.xxx.xxx.common.dataformat.orc.OrcJsonHolder. (OrcJsonHolder.java:463) at com.xxx.xxx.dataformat.orc.OrcJsonHolderTest.testFastJsonOrcCompression(OrcJsonHolderTest.java:80) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) - 难道是
log4j.Hierarchy.getInstance()
死循环了? 得继续尝试使用arthas
抓个flamegraph
确认一下
java -jar arthas-boot.jar [INFO] JAVA_HOME: /usr/local/jdk1.8.0_401.jdk/Contents/Home/jre [INFO] arthas-boot version: 3.7.2 [INFO] Found existing java process, please choose one and input the serial number of the process, eg : 1. Then hit ENTER. * [1]: 2323 org.jetbrains.jps.cmdline.Launcher [2]: 1446 org.gradle.launcher.daemon.bootstrap.GradleDaemon [3]: 1386 com.intellij.idea.Main [4]: 1435 org.jetbrains.idea.maven.server.RemoteMavenServer36 [5]: 2303 worker.org.gradle.process.internal.worker.GradleWorkerMain 5 [INFO] arthas home: /Users/jw/.arthas/lib/3.7.2/arthas [INFO] Try to attach process 2303 [INFO] Attach process 2303 success. [INFO] arthas-client connect 127.0.0.1 3658 ,---. ,------. ,--------.,--. ,--. ,---. ,---. / O \ | .--. ''--. .--'| '--' | / O \ ' .-' | .-. || '--'.' | | | .--. || .-. |
.
-. | | | || |\ \ | | | | | || | | |.-' |--'
--'--' '--'
--'--'
--'--'
--'`-----' wiki https://arthas.aliyun.com/doc tutorials https://arthas.aliyun.com/doc/arthas-tutorials.html version 3.7.2 main_class worker.org.gradle.process.internal.worker.GradleWorkerMain pid 2303 time 2024-04-14 22:14:36 [arthas@2303]$ profiler start Profiling started [arthas@2303]$ profiler stop OK profiler output file: ~/arthas-output/20240414-221449.html [arthas@2303]$- 好家伙,果然调用链只到
log4j.Hierarchy.getInstance()
就死了
-
解决
- 根据如上分析 100% 确认就是
log4j-1.2-api-2.17.2.jar!org.apache.log4j.Hierarchy.getInstance()
死循环了,这种基础库一般不会出现这种bug,大概率是冲突了,于是看看依赖树
./gradlew dependencies > /tmp/2
vim /tmp/2
- 果然
hive-exec:3.1.13
依赖的log4j-1.2-api:2.17.1
被springboot:2.7.13
的log4j-1.2-api:2.17.2
覆盖了
...
--- org.springframework.boot:spring-boot-dependencies:{strictly 2.7.13} -> 2.7.13
+--- ...
+--- org.apache.logging.log4j:log4j-1.2-api:2.17.2 (c)
...
+--- org.apache.hive:hive-exec:3.1.3
| +--- org.apache.hive:hive-llap-tez:3.1.3
| | +--- org.apache.hive:hive-common:3.1.3
...
| | | +--- org.apache.logging.log4j:log4j-1.2-api:2.17.1 -> 2.17.2
- 解决方案
- 方案1:直接 hive-exec:3.1.13 中彻底排除掉 log4j-1.2-api 库的依赖(都是
slf4j
体系,排除它不影响运行)。 —- 亲测成功 - 方案2:重新引入并强制指定版本
org.apache.logging.log4j:log4j-1.2-api:**2.17.1
** 即可。 —- 可行
- 方案1:直接 hive-exec:3.1.13 中彻底排除掉 log4j-1.2-api 库的依赖(都是