Regression: Coordinator crashes with OOM in 468 #24572

guyco33 · 2024-12-24T12:33:53Z

Coordinator in 468 consumes much more memory than 467 that leads to OS OOM.
What could be the reason for such memory consumption change ?

Listed below the output from free memory stats and gc logs during execution from both coordinator versions.

467
===
** after 15 minutes 

[2024-12-24T11:10:18.671+0000][7691][info][gc,start       ] GC(60) Pause Young (Normal) (G1 Evacuation Pause)
[2024-12-24T11:10:18.672+0000][7691][info][gc,task        ] GC(60) Using 4 workers of 4 for evacuation
[2024-12-24T11:10:18.834+0000][7691][info][gc,phases      ] GC(60)   Pre Evacuate Collection Set: 0.87ms
[2024-12-24T11:10:18.834+0000][7691][info][gc,phases      ] GC(60)   Merge Heap Roots: 0.44ms
[2024-12-24T11:10:18.834+0000][7691][info][gc,phases      ] GC(60)   Evacuate Collection Set: 159.86ms
[2024-12-24T11:10:18.834+0000][7691][info][gc,phases      ] GC(60)   Post Evacuate Collection Set: 1.28ms
[2024-12-24T11:10:18.834+0000][7691][info][gc,phases      ] GC(60)   Other: 0.12ms
[2024-12-24T11:10:18.834+0000][7691][info][gc,heap        ] GC(60) Eden regions: 100->0(89)
[2024-12-24T11:10:18.834+0000][7691][info][gc,heap        ] GC(60) Survivor regions: 8->9(14)
[2024-12-24T11:10:18.834+0000][7691][info][gc,heap        ] GC(60) Old regions: 34->34
[2024-12-24T11:10:18.834+0000][7691][info][gc,heap        ] GC(60) Humongous regions: 0->0
[2024-12-24T11:10:18.834+0000][7691][info][gc,metaspace   ] GC(60) Metaspace: 291741K(295488K)->291741K(295488K) NonClass: 254345K(256448K)->254345K(256448K) Class: 37395K(39040K)->37395K(39040K)
[2024-12-24T11:10:18.834+0000][7691][info][gc             ] GC(60) Pause Young (Normal) (G1 Evacuation Pause) 4449M->1279M(16384M) 162.817ms
[2024-12-24T11:10:18.834+0000][7691][info][gc,cpu         ] GC(60) User=0.43s Sys=0.04s Real=0.17s
               total        used        free      shared  buff/cache   available
Mem:              30           7          19           0           4          23
Swap:              0           0           0
Tue Dec 24 11:10:26 UTC 2024


** after 1:15 hour

[2024-12-24T12:09:59.680+0000][7686][info][gc,marking     ] GC(173) Concurrent Mark From Roots 2367.862ms
[2024-12-24T12:09:59.680+0000][7686][info][gc,marking     ] GC(173) Concurrent Preclean
[2024-12-24T12:09:59.681+0000][7686][info][gc,marking     ] GC(173) Concurrent Preclean 1.161ms
[2024-12-24T12:09:59.681+0000][7691][info][gc,start       ] GC(173) Pause Remark
[2024-12-24T12:09:59.724+0000][7691][info][gc             ] GC(173) Pause Remark 5988M->5691M(16384M) 42.204ms
[2024-12-24T12:09:59.724+0000][7691][info][gc,cpu         ] GC(173) User=0.15s Sys=0.00s Real=0.04s
[2024-12-24T12:09:59.726+0000][7686][info][gc,marking     ] GC(173) Concurrent Mark 2413.972ms
[2024-12-24T12:09:59.726+0000][7686][info][gc,marking     ] GC(173) Concurrent Rebuild Remembered Sets and Scrub Regions
[2024-12-24T12:10:00.210+0000][7686][info][gc,marking     ] GC(173) Concurrent Rebuild Remembered Sets and Scrub Regions 483.491ms
[2024-12-24T12:10:00.210+0000][7691][info][gc,start       ] GC(173) Pause Cleanup
[2024-12-24T12:10:00.210+0000][7691][info][gc             ] GC(173) Pause Cleanup 5725M->5725M(16384M) 0.054ms
[2024-12-24T12:10:00.210+0000][7691][info][gc,cpu         ] GC(173) User=0.00s Sys=0.00s Real=0.00s
[2024-12-24T12:10:00.210+0000][7686][info][gc,marking     ] GC(173) Concurrent Clear Claimed Marks
[2024-12-24T12:10:00.210+0000][7686][info][gc,marking     ] GC(173) Concurrent Clear Claimed Marks 0.427ms
[2024-12-24T12:10:00.210+0000][7686][info][gc,marking     ] GC(173) Concurrent Cleanup for Next Mark
[2024-12-24T12:10:00.228+0000][7686][info][gc,marking     ] GC(173) Concurrent Cleanup for Next Mark 17.547ms
[2024-12-24T12:10:00.228+0000][7686][info][gc             ] GC(173) Concurrent Mark Cycle 3285.446ms
               total        used        free      shared  buff/cache   available
Mem:              30          16          10           0           4          14
Swap:              0           0           0
Tue Dec 24 12:10:41 UTC 2024

===
468
===
** after 15 minutes 

[2024-12-24T11:10:04.019+0000][7679][info][gc,start       ] GC(62) Pause Young (Normal) (G1 Evacuation Pause)
[2024-12-24T11:10:04.019+0000][7679][info][gc,task        ] GC(62) Using 4 workers of 4 for evacuation
[2024-12-24T11:10:04.161+0000][7679][info][gc,phases      ] GC(62)   Pre Evacuate Collection Set: 0.32ms
[2024-12-24T11:10:04.161+0000][7679][info][gc,phases      ] GC(62)   Merge Heap Roots: 0.22ms
[2024-12-24T11:10:04.161+0000][7679][info][gc,phases      ] GC(62)   Evacuate Collection Set: 139.19ms
[2024-12-24T11:10:04.161+0000][7679][info][gc,phases      ] GC(62)   Post Evacuate Collection Set: 2.22ms
[2024-12-24T11:10:04.161+0000][7679][info][gc,phases      ] GC(62)   Other: 0.08ms
[2024-12-24T11:10:04.161+0000][7679][info][gc,heap        ] GC(62) Eden regions: 118->0(104)
[2024-12-24T11:10:04.161+0000][7679][info][gc,heap        ] GC(62) Survivor regions: 8->10(16)
[2024-12-24T11:10:04.161+0000][7679][info][gc,heap        ] GC(62) Old regions: 33->33
[2024-12-24T11:10:04.161+0000][7679][info][gc,heap        ] GC(62) Humongous regions: 0->0
[2024-12-24T11:10:04.161+0000][7679][info][gc,metaspace   ] GC(62) Metaspace: 292209K(296000K)->292209K(296000K) NonClass: 254676K(256768K)->254676K(256768K) Class: 37533K(39232K)->37533K(39232K)
[2024-12-24T11:10:04.161+0000][7679][info][gc             ] GC(62) Pause Young (Normal) (G1 Evacuation Pause) 5009M->1277M(16384M) 142.330ms
[2024-12-24T11:10:04.162+0000][7679][info][gc,cpu         ] GC(62) User=0.44s Sys=0.04s Real=0.14s
               total        used        free      shared  buff/cache   available
Mem:              30          16           9           0           4          13
Swap:              0           0           0
Tue Dec 24 11:10:27 UTC 2024


** after 1:15 hour

[2024-12-24T12:10:12.713+0000][7679][info][gc,start       ] GC(179) Pause Young (Normal) (G1 Evacuation Pause)
[2024-12-24T12:10:12.714+0000][7679][info][gc,task        ] GC(179) Using 4 workers of 4 for evacuation
[2024-12-24T12:10:12.795+0000][7679][info][gc,phases      ] GC(179)   Pre Evacuate Collection Set: 0.34ms
[2024-12-24T12:10:12.795+0000][7679][info][gc,phases      ] GC(179)   Merge Heap Roots: 0.24ms
[2024-12-24T12:10:12.795+0000][7679][info][gc,phases      ] GC(179)   Evacuate Collection Set: 77.50ms
[2024-12-24T12:10:12.795+0000][7679][info][gc,phases      ] GC(179)   Post Evacuate Collection Set: 2.34ms
[2024-12-24T12:10:12.795+0000][7679][info][gc,phases      ] GC(179)   Other: 0.09ms
[2024-12-24T12:10:12.795+0000][7679][info][gc,heap        ] GC(179) Eden regions: 303->0(302)
[2024-12-24T12:10:12.795+0000][7679][info][gc,heap        ] GC(179) Survivor regions: 4->5(39)
[2024-12-24T12:10:12.795+0000][7679][info][gc,heap        ] GC(179) Old regions: 113->113
[2024-12-24T12:10:12.795+0000][7679][info][gc,heap        ] GC(179) Humongous regions: 1->1
[2024-12-24T12:10:12.795+0000][7679][info][gc,metaspace   ] GC(179) Metaspace: 296387K(300352K)->296387K(300352K) NonClass: 258630K(260800K)->258630K(260800K) Class: 37756K(39552K)->37756K(39552K)
[2024-12-24T12:10:12.795+0000][7679][info][gc             ] GC(179) Pause Young (Normal) (G1 Evacuation Pause) 13361M->3699M(16384M) 82.112ms
[2024-12-24T12:10:12.795+0000][7679][info][gc,cpu         ] GC(179) User=0.26s Sys=0.01s Real=0.08s
               total        used        free      shared  buff/cache   available
Mem:              30          29           1           0           0           0
Swap:              0           0           0
Tue Dec 24 12:10:25 UTC 2024


*** syslog OOM output 
grep -i oom /var/log/syslog

2024-12-24T12:14:35.481240+00:00 ip-172-30-20-32 kernel: http-worker-691 invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
2024-12-24T12:14:35.496433+00:00 ip-172-30-20-32 kernel:  oom_kill_process+0x116/0x270
2024-12-24T12:14:35.496434+00:00 ip-172-30-20-32 kernel:  ? oom_evaluate_task+0x143/0x1e0
2024-12-24T12:14:35.496435+00:00 ip-172-30-20-32 kernel:  __alloc_pages_may_oom+0x10c/0x1d0
2024-12-24T12:14:35.511859+00:00 ip-172-30-20-32 kernel: [  pid  ]   uid  tgid total_vm      rss rss_anon rss_file rss_shmem pgtables_bytes swapents oom_score_adj name
2024-12-24T12:14:35.511912+00:00 ip-172-30-20-32 kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/cloud-final.service,task=trino-server,pid=7634,uid=0
2024-12-24T12:14:35.511912+00:00 ip-172-30-20-32 kernel: Out of memory: Killed process 7634 (trino-server) total-vm:36015048kB, anon-rss:29973656kB, file-rss:2948kB, shmem-rss:0kB, UID:0 pgtables:61856kB oom_score_adj:0
2024-12-24T12:14:35.792898+00:00 ip-172-30-20-32 systemd[1]: cloud-final.service: A process of this unit has been killed by the OOM killer.
2024-12-24T12:14:35.803968+00:00 ip-172-30-20-32 systemd[1]: cloud-final.service: Failed with result 'oom-kill'.
2024-12-24T12:14:35.806272+00:00 ip-172-30-20-32 systemd[1]: system.slice: A process of this unit has been killed by the OOM killer.

Trino coordinator is running on EC2 machine type r6a.xlarge with OpenJDK Runtime Environment Temurin-23.0.1+11 (build 23.0.1+11)

jvm.conf is

-server
-Xmx16G
-Xms16G
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:+ExplicitGCInvokesConcurrent
-XX:+ExitOnOutOfMemoryError
-XX:-OmitStackTraceInFastThrow
-Djava.io.tmpdir=/data/tmpdir
-XX:ReservedCodeCacheSize=512M
-XX:PerMethodRecompilationCutoff=10000
-XX:PerBytecodeRecompilationCutoff=10000
-Dfile.encoding=UTF-8
-Dcom.sun.jndi.ldap.object.disableEndpointIdentification=true
-Djdk.attach.allowAttachSelf=true
-Djdk.nio.maxCachedBufferSize=2000000
# Reduce starvation of threads by GClocker, recommend to set about the number of cpu cores (JDK-8192647)
-XX:+UnlockDiagnosticVMOptions
-XX:G1NumCollectionsKeepPinned=10000000
-XX:GCLockerRetryAllocationCount=32
-Xlog:gc*:/data/trino/var/log/gc.log:time,level,tags,tid:filecount=10,filesize=100M

The text was updated successfully, but these errors were encountered:

guyco33 · 2024-12-24T15:19:45Z

Executing same load on r6a.2xlarge with double memory size and it seems that OOM will happen soon.

Memory consumption after 1.5 hours:

===
467
===

$ date && free -g && (top -n 20| grep trino) && date && free -g

Tue Dec 24 15:00:31 UTC 2024
               total        used        free      shared  buff/cache   available
Mem:              61          24          32           0           4          36
Swap:              0           0           0

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   7766 root      20   0   44.6g  23.0g  30676 S  18.2  37.4  35:48.76 trino-s+
   7766 root      20   0   44.6g  23.0g  30676 S  20.3  37.4  35:49.37 trino-s+
   7766 root      20   0   44.6g  23.0g  30676 S  18.6  37.4  35:49.93 trino-s+
   7766 root      20   0   44.6g  23.0g  30676 S  16.3  37.4  35:50.42 trino-s+
   7766 root      20   0   44.6g  23.0g  30676 S  15.0  37.4  35:50.87 trino-s+
   7766 root      20   0   44.6g  23.0g  30676 S  16.6  37.4  35:51.37 trino-s+
   7766 root      20   0   44.6g  23.0g  30676 S  12.3  37.4  35:51.74 trino-s+
   7766 root      20   0   44.6g  23.0g  30676 S  11.3  37.4  35:52.08 trino-s+
   7766 root      20   0   44.6g  23.0g  30676 S  13.7  37.4  35:52.49 trino-s+
   7766 root      20   0   44.6g  23.0g  30676 S  12.3  37.4  35:52.86 trino-s+
   7766 root      20   0   44.6g  23.0g  30676 S  11.0  37.4  35:53.19 trino-s+
   7766 root      20   0   44.6g  23.0g  30676 S   7.6  37.4  35:53.42 trino-s+
   7766 root      20   0   44.5g  23.0g  30676 S   7.3  37.4  35:53.64 trino-s+
   7766 root      20   0   44.5g  23.0g  30676 S   6.0  37.4  35:53.82 trino-s+
   7766 root      20   0   44.5g  23.0g  30676 S   6.6  37.4  35:54.02 trino-s+
   7766 root      20   0   44.5g  23.0g  30676 S   6.6  37.4  35:54.22 trino-s+
   7766 root      20   0   44.4g  23.0g  30676 S   8.0  37.4  35:54.46 trino-s+
   7766 root      20   0   44.4g  23.0g  30676 S   6.3  37.4  35:54.65 trino-s+
   7766 root      20   0   44.4g  23.0g  30676 S   7.0  37.4  35:54.86 trino-s+
   7766 root      20   0   44.6g  23.1g  30676 S 278.7  37.6  36:03.25 trino-s+
Tue Dec 24 15:01:29 UTC 2024

===
468
===

$ date && free -g && (top -n 20| grep trino) && date && free -g

Tue Dec 24 15:00:30 UTC 2024
               total        used        free      shared  buff/cache   available
Mem:              61          43          13           0           4          17
Swap:              0           0           0

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   7784 root      20   0   61.7g  42.1g  30820 S  10.0  68.6  39:12.02 trino-s+
   7784 root      20   0   61.7g  42.2g  30820 S   8.3  68.6  39:12.27 trino-s+
   7784 root      20   0   61.8g  42.2g  30820 S   7.0  68.7  39:12.48 trino-s+
   7784 root      20   0   61.8g  42.2g  30820 S  10.3  68.7  39:12.79 trino-s+
   7784 root      20   0   61.9g  42.2g  30820 S   7.3  68.7  39:13.01 trino-s+
   7784 root      20   0   61.9g  42.3g  30820 S   5.6  68.8  39:13.18 trino-s+
   7784 root      20   0   61.9g  42.3g  30820 S   5.7  68.8  39:13.35 trino-s+
   7784 root      20   0   61.8g  42.3g  30820 S   5.0  68.9  39:13.50 trino-s+
   7784 root      20   0   61.8g  42.3g  30820 S   3.0  68.9  39:13.59 trino-s+
   7784 root      20   0   61.8g  42.3g  30820 S   2.0  68.9  39:13.65 trino-s+
   7784 root      20   0   61.8g  42.3g  30820 S   2.3  68.8  39:13.72 trino-s+
   7784 root      20   0   61.8g  42.3g  30820 S   1.7  68.8  39:13.77 trino-s+
   7784 root      20   0   61.8g  42.2g  30820 S   2.0  68.7  39:13.83 trino-s+
   7784 root      20   0   61.8g  42.2g  30820 S   2.0  68.7  39:13.89 trino-s+
   7784 root      20   0   62.0g  42.4g  30820 S 181.4  69.0  39:19.35 trino-s+
   7784 root      20   0   62.9g  43.0g  30820 S 220.7  70.0  39:25.97 trino-s+
   7784 root      20   0   63.4g  43.5g  30820 S  41.2  70.8  39:27.21 trino-s+
   7784 root      20   0   63.8g  44.0g  30820 S  43.9  71.6  39:28.53 trino-s+
   7784 root      20   0   64.6g  44.7g  30820 S 155.0  72.7  39:33.18 trino-s+
   7784 root      20   0   64.9g  45.1g  30820 S  34.6  73.3  39:34.22 trino-s+
Tue Dec 24 15:01:27 UTC 2024
               total        used        free      shared  buff/cache   available
Mem:              61          46          10           0           4          14
Swap:              0           0           0

guyco33 · 2024-12-24T17:33:11Z

OS OOM happens again

$ grep oom /var/log/syslog

2024-12-24T17:14:34.706917+00:00 ip-172-30-21-101 kernel: query-execution invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
2024-12-24T17:14:34.744709+00:00 ip-172-30-21-101 kernel:  oom_kill_process+0x116/0x270
2024-12-24T17:14:34.744711+00:00 ip-172-30-21-101 kernel:  ? oom_evaluate_task+0x143/0x1e0
2024-12-24T17:14:34.744712+00:00 ip-172-30-21-101 kernel:  __alloc_pages_may_oom+0x10c/0x1d0
2024-12-24T17:14:34.792485+00:00 ip-172-30-21-101 kernel: [  pid  ]   uid  tgid total_vm      rss rss_anon rss_file rss_shmem pgtables_bytes swapents oom_score_adj name
2024-12-24T17:14:34.792510+00:00 ip-172-30-21-101 kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/cloud-final.service,task=trino-server,pid=7784,uid=0
2024-12-24T17:14:34.792511+00:00 ip-172-30-21-101 kernel: Out of memory: Killed process 7784 (trino-server) total-vm:74788372kB, anon-rss:63005676kB, file-rss:2412kB, shmem-rss:0kB, UID:0 pgtables:128700kB oom_score_adj:0
2024-12-24T17:14:37.446266+00:00 ip-172-30-21-101 kernel: oom_reaper: reaped process 7784 (trino-server), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
2024-12-24T17:14:37.759101+00:00 ip-172-30-21-101 systemd[1]: cloud-final.service: Failed with result 'oom-kill'.

wendigo · 2024-12-24T19:34:31Z

@guyco33 can you do a heapdump when gc pressure goes up?

guyco33 · 2024-12-29T12:38:41Z

I've run the same exact test queries for about 1 hour on both 467 and 468 and created a trace dump comparison with VisualVM.
Maybe the increase in objects count might explain the increase in overall memory although the total heap size is less than 2GB

losipiuk · 2024-12-30T10:51:16Z

It looks we need to be looking for something off-heap (java heap seems fine).

@wendigo did you recall updating any libs which use off-heap memory in 467..468 period? Anyway this should ratter impacty workers not coordinator.
Do python functions support has potential of using off-heap memeroy?
@guyco33 what connector/table format are you using?

wendigo · 2024-12-30T11:07:28Z

@losipiuk not really

losipiuk · 2024-12-30T11:12:15Z

As for python:
this code

trino/plugin/trino-functions-python/src/main/java/io/trino/plugin/functions/python/PythonEngine.java

Lines 122 to 131 in 6b806ae

    
           Instance instance = Instance.builder(PYTHON_MODULE) 
        
                   .withMachineFactory(PythonModule::create) 
        
                   .withImportValues(importValues) 
        
                   .build(); 
        
           allocate = instance.export("allocate"); 
        
           deallocate = instance.export("deallocate"); 
        
           setup = instance.export("setup"); 
        
           execute = instance.export("execute"); 
        
           memory = instance.memory();

is potentially leaky. We are not closing Instance explicitly - just assuming GC will do the job - and IDK if it does not hold some native memory.
Also it looks like we are not always have allocate/deallocate pairs. E.g for argTypeAddress and returnTypeAddress in PythonEngine.doSetup. IDK if that is a problem or not. @electrum PTAL.
But this is probably not what @guyco33 is observing as this should not be called if Python functions are not in use.

guyco33 · 2024-12-30T22:53:34Z

what connector/table format are you using?

I'm using postgres, mysql and hive tables with ORC and csv gzip file formats

electrum · 2024-12-30T23:23:22Z

@losipiuk There is no close for Instance. It doesn't hold any native memory. The WASM memory is allocated on-heap as a single contiguous array. The argType and returnType pointers passed to setup() are saved and held onto for the lifetime of the WASM engine.

There is a potential leak of PythonEngine which creates a Jimfs instance that may get registered with the JDK file system machinery. We create a ScalarFunctionImplementation with an instanceFactory so that the engine can create one per-thread as needed. If this is a problem, we'll need to change the engine to support Closeable function instances, use java.lang.ref.Cleaner, etc. Also, we may need memory tracking for the function instances, as language functions will typically use more memory than other types of functions.

That said, that seems unrelated to this issue, as there is no evidence they are being used here.

guyco33 · 2025-01-02T12:19:10Z

I ran the same tests after deleting ./plugin/functions-python/ directory, but OS OOM still happened on the coordinator, so it seems that the leaky code comes from something else.
What would be the best approach to tracking off-heap leaks ?

wendigo · 2025-01-02T13:42:23Z

@guyco33 we don't allocate off-heap memory directly, the only known instances that I'm aware of is the usage of the Apache Arrow (BigQuery, Snowflake connectors) but this is not the case here.

wendigo · 2025-01-02T13:43:24Z

@guyco33 how much memory is assigned to the JVM process? Can you lower Xmx to 70% of the quota?

teddy-hackle · 2025-01-03T00:26:35Z

We are also experiencing the same problem after upgrading to 468.

We are using s3, hive metastore and iceberg through the jdbc connector.

Due to this issue, the -XX:TrimNativeHeapInterval=5000 option was added among the JVM options, and in terms of metric, native heap memory is removed periodically and fluctuates, but it continues to increase and eventually OOM occurs.

wendigo · 2025-01-03T05:19:10Z

@teddy-hackle @guyco33 can you do a thread dump?

guyco33 · 2025-01-03T09:03:28Z

how much memory is assigned to the JVM process? Can you lower Xmx to 70% of the quota?

@wendigo I'm using 16G for Xmx on r6a.xlarge instance with 32G

wendigo · 2025-01-03T09:09:04Z

@guyco33 can you provide a thread dump?

losipiuk · 2025-01-03T12:31:05Z

@guyco33 we don't allocate off-heap memory directly

I used to observe behaviour (in different project) which suggested that AWS SDK is allocating stuff off-heap when doing PUTs for S3.
I did not manage to get to root of that though. It does not much what coordinator is doing though.

wendigo · 2025-01-03T14:01:26Z

Can one of you run Trino with JVM flag -XX:NativeMemoryTracking=detail and then capture the statistics for native memory using jps to get Trino PID and then jcmd [PID] VM.native_memory baseline scale=MB ?

guyco33 · 2025-01-03T21:02:00Z

@losipiuk OOM happens also when hive is not reachable from the coordinator and s3 operations are not involved - I have an assumption that it's something about the planner.
@wendigo Attached some output files from jcmd

statistics.txt
summary.diff.txt
detail.diff.txt

wendigo · 2025-01-03T21:39:29Z

I'm worried about this Other section which contains ByteBuffers and such (Unsafe allocations):

-                     Other (reserved=5839MB +449MB, committed=5839MB +449MB)
                            (malloc=5839MB +449MB #3755 +278)

Trino itself is not using ByteBuffer's as the Slice is wrapping around heap-allocated byte[] arrays and that's the primary memory allocation/manipulation unit. I think that the allocations are coming from 3rd party code, i.e. Netty or AWS SDK.

Is it possible for you to attach a debugger to a running process and check Unsafe.allocateMemory calls?

wendigo · 2025-01-03T21:39:43Z

@dain @electrum ideas how to approach that?

wendigo · 2025-01-03T22:12:03Z

I think that these allocations are happening in Jetty:

Breakpoint reached
	at jdk.internal.misc.Unsafe.allocateMemory(Unsafe.java:625)
	at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:115)
	at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:363)
	at org.eclipse.jetty.util.BufferUtil.allocateDirect(BufferUtil.java:137)
	at org.eclipse.jetty.util.BufferUtil.allocate(BufferUtil.java:153)
	at org.eclipse.jetty.io.ArrayByteBufferPool.acquire(ArrayByteBufferPool.java:208)
	at org.eclipse.jetty.server.internal.HttpConnection$SendCallback.process(HttpConnection.java:779)
	at org.eclipse.jetty.util.IteratingCallback.processing(IteratingCallback.java:262)
	at org.eclipse.jetty.util.IteratingCallback.iterate(IteratingCallback.java:243)
	at org.eclipse.jetty.server.internal.HttpConnection$HttpStreamOverHTTP1.send(HttpConnection.java:1427)
	at org.eclipse.jetty.server.HttpStream$Wrapper.send(HttpStream.java:185)
	at org.eclipse.jetty.server.internal.HttpChannelState$ChannelResponse.write(HttpChannelState.java:1300)
	at org.eclipse.jetty.server.Response$Wrapper.write(Response.java:768)
	at org.eclipse.jetty.server.handler.EventsHandler$EventsResponse.write(EventsHandler.java:353)
	at org.eclipse.jetty.server.Response$Wrapper.write(Response.java:768)
	at org.eclipse.jetty.server.handler.ContextResponse.write(ContextResponse.java:56)
	at org.eclipse.jetty.ee10.servlet.ServletContextResponse.write(ServletContextResponse.java:288)
	at org.eclipse.jetty.ee10.servlet.HttpOutput.channelWrite(HttpOutput.java:215)
	at org.eclipse.jetty.ee10.servlet.HttpOutput.channelWrite(HttpOutput.java:200)
	at org.eclipse.jetty.ee10.servlet.HttpOutput.write(HttpOutput.java:848)
	at org.glassfish.jersey.servlet.internal.ResponseWriter$NonCloseableOutputStreamWrapper.write(ResponseWriter.java:298)
	at java.io.ByteArrayOutputStream.writeTo(ByteArrayOutputStream.java:170)
	at org.glassfish.jersey.message.internal.CommittingOutputStream.flushBuffer(CommittingOutputStream.java:286)
	at org.glassfish.jersey.message.internal.CommittingOutputStream.commit(CommittingOutputStream.java:235)
	at org.glassfish.jersey.message.internal.CommittingOutputStream.close(CommittingOutputStream.java:250)
	at org.glassfish.jersey.message.internal.OutboundMessageContext.close(OutboundMessageContext.java:568)
	at org.glassfish.jersey.server.ContainerResponse.close(ContainerResponse.java:403)
	at org.glassfish.jersey.server.ServerRuntime$Responder.writeResponse(ServerRuntime.java:763)
	at org.glassfish.jersey.server.ServerRuntime$Responder.processResponse(ServerRuntime.java:398)
	at org.glassfish.jersey.server.ServerRuntime$Responder.process(ServerRuntime.java:388)
	at org.glassfish.jersey.server.ServerRuntime$1.run(ServerRuntime.java:277)
	at org.glassfish.jersey.internal.Errors$1.call(Errors.java:248)
	at org.glassfish.jersey.internal.Errors$1.call(Errors.java:244)
	at org.glassfish.jersey.internal.Errors.process(Errors.java:292)
	at org.glassfish.jersey.internal.Errors.process(Errors.java:274)
	at org.glassfish.jersey.internal.Errors.process(Errors.java:244)
	at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:266)
	at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:253)
	at org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:696)
	at org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:397)
	at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:349)
	at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:358)
	at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:312)
	at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:205)
	at org.eclipse.jetty.ee10.servlet.ServletHolder.handle(ServletHolder.java:736)
	at org.eclipse.jetty.ee10.servlet.ServletHandler$ChainEnd.doFilter(ServletHandler.java:1619)
	at io.airlift.http.server.tracing.TracingServletFilter.doFilter(TracingServletFilter.java:113)
	at org.eclipse.jetty.ee10.servlet.FilterHolder.doFilter(FilterHolder.java:205)
	at org.eclipse.jetty.ee10.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1591)
	at org.eclipse.jetty.ee10.servlet.ServletHandler$MappedServlet.handle(ServletHandler.java:1552)
	at org.eclipse.jetty.ee10.servlet.ServletChannel.dispatch(ServletChannel.java:819)
	at org.eclipse.jetty.ee10.servlet.ServletChannel.handle(ServletChannel.java:436)
	at org.eclipse.jetty.ee10.servlet.ServletHandler.handle(ServletHandler.java:469)
	at org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:611)
	at org.eclipse.jetty.server.handler.ContextHandler.handle(ContextHandler.java:1060)
	at org.eclipse.jetty.server.Handler$Wrapper.handle(Handler.java:740)
	at org.eclipse.jetty.server.handler.EventsHandler.handle(EventsHandler.java:81)
	at org.eclipse.jetty.server.Server.handle(Server.java:182)
	at org.eclipse.jetty.server.internal.HttpChannelState$HandlerInvoker.run(HttpChannelState.java:662)
	at org.eclipse.jetty.server.internal.HttpConnection.onFillable(HttpConnection.java:418)
	at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:322)
	at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:99)
	at org.eclipse.jetty.io.SelectableChannelEndPoint$1.run(SelectableChannelEndPoint.java:53)
	at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.runTask(AdaptiveExecutionStrategy.java:480)
	at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.consumeTask(AdaptiveExecutionStrategy.java:443)
	at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.tryProduce(AdaptiveExecutionStrategy.java:293)
	at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.run(AdaptiveExecutionStrategy.java:201)
	at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:311)
	at org.eclipse.jetty.util.thread.MonitoredQueuedThreadPool$1.run(MonitoredQueuedThreadPool.java:73)
	at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:979)
	at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.doRunJob(QueuedThreadPool.java:1209)
	at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1164)
	at java.lang.Thread.runWith(Thread.java:1588)
	at java.lang.Thread.run(Thread.java:1575)

I've reported this issue to Jetty folks

guyco33 · 2025-01-05T10:43:05Z

Attached output files of jcmd $(pgrep trino) VM.native_memory detail.diff from both 467 and 468 after running the same queries for about 1 hour.
detail.diff.467.txt
detail.diff.468.txt

The Other section in 468 seems to be the issue.

467

-                     Other (reserved=63MB +48MB, committed=63MB +48MB)
                            (malloc=63MB +48MB #1111 +328)

468

-                     Other (reserved=6586MB +5151MB, committed=6586MB +5151MB)
                            (malloc=6586MB +5151MB #3986 +2434)

guyco33 · 2025-01-05T15:09:31Z

I think that the allocations are coming from 3rd party code, i.e. Netty or AWS SDK.

@wendigo
I tried to run 468 after downgrading io.netty to 4.1.115.Final #24506 and AWS SDK to 2.29.29 #24450 & 1.12.779 #24485 (same as in 467) but memory leak still occurs.

wendigo · 2025-01-05T16:30:16Z

How did you downgrade them?

guyco33 · 2025-01-05T16:48:36Z

Build 468 after updating pom.xml with the previous versions

wendigo · 2025-01-05T18:35:15Z

Did the native memory (other) change with the downgrade?

guyco33 · 2025-01-05T21:20:07Z

Did the native memory (other) change with the downgrade?

No. Same high value

robinsinghstudios · 2025-01-05T21:27:09Z

Not sure if it was the same reason, but I believe I faced the same issue on 467 as well. Might it be related to the shift to the new UBI?

guyco33 · 2025-01-05T21:40:53Z

I have a production cluster running on 467 and the coordinator has been running for almost 3 weeks with no OS OOM issues.

wendigo · 2025-01-05T22:05:01Z

@robinsinghstudios base image doesn't affect runtime

robinsinghstudios · 2025-01-05T22:06:08Z

Understood. Thanks @wendigo .

guyco33 · 2025-01-05T23:04:19Z

I've tested two snapshot versions of 468 and found that the issue started in one of the changes done between #24305 (a7e72d4) and #22974 (4718011)

a7e72d4 (test) - no issue

-                     Other (reserved=95MB +91MB, committed=95MB +91MB)
                            (malloc=95MB +91MB #1542 +1451)

4718011 (HEAD -> test2) - OS OOM

-                     Other (reserved=4213MB +2180MB, committed=4213MB +2180MB)
                            (malloc=4213MB +2180MB #3592 +879)

* 4718011a39 (HEAD -> test2) Use BigQuery storage read API when reading external BigLake tables
* 9b24b74f80 Add assertConsistently
* 9499dc82f2 Update google-sheets api to v4-rev20241203-2.0.0
* b869fe2f43 Update exasol to 24.2.1
* ccfefa284d Update s3mock testcontainers to 3.12.0
* bcfbcc3248 Update nessie to 0.101.1
* 5553fc29c4 Update JLine to 3.28.0
* f2c1b83c67 Update AWS SDK v2 to 2.29.31
* c794993acd Update swagger to 2.2.27
* a017723a76 Update reactor-core to 3.7.1
* 6704bf467c Update metrics-core to 4.2.29
* 0c01c04242 Update airlift to 292
* ee5cba4407 Update airbase to 204
* 4fef0f0bee Update Hudi library to 1.0.0
* a7e72d424f (test) Add missing table features when adding timestamp_ntz in Delta

wendigo · 2025-01-06T08:39:16Z

First 7 commits doesn't matter which leaves reactor, metrics, aws sdk, airlift

wendigo · 2025-01-06T10:57:24Z

* 4718011a39 (HEAD -> test2) Use BigQuery storage read API when reading external BigLake tables <- doesn't matter - different connector
* 9b24b74f80 Add assertConsistently <- doesn't matter - test only
* 9499dc82f2 Update google-sheets api to v4-rev20241203-2.0.0 <- doesn't matter - different connector
* b869fe2f43 Update exasol to 24.2.1 <- doesn't matter - different connector
* ccfefa284d Update s3mock testcontainers to 3.12.0 <- doesn't matter - test only
* bcfbcc3248 Update nessie to 0.101.1 <- doesn't matter - different connector
* 5553fc29c4 Update JLine to 3.28.0 <- doesn't matter - CLI only
* f2c1b83c67 Update AWS SDK v2 to 2.29.31 <- candidate
* c794993acd Update swagger to 2.2.27 <- doesn't matter
* a017723a76 Update reactor-core to 3.7.1 <- candidate (but this applies to Azure SDK only)
* 6704bf467c Update metrics-core to 4.2.29 <- candidate
* 0c01c04242 Update airlift to 292 <- candidate
* ee5cba4407 Update airbase to 204 <- candidate
* 4fef0f0bee Update Hudi library to 1.0.0 <- doesn't matter - different connector
* a7e72d424f (test) Add missing table features when adding timestamp_ntz in Delta <- doesn't matter - different connector

guyco33 · 2025-01-06T12:43:01Z

so it seems to be the update of airbase from version 203 to version 204

* ee5cba4407 (bad) Update airbase to 204 
* 4fef0f0bee (good) Update Hudi library to 1.0.0
* a7e72d424f Add missing table features when adding timestamp_ntz in Delta

airbase was updated 3 times in 467..678

$ git log --oneline --decorate --graph 467..468 | grep 'airbase'
* 547cc430e7 Update airbase to 206 and airlift to 293
* c1cc49fce2 Update airbase to 205
* ee5cba4407 Update airbase to 204

wendigo · 2025-01-06T12:46:08Z

Airlift 204 is among other things Jetty 12.0.16: https://github.com/airlift/airbase/releases/tag/204

wendigo · 2025-01-06T12:47:16Z

I've created https://github.com/airlift/airlift/pull/1349/files to bound/reuse memory/heap allocations in Jetty

guyco33 · 2025-01-06T13:23:08Z

I've just approved that running 468 with airbase version 203 solves the memory leak issue

* 5f9b222b93 (HEAD -> test) rollback to airbase 203
* cb7a9cc1cf (tag: 468) [maven-release-plugin] prepare release 468

wendigo · 2025-01-06T14:37:55Z

@guyco33 can you test with the airflit fix in place?

guyco33 · 2025-01-06T19:07:35Z

@wendigo Doesn't it make sense that the fix would be in airbase ?

wendigo · 2025-01-06T19:16:01Z

@guyco33 airbase is just a set of dependency declarations. Airbase defines versions for airlift and airlift is used by Trino. We need Jetty 12.0.16 as it solves other issue that we were facing so it's not really an option to downgrade.

wendigo · 2025-01-06T19:24:36Z

We've filled jetty/jetty.project#12670 to track Jetty improvement

guyco33 · 2025-01-06T20:54:58Z

@wendigo I can approve the your airlift fix solves the issue !
Thanks for your professional and exceptional workmanship on this fix.

guyco33 added bug Something isn't working RELEASE-BLOCKER labels Dec 24, 2024

guyco33 mentioned this issue Jan 6, 2025

Configure Buffer pools for Jetty airlift/airlift#1349

Open

3 tasks

Regression: Coordinator crashes with OOM in 468 #24572

Regression: Coordinator crashes with OOM in 468 #24572

Comments

guyco33 commented Dec 24, 2024

guyco33 commented Dec 24, 2024

guyco33 commented Dec 24, 2024

wendigo commented Dec 24, 2024

guyco33 commented Dec 29, 2024

losipiuk commented Dec 30, 2024

wendigo commented Dec 30, 2024

losipiuk commented Dec 30, 2024

guyco33 commented Dec 30, 2024

electrum commented Dec 30, 2024

guyco33 commented Jan 2, 2025 • edited Loading

wendigo commented Jan 2, 2025

wendigo commented Jan 2, 2025

teddy-hackle commented Jan 3, 2025

wendigo commented Jan 3, 2025

guyco33 commented Jan 3, 2025

wendigo commented Jan 3, 2025

losipiuk commented Jan 3, 2025

wendigo commented Jan 3, 2025

guyco33 commented Jan 3, 2025

wendigo commented Jan 3, 2025

wendigo commented Jan 3, 2025

wendigo commented Jan 3, 2025

guyco33 commented Jan 5, 2025

guyco33 commented Jan 5, 2025 • edited Loading

wendigo commented Jan 5, 2025

guyco33 commented Jan 5, 2025

wendigo commented Jan 5, 2025

guyco33 commented Jan 5, 2025

robinsinghstudios commented Jan 5, 2025 • edited Loading

guyco33 commented Jan 5, 2025

wendigo commented Jan 5, 2025

robinsinghstudios commented Jan 5, 2025

guyco33 commented Jan 5, 2025

wendigo commented Jan 6, 2025

wendigo commented Jan 6, 2025

guyco33 commented Jan 6, 2025

wendigo commented Jan 6, 2025

wendigo commented Jan 6, 2025

guyco33 commented Jan 6, 2025

wendigo commented Jan 6, 2025

guyco33 commented Jan 6, 2025

wendigo commented Jan 6, 2025

wendigo commented Jan 6, 2025

guyco33 commented Jan 6, 2025

guyco33 commented Jan 2, 2025 •

edited

Loading

guyco33 commented Jan 5, 2025 •

edited

Loading

robinsinghstudios commented Jan 5, 2025 •

edited

Loading