Error: RESOURCE_EXHAUSTED: Out of memory
I am trying to run a job on Phys (7 nodes, 21 GPUs) and I got this error:
external/xla/xla/pjrt/pjrt_stream_executor_client.cc:2461] Execution of replica 0 failed: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 14158869984 bytes
Although this could be avoided by using more nodes, it would be nice to be able to estimate how much memory you will need a priori.
Also, is there an alternative if the whole tail cannot fit in memory?