Stream: helpdesk (published)

Topic: on tpu-vm crashes with invalid pointer


view this post on Zulip Reid (Feb 18 2022 at 01:18):

I'm trying to run julia on a tpu-vm v3-8 using the tpu-vm-pt-1.10 image. It crashes on various operations with "free(): invalid pointer." This happens with both the latest release and the LTS. (crossposting from slack)

(@v1.6) pkg> generate Demo
  Generating  project Demo:
free(): invalid pointer

signal (6): Aborted
in expression starting at none:0
gsignal at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
abort at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x7f5bc79ef3ed)
unknown function (ip: 0x7f5bc79f747b)
unknown function (ip: 0x7f5bc79f8cab)
git_mbedtls_stream_global_init at /home/rs/tools/julia-1.6.5/bin/../lib/julia/libgit2.so (unknown line)
init_once at /home/rs/tools/julia-1.6.5/bin/../lib/julia/libgit2.so (unknown line)
__pthread_once_slow at /lib/x86_64-linux-gnu/libpthread.so.0 (unknown line)
git_libgit2_init at /home/rs/tools/julia-1.6.5/bin/../lib/julia/libgit2.so (unknown line)
macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/LibGit2/src/error.jl:108 [inlined]
initialize at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/LibGit2/src/LibGit2.jl:986
#164 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/LibGit2/src/LibGit2.jl:971
lock at ./lock.jl:187
ensure_initialized at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/LibGit2/src/LibGit2.jl:967 [inlined]
GitConfig at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/LibGit2/src/config.jl:50
GitConfig at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/LibGit2/src/config.jl:50 [inlined]
with at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/LibGit2/src/types.jl:1156 [inlined]
getconfig at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/LibGit2/src/config.jl:160 [inlined]
project at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Pkg/src/generate.jl:30
#generate#3 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Pkg/src/generate.jl:15
generate at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Pkg/src/generate.jl:10 [inlined]
#generate#2 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Pkg/src/generate.jl:8 [inlined]
generate at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Pkg/src/generate.jl:8 [inlined]
#generate_deprecated#1 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Pkg/src/generate.jl:5 [inlined]
generate_deprecated at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Pkg/src/generate.jl:4
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2237 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2419
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1703 [inlined]
do_apply at /buildworker/worker/package_linux64/build/src/builtins.c:670
do_cmd! at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Pkg/src/REPLMode/REPLMode.jl:405
#do_cmd#21 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Pkg/src/REPLMode/REPLMode.jl:386
do_cmd at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Pkg/src/REPLMode/REPLMode.jl:377 [inlined]
#24 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Pkg/src/REPLMode/REPLMode.jl:550
jfptr_YY.24_45436.clone_1 at /home/rs/tools/julia-1.6.5/lib/julia/sys.so (unknown line)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2237 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2419
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1703 [inlined]
jl_f__call_latest at /buildworker/worker/package_linux64/build/src/builtins.c:714
#invokelatest#2 at ./essentials.jl:708 [inlined]
invokelatest at ./essentials.jl:706 [inlined]
run_interface at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/REPL/src/LineEdit.jl:2441
jfptr_run_interface_54737.clone_1 at /home/rs/tools/julia-1.6.5/lib/julia/sys.so (unknown line)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2237 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2419
run_frontend at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl:1126
#44 at ./task.jl:411
jfptr_YY.44_53285.clone_1 at /home/rs/tools/julia-1.6.5/lib/julia/sys.so (unknown line)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2237 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2419
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1703 [inlined]
start_task at /buildworker/worker/package_linux64/build/src/task.c:834
Allocations: 2654 (Pool: 2639; Big: 15); GC: 0
Aborted (core dumped)

I found a very similar issue in discourse, but it had no replies. Github issues did not seem to have anything relevant.
https://discourse.julialang.org/t/issues-with-julia-installation-on-google-tpu-vm/65783

$ uname -a
Linux *********** 5.11.0-1021-gcp #23~20.04.1-Ubuntu SMP Fri Oct 1 19:04:32 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

I did notice that trying to check the libc.so.6 info segfaulted, which makes me suspicious google is doing something strange with their glibc.

$ /lib/x86_64-linux-gnu/libc.so.6
Segmentation fault (core dumped)

Anyone had success running julia on tpu-vms? Thanks!

view this post on Zulip Mason Protter (Feb 21 2022 at 17:16):

Hm, yeah I'm not sure how much attention Julia's TPU stuff has gotten in a while. It seemed to be an exciting hotness for a little while and then I suddenly stopped hearing about it

view this post on Zulip Brenhin Keller (Feb 21 2022 at 19:39):

Yeah.. I wonder how hard it would be to target with StaticCompiler?

view this post on Zulip Jayesh K. Gupta (Feb 22 2022 at 23:13):

To be fair, I have seen these often enough with using jax via PyCall that it's likely some interaction between XLA's LLVM and Julia's LLVM.

view this post on Zulip Brian Chen (Feb 22 2022 at 23:35):

AFAIK @Reid is using the official binaries, so there should only be one LLVM (Julia's own) in play? No other libraries.

view this post on Zulip Reid (Feb 23 2022 at 00:42):

Brian Chen said:

AFAIK Reid is using the official binaries, so there should only be one LLVM (Julia's own) in play? No other libraries.

I am using the official binaries. I also tried installing with conda forge and building from source.
Also here's the GitHub issue I opened.
https://github.com/JuliaLang/julia/issues/44242

view this post on Zulip Reid (Feb 23 2022 at 15:40):

Jayesh K. Gupta said:

To be fair, I have seen these often enough with using jax via PyCall that it's likely some interaction between XLA's LLVM and Julia's LLVM.

Yes, I understand XLA.jl is languishing, but I thought calling out to jax, maybe with a simple wrapper would be usable. Sadly I can't even get to the point of trying to use the tpu cores. Do you have a working setup on a tpu vm (aka you can use julia even just on the cpu)?

view this post on Zulip Brian Chen (Feb 23 2022 at 16:16):

What confuses me is that Julia does run on Colab VMs, which shouldn't be that different than the TPU VM you're using (they're all on GCP after all). Perhaps there's some (mis)configuration with the TPU VM image that is wreaking havok?

view this post on Zulip Reid (Feb 23 2022 at 19:35):

Brian Chen said:

What confuses me is that Julia does run on Colab VMs, which shouldn't be that different than the TPU VM you're using (they're all on GCP after all). Perhaps there's some (mis)configuration with the TPU VM image that is wreaking havok?

Yeah, I tested it on a regular vm and it worked. I have tried v2-alpha, tpu-vm-pt-1.10, and tpu-vm-tf-2.8.0 images and it failed.


Last updated: Nov 22 2024 at 04:41 UTC