What is the correct/fastest way to make an Array which is created though unsafe_wrap safe? It seems that both collect
and copy
works, but is there some way which is the best way?
Here is the code which I need to fix for reference (which is not safe and segfaults): https://github.com/apache/arrow-julia/pull/436/files
Depends on where that pointer came from
Are you sure it's not being freed under your nose somewhere?
If you arent protecting batch.bytes
from garbage collection you’re in for a bad time
One thing I really like to do is use StrideArrays.jl for this kind of thing.
julia> using StrideArrays
julia> let A = [1, 2, 3, 4, 5]
pA = PtrArray(pointer(A), size(A))
sA = StrideArray(pA, A)
end
5-element StrideArray{Int64, 1, (1,), Tuple{Int64}, Tuple{Nothing}, Tuple{StaticInt{1}}, Vector{Int64}}:
1
2
3
4
5
the object pA
is an abstract array that just has the pointer to A
and information about it's size, and the object sA
is also an abstractarray, but it stores pA
and a reference to where the original data from pA
's pointer came from. Since sA
carries this reference around, it will be safe from garbage collection
Mason Protter said:
If you arent protecting
batch.bytes
from garbage collection you’re in for a bad time
Yeah exactly. It was a hipshot pr and I didn't read carefully what unsafe_wrap did (although the use of wrap
name should have triggered an alarm).
Since the normal case allocates a new array for the result, I'm sure that allocating a new array also for the uncompressed case is fine as well. I was more after if there is some idiomatic way to do it. I'm not sure the devs would like a new dependency just to handle this edge case.
I guess they'd also prefer if the function was type stable, but maybe one can use a StridedArray also for the compressed case.
@Sukera It absolutely is being freed under my nose, so I need to move it over to some safe array. Sorry if my question was unclear. I was trying to ask what is an idiomatic way to do it.
The most idiomatic thing I guess would be to not do the pointer gymnastics in the first place, but failing that, I'd change the else
block in buildbitmap
to something like
else
# compressed
ptr = pointer(batch.bytes, voff)
GC.@preserve batch begin
_, _decodedbytes = uncompress(ptr, buffer, rb.compression)
decodedbytes = collect(_decodedbytes)
end
return ValidityBitmap(decodedbytes, 1, node.length, node.null_count)
end
or you could change uncompress
to take batch
or batch.bytes
as an argument and do the GC.@preserve
and collect
in there.
And yeah, copy
or collect
should be fine. I guess copy
is a more more clear here?
I was also thinking that Vector{UInt8}(encodedbytes)
might be even more clear, e.g. in case copy
would someday return some other type than Vector
.
the use of wrap name should have triggered an alarm
the use of unsafe_*
should have triggered an alarm :eyes: I agree with Mason - if you reach for pointers without something in an FFI interopt requring that, you're willingly walking over a sea of knives.
copy
is perfectly good to use here.
It’s not immediately obvious to me that Vector{UInt8}(encodedbytes)
actually makes a copy, so I wouldn’t use that one personally
I agree - I wouldn't expect copy
to return a different type entirely.
@DrChainsaw I don't think this change https://github.com/apache/arrow-julia/pull/436/commits/5dec5f8353c26af5d017cbda5989cdfd026435b3 is sufficient
You also need to protect batch.bytes
from the GC while you do these operations on it
That's why I suggested GC.@preserve
here:
Mason Protter said:
The most idiomatic thing I guess would be to not do the pointer gymnastics in the first place, but failing that, I'd change the
else
block inbuildbitmap
to something likeelse # compressed ptr = pointer(batch.bytes, voff) GC.@preserve batch begin _, _decodedbytes = uncompress(ptr, buffer, rb.compression) decodedbytes = collect(_decodedbytes) end return ValidityBitmap(decodedbytes, 1, node.length, node.null_count) end
Hmm, yeah I agree it looks weird to not have any GC.@preserve
near that pointer. Wouldn't pretty much the entire codebase suffer from this since it uses pointers everywhere?
No idea, I haven't looked at the wider codebase
but any time you take the pointer from a julia GC tracked object, and then don't have references to that object, but keep your references to the pointer, you're dealing with a non-deterministic time bomb
Oh it's a very deterministic time bomb :grinning: It will fail, the only thing you don't know is when :D
The cyber security community made a whole industry about issues like that after all
the only thing you don't know is when :grinning:
So non-detereministic :stuck_out_tongue:
non-deterministic is that you don't know whether something happens or not - this WILL happen, you're just closing your eyes to it :stuck_out_tongue_wink:
But when and how it happens is completely non-deterministic
The answer to those is always "at the most inconvenient times" :P
There is something here which I think I don't have a firm grasp about.
any time you take the pointer from a julia GC tracked object, and then don't have references to that object, but keep your references to the pointer, you're dealing with a non-deterministic time bomb
This makes perfect sense to me and the first PR commit clearly did this by just returning the wrapped array (and that did indeed blow up in my face). However, the original (and afaik current PR code) seems to always copy over the pointed to data to a new array before the pointed to object (batch.bytes
) can go out of scope. I suppose one can complain about the coding style to even have a function which accepts a pointer, but I'm ok with leaving that judgement to the package maintainers.
However, looking at the the docs of GC.@preserve
, the first example seems like redundant use of GC.@preserve
since x
can't be garbage collected before unsafe_load
returns, or can it?
julia> let
x = Ref{Int}(101)
p = Base.unsafe_convert(Ptr{Int}, x)
GC.@preserve x unsafe_load(p)
end
Is this just a simple demonstration of how to use it, or is it a situation where it is needed or else the code will blow up eventually?
Without a GC.@preserve x
that code would blow up.
It's unlikely to blow up, but it can happen
If I write
x = Ref(101)
p = Base.unsafe_convert(Ptr{Int}, x)
unsafe_load(p)
then the runtime is fully allowed to garbage collect x
between the second and third lines, i.e.
x = Ref(101)
p = Base.unsafe_convert(Ptr{Int}, x)
# No more references to x, now allowed to free its memory here!!!
unsafe_load(p)
So I guess then the code in Arrow.jl might work (for now) because batch
is accessed later on in the code?
Probably yeah
with the Ref
example above, it's highly unlikely to blow up on you because the Ref
constructed in that way is stack allocated, and in simple examples the compiler wouldn't move the stack frame to corrupt the Ref
in the middle of a function's execution
but it is definitely allowed to do so
I guess I was assuming that scopes are as important to the compiler as they are to the programmer.
(Note that there's no semantic guarantee that the Ref
is actually stack allocated! Semantically, it's all just "julia managed memory" - there is no heap nor stack in that model, that's just the reality of computing leaking out from the abstraction)
Last updated: Dec 28 2024 at 04:38 UTC