Right now, going from Array{T}
to Array{Union{Missing, T}}
requires copying the original array: https://github.com/JuliaLang/julia/issues/26681
But according to this blog post a Array{Union{Missing, T}}
is represented internally as two arrays, one for the data and one for the Missing
array: https://julialang.org/blog/2018/06/missing/#an_efficient_representation
So my question is, why is it necessary to copy the original array when using convert()
? If an Array{Union{Missing, T}}
is represented internally as two arrays anyway, would it be enough to create a Array{Union{Missing, T}}
that references the original one but only initializes its internal Array{UInt8}
?
we could do that, but what do you then do about other references still pointing to the old array?
What about them? :thinking:
There are plenty of cases where array variables share the underlying data (e.g. reshape()
).
yeah, but there the element type is the same
imagine you have old_arr
with Int8
s, and then you have a non-copying conversion to Union{Missing, Int8}
. Now I set some element in the latter to Missing
- what should reading from old_arr
produce?
the Int8
value is now invalid after all
(Int8 is just an example, it gets spicier with mutable structs for example)
Well, not really. As far as old_arr
is concerned that element still has valid data, it's only in new_arr
that it is missing
.
I can see why it could be confusing at first if you don't know how it's implemented under the hood, but to me that seems like quite reasonable semantics.
inconsistent views of the same data does not at all sound reasonable to me :)
sounds very much like a bug waiting to happen
my point is that you can't really call it new_arr
if it's sharing the same data
I don't think it's inconsistent, anymore than something like PermutedDimsArray
is inconsistent :stuck_out_tongue:
Sukera said:
my point is that you can't really call it
new_arr
if it's sharing the same data
I think you absolutely can, and it's even very useful because it avoids copying a potentially large array.
Numpy's masked arrays are also implemented like this, but they have a copy
option. Maybe it'd be best to have something like that, but with copy=True
as the default?
permutedims
is chiefly not the same, since it keeps the underlying the data exactly the same and all its validity guarantees. All it does is defer a copy if you do decide to push to e.g. the original vector.
You can't really avoid copying that without being sure that the data isn't being messed with under your nose
I absolutely agree, but in many cases that's exactly what I want. The whole point of converting to a missing array is so that I can forget about masked out elements.
(i.e. I'm never going to touch old_arr
once it's converted)
that's a different usecase from just calling convert
though
that'd be perfectly fine in a MissingArray
wrapper
Is that a thing? :eyes:
no
Ah you had me excited :sweat_smile: But yes, that would also work if it existed.
either way, if that conversion/copy is your bottleneck, I'm a tiny bit concerned about the other bottlenecks that are sure to come up around such code :eyes:
It's not the biggest bottleneck, but it increases running time by about ~40%. The biggest bottleneck is IO.
Sukera said:
that'd be perfectly fine in a
MissingArray
wrapper
I do wonder though, could you in theory get the same performance with such a wrapper as with a Union{Missing, T}
? My understanding is that the compiler does a lot of work to optimize that away.
sounds like you want to preallocate a Vector{Union{Missing, T}}
in the first place to write into, but ok
the compiler isn't really able to optimize your Array{Union{Missing, T}}
away, no
it has to insert checks for every element access
Sukera said:
sounds like you want to preallocate a
Vector{Union{Missing, T}}
in the first place to write into, but ok
Unfortunately that's not possible, the data is being read from a Python library that doesn't accept a preallocated array.
Sukera said:
it has to insert checks for every element access
Right, but in the blogpost I linked there's an example of the compiler being able to vectorize a loop anyway using masked SIMD instructions. Is that sort of thing possible with a custom array type too? I don't know how specially the compiler treats unions of isbits types.
masked SIMD are very specialized, if your data is not a very simple datatype SIMD will be hard either way
the compiler can do some tricks to bring the data in that shape, if you don't need all of a struct in the loop, but it's generally much more difficult
Gotcha, thanks.
Sukera said:
yeah, but there the element type is the same
What about reinterpret
?
those at least share the exact same memory for all their representations, so changing one affects the other directly
you may still not get a valid instance for custom structs with custom invariants, yes
It’s the exact same memory with Union{T, Missing}
as well
There’s just an additional bitvector telling you if an element is junk or not
I think what James is asking for here is certainly not a good idea for a public API, but isn’t really any more scary than most unsafe pointer operations
Mason Protter said:
I think what James is asking for here is certainly got a good idea for a public API, but isn’t really any more scary than most unsafe pointer operations
The "but" makes me think you meant to type "certainly not a good idea for a public API" there?
Oops, yeah I meant 'not'
edited
Last updated: Nov 06 2024 at 04:40 UTC