Constructing missing arrays · helpdesk (published)

Stream: helpdesk (published)

Topic: Constructing missing arrays

James Wrigley (Aug 11 2022 at 09:11):

Right now, going from Array{T} to Array{Union{Missing, T}} requires copying the original array: https://github.com/JuliaLang/julia/issues/26681
But according to this blog post a Array{Union{Missing, T}} is represented internally as two arrays, one for the data and one for the Missing array: https://julialang.org/blog/2018/06/missing/#an_efficient_representation

So my question is, why is it necessary to copy the original array when using convert()? If an Array{Union{Missing, T}} is represented internally as two arrays anyway, would it be enough to create a Array{Union{Missing, T}} that references the original one but only initializes its internal Array{UInt8}?

Sukera (Aug 11 2022 at 09:21):

we could do that, but what do you then do about other references still pointing to the old array?

James Wrigley (Aug 11 2022 at 09:25):

What about them? :thinking:

James Wrigley (Aug 11 2022 at 09:26):

There are plenty of cases where array variables share the underlying data (e.g. reshape()).

Sukera (Aug 11 2022 at 09:30):

yeah, but there the element type is the same

Sukera (Aug 11 2022 at 09:31):

imagine you have old_arr with Int8s, and then you have a non-copying conversion to Union{Missing, Int8}. Now I set some element in the latter to Missing - what should reading from old_arr produce?

Sukera (Aug 11 2022 at 09:31):

the Int8 value is now invalid after all

Sukera (Aug 11 2022 at 09:31):

(Int8 is just an example, it gets spicier with mutable structs for example)

James Wrigley (Aug 11 2022 at 09:33):

Well, not really. As far as old_arr is concerned that element still has valid data, it's only in new_arr that it is missing.

James Wrigley (Aug 11 2022 at 09:34):

I can see why it could be confusing at first if you don't know how it's implemented under the hood, but to me that seems like quite reasonable semantics.

Sukera (Aug 11 2022 at 09:36):

inconsistent views of the same data does not at all sound reasonable to me :)

Sukera (Aug 11 2022 at 09:36):

sounds very much like a bug waiting to happen

Sukera (Aug 11 2022 at 09:36):

my point is that you can't really call it new_arr if it's sharing the same data

James Wrigley (Aug 11 2022 at 09:40):

I don't think it's inconsistent, anymore than something like PermutedDimsArray is inconsistent :stuck_out_tongue:

James Wrigley (Aug 11 2022 at 09:41):

Sukera said:

my point is that you can't really call it new_arr if it's sharing the same data

I think you absolutely can, and it's even very useful because it avoids copying a potentially large array.

James Wrigley (Aug 11 2022 at 09:42):

Numpy's masked arrays are also implemented like this, but they have a copy option. Maybe it'd be best to have something like that, but with copy=True as the default?

Sukera (Aug 11 2022 at 09:55):

permutedims is chiefly not the same, since it keeps the underlying the data exactly the same and all its validity guarantees. All it does is defer a copy if you do decide to push to e.g. the original vector.

Sukera (Aug 11 2022 at 09:56):

You can't really avoid copying that without being sure that the data isn't being messed with under your nose

James Wrigley (Aug 11 2022 at 09:59):

I absolutely agree, but in many cases that's exactly what I want. The whole point of converting to a missing array is so that I can forget about masked out elements.

James Wrigley (Aug 11 2022 at 09:59):

(i.e. I'm never going to touch old_arr once it's converted)

Sukera (Aug 11 2022 at 10:00):

that's a different usecase from just calling convert though

Sukera (Aug 11 2022 at 10:00):

that'd be perfectly fine in a MissingArray wrapper

James Wrigley (Aug 11 2022 at 10:01):

Is that a thing? :eyes:

Sukera (Aug 11 2022 at 10:01):

James Wrigley (Aug 11 2022 at 10:02):

Ah you had me excited :sweat_smile: But yes, that would also work if it existed.

Sukera (Aug 11 2022 at 10:02):

either way, if that conversion/copy is your bottleneck, I'm a tiny bit concerned about the other bottlenecks that are sure to come up around such code :eyes:

James Wrigley (Aug 11 2022 at 10:03):

It's not the biggest bottleneck, but it increases running time by about ~40%. The biggest bottleneck is IO.

James Wrigley (Aug 11 2022 at 10:05):

Sukera said:

that'd be perfectly fine in a MissingArray wrapper

I do wonder though, could you in theory get the same performance with such a wrapper as with a Union{Missing, T}? My understanding is that the compiler does a lot of work to optimize that away.

Sukera (Aug 11 2022 at 10:05):

sounds like you want to preallocate a Vector{Union{Missing, T}} in the first place to write into, but ok

Sukera (Aug 11 2022 at 10:06):

the compiler isn't really able to optimize your Array{Union{Missing, T}} away, no

Sukera (Aug 11 2022 at 10:06):

it has to insert checks for every element access

James Wrigley (Aug 11 2022 at 10:07):

Sukera said:

sounds like you want to preallocate a Vector{Union{Missing, T}} in the first place to write into, but ok

Unfortunately that's not possible, the data is being read from a Python library that doesn't accept a preallocated array.

James Wrigley (Aug 11 2022 at 10:09):

Sukera said:

it has to insert checks for every element access

Right, but in the blogpost I linked there's an example of the compiler being able to vectorize a loop anyway using masked SIMD instructions. Is that sort of thing possible with a custom array type too? I don't know how specially the compiler treats unions of isbits types.

Sukera (Aug 11 2022 at 10:31):

masked SIMD are very specialized, if your data is not a very simple datatype SIMD will be hard either way

Sukera (Aug 11 2022 at 10:31):

the compiler can do some tricks to bring the data in that shape, if you don't need all of a struct in the loop, but it's generally much more difficult

James Wrigley (Aug 11 2022 at 10:40):

Gotcha, thanks.

Mason Protter (Aug 11 2022 at 15:26):

Sukera said:

yeah, but there the element type is the same

What about reinterpret?

Sukera (Aug 11 2022 at 15:52):

those at least share the exact same memory for all their representations, so changing one affects the other directly

Sukera (Aug 11 2022 at 15:53):

you may still not get a valid instance for custom structs with custom invariants, yes

Mason Protter (Aug 11 2022 at 16:53):

It’s the exact same memory with Union{T, Missing} as well

Mason Protter (Aug 11 2022 at 16:53):

There’s just an additional bitvector telling you if an element is junk or not

Mason Protter (Aug 11 2022 at 16:55):

I think what James is asking for here is certainly not a good idea for a public API, but isn’t really any more scary than most unsafe pointer operations

Sundar R (Aug 11 2022 at 22:24):

Mason Protter said:

I think what James is asking for here is certainly got a good idea for a public API, but isn’t really any more scary than most unsafe pointer operations

The "but" makes me think you meant to type "certainly not a good idea for a public API" there?

Mason Protter (Aug 11 2022 at 22:48):

Oops, yeah I meant 'not'

Mason Protter (Aug 11 2022 at 22:49):

edited

Last updated: Aug 14 2025 at 04:51 UTC