Stream: helpdesk (published)

Topic: Constructing missing arrays


view this post on Zulip James Wrigley (Aug 11 2022 at 09:11):

Right now, going from Array{T} to Array{Union{Missing, T}} requires copying the original array: https://github.com/JuliaLang/julia/issues/26681
But according to this blog post a Array{Union{Missing, T}} is represented internally as two arrays, one for the data and one for the Missing array: https://julialang.org/blog/2018/06/missing/#an_efficient_representation

So my question is, why is it necessary to copy the original array when using convert()? If an Array{Union{Missing, T}} is represented internally as two arrays anyway, would it be enough to create a Array{Union{Missing, T}} that references the original one but only initializes its internal Array{UInt8}?

view this post on Zulip Sukera (Aug 11 2022 at 09:21):

we could do that, but what do you then do about other references still pointing to the old array?

view this post on Zulip James Wrigley (Aug 11 2022 at 09:25):

What about them? :thinking:

view this post on Zulip James Wrigley (Aug 11 2022 at 09:26):

There are plenty of cases where array variables share the underlying data (e.g. reshape()).

view this post on Zulip Sukera (Aug 11 2022 at 09:30):

yeah, but there the element type is the same

view this post on Zulip Sukera (Aug 11 2022 at 09:31):

imagine you have old_arr with Int8s, and then you have a non-copying conversion to Union{Missing, Int8}. Now I set some element in the latter to Missing - what should reading from old_arr produce?

view this post on Zulip Sukera (Aug 11 2022 at 09:31):

the Int8 value is now invalid after all

view this post on Zulip Sukera (Aug 11 2022 at 09:31):

(Int8 is just an example, it gets spicier with mutable structs for example)

view this post on Zulip James Wrigley (Aug 11 2022 at 09:33):

Well, not really. As far as old_arr is concerned that element still has valid data, it's only in new_arr that it is missing.

view this post on Zulip James Wrigley (Aug 11 2022 at 09:34):

I can see why it could be confusing at first if you don't know how it's implemented under the hood, but to me that seems like quite reasonable semantics.

view this post on Zulip Sukera (Aug 11 2022 at 09:36):

inconsistent views of the same data does not at all sound reasonable to me :)

view this post on Zulip Sukera (Aug 11 2022 at 09:36):

sounds very much like a bug waiting to happen

view this post on Zulip Sukera (Aug 11 2022 at 09:36):

my point is that you can't really call it new_arr if it's sharing the same data

view this post on Zulip James Wrigley (Aug 11 2022 at 09:40):

I don't think it's inconsistent, anymore than something like PermutedDimsArray is inconsistent :stuck_out_tongue:

view this post on Zulip James Wrigley (Aug 11 2022 at 09:41):

Sukera said:

my point is that you can't really call it new_arr if it's sharing the same data

I think you absolutely can, and it's even very useful because it avoids copying a potentially large array.

view this post on Zulip James Wrigley (Aug 11 2022 at 09:42):

Numpy's masked arrays are also implemented like this, but they have a copy option. Maybe it'd be best to have something like that, but with copy=True as the default?

view this post on Zulip Sukera (Aug 11 2022 at 09:55):

permutedims is chiefly not the same, since it keeps the underlying the data exactly the same and all its validity guarantees. All it does is defer a copy if you do decide to push to e.g. the original vector.

view this post on Zulip Sukera (Aug 11 2022 at 09:56):

You can't really avoid copying that without being sure that the data isn't being messed with under your nose

view this post on Zulip James Wrigley (Aug 11 2022 at 09:59):

I absolutely agree, but in many cases that's exactly what I want. The whole point of converting to a missing array is so that I can forget about masked out elements.

view this post on Zulip James Wrigley (Aug 11 2022 at 09:59):

(i.e. I'm never going to touch old_arr once it's converted)

view this post on Zulip Sukera (Aug 11 2022 at 10:00):

that's a different usecase from just calling convert though

view this post on Zulip Sukera (Aug 11 2022 at 10:00):

that'd be perfectly fine in a MissingArray wrapper

view this post on Zulip James Wrigley (Aug 11 2022 at 10:01):

Is that a thing? :eyes:

view this post on Zulip Sukera (Aug 11 2022 at 10:01):

no

view this post on Zulip James Wrigley (Aug 11 2022 at 10:02):

Ah you had me excited :sweat_smile: But yes, that would also work if it existed.

view this post on Zulip Sukera (Aug 11 2022 at 10:02):

either way, if that conversion/copy is your bottleneck, I'm a tiny bit concerned about the other bottlenecks that are sure to come up around such code :eyes:

view this post on Zulip James Wrigley (Aug 11 2022 at 10:03):

It's not the biggest bottleneck, but it increases running time by about ~40%. The biggest bottleneck is IO.

view this post on Zulip James Wrigley (Aug 11 2022 at 10:05):

Sukera said:

that'd be perfectly fine in a MissingArray wrapper

I do wonder though, could you in theory get the same performance with such a wrapper as with a Union{Missing, T}? My understanding is that the compiler does a lot of work to optimize that away.

view this post on Zulip Sukera (Aug 11 2022 at 10:05):

sounds like you want to preallocate a Vector{Union{Missing, T}} in the first place to write into, but ok

view this post on Zulip Sukera (Aug 11 2022 at 10:06):

the compiler isn't really able to optimize your Array{Union{Missing, T}} away, no

view this post on Zulip Sukera (Aug 11 2022 at 10:06):

it has to insert checks for every element access

view this post on Zulip James Wrigley (Aug 11 2022 at 10:07):

Sukera said:

sounds like you want to preallocate a Vector{Union{Missing, T}} in the first place to write into, but ok

Unfortunately that's not possible, the data is being read from a Python library that doesn't accept a preallocated array.

view this post on Zulip James Wrigley (Aug 11 2022 at 10:09):

Sukera said:

it has to insert checks for every element access

Right, but in the blogpost I linked there's an example of the compiler being able to vectorize a loop anyway using masked SIMD instructions. Is that sort of thing possible with a custom array type too? I don't know how specially the compiler treats unions of isbits types.

view this post on Zulip Sukera (Aug 11 2022 at 10:31):

masked SIMD are very specialized, if your data is not a very simple datatype SIMD will be hard either way

view this post on Zulip Sukera (Aug 11 2022 at 10:31):

the compiler can do some tricks to bring the data in that shape, if you don't need all of a struct in the loop, but it's generally much more difficult

view this post on Zulip James Wrigley (Aug 11 2022 at 10:40):

Gotcha, thanks.

view this post on Zulip Mason Protter (Aug 11 2022 at 15:26):

Sukera said:

yeah, but there the element type is the same

What about reinterpret?

view this post on Zulip Sukera (Aug 11 2022 at 15:52):

those at least share the exact same memory for all their representations, so changing one affects the other directly

view this post on Zulip Sukera (Aug 11 2022 at 15:53):

you may still not get a valid instance for custom structs with custom invariants, yes

view this post on Zulip Mason Protter (Aug 11 2022 at 16:53):

It’s the exact same memory with Union{T, Missing} as well

view this post on Zulip Mason Protter (Aug 11 2022 at 16:53):

There’s just an additional bitvector telling you if an element is junk or not

view this post on Zulip Mason Protter (Aug 11 2022 at 16:55):

I think what James is asking for here is certainly got a good idea for a public API, but isn’t really any more scary than most unsafe pointer operations

view this post on Zulip Sundar R (Aug 11 2022 at 22:24):

Mason Protter said:

I think what James is asking for here is certainly got a good idea for a public API, but isn’t really any more scary than most unsafe pointer operations

The "but" makes me think you meant to type "certainly not a good idea for a public API" there?

view this post on Zulip Mason Protter (Aug 11 2022 at 22:48):

Oops, yeah I meant 'not'

view this post on Zulip Mason Protter (Aug 11 2022 at 22:49):

edited


Last updated: Oct 02 2023 at 04:34 UTC