Stream: helpdesk (published)

Topic: Unexpected behavior in `Iterators.take`


view this post on Zulip mbaz (Dec 22 2022 at 19:25):

I'm writing an infinite, random iterator. Something like this (simplified):

struct RandIter
    x :: Vector{Int}
end

Base.length(ri :: RandIter) = 5

function Base.iterate(i::RandIter, state = 0)
    if state < 5
        i.x[1] = rand((1,2,3))
        i.x[2] = rand((1,2,3))
        i.x[3] = rand((1,2,3))
        state += 1
        return (i.x, state)
    else
        return nothing
    end
end

and then

julia> ri = RandIter([0,0,0]);

julia> collect(Iterators.take(ri, 4))
4-element Vector{Any}:
 [1, 3, 2]
 [1, 3, 2]
 [1, 3, 2]
 [1, 3, 2]

Here, [1, 3, 2] is the last random value assigned to ri.x.

This is specific to take; for example, this works:

julia> for v in ri println(v); end
[3, 3, 1]
[2, 3, 1]
[3, 3, 3]
[1, 2, 2]
[1, 1, 2]

Ultimately, what I want to do is generate the random values without allocating; that's the idea behind reusing the memory allocated in ri.x. How can I do that and still have a working take iterator?

view this post on Zulip Sukera (Dec 22 2022 at 20:26):

you're returning the exact same array in every iteration

view this post on Zulip Sukera (Dec 22 2022 at 20:26):

if you want to create distinct arrays, you HAVE to copy

view this post on Zulip mbaz (Dec 22 2022 at 20:27):

Hmm, I'm starting to think that this will be impossible... for take to work, at some point a copy will have to be made. Unfortunately, this makes the iterator almost 10x slower

view this post on Zulip Sukera (Dec 22 2022 at 20:27):

you could always return a tuple instead, if the size is limited

view this post on Zulip Sukera (Dec 22 2022 at 20:27):

but yeah, at some point you will have to copy

view this post on Zulip Sukera (Dec 22 2022 at 20:28):

the only reason println works is because it prints the current state of the array, and doesn't keep a reference to it around for longer than one iteration

view this post on Zulip mbaz (Dec 22 2022 at 20:28):

Yep -- the problem with tuples is that later on I'd like to use the random values in more flexible ways.

view this post on Zulip mbaz (Dec 22 2022 at 20:29):

I guess what I was hoping would happen is that collect(take(...)) would allocate an array of the required size and stick the iterator values in there, requiring only one allocation.

If I need to copy the array, I end up doing one allocation per iteration, which is sad.

view this post on Zulip Sukera (Dec 22 2022 at 20:30):

well it does only allocate once, the outer array

view this post on Zulip Sukera (Dec 22 2022 at 20:30):

it's just that every entry returned is the exact same array (===), where you just changed its contents

view this post on Zulip Sukera (Dec 22 2022 at 20:31):

it's the same reason why

julia> fill([], 2)
2-element Vector{Vector{Any}}:
 []
 []

julia> ans[1] === ans[2]
true

view this post on Zulip mbaz (Dec 22 2022 at 20:31):

Yeah, but collect is not copying the values returned by my iterator.

view this post on Zulip Sukera (Dec 22 2022 at 20:32):

why should it?

view this post on Zulip Michael Abbott (Dec 22 2022 at 20:32):

If you do something with the result, though, it might be fine that it gets re-used. E.g. map(sum, iter) ought to call sum before getting the next element, and thus have distinct values.

view this post on Zulip mbaz (Dec 22 2022 at 20:33):

@Michael Abbott Yeah, in some cases (like for) above it does work.

view this post on Zulip Sukera (Dec 22 2022 at 20:33):

if you want to copy, you need to copy explicitly - either by doing Iterators.map(copy, RandIter([1,2,3])) or by copying in your iterator

view this post on Zulip Sukera (Dec 22 2022 at 20:33):

mbaz said:

Michael Abbott Yeah, in some cases (like for) above it does work.

again - the for doesn't copy. The println just looks at the elements of the array before you change them again.

view this post on Zulip mbaz (Dec 22 2022 at 20:34):

I understand.

view this post on Zulip mbaz (Dec 22 2022 at 20:35):

I'm not complaining, by the way, I just was very confused by the behavior of collect(take(itr)). I assumed that, just like println uses the values before they change, collect would take the returned values and stick them in its own array.

view this post on Zulip Sukera (Dec 22 2022 at 20:36):

that is exactly what it's doing though! :D

view this post on Zulip Sukera (Dec 22 2022 at 20:36):

you just always return the same array

view this post on Zulip Sukera (Dec 22 2022 at 20:36):

so it always sticks the same exact array into the new location

view this post on Zulip mbaz (Dec 22 2022 at 20:38):

Maybe I don't know what you mean by "same array". I understand the array (in the sense of the array pointer) is the same. However, the values of the array are different in each iteration.

view this post on Zulip Michael Abbott (Dec 22 2022 at 20:38):

Maybe this is like copy vs deepcopy. collect is making a new array to hold things, but those things themselves are just pointers to memory. If I do A = [[1,2], [3,4]]; B = copy(A), doing B[1] = [5,6] mutates B but not A. But B[2] .= 0 mutates not B but B[2], which is === A[2], the same memory.

view this post on Zulip Sukera (Dec 22 2022 at 20:40):

By "same array" I mean that they are === (called egal) - they are literally the same exact array object! They not only have the same contents, they occupy the same memory because they are the exact same object.

view this post on Zulip Michael Abbott (Dec 22 2022 at 20:40):

I also had a mapslices example... internally it does exactly this re-use of one array for every slice. This is written into, passed to f, the result saved elsewhere, and repeat -- safe. But if you explicitly save the array elsewhere, it's always the same one:

julia> store = []; mapslices(rand(Int8,2,3); dims=1) do x
                       push!(store, x)
                       sum(x)
                   end
1×3 Matrix{Int64}:
 -225  90  -78

julia> store
3-element Vector{Any}:
 Int8[-64, -14]
 Int8[-64, -14]
 Int8[-64, -14]

view this post on Zulip Sukera (Dec 22 2022 at 20:41):

Whether the content of the array differs in each iteration is irrelevant - your iterator always returns the same object, so that's what collect sticks in the collected array. It doesn't go around checking whether the contents have changed - that'd be quite a lot of overhead (and hard to do in a generic way).

view this post on Zulip mbaz (Dec 22 2022 at 20:41):

Let's say A is collect's array and ri.x is what my iterator returns. I thought what would happen is A[:, n] = ri.x[:], where n is the iteration index.

view this post on Zulip Sukera (Dec 22 2022 at 20:41):

an iteration is not indexable - your iterator only returns a single object

view this post on Zulip Sukera (Dec 22 2022 at 20:41):

not a collection of objects

view this post on Zulip mbaz (Dec 22 2022 at 20:42):

No, but collect knows, since it iterates over the Take iterator.

view this post on Zulip Michael Abbott (Dec 22 2022 at 20:43):

That's what mapslices will do, it makes a matrix A to write slices into -- they are moved to new memory. But the collected array is not a matrix like A, it's a vector containing Vectors, each of which has its own pointer (and in this case they all agree)

view this post on Zulip Sukera (Dec 22 2022 at 20:43):

that is also not indexable!

view this post on Zulip Michael Abbott (Dec 22 2022 at 20:44):

Making one big output array and writing into it while iterating is also what stack will do:

julia> using Compat

julia> stack(Iterators.take(ri, 4))
3×4 Matrix{Int64}:
 2  1  2  1
 3  1  3  2
 1  2  3  3

view this post on Zulip Sukera (Dec 22 2022 at 20:45):

collect (roughly) does this, provided the array has some knowable size:

out = Array{eltype(itr)}(undef, length(itr))
for (idx, obj) in enumerate(itr)
    out[idx] = obj
end
out

view this post on Zulip Sukera (Dec 22 2022 at 20:45):

there is no copy involved and it has no knowledge of whether or not you return the same exact object in each iteration or not

view this post on Zulip Sukera (Dec 22 2022 at 20:46):

(there are some more cases with multidimensional iterators, but those share the non-copying semantics so I'll leave them out for simplicity)

view this post on Zulip Michael Abbott (Dec 22 2022 at 20:46):

And this is very different to having out[:, idx] .= obj in the loop, which copies the values of obj immediately (into out::Matrix instead of out::Vector{Vector{...}})

view this post on Zulip mbaz (Dec 22 2022 at 20:46):

@Sukera Agreed -- I see it now, but I didn't see it before.

@Michael Abbott stack looks like what I need here.

view this post on Zulip mbaz (Dec 22 2022 at 20:50):

stack is superfast as well... almost zero overhead. I can't wait for v1.9

view this post on Zulip mbaz (Dec 22 2022 at 20:51):

@Sukera @Michael Abbott Thanks for you help!


Last updated: Nov 06 2024 at 04:40 UTC