Stream: helpdesk (published)

Topic: Collecting an iterator at specified indices?


view this post on Zulip Júlio Hoffimann (Feb 05 2024 at 12:42):

Suppose we are given an iterator iter and a list of linear indices inds. How to collect iter at inds? Is there a helper function in Base for that?

view this post on Zulip Júlio Hoffimann (Feb 05 2024 at 13:00):

I wrote the following helper function assuming that inds is indexable and sorted:

function collectat(iter, inds)
  ind = inds[begin]
  itr = Iterators.drop(iter, ind - 1)
  vec = [first(itr)]
  for new in inds[(begin + 1):end]
    itr = Iterators.drop(itr, new - ind)
    push!(vec, first(itr))
    ind = new
  end
  vec
end

Do you have improvement suggestions?

view this post on Zulip Júlio Hoffimann (Feb 05 2024 at 13:14):

The function above is super slow though with large collections. Any workaround?

view this post on Zulip Júlio Hoffimann (Feb 05 2024 at 13:15):

The underlying issue is that I have a large collection that requires more memory than available RAM. However, I only need to materialize at specific indices efficiently.

view this post on Zulip Júlio Hoffimann (Feb 05 2024 at 13:27):

Another solution is:

function collectat(iter, inds)
  vec = Vector{eltype(iter)}()
  set = Set(inds)
  for (i, x) in enumerate(iter)
    if i in set
      push!(vec, x)
    end
  end
  vec
end

view this post on Zulip Mason Protter (Feb 05 2024 at 13:27):

Does inds have a known size?

view this post on Zulip Júlio Hoffimann (Feb 05 2024 at 13:27):

Mason Protter said:

Does inds have a known size?

yes, it does

view this post on Zulip Mason Protter (Feb 05 2024 at 13:37):

My first intinct is to write something like

using BangBang: setindex!!
using MicroCollections: UndefArray

# easy case where we can just take a view of `A`
collectat(A::AbstractArray, inds::AbstractArray) = collect(view(A, inds))

# Harder case where `iter` is some general iterator
function collectat(iter, inds)
    arr = UndefArray(size(inds)...)
    iter_next = iterate(iter)
    inds_ind = firstindex(inds)
    count = 1
    while !isnothing(iter_next)
        x, state = iter_next
        if count == inds[inds_ind]
            arr = setindex!!(arr, x, inds_ind)
            inds_ind += 1
        end
        count += 1
    end
    arr
end

view this post on Zulip Júlio Hoffimann (Feb 05 2024 at 13:39):

Thank you @Mason Protter , will take a look at it as soon as possible. Running some experiments at the moment to see if this bottleneck is solved.

view this post on Zulip Mason Protter (Feb 05 2024 at 13:43):

sorry I made some typos above

view this post on Zulip Mason Protter (Feb 05 2024 at 13:43):

Needs a few fixes

view this post on Zulip Júlio Hoffimann (Feb 05 2024 at 13:43):

Thanks for the help. I am investigating the runs so far and will look into it more carefully soon.

view this post on Zulip Mason Protter (Feb 05 2024 at 14:07):

Hm, yeah the things I've been trying are all pretty slow.

view this post on Zulip Mason Protter (Feb 05 2024 at 14:08):

This does seem to work:

function collectat(iter, inds)
    arr = Array{eltype(iter)}(undef, size(inds)...)
    iter_next = iterate(iter)
    count = 1
    ind = firstindex(inds)
    for x  iter
        if count == inds[ind]
            arr = setindex!!(arr, x, ind)
            ind += 1
        end
        count += 1
        if ind > lastindex(inds)
            break
        end
    end
    arr
end

but the perf is not good

view this post on Zulip Mason Protter (Feb 05 2024 at 14:08):

I think maybe one should just write a custom iterator?

view this post on Zulip Mason Protter (Feb 05 2024 at 14:10):

i.e. something like

struct IterView{T, Iter, Inds}
    iter::Iter
    inds::Inds
end
IterView(itr::A, inds::I) where {A, I} = IterView{eltype(A), A, I}(itr, inds)

and then a custom iterate method

view this post on Zulip Mason Protter (Feb 05 2024 at 14:10):

but it'd be annoying to write and I'm not sure it'd be faster

view this post on Zulip Júlio Hoffimann (Feb 05 2024 at 14:17):

I did a workaround in the script to avoid the construction of large iterators, but will continue to push the limits here, and come back to the collectat soon.

view this post on Zulip aplavin (Feb 05 2024 at 15:12):

Won't something like map(last, Iterators.filter(((i, v),) -> i in ind, enumerate(iter))) work? Basically, a one-line version of your second solution @Júlio Hoffimann.

view this post on Zulip Andy Dienes (Feb 05 2024 at 15:43):

that will still materialize each item right? just not all at the same time

view this post on Zulip Andy Dienes (Feb 05 2024 at 15:46):

julia> foo() = rand(10000000);

julia> idxs = [2,3,5,7];

julia> itr = (foo() for _  1:100);

julia> @allocated collectat(itr, idxs)
641436824

julia> @allocated map(last, Iterators.filter(((i, v),) -> i  idxs, enumerate(itr)))
8003540800

view this post on Zulip aplavin (Feb 05 2024 at 17:30):

Of course, iterators are fundamentally sequential, so the only way to get to the Nth element is to iterate over the first N-1. For random access, see arrays :)

view this post on Zulip Mason Protter (Feb 05 2024 at 17:35):

The problem is that your version goes through all elements of itr, whereas mine stops once you've consumed all the indices

view this post on Zulip Mason Protter (Feb 05 2024 at 17:36):

which can be very consequential:

#+begin_src julia
using BangBang, MicroCollections

idxs = [2,3,5,7];
itr = (rand(10000000) for _  1:100);

function collectat(iter, inds)
    arr = UndefArray(size(inds)...)
    iter_next = iterate(iter)
    count = 1
    ind = firstindex(inds)
    for x  iter
        if count == inds[ind]
            arr = setindex!!(arr, x, ind)
            ind += 1
        end
        count += 1
        if ind > lastindex(inds)
            break
        end
    end
    arr
end

collectat_itr(itr, idxs) = map(last, Iterators.filter(((i, v),) -> i  idxs, enumerate(itr)))

@btime collectat($itr, $idxs)
@btime collectat_itr($itr, $idxs);
#+end_src
#+RESULTS:
:   95.072 ms (17 allocations: 610.35 MiB)
:   1.181 s (307 allocations: 7.45 GiB)

view this post on Zulip aplavin (Feb 05 2024 at 17:43):

That's not a fundamental issue – just add Iterators.takewhile for optimization! Still composes nicely out of basic building blocks:

julia> using DataPipes  # @p to make piping nice
julia> collectat_itr(itr, idxs) = @p let
    itr
    enumerate()
    Iterators.takewhile(first(_)  last(idxs))
    Iterators.filter(first(_)  idxs)
    map(last)
end

julia> @btime collectat_itr($itr, $idxs);
  72.524 ms (26 allocations: 610.35 MiB)

view this post on Zulip Mason Protter (Feb 05 2024 at 17:50):

yeah that's pretty good

view this post on Zulip Mason Protter (Feb 05 2024 at 18:10):

I think if we want to talk about decomposing into modular building blocks though, Transducers is still my favourite:

#+begin_src julia
using Transducers
SelectAt(idxs) = enumerate  TakeWhile(x -> first(x)  last(idxs))  Filter(y -> first(y)  idxs)  Map(last)

@btime collectat($itr, $idxs)
@btime collectat_itr2($itr, $idxs);
@btime itr |> SelectAt($idxs) |> collect;
@btime itr |> SelectAt($idxs) |> x -> tcollect(x ;basesize=2);
#+end_src

#+RESULTS:
:   93.922 ms (17 allocations: 610.35 MiB)
:   93.933 ms (26 allocations: 610.35 MiB)
:   94.020 ms (24 allocations: 610.35 MiB)
:   61.983 ms (222 allocations: 915.54 MiB)

view this post on Zulip Júlio Hoffimann (Feb 06 2024 at 11:18):

Thank you all for the inputs, experimenting with them here.

view this post on Zulip Júlio Hoffimann (Feb 06 2024 at 11:57):

@Mason Protter I really like the Transducers.jl solution and we already have it as a dependency. What is the symbol? Couldn't find it in the docs. Would it be a lazy version of |>?

view this post on Zulip Mason Protter (Feb 06 2024 at 12:02):

Huh it's weird it's not in the docs. Here's the docstring, I'll try and get it in the docs website too though:

help?> 
"⨟" can be typed by \bbsemi<tab>

search: 

  g  f
  opcompose(g, f)


  The opposite composition operator defined as

  g  f = f  g
  (f) = f
  (fs...) = (reverse(fs)...)

view this post on Zulip Mason Protter (Feb 06 2024 at 12:05):

Mhm, I guess it's because it's actually a re-export from CompositionsBase.jl

view this post on Zulip Júlio Hoffimann (Feb 06 2024 at 13:41):

@Mason Protter I had to adjust the definition to handle empty inds:

using Transducers

"""
    collectat(iter, inds)

Collect iterator `iter` at indices `inds` without materialization.
"""
function collectat(iter, inds)
  if isempty(inds)
    eltype(iter)[]
  else
    selectat(inds) = enumerate  TakeWhile(x -> first(x)  last(inds))  Filter(y -> first(y)  inds)  Map(last)
    iter |> selectat(inds) |> tcollect
  end
end

view this post on Zulip Mason Protter (Feb 06 2024 at 13:52):

Okay got a docstring in Transducers itself now:https://juliafolds2.github.io/Transducers.jl/dev/reference/manual/#CompositionsBase.:%E2%A8%9F

view this post on Zulip Mason Protter (Feb 06 2024 at 13:53):

Some history on the topic if you're interested: https://github.com/JuliaFolds/Transducers.jl/issues/67

view this post on Zulip Mason Protter (Feb 06 2024 at 14:13):

Júlio Hoffimann said:

Mason Protter I had to adjust the definition to handle empty inds:

You could also do

SelectAt(idxs) = (enumerate
                   TakeWhile(x -> !isempty(idxs) && first(x)  last(idxs))
                   Filter(y -> first(y)  idxs)
                   Map(last))

view this post on Zulip Júlio Hoffimann (Feb 06 2024 at 23:44):

@Mason Protter is this a good application for OhMyThreads.jl? We are considering moving some code from Transducers.jl to the new package

view this post on Zulip Mason Protter (Feb 06 2024 at 23:51):

Not really, no. OhMyThreads doesn't handle general iterators or early termination like Transducers does

view this post on Zulip Mason Protter (Feb 07 2024 at 00:02):

OhMyThreads isn't me giving up on Transducers, it was more just that I knew that the design of OhMyThreads can cover a big chunk of use-cases with very little code, and be relatively low maintenance and easy for people to pick up and contribute to


Last updated: Nov 06 2024 at 04:40 UTC