Suppose we are given an iterator iter
and a list of linear indices inds
. How to collect iter
at inds
? Is there a helper function in Base for that?
I wrote the following helper function assuming that inds
is indexable and sorted:
function collectat(iter, inds)
ind = inds[begin]
itr = Iterators.drop(iter, ind - 1)
vec = [first(itr)]
for new in inds[(begin + 1):end]
itr = Iterators.drop(itr, new - ind)
push!(vec, first(itr))
ind = new
end
vec
end
Do you have improvement suggestions?
The function above is super slow though with large collections. Any workaround?
The underlying issue is that I have a large collection that requires more memory than available RAM. However, I only need to materialize at specific indices efficiently.
Another solution is:
function collectat(iter, inds)
vec = Vector{eltype(iter)}()
set = Set(inds)
for (i, x) in enumerate(iter)
if i in set
push!(vec, x)
end
end
vec
end
Does inds
have a known size?
Mason Protter said:
Does
inds
have a known size?
yes, it does
My first intinct is to write something like
using BangBang: setindex!!
using MicroCollections: UndefArray
# easy case where we can just take a view of `A`
collectat(A::AbstractArray, inds::AbstractArray) = collect(view(A, inds))
# Harder case where `iter` is some general iterator
function collectat(iter, inds)
arr = UndefArray(size(inds)...)
iter_next = iterate(iter)
inds_ind = firstindex(inds)
count = 1
while !isnothing(iter_next)
x, state = iter_next
if count == inds[inds_ind]
arr = setindex!!(arr, x, inds_ind)
inds_ind += 1
end
count += 1
end
arr
end
Thank you @Mason Protter , will take a look at it as soon as possible. Running some experiments at the moment to see if this bottleneck is solved.
sorry I made some typos above
Needs a few fixes
Thanks for the help. I am investigating the runs so far and will look into it more carefully soon.
Hm, yeah the things I've been trying are all pretty slow.
This does seem to work:
function collectat(iter, inds)
arr = Array{eltype(iter)}(undef, size(inds)...)
iter_next = iterate(iter)
count = 1
ind = firstindex(inds)
for x ∈ iter
if count == inds[ind]
arr = setindex!!(arr, x, ind)
ind += 1
end
count += 1
if ind > lastindex(inds)
break
end
end
arr
end
but the perf is not good
I think maybe one should just write a custom iterator?
i.e. something like
struct IterView{T, Iter, Inds}
iter::Iter
inds::Inds
end
IterView(itr::A, inds::I) where {A, I} = IterView{eltype(A), A, I}(itr, inds)
and then a custom iterate
method
but it'd be annoying to write and I'm not sure it'd be faster
I did a workaround in the script to avoid the construction of large iterators, but will continue to push the limits here, and come back to the collectat soon.
Won't something like map(last, Iterators.filter(((i, v),) -> i in ind, enumerate(iter)))
work? Basically, a one-line version of your second solution @Júlio Hoffimann.
that will still materialize each item right? just not all at the same time
julia> foo() = rand(10000000);
julia> idxs = [2,3,5,7];
julia> itr = (foo() for _ ∈ 1:100);
julia> @allocated collectat(itr, idxs)
641436824
julia> @allocated map(last, Iterators.filter(((i, v),) -> i ∈ idxs, enumerate(itr)))
8003540800
Of course, iterators are fundamentally sequential, so the only way to get to the Nth element is to iterate over the first N-1. For random access, see arrays :)
The problem is that your version goes through all elements of itr
, whereas mine stops once you've consumed all the indices
which can be very consequential:
#+begin_src julia
using BangBang, MicroCollections
idxs = [2,3,5,7];
itr = (rand(10000000) for _ ∈ 1:100);
function collectat(iter, inds)
arr = UndefArray(size(inds)...)
iter_next = iterate(iter)
count = 1
ind = firstindex(inds)
for x ∈ iter
if count == inds[ind]
arr = setindex!!(arr, x, ind)
ind += 1
end
count += 1
if ind > lastindex(inds)
break
end
end
arr
end
collectat_itr(itr, idxs) = map(last, Iterators.filter(((i, v),) -> i ∈ idxs, enumerate(itr)))
@btime collectat($itr, $idxs)
@btime collectat_itr($itr, $idxs);
#+end_src
#+RESULTS:
: 95.072 ms (17 allocations: 610.35 MiB)
: 1.181 s (307 allocations: 7.45 GiB)
That's not a fundamental issue – just add Iterators.takewhile
for optimization! Still composes nicely out of basic building blocks:
julia> using DataPipes # @p to make piping nice
julia> collectat_itr(itr, idxs) = @p let
itr
enumerate()
Iterators.takewhile(first(_) ≤ last(idxs))
Iterators.filter(first(_) ∈ idxs)
map(last)
end
julia> @btime collectat_itr($itr, $idxs);
72.524 ms (26 allocations: 610.35 MiB)
yeah that's pretty good
I think if we want to talk about decomposing into modular building blocks though, Transducers is still my favourite:
#+begin_src julia
using Transducers
SelectAt(idxs) = enumerate ⨟ TakeWhile(x -> first(x) ≤ last(idxs)) ⨟ Filter(y -> first(y) ∈ idxs) ⨟ Map(last)
@btime collectat($itr, $idxs)
@btime collectat_itr2($itr, $idxs);
@btime itr |> SelectAt($idxs) |> collect;
@btime itr |> SelectAt($idxs) |> x -> tcollect(x ;basesize=2);
#+end_src
#+RESULTS:
: 93.922 ms (17 allocations: 610.35 MiB)
: 93.933 ms (26 allocations: 610.35 MiB)
: 94.020 ms (24 allocations: 610.35 MiB)
: 61.983 ms (222 allocations: 915.54 MiB)
Thank you all for the inputs, experimenting with them here.
@Mason Protter I really like the Transducers.jl solution and we already have it as a dependency. What is the ⨟
symbol? Couldn't find it in the docs. Would it be a lazy version of |>
?
Huh it's weird it's not in the docs. Here's the docstring, I'll try and get it in the docs website too though:
help?> ⨟
"⨟" can be typed by \bbsemi<tab>
search: ⨟
g ⨟ f
opcompose(g, f)
The opposite composition operator defined as
g ⨟ f = f ∘ g
⨟(f) = f
⨟(fs...) = ∘(reverse(fs)...)
Mhm, I guess it's because it's actually a re-export from CompositionsBase.jl
@Mason Protter I had to adjust the definition to handle empty inds
:
using Transducers
"""
collectat(iter, inds)
Collect iterator `iter` at indices `inds` without materialization.
"""
function collectat(iter, inds)
if isempty(inds)
eltype(iter)[]
else
selectat(inds) = enumerate ⨟ TakeWhile(x -> first(x) ≤ last(inds)) ⨟ Filter(y -> first(y) ∈ inds) ⨟ Map(last)
iter |> selectat(inds) |> tcollect
end
end
Okay got a docstring in Transducers itself now:https://juliafolds2.github.io/Transducers.jl/dev/reference/manual/#CompositionsBase.:%E2%A8%9F
Some history on the topic if you're interested: https://github.com/JuliaFolds/Transducers.jl/issues/67
Júlio Hoffimann said:
Mason Protter I had to adjust the definition to handle empty
inds
:
You could also do
SelectAt(idxs) = (enumerate
⨟ TakeWhile(x -> !isempty(idxs) && first(x) ≤ last(idxs))
⨟ Filter(y -> first(y) ∈ idxs)
⨟ Map(last))
@Mason Protter is this a good application for OhMyThreads.jl? We are considering moving some code from Transducers.jl to the new package
Not really, no. OhMyThreads doesn't handle general iterators or early termination like Transducers does
OhMyThreads isn't me giving up on Transducers, it was more just that I knew that the design of OhMyThreads can cover a big chunk of use-cases with very little code, and be relatively low maintenance and easy for people to pick up and contribute to
Last updated: Nov 06 2024 at 04:40 UTC