Stream: helpdesk (published)

Topic: `mapreduce(f, op, v)` versus `reduce(op, map(f, v))`


view this post on Zulip Júlio Hoffimann (Nov 13 2024 at 16:59):

In some applications, we noticed that reduce(op, map(f, v)) is faster, specially when v items have a complicated parametric type.

Is there a clear rule to choose between the two functions? When is one expected to outperform the other?

view this post on Zulip Expanding Man (Nov 13 2024 at 17:10):

I don't know for sure, but I suspect you already nailed it. mapreduce will try to do a loop over s = op(s, f(v)) and unless f(v) is something relatively simple it'll give up and type s as Any. When you do reduce(op, map(f, v)), it's essentially forcing the type of s to be op(::T, ::T) where T = eltype(map(f, v)). The compiler doesn't guarantee that the type inference is equivalent in these cases.

view this post on Zulip Júlio Hoffimann (Nov 13 2024 at 17:12):

Is there a helper function in Base that can be used to tell when the compiler will give up given f and v?

view this post on Zulip Júlio Hoffimann (Nov 13 2024 at 17:13):

We have some algorithms to write, and it would be nice to know a priori when to use mapreduce or reduce.

view this post on Zulip Expanding Man (Nov 13 2024 at 17:14):

There's undoubtedly a way to do it with internals, but you should not do that in package code. The best thing to do would be to assume no and to work around it.

view this post on Zulip Júlio Hoffimann (Nov 13 2024 at 17:15):

We wouldn't use it in package code. Just curious if there is tooling to help decide between the two during development.

view this post on Zulip Expanding Man (Nov 13 2024 at 17:16):

There is a bit of general advice I can give that I'm not sure applies to you, but which I suspect may having seen some of what's going on in Meshes.jl: Julia really, really, really does not like containers (or iterators) of mixed eltype, even if those types look to you like they are very similar (i.e. differ only by a single parameter or something). Yes, you can get miraculously good performance out of containers with small union types, but for one, the compiler does not narrow the containers to those types for you, you'd have to do it yourself, and for another, this holds fragile promise and can break type inference in more complicated code down the road.

view this post on Zulip Júlio Hoffimann (Nov 13 2024 at 17:17):

Would be nice if VSCode could give hints to developers about mapreduce for example.

view this post on Zulip Júlio Hoffimann (Nov 13 2024 at 17:18):

The code in Meshes.jl assumes heterogeneous containers in general, and when the containers are homogeneous of the same geometry type, things run with the top performance, and inference works. We do a few checks here and there in the test suite for that.

view this post on Zulip Expanding Man (Nov 13 2024 at 17:19):

Yeah I suspected that. You may want to seriously consider wrapper types that hold different objects in separate containers internally, otherwise it's pretty much a guarantee that you will repeatedly slam into type inference issues. For some things you can rely on small union types, but again, this is fragile, and if you hand over an object that's typed that way to a "user" it's easy for them to break it unless they understand very clearly what is happening.

view this post on Zulip Júlio Hoffimann (Nov 13 2024 at 17:20):

As an example, the Grid subtypes are heavily optimized because the grid elements are always of the same type. Same for SimpleMesh of Triangle elements. In the case of a GeometrySet, there is nothing we can do. It can store heterogeneous containers. It still allows homogeneous containers though, as is the case with PointSet.

view this post on Zulip Júlio Hoffimann (Nov 13 2024 at 17:21):

Expanding Man said:

Yeah I suspected that. You may want to seriously consider wrapper types that hold different objects in separate containers internally, otherwise it's pretty much a guarantee that you will repeatedly slam into type inference issues. For some things you can rely on small union types, but again, this is fragile, and if you hand over an object that's typed that way to a "user" it's easy for them to break it unless they understand very clearly what is happening.

We have evidence that everything is working pretty well with rich parametric types. Only in a few places that we need to circumvent Julia's inference problems.

view this post on Zulip Expanding Man (Nov 13 2024 at 17:23):

Again, the problem is that it's fragile. It can work great until it doesn't, and knowing where type inference is going to start completely giving up is really hard. This is an issue with Julia's type system in general that I think should be better warned about in the documentation. You can be really impressed with how it's doing with apparently dynamic types and then all of a sudden you hit a performance wall when everything in a complicated block of code starts showing up as Any.

view this post on Zulip Júlio Hoffimann (Nov 13 2024 at 17:24):

I think that is part of the game. As a Julia developer you need to handle these. I agree it should be documented better in the official docs.

view this post on Zulip Expanding Man (Nov 13 2024 at 17:25):

I'm not talking about manifestly type stable code that just happens to have really complicated type signatures, that should be fine (sometimes it isn't, but it really should be), but stuff like containers without a single concrete type can start to get real ugly.

view this post on Zulip Júlio Hoffimann (Nov 13 2024 at 17:31):

Most container types in Meshes.jl have a concrete eltype. The GeometrySet is the one with greatest potential for non-concrete eltype. If you find issues with the package in practical applications, please feel free to report. We will be happy to discuss improvements.

view this post on Zulip Júlio Hoffimann (Nov 13 2024 at 17:32):

Now, coming back to the original issue. I wish VSCode could warn about mapreduce in some cases before running the function that uses it.

view this post on Zulip Júlio Hoffimann (Nov 13 2024 at 17:33):

If we know the type of the arguments of the function, then we can know the intermediate type of expressions inside the function. At that point, VSCode could say: "this mapreduce will not infer the type properly...".

view this post on Zulip Expanding Man (Nov 13 2024 at 17:35):

There are tools like Cthulhu.jl and @code_warntype. As things stand right now code has to get compiled (or lowered, or whatever @code_warntype does exactly) for you to know what's going to happen with type inference.

view this post on Zulip Júlio Hoffimann (Nov 13 2024 at 17:36):

Yes, I use @code_warntype in these situations, but sometimes we don't have time to launch a REPL and test these details.

view this post on Zulip Daniel VandenHeuvel (Nov 13 2024 at 20:48):

VSCode can show type hints. I've got it disabled so maybe I'm not up to date with how accurate it is. I don't think VSCode would really be suggesting things like that not using mapreduce directly though.

view this post on Zulip John Lapeyre (Nov 13 2024 at 23:25):

Sort of rephrasing and filling in: I think one thing that hasn't changed since I last used Julia frequently is that for Vector{T} with abstract T the compiler can't (does not) make any assumptions about the elements. So any dispatch, say in addition in sum happens at runtime. A Union is not concrete, but is not abstract. If T is a small union, then sum can be much faster.

In general, this is a good application for sum types (tagged unions). I don't know if it would work for you. In discussing this once on slack, someone showed me that a plain old Union is just as good in many cases. This is easy to verify with simple examples.

view this post on Zulip jar (Nov 14 2024 at 01:06):

Aside from the type inference issue, this seems like an instance of the general fusion/fission question: sometimes you want map(f, map(g, xs)) and sometimes you want map(f∘g, xs).

view this post on Zulip Kyle Beggs (Nov 14 2024 at 01:59):

@jar I'm confused, where is the reduce?

view this post on Zulip jar (Nov 14 2024 at 03:08):

I'm just talking about doing two passes vs one pass over the data


Last updated: Nov 22 2024 at 04:41 UTC