I'm trying to find some resources on how to work on improving the parallel processing of an algorithm, but I don't know what terms I should be searching with. The specific algorithm I'm looking at is Needleman-Wunch, a classic in bioinformatics and an example of "dynamic programming" (though I don't actually know what that term means).
Basically, given an n x m
matrix, one can calculate the value of position (i, j)
, as long as [(i-1, j-1), (i-1, j), (i, j-1)]
are known (the first row and first column can all be calculated from the start).
I found this paper, but it seems to be doing some fancy tricks to optimize, I'm looking to just sort of queue up tasks as they become available, if that makes any sense. Or maybe this is just an incredibly naive way to think about it...
Dynamic programming is the main pillar of modern mathematical economics. There are at least 4 or 5 econ Nobel prize winners that I can think of whose work contributed to the foundations of dynamic programming (sadly Richard Bellman, the "father" of dynamic programming, did not get the prize). Thomas Sargent, of Julia and Nobel fame, has good intro chapters in most of his books, and there will be Julia-based lectures over at Quant Econ: https://julia.quantecon.org/dynamic_programming/index_grad.html
Being an economist I was surprised that I'd never heard of Needleman-Wunch, not even by name. I got curious and found these introductory pages: It seems that naive approaches run into memory issues pretty quickly. Sorry I can't help at all, but I'd love to learn more about this.
Around page 30 or so:
Maybe just put everything in a channel and use Worker Pool approach?
https://juliafolds.github.io/data-parallelism/tutorials/concurrency-patterns/
You probable should lock channel on read/write operations, which can be bottleneck, but other then that it should work smoothly.
You sure you need Needleman-Wunch and not Smith-Waterman? Smith-Waterman has been ridiculously optimized. There are several shortcuts to make in the algorithm, and there exists SIMD variants. Ufortunately I don't know _too_ much about these optimizations themselves, but some Googling might reveal some nice sources
see: https://bmcbioinformatics.biomedcentral.com/track/pdf/10.1186/1471-2105-12-221.pdf
https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0082138
I agree with others that it might be better to look into existing implementation strategies. Maybe there are some domain specific optimizations you can do.
But, from just the description of the OP, it looks like what you want has this access pattern?
# "left top" means (1, 1) and "right bottom" means (n, m)
for i in left_top_to_right_bottom # serialized
for j in left_bottom_to_right_top # parallelizable
xs[i, j] = compute(xs[i-1, j-1], xs[i-1, j], xs[i, j-1])
end
end
If so, it looks like you can "just" parallelize the inner loop? You can use @threads
or @floop
to do it.
Also, for this kind of "parallel loop(s) inside sequential loop" pattern, https://github.com/tkf/Barriers.jl may be an interesting tool to use. Though it's a bit lower-level than simple parallel for
loops.
You can't parallelize the inner loop, right? It is dependent on the order of j.
However, if you iterate "diagonally", you can parallelize the inner loop.
ah, yeah, iterating diagonally was what I meant by left top to right bottom etc. but using i
and j
there was incorrect
Cool, thanks for the responses everyone, sorry it took me a while to get back to it!
@Patrick Toche Very interesting, thanks. I still don't know what dynamic programming means, but sounds like it has a rich history! :laughter_tears: . I will get to your reading list, thanks! Happy to tell you more about the bioinformatics part if you like.
@Jakob Nybo Nissen I am, because it's simpler than Smith-Waterman and I'm doing it as a learning exercise (for myself and my students). Ultimately I'm interested in figuring out this paper, which is supposedly based on NW, and possibly trying to write it in julia - the tool is supposed to be amazing, but basically only the grad student who wrote it knows how to use it, and he jumped to industry. The impression he gave (though to be frank it was a bit tough to understand him) was that those optimizations and shortcuts detract from the "provably optimal" nature of NW... whether that's actually significant, I don't really know.
@Takafumi Arakaki (tkf) Cool - thanks! If I understand what you mean by iterating diagonally, I think that's right, but I'll need to spend more time wrapping my head around it.
Always interested to read about this! Had a quick look at the BURST github page: looks like quite a bit of work went into it!
Yeah, I met the guy that wrote it - he's pretty clearly a genius. But one of those whose genius does not extend to explaining themselves well, nor apparently to documenting a tool that's meant to be used by others. :shrug:
I'm pretty sure his PI doesn't even understand what's going on there...
https://github.com/knights-lab/BURST looks reasonably short at least.. I wonder how direct of a C-to-Julia port you could get away with? Julia written in C-like style is usually not too bad performance wise.. but I guess the pointer stuff including all the ->
s would be a bit tricky to naively translate and still get the same performance as C given that there apparently isn't quite a proper "dereference" in Julia since according to https://giordano.github.io/blog/2019-05-03-julia-get-pointer-value/ even unsafe_load
makes a copy
…but the pragma omp stuff could be pretty trivially replaced with the built-in multithreading, and the preprocessor macro stuff seems like it ought to be no trouble by Julia metaprogramming standards?
atomic increment of an element in an array like this is a bit tricky ATM (it's doable but you'd need to write LLVM IR by hand)
CBinding.jl provides a decent solution for C's ->
style dereferencing in Julia without making copies.
Last updated: Dec 28 2024 at 04:38 UTC