For my code I often have a large amount of tensors stored in ram, but they typically aren't all required at the same time. Recalculating them whenever I need them would be extremely wasteful, so I was thinking that I could maybe create one big file, and memory map parts of that file as memory for the tensors. When I don't need them, the operating system can then write things out to disk and keep my memory usage relatively low.
The idea would be to create some kind of memory manager that simply tracks which parts of some backend file are currently being memory mapped, and I can request new memory mapped arrays from it. Bookkeeping to see what is freed up can then be done using finalizers.
My questions are:
is this a good idea, or can I get better performance by manually read-writing things to disk as needed instead of relying on the OS to do this?
does something like this already exist?
memory mapping already does that
The smallest amount of memory your OS manages for you is called a Page. When you memory map a file, the kernel more or less goes "I have a file of size X, so I need ceil(X/SIZE_PAGE)
pages to store that. I won't fill them all, but when someone asks for memory in that region, I'll fill them lazily by loading from disk"
right, but I would need some way of doing this for many different tensors so unless I use a new file per tensors (which potentially can be a ridiculous amount of many small files), I would still have to write some kind of "memory manager" which tracks which part of a file can be assigned again, right?
memory mapping appears to do exactly what I want, and I want to do this for many small objects
what do you mean with "assigned again"?
memory mapping is a bidirectional operation
the kernel already flushes old pages you haven't accessed in a while to disk
memory mapping is a generic operation for the exact same thing - if you'd want your version to be efficient, you'd have to work on the page level as well, to keep things fast
if the garbage collector frees up an array that was assigned to say bytes 10 - 20 for my file on disk, and my program wants to create a new tensor object that again requires 10 bytes, then I probably want to memory map again bytes 10-20 from the file, instead of growing it.
that's a completely orthogonal thing
if you want to work with memory mapping, you'll have to work with arrays (which are then memory mapped on disk)
you wouldn't necessarily create new tensor objects, you'd mutate the array, either by assigning an existing index or mutating in place
I'm probably not explaining myself very well.
let's say I have tensors (t1,t2,t3,t4), and I want to store them on disk and only load them in lazily. Now I can either create a seperate file for every object "t1.bin", "t2.bin", "t3.bin", ..., or I can create one big "db.bin", and memory map regions of that file to the respective arrays.
Now lets say I don't really need tensor t1 anymore, ever. The garbage collector cleans it up. I now do a new operation, which creates a new tensor, and this tensor is equally big as tensor "t1". I can now once again create a new file "t5.bin", or if I work with one big "db.bin", I can memory map a new region form "db.bin" for my new tensor.
This essentially means that "db.bin" would keep on growing and growing, despite large parts of that file not being needed anymore. I could probably do better, and keep track of what can be re-used, and whenever I require creating a new tensor, I could simply memory map a part of the file that was no long in use.
The alternative is to work with many small files, but a finalizer is not allowed to do IO, so I cannot automatically delete those files and clean things up.
I understand what you're saying. What I'm saying is that memory mapping things that aren't page size is wasteful, as that's the smallest unit of memory the kernel can load/handle. So you need to have one big memory mapped file which you use as backing storage (nonideal, as you'd have to more or less write your own wrapper allocator around that and can't use existing interfaces to allocate it)
what you're describing is more or less replacing/amending the existing GC, which is really not well supported in julia
So what would you do, to provide some kind of lazy disk storage?
And also, what about it isn't well supported in julia? The implementation of mmap seems relatively simple, and I appear to be able to use pointer_to_array to use parts of my memory mapped region as arrays?
I would first try to see if you can't save memory in other places first. The GC already reuses memory you can't reach anymore for new allocations - if you have lots of large, long lived allocations you don't actually need anymore (which is what would consume more memory) that to me suggests there are places in your code that can be made more efficient first.
on the other hand, are you certain that recomputing the tensors is too expensive?
what I was referring to with it not being well supported is that you can't e.g. just naively take the existing API, which may allocate regular GC memory instead of your special wrapped memory
so you'll have to modify your tensors and the operations on them to be aware of the Mmapped memory, if you need to allocate intermediaries
yes, though I could probably shave off a bit of my ram usage. It's a standard algorithm in physics called "DMRG", and to reach state of the art people are now really storing every individual tensor on disk, and load them in when necessary. It is much cheaper to read it out from disk then it is to recalculate the necessary tensors
I would probably start out with writing a "send2disk" function, and copy regular GC allcoated memory over to a memory mapped region. It's an extra copy, but really simple to implement. I can go even further and directly allocate memory mapped arrays in a next step, but I'm not too sure if that'll be even necessary
if your algorithm does all that explicitly anyway, I'd just serialize
the tensors at points where you know you don't need them anymore
I was hoping that I could offload that logic (serialize old tensors, keep recently used tensors still in ram) to the OS, precisely by using mmap
you can't, not easily at least
the OS has no concept of your tensors, it only knows blobs of memory
GC knows about your objects, but it doesn't even try to use disk as potential backing storage for them
you could try to increase your SWAP memory and just let GC do its thing
either way, is your concern theoretical or are you running into actual OOM errors?
At the moment I don't do any cache2disk things, and I run into OOM errors
have you profiled/minimized allocations already?
yes, the dominant memory usage are because of things that other people store on disk, and that I should also store on disk and only load in when necessary
how are those other people solving that problem?
are they using a custom allocator approach? This would be equivalent to subverting GC in julia.
No, they manually store and load things from disk, and they hardcode this logic in the algorithm itself.
then I'd follow that approach, as anything you could cook up in julia would look extremely similar, at least without replacing GC entirely
But this is ugly and cumbersome to deal with, so a first step would be to create an array type that only keeps a few elements into ram, and serializes everything else to disk. That way, I don't have to deal with this bookkeeping in the core algorithm logic itself (where it imho doesn't belong)
if you already have those points of serialization, you can definitely Mmap the individual backing arrays though
yeah it doesn't belong there, but custom allocation schemes are not supported in julia
they don't play nice with GC without extreme care
But for example something like this :
https://gist.github.com/maartenvd/a265cbcbb3eb5cb3e9915861ff158569
wouldn't something like this work just fine?
it's not very pretty, and there are probably much nicer strategies then using findfirst
to speed things up, I can be more intellegent about alligning larger tensors to the start of a page
yes it would work, but you'd have to take care not to accidentally get an array that isn't managed by your allocator
for that to work in general, your tensor library would probably need to be aware of that
I'm not sure how growing a tensor vector would work or if that's required, but it would probably break under that scheme
you can probably also run into some other race conditions, if the finalizers run late
they're not guaranteed to run eagerly, after all
yes, the tensor library never "grows" the data; which in any case wouldn't be a problem as it will simply result in a tensor that lives entirely in ram - which is the current status anyway.
I thought briefly about race conditions, but I don't think there's a real problem. The only way the memory manager finalizer can be called is if the arrays that were created through it are also all going to be garbage collected (as their finalizers hold a reference to the memory manager), so that doesn't appear to be a problem (though I don't really know a lot about that).
although now I think about it, push!/append! would indeed kinda be a problem, as some memory in the memory mapped file will not be freed, despite it not being used either.
ah but push!/append! doesn't work. The array from unsafe_wrap is "shared", so I will get errors in that case :)
mn2 = DiskManager("derp3.bin",1024*1024);
arr = allocate!(mn2,Array{ComplexF64,1},(10,));
append!(arr,[1.0])
errors
In trying to further implement this diskamanger thing,I ran into the sublety of finalizers and thread safety. My diskmanager has a lock, but the finalizer of the arrays I allocated also need to acquire this lock in order to signal that this data can be freed. This in turn could lead to a recursive lock deadlock.
The documentation recommends
A related third strategy is to use a yield-free queue. We don't currently have a lock-free queue implemented in Base, but Base.InvasiveLinkedListSynchronized{T} is suitable.
In what way is this type "suitable"? Is it simply a thread safe list, or is it special, in the sense that during execution it never yields to finalizers?
both
"yielding to a finalizer" doesn't really make sense, as finalizers are not tasks. they can't be yielded to. What the comment is referring to is that you really don't want to be interrupted by GC when you do such work (as you have no idea when you will be scheduled again), so you need a lock-free, non-yielding queue
GC is allowed to run at yieldpoints, so yielding is a nogo
and invasivelinkedlistsynchronizd is written in such a way that during operations on this type, it won't be interrupted by the GC?
according to that comment, yes, but I don't recall where that comment is from, so I can't say much more
safe use of finalizers, the julia docs
aah, your context was running this as part of a finalizer, yes?
yes
Yes, the problem is that finalizers are not allowed to yield to other tasks, as that can potentially leave an object half-cleaned up
btw, which version of julia are you running? those docs you linked are neither official nor the most recent version
ah I'm on 1.8 rc4, and google simply happened to drop me there instead of the official docs
it's still there for 1.8 https://docs.julialang.org/en/v1/manual/multi-threading/#Safe-use-of-Finalizers
grr, google
yeah, the point is that old versions may have a different wording
or the newer version made some relevant clarification
let me quickly sketch the situation:
I have a DiskManager, which essentially has a large byte array and a list of memory regions in use.
If I allocate an array, I want to attach a finalizer to this array that tells the underlying diskmanager that this memory is no longer in use.
I want this thread safe, so DiskManager needs to acquire a lock during allocation, and the finalizers of the array also need to acquire the same lock when "freeing up" the data. Using lock() in the finalizer would result in deadlocks.
The documentation recommends another approach as well - the delayed finalizers - but this appears to segfault if I use an anonymous function as finalizer:
function A_finalizer(x)
if trylock(dm)
hit = findfirst(x->x[1] == i,dm.offsets);
deleteat!(dm.offsets,hit);
unlock(dm)
else
# couldn't get the lock; delay finalizer
finalizer(A_finalizer,x)
nothing
end
end
but if InvasieLinkedList operations are indeed never interrupted by the garbage collector, then I can simply use that
going through the code of this List type, I don't quite see what makes it special; that it won't be interrupted by the GC
I think they're just referring to it being safe to be interrupted at any time, as the critical region is atomic?
not too familiar with its internals though
remember, you can be interrupted by other finalizers too
I don't know how that type would be used - presumably you'd have to attach it to custom structs as an inline field, as its an InvasiveLinkedList
when an array is finalized, I would simply push that information to an InvasiveLinkedListSynchronized in the DiskManager. Then, during allocate!, I can acquire the DiskManager lock, and process all this "region X is freed up" information in the InvasiveLinkedListSynchronized
that's not how invasive linked lists work though
the whole point of them is to not interact with GC, i.e. not having to allocate new memory
"pushing" to an array potentially needs to switch to the GC, since it may have to move an array around
invasive linked lists work by already being part of the object that's going to be part of the list - all "pushing" an element to that list does is let its parent know where its child is (and vice versa, if it's a doubly linked list)
that's why they're safe to use for such applications, all the memory that's required for managing them was already allocated as part of the object itself
ah ok, I had no clue what invasive linked lists actually were
I could simply disable finalizers temporarily a la https://github.com/JuliaGPU/CuArrays.jl/commit/2258a24a6c1927a9acac6ca19ea604dc633e2028
I would also have to disable finalizers while runing the finalizer itself, to prevent different finalizers interrupting eachother. Is that even allowed?
I don't know, but it doesn't sound good
I'd check whether they're using that macro in finalizers themselves
all they're doing is making sure that their critical portion is not interrupted by finalizers
aha = https://github.com/JuliaLang/julia/pull/38487
no finalizers are run if any lock is acquired!
correction - this is only true if I use Reentrantlock
Calling 'lock' will also inhibit running of finalizers on that thread until the corresponding
'unlock'.
so it finally works with surprisingly reasonable performance, given how stupid my allocator implementation is. One annoying caveat is that I rely on the julia GC to free up memory; but the GC has no idea how much memory I still have available in my file, and whether I'm close to running out. A very similar issue to what the CUDA people appear to have
Last updated: Nov 06 2024 at 04:40 UTC