Stream: helpdesk (published)

Topic: memory mapped cache


view this post on Zulip Maarten (Aug 16 2022 at 10:23):

For my code I often have a large amount of tensors stored in ram, but they typically aren't all required at the same time. Recalculating them whenever I need them would be extremely wasteful, so I was thinking that I could maybe create one big file, and memory map parts of that file as memory for the tensors. When I don't need them, the operating system can then write things out to disk and keep my memory usage relatively low.

The idea would be to create some kind of memory manager that simply tracks which parts of some backend file are currently being memory mapped, and I can request new memory mapped arrays from it. Bookkeeping to see what is freed up can then be done using finalizers.

My questions are:

view this post on Zulip Sukera (Aug 16 2022 at 10:25):

memory mapping already does that

view this post on Zulip Sukera (Aug 16 2022 at 10:27):

The smallest amount of memory your OS manages for you is called a Page. When you memory map a file, the kernel more or less goes "I have a file of size X, so I need ceil(X/SIZE_PAGE) pages to store that. I won't fill them all, but when someone asks for memory in that region, I'll fill them lazily by loading from disk"

view this post on Zulip Maarten (Aug 16 2022 at 10:30):

right, but I would need some way of doing this for many different tensors so unless I use a new file per tensors (which potentially can be a ridiculous amount of many small files), I would still have to write some kind of "memory manager" which tracks which part of a file can be assigned again, right?

view this post on Zulip Maarten (Aug 16 2022 at 10:30):

memory mapping appears to do exactly what I want, and I want to do this for many small objects

view this post on Zulip Sukera (Aug 16 2022 at 10:33):

what do you mean with "assigned again"?

view this post on Zulip Sukera (Aug 16 2022 at 10:33):

memory mapping is a bidirectional operation

view this post on Zulip Sukera (Aug 16 2022 at 10:34):

the kernel already flushes old pages you haven't accessed in a while to disk

view this post on Zulip Sukera (Aug 16 2022 at 10:34):

memory mapping is a generic operation for the exact same thing - if you'd want your version to be efficient, you'd have to work on the page level as well, to keep things fast

view this post on Zulip Maarten (Aug 16 2022 at 10:35):

if the garbage collector frees up an array that was assigned to say bytes 10 - 20 for my file on disk, and my program wants to create a new tensor object that again requires 10 bytes, then I probably want to memory map again bytes 10-20 from the file, instead of growing it.

view this post on Zulip Sukera (Aug 16 2022 at 10:36):

that's a completely orthogonal thing

view this post on Zulip Sukera (Aug 16 2022 at 10:37):

if you want to work with memory mapping, you'll have to work with arrays (which are then memory mapped on disk)

view this post on Zulip Sukera (Aug 16 2022 at 10:37):

you wouldn't necessarily create new tensor objects, you'd mutate the array, either by assigning an existing index or mutating in place

view this post on Zulip Maarten (Aug 16 2022 at 10:44):

I'm probably not explaining myself very well.

let's say I have tensors (t1,t2,t3,t4), and I want to store them on disk and only load them in lazily. Now I can either create a seperate file for every object "t1.bin", "t2.bin", "t3.bin", ..., or I can create one big "db.bin", and memory map regions of that file to the respective arrays.

Now lets say I don't really need tensor t1 anymore, ever. The garbage collector cleans it up. I now do a new operation, which creates a new tensor, and this tensor is equally big as tensor "t1". I can now once again create a new file "t5.bin", or if I work with one big "db.bin", I can memory map a new region form "db.bin" for my new tensor.

This essentially means that "db.bin" would keep on growing and growing, despite large parts of that file not being needed anymore. I could probably do better, and keep track of what can be re-used, and whenever I require creating a new tensor, I could simply memory map a part of the file that was no long in use.

The alternative is to work with many small files, but a finalizer is not allowed to do IO, so I cannot automatically delete those files and clean things up.

view this post on Zulip Sukera (Aug 16 2022 at 10:50):

I understand what you're saying. What I'm saying is that memory mapping things that aren't page size is wasteful, as that's the smallest unit of memory the kernel can load/handle. So you need to have one big memory mapped file which you use as backing storage (nonideal, as you'd have to more or less write your own wrapper allocator around that and can't use existing interfaces to allocate it)

view this post on Zulip Sukera (Aug 16 2022 at 10:51):

what you're describing is more or less replacing/amending the existing GC, which is really not well supported in julia

view this post on Zulip Maarten (Aug 16 2022 at 11:00):

So what would you do, to provide some kind of lazy disk storage?

And also, what about it isn't well supported in julia? The implementation of mmap seems relatively simple, and I appear to be able to use pointer_to_array to use parts of my memory mapped region as arrays?

view this post on Zulip Sukera (Aug 16 2022 at 11:07):

I would first try to see if you can't save memory in other places first. The GC already reuses memory you can't reach anymore for new allocations - if you have lots of large, long lived allocations you don't actually need anymore (which is what would consume more memory) that to me suggests there are places in your code that can be made more efficient first.

view this post on Zulip Sukera (Aug 16 2022 at 11:08):

on the other hand, are you certain that recomputing the tensors is too expensive?

view this post on Zulip Sukera (Aug 16 2022 at 11:08):

what I was referring to with it not being well supported is that you can't e.g. just naively take the existing API, which may allocate regular GC memory instead of your special wrapped memory

view this post on Zulip Sukera (Aug 16 2022 at 11:09):

so you'll have to modify your tensors and the operations on them to be aware of the Mmapped memory, if you need to allocate intermediaries

view this post on Zulip Maarten (Aug 16 2022 at 11:13):

yes, though I could probably shave off a bit of my ram usage. It's a standard algorithm in physics called "DMRG", and to reach state of the art people are now really storing every individual tensor on disk, and load them in when necessary. It is much cheaper to read it out from disk then it is to recalculate the necessary tensors

view this post on Zulip Maarten (Aug 16 2022 at 11:16):

I would probably start out with writing a "send2disk" function, and copy regular GC allcoated memory over to a memory mapped region. It's an extra copy, but really simple to implement. I can go even further and directly allocate memory mapped arrays in a next step, but I'm not too sure if that'll be even necessary

view this post on Zulip Sukera (Aug 16 2022 at 11:19):

if your algorithm does all that explicitly anyway, I'd just serialize the tensors at points where you know you don't need them anymore

view this post on Zulip Maarten (Aug 16 2022 at 11:20):

I was hoping that I could offload that logic (serialize old tensors, keep recently used tensors still in ram) to the OS, precisely by using mmap

view this post on Zulip Sukera (Aug 16 2022 at 11:22):

you can't, not easily at least

view this post on Zulip Sukera (Aug 16 2022 at 11:22):

the OS has no concept of your tensors, it only knows blobs of memory

view this post on Zulip Sukera (Aug 16 2022 at 11:22):

GC knows about your objects, but it doesn't even try to use disk as potential backing storage for them

view this post on Zulip Sukera (Aug 16 2022 at 11:23):

you could try to increase your SWAP memory and just let GC do its thing

view this post on Zulip Sukera (Aug 16 2022 at 11:23):

either way, is your concern theoretical or are you running into actual OOM errors?

view this post on Zulip Maarten (Aug 16 2022 at 11:25):

At the moment I don't do any cache2disk things, and I run into OOM errors

view this post on Zulip Sukera (Aug 16 2022 at 11:25):

have you profiled/minimized allocations already?

view this post on Zulip Maarten (Aug 16 2022 at 11:26):

yes, the dominant memory usage are because of things that other people store on disk, and that I should also store on disk and only load in when necessary

view this post on Zulip Sukera (Aug 16 2022 at 11:26):

how are those other people solving that problem?

view this post on Zulip Sukera (Aug 16 2022 at 11:27):

are they using a custom allocator approach? This would be equivalent to subverting GC in julia.

view this post on Zulip Maarten (Aug 16 2022 at 11:27):

No, they manually store and load things from disk, and they hardcode this logic in the algorithm itself.

view this post on Zulip Sukera (Aug 16 2022 at 11:28):

then I'd follow that approach, as anything you could cook up in julia would look extremely similar, at least without replacing GC entirely

view this post on Zulip Maarten (Aug 16 2022 at 11:29):

But this is ugly and cumbersome to deal with, so a first step would be to create an array type that only keeps a few elements into ram, and serializes everything else to disk. That way, I don't have to deal with this bookkeeping in the core algorithm logic itself (where it imho doesn't belong)

view this post on Zulip Sukera (Aug 16 2022 at 11:29):

if you already have those points of serialization, you can definitely Mmap the individual backing arrays though

view this post on Zulip Sukera (Aug 16 2022 at 11:30):

yeah it doesn't belong there, but custom allocation schemes are not supported in julia

view this post on Zulip Sukera (Aug 16 2022 at 11:30):

they don't play nice with GC without extreme care

view this post on Zulip Maarten (Aug 16 2022 at 15:31):

But for example something like this :

https://gist.github.com/maartenvd/a265cbcbb3eb5cb3e9915861ff158569

wouldn't something like this work just fine?

view this post on Zulip Maarten (Aug 16 2022 at 15:32):

it's not very pretty, and there are probably much nicer strategies then using findfirst

view this post on Zulip Maarten (Aug 16 2022 at 15:34):

to speed things up, I can be more intellegent about alligning larger tensors to the start of a page

view this post on Zulip Sukera (Aug 16 2022 at 16:10):

yes it would work, but you'd have to take care not to accidentally get an array that isn't managed by your allocator

view this post on Zulip Sukera (Aug 16 2022 at 16:11):

for that to work in general, your tensor library would probably need to be aware of that

view this post on Zulip Sukera (Aug 16 2022 at 16:11):

I'm not sure how growing a tensor vector would work or if that's required, but it would probably break under that scheme

view this post on Zulip Sukera (Aug 16 2022 at 16:12):

you can probably also run into some other race conditions, if the finalizers run late

view this post on Zulip Sukera (Aug 16 2022 at 16:12):

they're not guaranteed to run eagerly, after all

view this post on Zulip Maarten (Aug 16 2022 at 16:19):

yes, the tensor library never "grows" the data; which in any case wouldn't be a problem as it will simply result in a tensor that lives entirely in ram - which is the current status anyway.

I thought briefly about race conditions, but I don't think there's a real problem. The only way the memory manager finalizer can be called is if the arrays that were created through it are also all going to be garbage collected (as their finalizers hold a reference to the memory manager), so that doesn't appear to be a problem (though I don't really know a lot about that).

view this post on Zulip Maarten (Aug 16 2022 at 16:23):

although now I think about it, push!/append! would indeed kinda be a problem, as some memory in the memory mapped file will not be freed, despite it not being used either.

view this post on Zulip Maarten (Aug 16 2022 at 16:28):

ah but push!/append! doesn't work. The array from unsafe_wrap is "shared", so I will get errors in that case :)

view this post on Zulip Maarten (Aug 16 2022 at 16:29):

mn2 = DiskManager("derp3.bin",1024*1024);
arr = allocate!(mn2,Array{ComplexF64,1},(10,));
append!(arr,[1.0])

errors

view this post on Zulip Maarten (Aug 18 2022 at 09:44):

In trying to further implement this diskamanger thing,I ran into the sublety of finalizers and thread safety. My diskmanager has a lock, but the finalizer of the arrays I allocated also need to acquire this lock in order to signal that this data can be freed. This in turn could lead to a recursive lock deadlock.

The documentation recommends

A related third strategy is to use a yield-free queue. We don't currently have a lock-free queue implemented in Base, but Base.InvasiveLinkedListSynchronized{T} is suitable.

In what way is this type "suitable"? Is it simply a thread safe list, or is it special, in the sense that during execution it never yields to finalizers?

view this post on Zulip Sukera (Aug 18 2022 at 09:54):

both

view this post on Zulip Sukera (Aug 18 2022 at 09:55):

"yielding to a finalizer" doesn't really make sense, as finalizers are not tasks. they can't be yielded to. What the comment is referring to is that you really don't want to be interrupted by GC when you do such work (as you have no idea when you will be scheduled again), so you need a lock-free, non-yielding queue

view this post on Zulip Sukera (Aug 18 2022 at 09:55):

GC is allowed to run at yieldpoints, so yielding is a nogo

view this post on Zulip Maarten (Aug 18 2022 at 09:56):

and invasivelinkedlistsynchronizd is written in such a way that during operations on this type, it won't be interrupted by the GC?

view this post on Zulip Sukera (Aug 18 2022 at 09:58):

according to that comment, yes, but I don't recall where that comment is from, so I can't say much more

view this post on Zulip Maarten (Aug 18 2022 at 09:58):

https://engaging-web.mit.edu/~alir/cnhlab004/julia-1.5.3/share/doc/julia/html/en/manual/multi-threading.html

safe use of finalizers, the julia docs

view this post on Zulip Sukera (Aug 18 2022 at 09:59):

aah, your context was running this as part of a finalizer, yes?

view this post on Zulip Maarten (Aug 18 2022 at 09:59):

yes

view this post on Zulip Sukera (Aug 18 2022 at 10:00):

Yes, the problem is that finalizers are not allowed to yield to other tasks, as that can potentially leave an object half-cleaned up

view this post on Zulip Sukera (Aug 18 2022 at 10:00):

btw, which version of julia are you running? those docs you linked are neither official nor the most recent version

view this post on Zulip Maarten (Aug 18 2022 at 10:01):

ah I'm on 1.8 rc4, and google simply happened to drop me there instead of the official docs

view this post on Zulip Maarten (Aug 18 2022 at 10:02):

it's still there for 1.8 https://docs.julialang.org/en/v1/manual/multi-threading/#Safe-use-of-Finalizers

view this post on Zulip Sukera (Aug 18 2022 at 10:02):

grr, google

view this post on Zulip Sukera (Aug 18 2022 at 10:03):

yeah, the point is that old versions may have a different wording

view this post on Zulip Sukera (Aug 18 2022 at 10:03):

or the newer version made some relevant clarification

view this post on Zulip Maarten (Aug 18 2022 at 10:06):

let me quickly sketch the situation:

I have a DiskManager, which essentially has a large byte array and a list of memory regions in use.

If I allocate an array, I want to attach a finalizer to this array that tells the underlying diskmanager that this memory is no longer in use.

I want this thread safe, so DiskManager needs to acquire a lock during allocation, and the finalizers of the array also need to acquire the same lock when "freeing up" the data. Using lock() in the finalizer would result in deadlocks.

The documentation recommends another approach as well - the delayed finalizers - but this appears to segfault if I use an anonymous function as finalizer:

function A_finalizer(x)
        if trylock(dm)
            hit = findfirst(x->x[1] == i,dm.offsets);
            deleteat!(dm.offsets,hit);
            unlock(dm)
        else
            # couldn't get the lock; delay finalizer
            finalizer(A_finalizer,x)
            nothing
        end
    end

view this post on Zulip Maarten (Aug 18 2022 at 10:09):

but if InvasieLinkedList operations are indeed never interrupted by the garbage collector, then I can simply use that

view this post on Zulip Maarten (Aug 18 2022 at 10:12):

going through the code of this List type, I don't quite see what makes it special; that it won't be interrupted by the GC

view this post on Zulip Sukera (Aug 18 2022 at 10:14):

I think they're just referring to it being safe to be interrupted at any time, as the critical region is atomic?

view this post on Zulip Sukera (Aug 18 2022 at 10:14):

not too familiar with its internals though

view this post on Zulip Sukera (Aug 18 2022 at 10:15):

remember, you can be interrupted by other finalizers too

view this post on Zulip Sukera (Aug 18 2022 at 10:15):

I don't know how that type would be used - presumably you'd have to attach it to custom structs as an inline field, as its an InvasiveLinkedList

view this post on Zulip Maarten (Aug 18 2022 at 10:18):

when an array is finalized, I would simply push that information to an InvasiveLinkedListSynchronized in the DiskManager. Then, during allocate!, I can acquire the DiskManager lock, and process all this "region X is freed up" information in the InvasiveLinkedListSynchronized

view this post on Zulip Sukera (Aug 18 2022 at 10:18):

that's not how invasive linked lists work though

view this post on Zulip Sukera (Aug 18 2022 at 10:19):

the whole point of them is to not interact with GC, i.e. not having to allocate new memory

view this post on Zulip Sukera (Aug 18 2022 at 10:19):

"pushing" to an array potentially needs to switch to the GC, since it may have to move an array around

view this post on Zulip Sukera (Aug 18 2022 at 10:20):

invasive linked lists work by already being part of the object that's going to be part of the list - all "pushing" an element to that list does is let its parent know where its child is (and vice versa, if it's a doubly linked list)

view this post on Zulip Sukera (Aug 18 2022 at 10:21):

that's why they're safe to use for such applications, all the memory that's required for managing them was already allocated as part of the object itself

view this post on Zulip Maarten (Aug 18 2022 at 10:24):

ah ok, I had no clue what invasive linked lists actually were

view this post on Zulip Maarten (Aug 18 2022 at 10:29):

I could simply disable finalizers temporarily a la https://github.com/JuliaGPU/CuArrays.jl/commit/2258a24a6c1927a9acac6ca19ea604dc633e2028

view this post on Zulip Maarten (Aug 18 2022 at 10:31):

I would also have to disable finalizers while runing the finalizer itself, to prevent different finalizers interrupting eachother. Is that even allowed?

view this post on Zulip Sukera (Aug 18 2022 at 10:34):

I don't know, but it doesn't sound good

view this post on Zulip Sukera (Aug 18 2022 at 10:35):

I'd check whether they're using that macro in finalizers themselves

view this post on Zulip Sukera (Aug 18 2022 at 10:36):

all they're doing is making sure that their critical portion is not interrupted by finalizers

view this post on Zulip Maarten (Aug 18 2022 at 10:53):

aha = https://github.com/JuliaLang/julia/pull/38487

view this post on Zulip Maarten (Aug 18 2022 at 10:54):

no finalizers are run if any lock is acquired!

view this post on Zulip Maarten (Aug 18 2022 at 10:55):

correction - this is only true if I use Reentrantlock

Calling 'lock' will also inhibit running of finalizers on that thread until the corresponding
  'unlock'.

view this post on Zulip Maarten (Aug 18 2022 at 17:17):

so it finally works with surprisingly reasonable performance, given how stupid my allocator implementation is. One annoying caveat is that I rely on the julia GC to free up memory; but the GC has no idea how much memory I still have available in my file, and whether I'm close to running out. A very similar issue to what the CUDA people appear to have


Last updated: Oct 02 2023 at 04:34 UTC