memory mapped cache · helpdesk (published)

For my code I often have a large amount of tensors stored in ram, but they typically aren't all required at the same time. Recalculating them whenever I need them would be extremely wasteful, so I was thinking that I could maybe create one big file, and memory map parts of that file as memory for the tensors. When I don't need them, the operating system can then write things out to disk and keep my memory usage relatively low.

The idea would be to create some kind of memory manager that simply tracks which parts of some backend file are currently being memory mapped, and I can request new memory mapped arrays from it. Bookkeeping to see what is freed up can then be done using finalizers.

Sukera (Aug 16 2022 at 10:25):

Sukera (Aug 16 2022 at 10:27):

The smallest amount of memory your OS manages for you is called a Page. When you memory map a file, the kernel more or less goes "I have a file of size X, so I need ceil(X/SIZE_PAGE) pages to store that. I won't fill them all, but when someone asks for memory in that region, I'll fill them lazily by loading from disk"

Maarten (Aug 16 2022 at 10:30):

right, but I would need some way of doing this for many different tensors so unless I use a new file per tensors (which potentially can be a ridiculous amount of many small files), I would still have to write some kind of "memory manager" which tracks which part of a file can be assigned again, right?

Maarten (Aug 16 2022 at 10:30):

memory mapping appears to do exactly what I want, and I want to do this for many small objects

Sukera (Aug 16 2022 at 10:33):

Sukera (Aug 16 2022 at 10:34):

memory mapping is a generic operation for the exact same thing - if you'd want your version to be efficient, you'd have to work on the page level as well, to keep things fast

Maarten (Aug 16 2022 at 10:35):

if the garbage collector frees up an array that was assigned to say bytes 10 - 20 for my file on disk, and my program wants to create a new tensor object that again requires 10 bytes, then I probably want to memory map again bytes 10-20 from the file, instead of growing it.

Sukera (Aug 16 2022 at 10:36):

Sukera (Aug 16 2022 at 10:37):

if you want to work with memory mapping, you'll have to work with arrays (which are then memory mapped on disk)

Sukera (Aug 16 2022 at 10:37):

you wouldn't necessarily create new tensor objects, you'd mutate the array, either by assigning an existing index or mutating in place

Maarten (Aug 16 2022 at 10:44):

let's say I have tensors (t1,t2,t3,t4), and I want to store them on disk and only load them in lazily. Now I can either create a seperate file for every object "t1.bin", "t2.bin", "t3.bin", ..., or I can create one big "db.bin", and memory map regions of that file to the respective arrays.

Now lets say I don't really need tensor t1 anymore, ever. The garbage collector cleans it up. I now do a new operation, which creates a new tensor, and this tensor is equally big as tensor "t1". I can now once again create a new file "t5.bin", or if I work with one big "db.bin", I can memory map a new region form "db.bin" for my new tensor.

This essentially means that "db.bin" would keep on growing and growing, despite large parts of that file not being needed anymore. I could probably do better, and keep track of what can be re-used, and whenever I require creating a new tensor, I could simply memory map a part of the file that was no long in use.

The alternative is to work with many small files, but a finalizer is not allowed to do IO, so I cannot automatically delete those files and clean things up.

Sukera (Aug 16 2022 at 10:50):

I understand what you're saying. What I'm saying is that memory mapping things that aren't page size is wasteful, as that's the smallest unit of memory the kernel can load/handle. So you need to have one big memory mapped file which you use as backing storage (nonideal, as you'd have to more or less write your own wrapper allocator around that and can't use existing interfaces to allocate it)

Sukera (Aug 16 2022 at 10:51):

what you're describing is more or less replacing/amending the existing GC, which is really not well supported in julia

Maarten (Aug 16 2022 at 11:00):

And also, what about it isn't well supported in julia? The implementation of mmap seems relatively simple, and I appear to be able to use pointer_to_array to use parts of my memory mapped region as arrays?

Sukera (Aug 16 2022 at 11:07):

I would first try to see if you can't save memory in other places first. The GC already reuses memory you can't reach anymore for new allocations - if you have lots of large, long lived allocations you don't actually need anymore (which is what would consume more memory) that to me suggests there are places in your code that can be made more efficient first.

Sukera (Aug 16 2022 at 11:08):

on the other hand, are you certain that recomputing the tensors is too expensive?

Sukera (Aug 16 2022 at 11:08):

what I was referring to with it not being well supported is that you can't e.g. just naively take the existing API, which may allocate regular GC memory instead of your special wrapped memory

Sukera (Aug 16 2022 at 11:09):

so you'll have to modify your tensors and the operations on them to be aware of the Mmapped memory, if you need to allocate intermediaries

Maarten (Aug 16 2022 at 11:13):

yes, though I could probably shave off a bit of my ram usage. It's a standard algorithm in physics called "DMRG", and to reach state of the art people are now really storing every individual tensor on disk, and load them in when necessary. It is much cheaper to read it out from disk then it is to recalculate the necessary tensors

Maarten (Aug 16 2022 at 11:16):

I would probably start out with writing a "send2disk" function, and copy regular GC allcoated memory over to a memory mapped region. It's an extra copy, but really simple to implement. I can go even further and directly allocate memory mapped arrays in a next step, but I'm not too sure if that'll be even necessary

Sukera (Aug 16 2022 at 11:19):

if your algorithm does all that explicitly anyway, I'd just serialize the tensors at points where you know you don't need them anymore

Maarten (Aug 16 2022 at 11:20):

I was hoping that I could offload that logic (serialize old tensors, keep recently used tensors still in ram) to the OS, precisely by using mmap

Sukera (Aug 16 2022 at 11:22):

GC knows about your objects, but it doesn't even try to use disk as potential backing storage for them

Sukera (Aug 16 2022 at 11:23):

either way, is your concern theoretical or are you running into actual OOM errors?

Maarten (Aug 16 2022 at 11:25):

Sukera (Aug 16 2022 at 11:25):

Maarten (Aug 16 2022 at 11:26):

yes, the dominant memory usage are because of things that other people store on disk, and that I should also store on disk and only load in when necessary

Sukera (Aug 16 2022 at 11:26):

Sukera (Aug 16 2022 at 11:27):

are they using a custom allocator approach? This would be equivalent to subverting GC in julia.

Maarten (Aug 16 2022 at 11:27):

No, they manually store and load things from disk, and they hardcode this logic in the algorithm itself.

Sukera (Aug 16 2022 at 11:28):

then I'd follow that approach, as anything you could cook up in julia would look extremely similar, at least without replacing GC entirely

Maarten (Aug 16 2022 at 11:29):

But this is ugly and cumbersome to deal with, so a first step would be to create an array type that only keeps a few elements into ram, and serializes everything else to disk. That way, I don't have to deal with this bookkeeping in the core algorithm logic itself (where it imho doesn't belong)

Sukera (Aug 16 2022 at 11:29):

if you already have those points of serialization, you can definitely Mmap the individual backing arrays though

Sukera (Aug 16 2022 at 11:30):

yeah it doesn't belong there, but custom allocation schemes are not supported in julia

Sukera (Aug 16 2022 at 11:30):

Maarten (Aug 16 2022 at 15:31):

Maarten (Aug 16 2022 at 15:32):

it's not very pretty, and there are probably much nicer strategies then using findfirst

Maarten (Aug 16 2022 at 15:34):

to speed things up, I can be more intellegent about alligning larger tensors to the start of a page

Sukera (Aug 16 2022 at 16:10):

yes it would work, but you'd have to take care not to accidentally get an array that isn't managed by your allocator

Sukera (Aug 16 2022 at 16:11):

for that to work in general, your tensor library would probably need to be aware of that

Sukera (Aug 16 2022 at 16:11):

I'm not sure how growing a tensor vector would work or if that's required, but it would probably break under that scheme

Sukera (Aug 16 2022 at 16:12):

you can probably also run into some other race conditions, if the finalizers run late

Sukera (Aug 16 2022 at 16:12):

Maarten (Aug 16 2022 at 16:19):

yes, the tensor library never "grows" the data; which in any case wouldn't be a problem as it will simply result in a tensor that lives entirely in ram - which is the current status anyway.

I thought briefly about race conditions, but I don't think there's a real problem. The only way the memory manager finalizer can be called is if the arrays that were created through it are also all going to be garbage collected (as their finalizers hold a reference to the memory manager), so that doesn't appear to be a problem (though I don't really know a lot about that).

Maarten (Aug 16 2022 at 16:23):

although now I think about it, push!/append! would indeed kinda be a problem, as some memory in the memory mapped file will not be freed, despite it not being used either.

Maarten (Aug 16 2022 at 16:28):

ah but push!/append! doesn't work. The array from unsafe_wrap is "shared", so I will get errors in that case :)

Maarten (Aug 16 2022 at 16:29):

mn2 = DiskManager("derp3.bin",1024*1024);
arr = allocate!(mn2,Array{ComplexF64,1},(10,));
append!(arr,[1.0])

Maarten (Aug 18 2022 at 09:44):

In trying to further implement this diskamanger thing,I ran into the sublety of finalizers and thread safety. My diskmanager has a lock, but the finalizer of the arrays I allocated also need to acquire this lock in order to signal that this data can be freed. This in turn could lead to a recursive lock deadlock.

A related third strategy is to use a yield-free queue. We don't currently have a lock-free queue implemented in Base, but Base.InvasiveLinkedListSynchronized{T} is suitable.

In what way is this type "suitable"? Is it simply a thread safe list, or is it special, in the sense that during execution it never yields to finalizers?

Sukera (Aug 18 2022 at 09:54):

Sukera (Aug 18 2022 at 09:55):

"yielding to a finalizer" doesn't really make sense, as finalizers are not tasks. they can't be yielded to. What the comment is referring to is that you really don't want to be interrupted by GC when you do such work (as you have no idea when you will be scheduled again), so you need a lock-free, non-yielding queue

Sukera (Aug 18 2022 at 09:55):

Maarten (Aug 18 2022 at 09:56):

and invasivelinkedlistsynchronizd is written in such a way that during operations on this type, it won't be interrupted by the GC?

Sukera (Aug 18 2022 at 09:58):

according to that comment, yes, but I don't recall where that comment is from, so I can't say much more

Maarten (Aug 18 2022 at 09:58):

Sukera (Aug 18 2022 at 09:59):

Maarten (Aug 18 2022 at 09:59):

Sukera (Aug 18 2022 at 10:00):

Yes, the problem is that finalizers are not allowed to yield to other tasks, as that can potentially leave an object half-cleaned up

Sukera (Aug 18 2022 at 10:00):

btw, which version of julia are you running? those docs you linked are neither official nor the most recent version

Maarten (Aug 18 2022 at 10:01):

ah I'm on 1.8 rc4, and google simply happened to drop me there instead of the official docs

Maarten (Aug 18 2022 at 10:02):

Sukera (Aug 18 2022 at 10:02):

Sukera (Aug 18 2022 at 10:03):

Maarten (Aug 18 2022 at 10:06):

I have a DiskManager, which essentially has a large byte array and a list of memory regions in use.

If I allocate an array, I want to attach a finalizer to this array that tells the underlying diskmanager that this memory is no longer in use.

I want this thread safe, so DiskManager needs to acquire a lock during allocation, and the finalizers of the array also need to acquire the same lock when "freeing up" the data. Using lock() in the finalizer would result in deadlocks.

The documentation recommends another approach as well - the delayed finalizers - but this appears to segfault if I use an anonymous function as finalizer:

function A_finalizer(x)
        if trylock(dm)
            hit = findfirst(x->x[1] == i,dm.offsets);
            deleteat!(dm.offsets,hit);
            unlock(dm)
        else
            # couldn't get the lock; delay finalizer
            finalizer(A_finalizer,x)
            nothing
        end
    end

Maarten (Aug 18 2022 at 10:09):

but if InvasieLinkedList operations are indeed never interrupted by the garbage collector, then I can simply use that

Maarten (Aug 18 2022 at 10:12):

going through the code of this List type, I don't quite see what makes it special; that it won't be interrupted by the GC

Sukera (Aug 18 2022 at 10:14):

I think they're just referring to it being safe to be interrupted at any time, as the critical region is atomic?

Sukera (Aug 18 2022 at 10:14):

Sukera (Aug 18 2022 at 10:15):

I don't know how that type would be used - presumably you'd have to attach it to custom structs as an inline field, as its an InvasiveLinkedList

Maarten (Aug 18 2022 at 10:18):

when an array is finalized, I would simply push that information to an InvasiveLinkedListSynchronized in the DiskManager. Then, during allocate!, I can acquire the DiskManager lock, and process all this "region X is freed up" information in the InvasiveLinkedListSynchronized

Sukera (Aug 18 2022 at 10:18):

Sukera (Aug 18 2022 at 10:19):

the whole point of them is to not interact with GC, i.e. not having to allocate new memory

Sukera (Aug 18 2022 at 10:19):

"pushing" to an array potentially needs to switch to the GC, since it may have to move an array around

Sukera (Aug 18 2022 at 10:20):

invasive linked lists work by already being part of the object that's going to be part of the list - all "pushing" an element to that list does is let its parent know where its child is (and vice versa, if it's a doubly linked list)

Sukera (Aug 18 2022 at 10:21):

that's why they're safe to use for such applications, all the memory that's required for managing them was already allocated as part of the object itself

Maarten (Aug 18 2022 at 10:24):

Maarten (Aug 18 2022 at 10:29):

Maarten (Aug 18 2022 at 10:31):

I would also have to disable finalizers while runing the finalizer itself, to prevent different finalizers interrupting eachother. Is that even allowed?

Sukera (Aug 18 2022 at 10:34):

Sukera (Aug 18 2022 at 10:35):

Sukera (Aug 18 2022 at 10:36):

all they're doing is making sure that their critical portion is not interrupted by finalizers

Maarten (Aug 18 2022 at 10:53):

Maarten (Aug 18 2022 at 10:54):

Maarten (Aug 18 2022 at 10:55):

Calling 'lock' will also inhibit running of finalizers on that thread until the corresponding
  'unlock'.

Maarten (Aug 18 2022 at 17:17):

so it finally works with surprisingly reasonable performance, given how stupid my allocator implementation is. One annoying caveat is that I rely on the julia GC to free up memory; but the GC has no idea how much memory I still have available in my file, and whether I'm close to running out. A very similar issue to what the CUDA people appear to have

Stream: helpdesk (published)

Topic: memory mapped cache

Maarten (Aug 16 2022 at 10:23):

Sukera (Aug 16 2022 at 10:25):

Sukera (Aug 16 2022 at 10:27):

Maarten (Aug 16 2022 at 10:30):

Maarten (Aug 16 2022 at 10:30):

Sukera (Aug 16 2022 at 10:33):

Sukera (Aug 16 2022 at 10:33):

Sukera (Aug 16 2022 at 10:34):

Sukera (Aug 16 2022 at 10:34):

Maarten (Aug 16 2022 at 10:35):

Sukera (Aug 16 2022 at 10:36):

Sukera (Aug 16 2022 at 10:37):

Sukera (Aug 16 2022 at 10:37):

Maarten (Aug 16 2022 at 10:44):

Sukera (Aug 16 2022 at 10:50):

Sukera (Aug 16 2022 at 10:51):

Maarten (Aug 16 2022 at 11:00):

Sukera (Aug 16 2022 at 11:07):

Sukera (Aug 16 2022 at 11:08):

Sukera (Aug 16 2022 at 11:08):

Sukera (Aug 16 2022 at 11:09):

Maarten (Aug 16 2022 at 11:13):

Maarten (Aug 16 2022 at 11:16):

Sukera (Aug 16 2022 at 11:19):

Maarten (Aug 16 2022 at 11:20):

Sukera (Aug 16 2022 at 11:22):

Sukera (Aug 16 2022 at 11:22):

Sukera (Aug 16 2022 at 11:22):

Sukera (Aug 16 2022 at 11:23):

Sukera (Aug 16 2022 at 11:23):

Maarten (Aug 16 2022 at 11:25):

Sukera (Aug 16 2022 at 11:25):

Maarten (Aug 16 2022 at 11:26):

Sukera (Aug 16 2022 at 11:26):

Sukera (Aug 16 2022 at 11:27):

Maarten (Aug 16 2022 at 11:27):

Sukera (Aug 16 2022 at 11:28):

Maarten (Aug 16 2022 at 11:29):

Sukera (Aug 16 2022 at 11:29):

Sukera (Aug 16 2022 at 11:30):

Sukera (Aug 16 2022 at 11:30):

Maarten (Aug 16 2022 at 15:31):

Maarten (Aug 16 2022 at 15:32):

Maarten (Aug 16 2022 at 15:34):

Sukera (Aug 16 2022 at 16:10):

Sukera (Aug 16 2022 at 16:11):

Sukera (Aug 16 2022 at 16:11):

Sukera (Aug 16 2022 at 16:12):

Sukera (Aug 16 2022 at 16:12):

Maarten (Aug 16 2022 at 16:19):

Maarten (Aug 16 2022 at 16:23):

Maarten (Aug 16 2022 at 16:28):

Maarten (Aug 16 2022 at 16:29):

Maarten (Aug 18 2022 at 09:44):

Sukera (Aug 18 2022 at 09:54):

Sukera (Aug 18 2022 at 09:55):

Sukera (Aug 18 2022 at 09:55):

Maarten (Aug 18 2022 at 09:56):

Sukera (Aug 18 2022 at 09:58):

Maarten (Aug 18 2022 at 09:58):

Sukera (Aug 18 2022 at 09:59):

Maarten (Aug 18 2022 at 09:59):

Sukera (Aug 18 2022 at 10:00):

Sukera (Aug 18 2022 at 10:00):

Maarten (Aug 18 2022 at 10:01):

Maarten (Aug 18 2022 at 10:02):

Sukera (Aug 18 2022 at 10:02):

Sukera (Aug 18 2022 at 10:03):

Sukera (Aug 18 2022 at 10:03):

Maarten (Aug 18 2022 at 10:06):

Maarten (Aug 18 2022 at 10:09):

Maarten (Aug 18 2022 at 10:12):

Sukera (Aug 18 2022 at 10:14):

Sukera (Aug 18 2022 at 10:14):

Sukera (Aug 18 2022 at 10:15):

Sukera (Aug 18 2022 at 10:15):

Maarten (Aug 18 2022 at 10:18):

Sukera (Aug 18 2022 at 10:18):