Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added error handling for objstore not having enough memory to allocate #275

Merged
merged 1 commit into from
Jul 19, 2016

Conversation

Wapaul1
Copy link
Collaborator

@Wapaul1 Wapaul1 commented Jul 16, 2016

@mehrdadn
Copy link
Collaborator

Nice commit! But this fails the build on Windows. Can you do something like #if defined(WIN32) || defined(_WIN32)/#else/#endif to conditionally enable the platform-specific bits?

@mehrdadn
Copy link
Collaborator

mehrdadn commented Jul 16, 2016

By the way, there's a better way to check if you're out of memory: try allocating it, and then catch the error so you can handle it. I don't think you'd need to explicitly check how much free memory the system has at all (not just because there's a minor race condition, but also because it may be misleading too, since some memory might be used in non-critical ways by the system and still be able to be freed).

@Wapaul1
Copy link
Collaborator Author

Wapaul1 commented Jul 17, 2016

I would do that, except that this allocate does not throw an exception, only a bus error. Thus, we have to check the available memory. This fix is just a patch until the new object store is implemented.

@mehrdadn
Copy link
Collaborator

That's strange, sorry I didn't realize that. In that case then just make it platform-dependent so it doesn't fail on other platforms. Note that even #include <sys/statvfs.h> fails on a system that doesn't include that header, so you have to be careful.

@@ -14,9 +16,11 @@ namespace {
#endif

void objstore_memcheck(int64_t size) {
#if defined(__unix__) || defined(__linux__)
struct statvfs buffer;
statvfs("/dev/shm/", &buffer);
RAY_CHECK_LE(size + 100, buffer.f_bsize * buffer.f_bfree, "Not enough available memory for allocation by objstore");
Copy link
Collaborator

@mehrdadn mehrdadn Jul 17, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if these are necessary for /dev/shm/, but if it behaves like typical file systems, then:

  • If f_bavail also works, use that instead of f_bfree.
  • If applicable, check if f_favail >= 1 as well. (Note: I only took a cursory glance at this and it may not actually be a valid check; I'm not sure.)

@robertnishihara
Copy link
Collaborator

robertnishihara commented Jul 18, 2016

I just tried the following in python scripts/shell.py (on Ubuntu)

import numpy as np
x = np.zeros(100000)
l = []
for i in range(1000):
  l.append(ray.put(x))

Eventually, I got an error saying

fatal error occured: Check failed at line 22 in /home/ubuntu/ray/src/utils.cc: (size + 100) <= (buffer.f_bsize * buffer.f_bfree) with message Not enough available memory for allocation by objstore

as expected.

However, if I then start up python scripts/shell.py again and do

import numpy as np
x = np.zeros(100000)
ray.put(x)

I immediately get the error

fatal error occured: Check failed at line 22 in /home/ubuntu/ray/src/utils.cc: (size + 100) <= (buffer.f_bsize * buffer.f_bfree) with message Not enough available memory for allocation by objstore

Do I need to clear out some files somewhere?

Also, this may be unrelated, but there were several instances where I got this error as soon as I tried anything

I0718 17:49:18.717566861   16251 subchannel.c:642]           Connect failed: {"created":"@1468864158.717542076","description":"Failed to connect to remote host","errno":107,"file":"src/core/lib/iomgr/error.c","file_line":256,"os_error":"Connection refused","target_address":"ipv4:127.0.0.1:10001"}
I0718 17:49:18.717596258   16251 subchannel.c:647]           Retry in 0.999912500 seconds

@robertnishihara
Copy link
Collaborator

I wasn't able to trigger the error on Mac OS X.

I ran

import numpy as np
x = np.zeros(10000000)
l = []
for i in range(1000):
    print i
    l.append(ray.put(x))

and that completed successfully.

>>> import sys
>>> sys.getsizeof(x)
80000096

So this should be storing 80GB in the object store.

@pcmoritz pcmoritz merged commit 6c96a05 into master Jul 19, 2016
@pcmoritz pcmoritz deleted the mem_error branch July 19, 2016 00:45
pcmoritz pushed a commit to pcmoritz/ray that referenced this pull request Dec 18, 2017
* Pipe num_cpus and num_gpus through from start_ray.py.

* Improve load balancing tests.

* Fix bug.

* Factor out some testing code.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants