Where are my files?

2017-10-17

A guy recently asked me:

I have a script that’s located in ~/projects/work/foo/bar. I added that path to my $PATH. Now, my script needs to read some more files that are located in the same directory. How do I find them?

That’s a very interesting question. At first, you think that this should be trivial to answer. Turns out that it’s not.

As usual, this blog post assumes that you run a UNIX-like operating system.

– Update, 2018-03-24: Part 2

The general, portable answer

The answer to that question is that there is no answer.

When you run some program, there is no guarantee that this program is located somewhere on your hard drive. It’s easy to show this: Just remove the program after it has been started.

#include <stdio.h>
#include <unistd.h>

int
main()
{
    sleep(5);
    printf("I'm still alive.\n");
    return 0;
}

Open two terminals, run the program in one of them, remove the binary in the other one. Program continues to run.

Another scenario: What if your program was available through more than one hard link? What’s the location of “the” program then?

So, there is no definite answer.

“But it works with `$foo`!”

There are programs which try to cope with that situation. Most people will probably be familiar with Java. You can drop a JRE somewhere on your hard drive, put it in $PATH and it will just work. It’s not that Java is a massive statically linked binary – Java has to read its JARs and other files. And, no, you don’t have to set $JAVA_HOME for this to work.

How does Java do that?

What about `argv[0]`?

There is some sort of convention that argv[0] contains the name under which the program has been invoked. This would get us a little closer to the answer, since it would resolve the issue of multiple hard links. The problem of removed binaries would remain.

Most importantly, there still is no guarantee that argv[0] really does contain the path to your program:

#include <unistd.h>

int
main()
{
    execl("/bin/ps", "foobar", "-f", NULL);
    return 1;
}

It will show something like this:

UID        PID  PPID  C STIME TTY          TIME CMD
void      3465  7122  0 08:24 pts/13   00:00:00 -/bin/bash
void      3544  3465  0 08:24 pts/13   00:00:00 vim -p bla.c
void     30336     1  0 08:42 pts/13   00:00:00 xclip
void     30337     1  0 08:42 pts/13   00:00:00 xclip -selection clipboard -f
void     30688  3544  0 08:42 pts/13   00:00:00 foobar -f

Also note the -/bin/bash: That additional dash, which clearly is not part of the program’s path, is used by login(1) (and others) to ask for a login shell. That’s another use case of altering argv[0].

Another obstacle: `$PATH`

Setting argv[0] to something weird is not common. You could argue that you could ignore this in 99% of the cases. Well, it’s not that easy.

When you add a directory to your $PATH, it allows you to type ls instead of /bin/ls. And that’s just it: As argv[0] should contain the name under which your program has been invoked, it now contains just ls. Bummer.

Workaround #1: `procfs` on Linux

Linux has that powerful interface called procfs. It’s a pseudo-filesystem usually mounted under /proc and it contains a lot of information about running processes. Each process is identified by its PID under /proc/$pid. There also is a special symlink called /proc/self which always points to the directory for the process reading that link.

Finally, in /proc/$pid, you have another symlink called exe. And that one points to the path under which your tool has been invoked.

See for yourself:

#include <stdio.h>
#include <unistd.h>

int
main()
{
    char buf[4096] = "";
    ssize_t len;

    if ((len = readlink("/proc/self/exe", buf, (sizeof buf) - 1)) != -1)
    {
        buf[len] = 0;
        printf("[%s]\n", buf);
        return 0;
    }
    else
        return 1;
}

This is also how Java finds its files on Linux. You can see it when you invoke it using strace:

07:52:14.046676 execve("/opt/jdk-9/bin/java", ["java", "Test"], 0x7ffda7b0fa28 /* 10 vars */) = 0
07:52:14.047496 brk(NULL)               = 0x1114000
07:52:14.047592 readlink("/proc/self/exe", "/opt/jdk-9/bin/java", 4096) = 19

Workaround #2: Traversing the `$PATH` by yourself

When procfs is not available, you can try to do what the shell (or kernel) has done to find your binary. After all, you just entered ls and something found your program, so why wouldn’t you be able to do just that?

This is what Java does on OpenBSD:

 93056 java     CALL  stat(0x7f7ffffc73d0,0x7f7ffffc7350)
 93056 java     NAMI  "/sbin/java"
 93056 java     RET   stat -1 errno 2 No such file or directory
 93056 java     CALL  stat(0x7f7ffffc73d0,0x7f7ffffc7350)
 93056 java     NAMI  "/usr/sbin/java"
 93056 java     RET   stat -1 errno 2 No such file or directory
 93056 java     CALL  stat(0x7f7ffffc73d0,0x7f7ffffc7350)
 93056 java     NAMI  "/bin/java"
 93056 java     RET   stat -1 errno 2 No such file or directory
 93056 java     CALL  stat(0x7f7ffffc73d0,0x7f7ffffc7350)
 93056 java     NAMI  "/usr/bin/java"
 93056 java     RET   stat -1 errno 2 No such file or directory
 93056 java     CALL  stat(0x7f7ffffc73d0,0x7f7ffffc7350)
 93056 java     NAMI  "/usr/X11R6/bin/java"
 93056 java     RET   stat -1 errno 2 No such file or directory
 93056 java     CALL  stat(0x7f7ffffc73d0,0x7f7ffffc7350)
 93056 java     NAMI  "/usr/local/sbin/java"
 93056 java     RET   stat -1 errno 2 No such file or directory
 93056 java     CALL  stat(0x7f7ffffc73d0,0x7f7ffffc7350)
 93056 java     NAMI  "/usr/local/bin/java"
 93056 java     RET   stat -1 errno 2 No such file or directory
 93056 java     CALL  stat(0x7f7ffffc73d0,0x7f7ffffc7350)
 93056 java     NAMI  "/usr/local/foobar/bin/java"
 93056 java     STRU  struct stat { dev=1027, ino=155914, mode=-rwxr-xr-x , nlink=1, uid=0<"root">, gid=7<"bin">, rdev=623880, atime=1508220651<"Oct 17 08:10:51 2017">.082047559, mtime=1506962794<"Oct  2 18:46:34 2017">, ctime=1508219904<"Oct 17 07:58:24 2017">.120012249, size=63657, blocks=128, blksize=16384, flags=0x0, gen=0x2049be9e }
 93056 java     RET   stat 0

(Please take this – and Java’s usage of /proc/self/exe above – with a grain of salt. I haven’t looked up Java’s source code, because it’s so incredibly complex, but the output above is a strong indicator that Java actually works this way. Either way, I’m just trying to make the point that manually traversing the $PATH is another possible approach.)

I won’t even start talking about the race conditions involved here.

Of course, this only works if your program still has access to the original $PATH variable. You can forcibly break this.

executor.c:

#include <unistd.h>

int
main()
{
    char *env[] = { "PATH=broken", NULL };
    execle("my-sub-program", "foo", NULL, env);
    return 1;
}

my-sub-program.c:

#include <stdio.h>
#include <stdlib.h>

int
main()
{
    printf("PATH=%s\n", getenv("PATH"));
    return 0;
}

Running:

$ export PATH=$PATH:.
$ executor
PATH=broken

The fact that it shows PATH=broken means the sub-program has been executed – but it can no longer see the original $PATH.

Workaround #3: Hard code your path

This is what many programs do. A lot of software is distributed as source code and there often is a build process. During build, a fixed string can be put into the compiled program. When you finally invoke it, it won’t have to “resolve” anything, it will just look under the hard coded path.

This is what ratterplatter does:

README, line 33

(Update 2023-08-13: That’s not the case anymore, it now avoids the problem by requiring the user to specify the path in its argv. Works fine for this program.)

Or irssi:

But of course … there are scenarios where this is not feasible. For example, when you’re dealing with a rather simple script without a build process – as the guy did who originally asked me.

What does it all mean?

It doesn’t look too good. Whatever you do, it’s fragile.

Unless: Maybe I’m missing something here. Is there a better way? Please tell me. :-)

Comments?

Where are my files?

The general, portable answer

“But it works with $foo!”

What about argv[0]?

Another obstacle: $PATH

Workaround #1: procfs on Linux