
I made https://github.com/rubenvannieuwpoort/atomic-exchange for my usecase.
You're misinterpreting the title. The author didn't intend "Unix" to literally mean only the official AT&T/TheOpenGroup UNIX® System to the exclusion of Linux.
The first sentence of "UNIX-like" makes that clear : >This is a catalog of things UNIX-like/POSIX-compliant operating systems can do atomically,
Further down, he then mentions some Linux specifics : >fcntl(fd, F_GETLK, &lock), fcntl(fd, F_SETLK, &lock), and fcntl(fd, F_SETLKW, &lock) . [...] There is a “mandatory locking” mode but Linux’s implementation is unreliable as it’s subject to a race condition.
Linux-specific open file description locks could be brought up in a modern version of TFA though.
Differing philosophies of how to interpret titles. Prescriptive vs Descriptive language.[0]
There can be different usages of the word "Unix":
#1: Unix is a UNIX(tm) System V descendent. More emphasis that the kernel needs to be UNIX. In this strict definition, you get the common reminder that "Linux is not a Unix!"
#2: "Unix" as a loose generic term for a family of o/s that looks/feels like Unix. This perspective includes using an o/s that has userland Unix utilities like cat/grep/awk. Sometimes deliberately styled as asterisk "*nix" or a suffix-qualifier "Unix-like" but often just written as a naked "Unix".
A Prescriptivist says the author's title is "incorrect". On the other hand, a Descriptivist looks at the whole content of the article -- notices the text has a lot of Linux specific info such as fcntl(,F_GETLEASE/F_SETLEASE), and every hyperlink to a man page url points to https://linux.die.net/man/ , etc -- and thus determines that the author is using "Unix"(#2) in the looser way that can include some Linux idiosyncrasies.
"Unix" instead of "*nix" as a generic term for Linux is not uncommon. Another example article where the authors use the so-called incorrect "Unix" in the title even though it's mostly discussing Linux CUPS instead of Solaris : https://www.evilsocket.net/2024/09/26/Attacking-UNIX-systems...
I wasn't claiming that. I just thought the ggp had a useful comment about renameat2() which led to gp's "correction" which wasn't 100% accurate.
IBM z/OS UNIX also has renameat2(). It doesn't have the Linux specific flag RENAME_EXCHANGE.
https://www.ibm.com/docs/en/zos/3.1.0?topic=functions-rename...
If Kubernetes starts using renameat2(RENAME_EXCHANGE), they could very plausibly add it.
[0] https://www.ibm.com/docs/en/zos/3.2.0?topic=csd-unshare-bpx1...
Many people write UNIX/POSIX without ever reading what it says.
> POSIX-compliant
Which, FWIW, doesn't mean Linux. AFAIK there is no Linux distro that's fully compliant, even before you worry about the specifics of whether it's certified as compliant.
I read author's use of "POSIX-compliant" as a loose and fuzzy family category rather than an exhaustive and authoritative reference on 100% strict compliance. Therefore, the author mentioning non-100%-compliant Linux is ok.
There seems to be 2 different expectations and interpretations of what the article is about.
- (1) article is attempting to be a strict intersection of all Unix-like systems that conform to official UNIX POSIX API. I didn't think this was a reasonable interpretation since we can't be sure the author actually verified/tested other POSIX-like systems such as FreeBSD, HP-UX, IBM AIX, etc.
- (2) article is a looser union of operating systems and can also include idiosyncracies of certain systems like Linux that the author is familiar with that don't apply to all other UNIX systems. I think some readers don't realize that all the author's citations to man pages point to Linux specific urls at : https://linux.die.net/man/
The ggp's (amstan) additional comment about renameat2(,,,,RENAME_EXCHANGE) is useful info and is consistent with interpretation (2).
If the author really didn't want Linux to be lumped in with "POSIX-like", it seems he would avoid linux.die.net and instead point to something more of a UNIX standard such as: https://unix.org/apis.html
[0] Intersection vs Union: https://en.wikipedia.org/wiki/Set_(mathematics)#Intersection
As in: Unix-like OR POSIX-compliant
In that light, it's probably fine to not nitpick over certifications here.
POSIX file locking semantics really are broken beyond repair: https://news.ycombinator.com/item?id=46542247
msync() sync content in memory back to _disk_. But multiple processes mapping the same file always see the same content (barring memory consistency, caching, etc.) already. Unless the file is mapped with MAP_PRIVATE.
I worked on a code base that was portable between Linux, AIX, and some other Unix flavors. mmap/msync was a source of bugs. Just imagine your system running for days, never syncing any data to disk... then someone pulls the plug. Where'd my data go? Even worse, it happened "in production" at a beta site. Fortunately we had a way to recover data from a log.
There's also `flock`, the CLI utility in util-linux, that allows using flocks in shell scripts.
frenameat2(srcdirfd, srcfd, srcname, dstdirfd, dstfd, dstname)What else it does not do is a transaction with multiple objects. That is why, I would design a operating system, that you can do a transaction with multiple objects.
https://learn.microsoft.com/en-us/windows/win32/fileio/about...
In some other cases, I've used a pattern where I used a symlink to folders. The symlink is created, resolved or updated atomically, and all I need is eventual consistency.
That last case was to manage several APT repository indices. The indices were constantly updated to publish new testing or unstable releases of software and machines in the fleet were regularly fetching the repository index. The APT protocol and structure being a bit "dumb" (for better or worse) requires you to fetch files (many of them) in the reverse order they are created, which leads to obvious issues like the signature is updated only after the list of files is updated, or the list of files is created only after the list of packages is created.
Long story short, each update would create a new folder that's consistent, and a symlink points to the last created folder (to atomically replace the folder as it was not possible to swap them), and a small HTTP server would initiate a server side session when the first file is fetched and only return files from the same index list, and everything is eventually consistent, and we never get APT complaining about having signature or hash mismatches. The pivotal component was indeed the atomicity of having a symlink to deal with it, as the Java implementation didn't have access to a more modern "openat" syscall, relative to a specific folder.
mv a b
mv c d
We could observe a state where a and d exist? I would find such "out of order execution" shocking.If that's not what you're saying, could you give an example of something you want to be able to do but can't?
As to whether it’s technically possible for it to happen on a system that stays on, I’m not sure, but it’s certainly vanishingly rare and likely requires very specific circumstances—not just a random race condition.
You can remedy 2) by doing fsync() on the parent directory in between. I just asked ChatGPT which directory you need to fsync. It says it's both, the source and the target directory. Which "makes sense" and simplifies implementations, but it means the rename operation is atomic only at runtime, not if there's a crash in between. It think you might end up with 0 or 2 entries after a crash if you're unlucky.
If that's true, then for safety maybe one should never rename across directories, but instead do a coordinated link(source, target), fsync(target_dir), unlink(source), fsync(source_dir)
Or to go a level deeper, if you have 2 occurrences of rename(2) from the stdlibc ...
rename('a', 'b'); rename('c', 'd');
...and the compiler decides on out of order execution or optimizing by scheduling on different cpus, you can get a and d existing at the same time.
The reason it won't happen in the example you posted is the shell ensures the atomicity (by not forking the second mv until the wait() on the first returns)
`inotifywait` actually sees them in order, but nothing ensure that it's that way.
$ inotifywait -m /tmp
/tmp/ MOVED_FROM a
/tmp/ MOVED_TO b
/tmp/ MOVED_FROM c
/tmp/ MOVED_TO d
`stat` tells us that the timestamps are equal as well. $ stat b d | grep '^Change'
Change: 2026-02-06 12:22:55.394932841 +0100
Change: 2026-02-06 12:22:55.394932841 +0100
However, speeding things up changes it a bit.Given
$ (
set -eo pipefail
for i in {1..10000}
do
printf '%d ' "$i"
touch a c
mv a b &
mv c d &
wait
rm b d
done
)
1 2 3 4 5 6 .....
And with `inotifywait` I saw this when running it for a while. $ inotifywait -m -e MOVED_FROM,MOVED_TO /tmp > /tmp/output
cat /tmp/output | xargs -l4 | sort | uniq -c
9104 /tmp/ MOVED_FROM a /tmp/ MOVED_TO b /tmp/ MOVED_FROM c /tmp/ MOVED_TO d
896 /tmp/ MOVED_FROM c /tmp/ MOVED_TO d /tmp/ MOVED_FROM a /tmp/ MOVED_TO bEven then it is only some file systems that guarantee it and even then file size updating isn’t atomic afaik.
Not so sure about file size update being atomic in this case but fairly sure about the rest.
Matklad had some writing or video about this.
Also there is a tool called ALICE and authors of that tool have a white paper about this subject.
Also there was a blog post about how badger database fixed some issues around this problem.
If there is a failure like a crash or power outage etc. then it doesn’t work like that.
You might as well be pushing into an in-memory data structure and writing to disk at program exit in terms of reliability
POSIX says that for a file opened with O_APPEND "the file offset shall be set to the end of the file prior to each write." That's it. That's all it does.
That tradeoff is at the root of why most notify APIs are either approximate (events can be dropped) or rigidly bounded by kernel settings that prevent truly arbitrary numbers of watches. fanotify and some implementations of kqueue are better at efficiently triggering large recursive watches, but that’s still just a mitigation on the underlying memory/performance tradeoffs, not a full solution.
Inotify is the way to shovel these events out of the kernel, then userspace process rules apply. It's maybe not elegant from your pov, but it's simple.
There are sample "drivers" in easily-modified python that are fast enough for casual use.
if(condition) {do the thing;}
With that said, at least for C and C++, the behavior of (std::)atomic when dealing with interprocess interactions is slightly outside the scope of the standard, but in practice (and at least recommended by the C++ standard) (atomic_)is_lock_free() atomics are generally usable between processes.
In the interview when they were describing this problem, I asked why the didn't just put all of the new release in a new dir, and use symlinks to roll forward and backwards as needed. They kind of froze and looked at each other and all had the same 'aha' moment. I ended up not being interested in taking the job, but they still made sure to thank me for the idea which I thought was nice.
Not that I'm a genius or anything, it's something I'd done previously for years, and I'm sure I learned it from someone else who'd been doing it for years. It's a very valid deployment mechanism IMO, of course depending on your architecture.
Just git branch (one branch per region because of compliance requirements) -> branch creates "tar.gz" with predefined name -> automated system downloads the new "tar.gz", checks release date, revision, etc. -> new symlink & php (serverles!!!) graceful restart and ka-b00m.
Rollbacks worked by pointing back to the old dir & restart.
Worked like a charm :-)
that's how Chrome updates itself, but without the symlink part
The OS core is deployed as a single unit and is a few GB in size, pretty small when internal storage is into the hundreds of GB.