Skip to content

Conversation

@slonopotamus
Copy link

This syscall is required for multiple usecases, one of them is Buildah.

See moby/moby#42441

Also, see https://github.com/containers/buildah/blob/b74149334e3ca3d1898f4e46f4ea94db60d14eaa/chroot/run_linux.go#L152-L160

Tested by building Moby with these changes and successfully running Buildah inside Docker.

Without these changes, it fails:

$ docker run --rm -it quay.io/buildah/stable:latest buildah
Error during unshare(CLONE_NEWUSER): Operation not permitted
ERRO[0000] parsing PID "": strconv.Atoi: parsing "": invalid syntax 
ERRO[0000] (Unable to determine exit status)

},
},
},
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CLONE_NEWUSER does not seem covered in this PR.

Anyway, it might be still scary to allow CLONE_NEWUSER by default, due to its several vulnerabilities in the past (CVE-2023–0386, etc.)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Docker documentation lists unshare as a thing that is blocked by default, but doesn't provide an example of how an authorized user would unblock it or an argument to it for a particular workload. It just says you can pass a seccomp profile (which it seems to expect you to already be competent to create based on the default one) or you can use --security-opt seccomp=unconfined.

Given the number of people who actually know how to write a seccomp profile, and that there isn't a slightly-more-permissive one included as an available option, probably the vast majority of workloads that need user namespaces are currently running completely unconfined.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CLONE_NEWUSER does not seem covered in this PR.

Whoops, I somehow messed up PR and did CLONE_UTS twice.

Copy link
Author

@slonopotamus slonopotamus Oct 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway, it might be still scary to allow CLONE_NEWUSER by default, due to its several vulnerabilities in the past (CVE-2023–0386, etc.)

There is a long discussion in moby/moby#42441 about safety of these. I just wanted to make a PR that (if this feature is declared safe) would implement the change.

And I said, my specific usecase is to make Buildah work. Building with Buildah inside a container will always be at least as safe (and normally much safer) than building with docker build outside of container, even if there will be a future vulnerability in unshare.

I agree with @adamnovak that writing a custom seccomp policy is too hard.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably the vast majority of workloads that need user namespaces are currently running completely unconfined

Or, worse yet, use custom builds of applications with sandboxing disabled (https://bugs.passt.top/show_bug.cgi?id=116#c6).

{
Index: 0,
Value: unix.CLONE_NEWNS,
ValueTwo: unix.CLONE_NEWNS,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why need ValueTwo ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, it is not needed. This is the first day I am writing seccomp policies :) Removed.

This syscall is required for multiple usecases, one of them is buildah.

See moby/moby#42441

Signed-off-by: Marat Radchenko <[email protected]>
@sbrivio-rh
Copy link

By the way, I just found this one after moving #5 to the right repository (yesterday, I filed moby/moby#51130).

It comes after a passt user happily reported at https://bugs.passt.top/show_bug.cgi?id=116 that they're now finally running a custom build of passt with a ton of security features commented out, so that it runs under Docker.

I'm not sure what to do with my pull request. On one hand, it comes with an explanatory diagram, and it's a bit different from this one. On the other hand, this one came first and I guess it could be amended for what passt and pasta needs. @slonopotamus what should I do?

@slonopotamus
Copy link
Author

slonopotamus commented Oct 8, 2025

I'm not sure what to do with my pull request.

Me neither. My personal opinion is that it's not Docker job to guard kernel against bugs in it and it should just allow the full unshare as your PR does. And I totally agree that ban of unshare reduces security. But Docker devs might have a different opinion.

@sbrivio-rh
Copy link

Me neither. My personal opinion is that it's not Docker job to guard kernel against bugs in it and it should just allow the full unshare as your PR does.

Yeah, same here. In any case, I'm trying to... try out your patch (checking passt in a Docker container), and I'm facing some issues because it now looks like building Docker now requires Docker itself (!)

$ make
mkdir bundles
docker buildx bake binary
make: docker: No such file or directory

but I'm not really keen on installing third-party packages. Anyway, I managed to run that with some hacks, but now I only have a new dockerd built there, not docker. Still trying to figure that out. I used to build Moby without particular issues but that was months ago...

@sbrivio-rh
Copy link

Hmm, okay, I guess the BPF program (the seccomp profile) is actually loaded by dockerd. But even amending your patch to:

diff --git a/vendor/github.com/moby/profiles/seccomp/default_linux.go b/vendor/github.com/moby/profiles/seccomp/default_linux.go
index 26c50c342e..a56cd26c21 100644
--- a/vendor/github.com/moby/profiles/seccomp/default_linux.go
+++ b/vendor/github.com/moby/profiles/seccomp/default_linux.go
@@ -646,6 +646,22 @@ func DefaultProfile() *Seccomp {
 				Arches: []string{"s390", "s390x"},
 			},
 		},
+		{
+			LinuxSyscall: specs.LinuxSyscall{
+				Names: []string{
+					"unshare",
+				},
+				Action: specs.ActAllow,
+				Args: []specs.LinuxSeccompArg{
+					{
+						Index:    0,
+						Value:    unix.CLONE_NEWUSER,
+						ValueTwo: 0,
+						Op:       specs.OpMaskedEqual,
+					},
+				},
+			},
+		},
 		{
 			LinuxSyscall: specs.LinuxSyscall{
 				Names: []string{

I'm getting:

/ # passt -f
[...]
Couldn't create user namespace: Operation not permitted

I'm not sure why. I guess I'll try dumping the BPF program with strace -e prctl and running it through something like https://seitan.rocks/seitan/tree/debug/disasm.c next, but there must be a simpler way...

@sbrivio-rh
Copy link

I tried with my version and now I'm getting EACCES, instead of EPERM (which is what the BPF program would return), so there's something else.

A change to the seccomp profile is needed in any case, but I would like to test this properly, so I'm going back and asking the reporter of https://bugs.passt.top/show_bug.cgi?id=116 about anything else that was needed to run passt other than not loading a seccomp profile.

If you want to try this out, by the way, just install passt (it's available on most distributions) and try running passt -f.

@AkihiroSuda
Copy link
Member

I'm facing some issues because it now looks like building Docker now requires Docker itself (!)

Practically you can just use go build ./cmd/dockerd

@sbrivio-rh
Copy link

Practically you can just use go build ./cmd/dockerd

I didn't know, that makes things much simpler, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants