Troubleshooting A Symlink — A Whodunnit For The Git Record Books

While I normally sport the well-worn fedora of a hard-boiled sysadmin, Sunday mornings I swap that neo-noir accessory for the tech-noir: a pair of pro headphones. This is the tale of the collision of those two roles. An educational caper, dear reader. You see, my weekly gig is to run a Facebook Live Stream, and Facebook just recently began enforcing a new policy: all video streams are required to use encryption. We have Fedora installed on the media machine, and use Open Broadcaster Software (OBS) to stream. It should have been easy to update the stream settings. I made the necessary changes and tested it out — no luck. The error message was less than helpful: “Failed to connect to server”. With a sigh, I took off my headphones, put my sysadmin hat on, and walked out into the digital darkness. It was time to get back to work.

What the RTMPS?

Some terms, before we dive in. RTMP is the Real Time Messaging Protocol, originally developed by Macromedia. Thanks to Adobe, a version of RTMP is now an open specification, and many video streaming services now use it to transport live audio and video across the internet. RTMPS is simply the encrypted version, where RTMP is wrapped inside a TLS/SSL connection. TLS, Transport Layer Security, is the same protocol that powers HTTPS. TLS does depend on your machine having a good copy of the certificate bundle, a collection of public keys that are considered to be trustworthy.

How does one go about trying to fix this sort of problem? A good first step is to get a more useful error message. Running OBS from a command line lets us see all the extra output messages that are usually invisible. Below I’ve cut out all the extra messages, to highlight the failing connection attempt.

info: [rtmp stream: 'simple_stream'] Connecting to RTMP URL rtmps://rtmp-api.facebook.com:443/rtmp/...
info: RTMP_Connect1, TLS_Connect failed: -0x7680
info: [rtmp stream: 'simple_stream'] Connection to rtmps://rtmp-api.facebook.com:443/rtmp/ failed: -2
info: ==== Streaming Stop ================================================

Now we’re getting somewhere. The failure is specifically in the TLS connection, which we could have guessed. We also get an error code. Quick note on trying to search for that error: Google will interpret the leading dash as an indicator that you want results that don’t include that search term. Surrounding it in quotation marks, "-0x7680" is the way to get useful results.

How to Narrow Down the Cause of an Error

Searching for that error number will bring up two interesting hits. One is the thread in the OBS forum where those of us facing this problem have been discussing it. The other hit is the mbedtls documentation, where an error with this code is defined. It’s possible that it’s a false positive, but since we’re troubleshooting a TLS issue, it’s likely related. That error is MBEDTLS_ERR_SSL_CA_CHAIN_REQUIRED, and is described as “No CA Chain is set, but required to operate.”

So what’s next? We’ve learned a bit, but still don’t have any answers, so let’s dive into code. My standby quick-and-dirty debugging technique is to add printf() calls to help follow code execution, but where to start? We have a breadcrumb in the program output, “RTMP_Connect1”. Searching the OBS codebase on Github lands us at a function with that name. Partway through the function, we can see the command to print the log message that brought us here.

The error message indicates that a CA chain isn’t loaded. That sounds like an initialization problem. Perhaps that term “chain” is used in the OBS source. Searching our suspect file returns 18 hits, 17 of which are in a function named RTMP_TLS_LoadCerts(). I spent some time chasing down the execution flow of the RTMP connection, even making a sketch of when each function gets called. The code led me back to RTMP_TLS_LoadCerts(). That function has quite a bit of code in it, but we can safely ignore the parts that are specific to Windows or MacOS. There is an obvious line that should load system certs.

if (mbedtls_x509_crt_parse_path(chain, "/etc/ssl/certs/") &< 0) {
        goto error;
    }

So OBS makes a function call to the mbedtls library, requesting that the certificates in /etc/ssl/certs/ are loaded. Let’s make sure a proper certificate file is actually there where OBS expects it to be:

ca-bundle.crt is where it's supposed to be, but why is that file a different color?

Ca-bundle.crt is the file we’re looking for. Notice the teal color? Those three files are actually symlinks to another location. Files on Linux filesystems are symbolically linked all the time, so likely not a problem. I spent some time checking things like file permissions, and tried disabling selinux, but came up with nothing. It didn’t seem to be an overzealous security setting. All I was left with was the knowledge that I had an mbedtls function that should be loading the certificate bundle, but when the program actually tried to verify a TLS certificate, it complained that the chain, ca-bundle.crt, was missing. If the mbedtls function was failing, shouldn’t it error out there? The next logical step is to look at the documentation for that function.

Now we find the first real hint about what could be happening. mbedtls_x509_crt_parse_path() can partially fail, and still give us a return code that doesn’t trigger an error. So, time to use printf() to see what that return code is on my machine. I added the code, compiled, ran the output binary… and got no such log output.

It took longer than I care to admit for me to figure out why my code changes didn’t seem to make a difference when running the program. This is a potential gotcha to watch out for. OBS uses a modular structure, consisting of the OBS binary, as well as various loadable modules. The code changes I was making were a part of obs-outputs.so, and even when running the compiled binary, those modules were being loaded from their default locations. To test my changes, I had to explicitly tell OBS to use the newly compiled module.

System Calls That Hate Symlinks

Something was obviously amiss with mbedtls_x509_crt_parse_path(). I wasn’t seeing anything obvious in the documentation, but I did see a similar function, mbedtls_x509_crt_parse_file(). What would happen if we forced mbedtls to only try loading the one crt file that we care about? I made the change, compiled, and to my surprise OBS finally connected to Facebook Live. I had a real fix, but I still didn’t understand why it was broken. It’s time to look at the mbedtls sourcecode.

The parse_path() function is easily found in the mbedtls source tree. Make sure to watch the #if defined() blocks — We’re not interested in the code for Windows. Once we find the loop that runs for each file in the given path, the problem code might jump out at you.

        else if( stat( entry_name, &sb ) == -1 )
        {
            ret = MBEDTLS_ERR_X509_FILE_IO_ERROR;
            goto cleanup;
        }

        if( !S_ISREG( sb.st_mode ) )
            continue;

Stat() is a system call that gets the status of a filesystem path, and then S_ISREG is a macro that checks whether that path is pointing at a regular file. Notably, the parse_file function does not do that check, and will happily load a symlink.

Report the Bug, But to Whom?

That, of course, is the core problem. Fedora uses a symlinked ca-bundle.crt, and mbedtls refuses to load ca-bundle.crt when it’s a symlink. We understand the problem, but what’s the proper fix? Which project actually has the bug here? OBS used the mbedtls function properly according to its documentation, and mbedtls may have a good security reason for refusing to load a bundle that’s actually a symlink. Is it on RPMFusion, and the package maintainer to fix the incompatibility? Personally, I think it’s really an mbedtls problem, particularly because this quirk isn’t mentioned in any documentation that I came across. Ultimately, it’s not my call which project needs to own this problem.

Our last task, then, is to report the bug we discovered. It’s a good idea to stop at this point, and ask yourself, is this bug a potential security issue? It’s best to try to report security issues privately, and most projects have contact instructions for disclosing those sorts of issues. Depending on where you found the problem, you may even be eligible for a bug bounty reward for finding the problem.

Assuming there is no security angle to consider, you’ll want to make a bug report. Does the project have a public bug tracker? That’s probably where it should go. If not, there is likely a mailing list where bugs are reported. Include enough information to reproduce the bug, and details on what you think is happening, but don’t include a bunch of log output in the bug or the mailing list. If it’s relevant, use pastebin or one of the other text hosting sites, to avoid including a wall of text in the bug report. If you have an idea of how to fix the problem, mention it. On a mailing list, patches are usually accepted. If the project is using Github or Gitlab, you can report the bug, and turn around and submit a pull request to fix it.

Particularly for trivial changes, I tend to ask what the project prefers, should I send a pull request, or is this trivial enough to fix without one. If you’re looking to do future work on the project, doing a PR is a handy way to get your name into the git record. Projects are more likely to look kindly on your future work, if there’s record of you already fixing bugs.

The End of Another Tale

This one turned out well enough. OBS is adding some workaround code to make sure the ca-bundle is properly loaded on systems where it’s a symlink. The mbedtls project sees this behavior as a bug, and I’ve submitted a patch to fix it. I noticed a related logic bug in the certificate loading code, and it’s been acknowledged as well. I’ve patched my copy of OBS so live-streams work again. It’s all in a days’ work for for the sysadmin. No rest for the weary, though. I have a pair of 10Gb Ethernet cards that die whenever they transfer VLAN tagged traffic. Just another case.

Errata

I know the more experienced programmers will point it out in the comments, stat() doesn’t ever set st_mode to S_IFLNK. Stat() follows the symlink to report on the target, while lstat() tells you about the symlink itself. My fix worked, but the problem was slightly different than I thought it was. Mbedtls_x509_crt_parse_path() can return a positive value if only some of the files in the specified directory successfully loaded. OBS was treating that positive value as a failure, and immediately dumping the certificates that had been loaded. Chasing false leads like this is totally par for the course when it comes to finding and fixing bugs. In the end, the bug is fixed, and that’s what really matters. Now if you’ll excuse me, I need to find something to wash the taste of crow out of my mouth.

14 thoughts on “Troubleshooting A Symlink — A Whodunnit For The Git Record Books

  1. Symlinks are the bane of my linux life, since the 90s… one early live CD distro (Forgetting the name, they didn’t last, wonder why) had a circular symlink trail to something really critical, something like libso.i386 god knows how the supplied apps worked, because nothing else would install. (After install to HDD obviously) Then a slackware 4 or 5ish distro had symlinks to stuff that was still on, I presume, Volkerding’s box, that didn’t make it to the CDs or packages. Then a mid noughties ubuntu when everyone started assuring me it was real mature now, I got intractable problems with symlinks again having no target. I think it was one of those things that don’t happen if you start some unknown number of versions earlier and incrementally upgrade. But anyway, got the helpful responses “Well it works for me” yeah of course it does you nitwit, it’s pointing to something that’s on your dev system but plainly isn’t in the distribution. Pointed out exact chapter and verse on why it can’t work for anyone, and got “Well nobody uses it” great… that’s another distro I had to avoid for 5 years until they got their crap together.

    That’s only the really unfixable ones that stick in my mind. I’ve lost hours putting some others right.

    1. Lol you should see the stuff I’m working on…. I’m using so many symlinks,tmpfses, overlays, and bind mounts that I wrote a program to manage correctly ordering them all.

      It’s for read-only root filesystems with certain things mapped to a separate rw partition. It all works reliably, but as a non sysadmin the amount of command line work needed to get there was sanity-crushing.

      1. But, that’s in a shell, this is C code. (c;

        I finally clicked on the Github link in the article and it looks to an (WordPress?) artifact. The actual code in Github is just a less-than operator.

    1. typo / formatting issue,
      search for “if (mbedtls_x509_crt_parse_path(chain,”
      in the OBS github and it shows
      ” if (mbedtls_x509_crt_parse_path(chain, “/etc/ssl/certs/”) < 0) {"

  2. A faster way to debug this issue is to use strace:

    strace -o outfile -f -p $(pidof obs)

    Then look for the error message in outfile after you’ve killed strace and it has followed the error.

    You’d see all the syscalls obs made and the likely failure in the chain of stat(2) and access(2) calls.

    The strace program is my goto problem solver for everything, since it works on opaque binaries, scripting languages, etc.

  3. I’ve saved much of my sanity thanks to strace. I’ve even used it to debug Python programs that would spit out utterly non-helpful errors. It was helpful when python package A would call package B and B would call C, D, E and W, et al I’d at least know what package the error was occurring in and could work on it from there.

  4. Really fascinating to read this despite a rudimentary knowledge of programming. For me it’s like watching an expert manipulate a safe open or get an ice engine running while explaining why he’s doing what he’s doing.

Leave a Reply

Please be kind and respectful to help make the comments section excellent. (Comment Policy)

This site uses Akismet to reduce spam. Learn how your comment data is processed.