263: Murphy's Law (again)

263: Murphy's Law (again)
This is a macro-style shot of a tiny porsche that they had at the gift shop, built into the wall.
  • -anything that can go wrong, will
  • -today kinda stunk, but also mention “we still experience the range of human emotions” because yes my life is very good
  • -S6 cancelled, had to take another train & bus
  • -Porsche museum was dope as FUCK except it didn’t have the 911 GT1 Strabenversion. Mention the wheels maybe, show off the tiny cars.
  • -train to Paris wasn’t as nice, but was still very cool
    • Went 315 km/h for like a while across the French countryside
    • No photos 1
    • Wifi was kinda ass
    • But! Re-learned myself some photoshop
  • Paris!
  • Vibram store closed for renovations 1
  • McDo at Gare de Lyon didn’t have the fake chicken nuggies I wanted to try
  • Had a (actually pretty good) salad instead
  • Figure skating
  • SERVER EXPLODED
    • Son of bitch
    • Might text Brandon to see if he can go in my room and restart it.
      • But also it’s only like 3 days, my very fervent readers (mostly El) will survive
      • miss her so much
  • Wind Power meeting
  • It’s so fucking late
  • Train to Nice tomorrow is like 6 hours long. So tired.

One of the first posts on this blog is named “Murphy’s Law”, because I ran into a ton of trouble with the blog when I was first setting it up, and it’s absolutely true that “anything that can go wrong, will” in this case. I’ve run into every conceivable problem with my server (where the blog lives), and many a problem within the blog itself. At this point I’ve had to solve problems with just about every single subsystem on the server. You’d think that would make the system more stable, but au contraire, it might actually be worse.

There’s a great post I saw at some point that goes something like “non-tech people assume that tech people must have their computers running smooth, but in fact, my systems are fucked up in ways that you can’t even imagine.”

It’s a perfect description of how I roll, honestly. My systems are unfathomably cursed, and especially the server is an exercise in the dumbest possible way to accomplish my given goal. It does still work, usually, but it’s pretty unreliable, and that is entirely my fault.

There’s some sunk-cost-fallacy tied to it now, though, because I could offload almost all of the processing and storage I do on my server to a service on the internet, or just get rid of it entirely (the blog isn’t strictly necessary, after all), but doing that would be ridiculously expensive, and a huge waste of sunk cost. The infrastructure I have is comical, IT professionals would have a laughing fit if they knew what I was running, but it cost me the bare minimum (I think I’m only like $400 total into it), and allowed me to re-use hardware that I already had (the server runs on an old laptop motherboard, after all). That $400 was mostly for hard drives, but I did the math and the cost of having an equivalent system running on the cloud somewhere is something to the tune of $200 a month, so it’s absolutely worth running it myself.

I also couldn’t run the smart-home parts of it online at all, and that might be my favorite part, honestly, so it’s definitely worth it.

I have to talk myself back into it being worth it every time the system decides to blow up, like it has as I’m writing this. I really need to figure out a better solution to some of these problems, because this thing is unreliable enough that it costs me a number of hours every so often. At this point, including initial setup of everything, I’m probably 300-ish hours deep, spread out over like a year and a half, that’s like an 45 minutes a day that I’ve had to spend working on this bastard, and while it’s working it is worth it, but yikes. This most recent period of functionality was 45 days long, but the periods before that were consistently 3 or 4 days. I solved that problem, as I talked about in an old post (I should link it but tbh I might forget), but I have literally no idea what caused the failure this time.

I’ll update the system when I get back, because as part of that SNAFU a month and a half ago I rolled back the version to the one I started on, which solved one of the problems I was having, but may have caused more (like the current whatever-this-is).


I have seen some of what this current failure manifested as before, though. The failure yesterday was an internal one, wherein the system had only some of the parts go down, not the whole thing, and not at once. The exact cause was novel, but the symptoms were some that I had seen before. Something weird goes wrong internally, where bridge-0 network fails and anything that references other containers or ports on the 10.27.27.29/0 internal network (which is the LAN address) fail. I think the only thing that still makes those references is Ghost, the blogging platform, when trying to connect to its database. Everything else uses internal addresses for different VLANs, I think I have 176.0.0.16 through 176.0.0.20 set up for different subsystems, and they all stay connected during this specific type of failure. When I rebuild the system this time I’m going to go change the Ghost database reference to its internal address, something that I’d thought about before but not actually done. Every subsystem has its own tunnel, which is like a little VPN that connects the specific service to the Cloudflare network, which then keeps track of which subdomains (that’s the thing before the main website, jellyfin.lukeoliver.net rather than just lukeoliver.net, for example) goes to which service. The tunnels and individual services largely survive this type of failure, so I want to try to spread that resiliency to the rest of the system (specifically the blog).

It’s an annoying state to be in, because the server is clearly working in some respects, my photos and movie service continued to work just fine, as did my password manager, but the blog and my file storage system both stopped working, which is really annoying. Those two are the more complicated parts of the system, I think because the tunnel for the file server references its external address (that 10.27.27.29:XXXX) rather than its internal, and the blog was still connected to its tunnel, because I was getting an internal error, but that error was that the blog system itself had lost contact with the database it used to manage everything, so it was effectively dead.


The thing that set this particular failure apart from the others is that I could still access the server itself. I have a VPN set up so that my laptop and servers pretend that they’re on a local network together, which gives me good, secure access from anywhere (shoutout Tailscale), and in the past, when I had this external-address failure, I couldn’t connect to the server using Tailscale, but this time I could, which is strange. The thing that really got me, though, is that the server said I wasn’t running any containers. Docker, the system that I use to run so many little services in their own little boxes, is almost never the problem, and I don’t know if docker itself was the problem this time, but the server told me that the docker service had failed to start entirely, despite the fact that it was verifiably running (photo server and tunnels still running just fine). The system did say that it was running something though, the RAM was almost full (might have been part of the problem tbh), and the CPU was working, there must have been some internal disconnect within the system. I tried toggling the docker service from the GUI, but that didn’t get me anywhere, and then when I tried spinning down the array (effectively stopping everything for maintenance), it just… wouldn’t. The logs simply said that the drives were busy, and therefore couldn’t be spun down. I’m thinking that since the docker service had disconnected, the spin-down process couldn’t shut down the containers to then spin down the disks, so it just got stuck in a loop.

If I was more knowledgeable (and in hindsight) I think there was probably a way that I could have hit it from the command line to do a little bit better of a job, but there’s no guarantee that would have worked either, so there’s only a little bit of regret there, not a ton.

The reason there’s regret at all is that I risked a full restart, which might be the single riskiest thing I can do on this server (but I was out of other options). Full restarts have maybe a 25% chance of success, it’s one of the serious issues on the server that persists, because if a restart fails, my only recourse is to go in-person and do a full, hard reset (unplug it and plug it back in). I don’t know what step of the process fails, nor how I would fix it, but it’s definitely an artifact of the sheer jank with which the system is set up. I know it can run smoothly, it’s done it before (I think my best uptime is something in the 100-day range, during fall quarter), but the initialization of the system is super unreliable.

I did reach out to the RA group chat to see if anyone is still in SLO and would be willing to poke it for me, I think I could walk them through what they need to do, but if they can’t that’s totally okay. I’ll be home in 3 days, and I can make this post (and the next two) retroactively, because I am the ministry of truth 12.


After that massive YAP though I can actually talk about today/yesterday. I’m going to leave the little outline I did at the top, because I think it’s a fun little piece of transparency, but I’m writing this long part the next day, on my really long train ride. I’m going to use the word “today” instead of yesterday, but otherwise stick with past-tense, as if I was writing it at the end of the day, like I usually do.