Improving Beehaw

BLUF: The operations team at Beehaw has worked to increase site performance and uptime. This includes proactive monitoring to prevent problems from escalating and planning for future likely events.


Problem: Emails only sent to approved users, not denials; denied users can’t reapply with the same username

  • Solution: Made it so denied users get emails and their usernames freed up to re-use

Details:

  • Disabled Docker postfix container; Lemmy runs on a Linux host that can use postfix itself, without any overhead

  • Modified various postfix components to accept localhost (same system) email traffic only

  • Created two different scripts to:

    • Check the Lemmy database once in while, for denied users, send them an email and delete the user from the database
      • User can use the same username to register again!
    • Send out emails to those users (and also, make the other Lemmy emails look nicer)
  • Sending so many emails from our provider caused the emails to end up in spam!! We had to change a bit of the outgoing flow

    • DKIM and SPF setup
    • Changed outgoing emails to relay through Mailgun instead of through our VPS
  • Configure Lemmy containers to use the host postfix as mail transport

All is well?


Problem: NO file level backups, only full image snapshots

  • Solution: Procured backup storage (Backblaze B2) and setup system backups, tested restoration (successfully)

Details:

  • Requested Funds from Beehaw org to be spent on purchase of cloud based storage, B2 - approved (thank you for the donations)

  • Installed and configured restic encrypted backups of key system files -> b2 ‘offsite’. This means, even the data from Beehaw that is saved there, is encrypted and no one else can read that information

  • Verified scheduled backups are being run every day, to b2. Important information such as the Lemmy volumes, pictures, configurations for various services, and a database dump are included in such

  • Verified restoration works! Had a small issue with the pictrs migration to object storage (b2). Restored the entire pictrs volume from restic b2 backup successfully. Backups work!

sorry for that downtime, but hey… it worked


Problem: No metrics/monitoring; what do we focus on to fix?

  • Solution: Configured external system monitoring via external SNMP, internal monitoring for services/scripts
Details:
  • Using an existing self-hosted Network Monitoring Solution (thus, no cost), established monitoring of Beehaw.org systems via SNMP
  • This gives us important metrics such as network bandwidth usage, Memory, and CPU usage tracking down to which process are using the most, parsing system event logs and tracking disk IO/Usage
  • Host based monitoring that is configured to perform actions for known error occurrences and attempts to automatically resolve them. Such as Lemmy app; crashing; again
  • Alerting for unexpected events or prolonged outages. Spams the crap out of @admin and @Lionir. They love me
  • Database level tracking for ‘expensive’ queries to know where the time and effort is spent for Lemmy. Helps us to report these issues to the developers and get it fixed.

With this information we’ve determined the areas to focus on are database performance and storage concerns. We’ll be moving our image storage to a CDN if possible to help with bandwidth and storage costs.

Peace of mind, and let the poor admins sleep!


Problem: Lemmy is really slow and more resources for it are REALLY expensive

  • Solution: Based on metrics (see above), tuned and configured various applications to improve performance and uptime
Details:
  • I know it doesn’t seem like it, but really, uptime has been better with a few exceptions
  • Modified NGINX (web server) to cache items and load balance between UI instances (currently running 2 lemmy-ui containers)
  • Setup frontend varnish cache to decrease backend (Lemmy/DB) load. Save images and other content before hitting the webserver; saves on CPU resources and connections, but no savings to bandwidth cost
  • Artificially restricting resource usage (memory, CPU) to prove that Lemmy can run on less hardware without a ton of problems. Need to reduce the cost of running Beehaw
THE DATABASE

This gets it’s own section. Look, the largest issue with Lemmy performance is currently the database. We’ve spent a lot of time attempting to track down why and what it is, and then fixing what we reliably can. However, none of us are rust developers or database admins. We know where Lemmy spends its time in the DB but not why and really don’t know how to fix it in the code. If you’ve complained about why is Lemmy/Beehaw so slow this is it; this is the reason.

So since I can’t code rust, what do we do? Fix it where we can! Postgresql server setting tuning and changes. Changed the following items in postgresql to give better performance based on our load and hardware:

 huge_pages = on # requires sysctl.conf changes and a system reboot
 shared_buffers = 2GB
 max_connections = 150
 work_mem = 3MB
 maintenance_work_mem = 256MB
 temp_file_limit = 4GB
 min_wal_size = 1GB
 max_wal_size = 4GB
 effective_cache_size = 3GB
 random_page_cost = 1.2
 wal_buffers = 16MB
 bgwriter_delay = 100ms
 bgwriter_lru_maxpages = 150
 effective_io_concurrency = 200
 max_worker_processes = 4 
 max_parallel_workers_per_gather = 2
 max_parallel_maintenance_workers = 2
 max_parallel_workers = 6
 synchronous_commit = off  	
 shared_preload_libraries = 'pg_stat_statements'
 pg_stat_statements.track = all

Now I’m not saying all of these had an affect, or even a cumulative affect; just the values we’ve changed. Be sure to use your own system values and not copy the above. The three largest changes I’d say are key to do are synchronous_commit = off, huge_pages = on and work_mem = 3MB. This article may help you understand a few of those changes.

With these changes, the database seems to be working a damn sight better even under heavier loads. There are still a lot of inefficiencies that can be fixed with the Lemmy app for these queries. A user phiresky has made some huge improvements there and we’re hoping to see those pulled into main Lemmy on the next full release.


Problem: Lemmy errors aren’t helpful and sometimes don’t even reach the user (UI)

  • Solution: Make our own UI with blackjack and hookers propagation for backend Lemmy errors. Some of these fixes have been merged into Lemmy main codebase

Details

  • Yeah, we did that. Including some other UI niceties. Main thing is, you need to pull in the lemmy-ui code make your changes locally, and then use that custom image as your UI for docker
  • Made some changes to a custom lemmy-ui image such as handling a few JSON parsed error better, improving feedback given to the user
  • Remove and/or move some elements around, change the CSS spacing
  • Change the node server to listen to system signals sent to it, such as a graceful docker restart
  • Other minor changes to assist caching, changed the container image to Debian based instead of Alpine (reducing crashes)

 

The end?

No, not by far. But I am about to hit the character limit for Lemmy posts. There have been many other changes and additions to Beehaw operations, these are but a few of the key changes. Sharing with the broader community so those of you also running Lemmy, can see if these changes help you too. Ask questions and I’ll discuss and answer what I can; no secret sauce or passwords though; I’m not ChatGPT.

Shout out to @Lionir@beehaw.org , @Helix@beehaw.org and @admin@beehaw.org for continuing to work with me to keep Beehaw running smoothly.

Thanks all you Beeple, for being here and putting up with our growing pains!

  • This gets it’s own section. Look, the largest issue with Lemmy performance is currently the database. We’ve spent a lot of time attempting to track down why and what it is, and then fixing what we reliably can. However, none of us are rust developers or database admins. We know where Lemmy spends its time in the DB but not why and really don’t know how to fix it in the code. If you’ve complained about why is Lemmy/Beehaw so slow this is it; this is the reason.

    There is a dedicated Lemmy community, !lemmyperformance@lemmy.ml

  • As someone technologically illiterate and new to Fediverse (hi all) I wonder if this is something you need to figure out to work around your own hardware to ptimise it or is this usuall thing with Lemmy? I guess the current events and influx of people also quite stress-tests various systems.

  •  kool_newt   ( @kool_newt@beehaw.org ) 
    link
    fedilink
    English
    4
    edit-2
    11 months ago

    Is it possible to use Redis to help speed up DB queries?

    I’m assuming the DB is a container too, containers (Docker) and overlay networks have overhead. There could be overhead in the way the DB accesses the storage devices as well. Look into running the DB as dedicated real server if possible, otherwise a dedicated VM and not a container.

    You can also look into read-replicas of the DBs as I’d imagine there are way more DB reads than writes. Take your DB backups from a read replica (you can stop one of the read replicas to get a consistent DB backup without interrupting other reads and writes).

    You can set up slow-query logging if you haven’t yet to find out the problematic queries so you know where to optimize (if optimizing queries is an option).

    • Thanks, we have explored these options. The Lemmy DB runs as a container, yes. The overhead of Docker on it isn’t that much, it’s more the queries themself. We do not want to increase the cost and complexity of adding another server on just for the database. We have also explored multiple DB containers and doing connection pooling. Again, this only moves the problem it does not solve it.

  • Honestly surprised it isn’t using redis already 😧

    Often end up plopping redis in as an ad-hoc caching layer pretty early during application development for backends that are expected to be load balanced. It’s super simple to use, has low resource costs relative to it’s load capacity, and solves for a lot of low hanging fruit as far as DB access performance goes.


    Opinion:

    It should definitely be a reasonably high/critical priority roadmap item 🤔. The time cost is negligible assuming your ecosystem has a decent redis library, if you’re an expert in the codebase (major/primary contributor) it can be as easy as a few days to do a cleanup (assuming redis/lib familiarity/docs) and knock out all the low hanging fruit. And the benefits can be enormous, like 10x, 50x load decreases enormous.


    Alternatively:

    Read replicas as @kool_noot said. If not, some dev work is required.

    This can sometimes work as a quick fix to address application perf problems without adding infrastructure, but time cost is more or less based on codebase quality & conventions, since you’ll be touching a lot more queries to make this change. And you’ll need to slap in a config that handles deployments without a read replica.

    Then users of Lemmy could have as many read replicas as they want behind a load balancer/proxy which lets them scale in that direction going forward.

    This is actually a common solution for read performance anyways.

  •  douglasg14b   ( @douglasg14b@beehaw.org ) 
    link
    fedilink
    English
    2
    edit-2
    11 months ago

    Unfortunately you can only get so much out of config changes if the problems lie in access patterns 🫤

    DB performance problems are very typical of ORM usage, which Lemmy appears to use. Though I’m not sure to what extent.

    Not necessarily the ORM itself, but the database access patterns it encourages. If care is not taken to ensure performant hot paths receive more SQL and caching love, you end up with systemic performance problems.

    Endemic to the habits of the devs, not to specific queries or one particular workload. But by a broad set of generally unperformant patterns that may not individually be a problem, but become one as a whole.

    🤔

    I also don’t code in rust unfortunately ,but I definitely understand ORM usage and how it can bite you, but I quite enjoy using them. So I’m not admonishing the choice.