cross-posted from: https://lemmy.ninja/post/30492

Summary

We started a Lemmy instance on June 13 during the Reddit blackout. While we were configuring the site, we accumulated a few thousand bot accounts, leading some sites to defederate with us. Read on to see how we cleaned up the mess.

Introduction

Like many of you, we came to Lemmy during the Great Reddit Blackout. @MrEUser started Lemmy.ninja on the 13th, and the rest of us on the site got to work populating some initial rules and content, learning how Lemmy worked, and finding workarounds for bugs and issues in the software. Unfortunately for us, one of the challenges to getting the site up turned out to be getting the email validation to work. So, assuming we were small and beneath notice, we opened our registration for a few days until we could figure out if the problems we were experiencing were configuration related or software bugs.

In that brief time, we were discovered by malicious actors and hundreds of new bot users were being created on the site. Of course we had no idea, since Lemmy provides no user management features. We couldn’t see them, and the bots didn’t participate in any of our local content.

Discovering the Bots

Within a couple of days, we discovered some third-party tools that gave us the only insights we had into our user base. Lemmy Explorer and The Federation were showing us that a huge number of users had registered. It took a while, but we eventually tracked down a post that described how to output a list of users from our Lemmy database. Sure enough, there were thousands of users there. It took some investigation, but we were eventually able to see which users were actually registered at lemmy.ninja. There were thousands, just like the third-party tools told us.

Meanwhile…

While we were figuring this out, others in Lemmy had noticed a coordinated bot attack, and some were rightly taking steps to cordon off the sites with bots as they began to interact with federated content. Unfortunately for us, this news never made it to us because our site was still young, and young Lemmy servers don’t automatically download all federated content right away. (In fact, despite daily efforts to connect lemmy.ninja to as many communities as possible, I didn’t even learn about the lemm.ee mitigation efforts until today.)

We know now that the bots began to interact with other Mastodon and Lemmy instances at some point, because we learned (again, today) that we had been blocked by a few of them. (Again, this required third-party tools to even discover.) At the time, we were completely unaware of the attack, that we had been blocked, or that the bots were doing anything at all.

Cleaning Up

The moment we learned that the bots were in our database, we set out to eliminate them. The first step, of course, was to enable a captcha and activate email validation so that no new bots could sign up. [Note: The captcha feature was eliminated in Lemmy 0.18.0.] Then we had to delete the bot users.

Next we made a backup. Always make a backup! After that, we asked the database to output all the users so we could manually review the data. After logging into the database docker container, we executed the following command:


select
  p.name,
  p.display_name,
  a.person_id,
  a.email,
  a.email_verified,
  a.accepted_application
from
  local_user a,
  person p
where
  a.person_id = p.id;

That showed us that yes, every user after #8 or so was indeed a bot.

Next, we composed a SQL statement to wipe all the bots.


BEGIN;
CREATE TEMP TABLE temp_ids AS
SELECT person_id FROM local_user WHERE person_id > 85347;
DELETE FROM local_user WHERE person_id IN (SELECT person_id FROM temp_ids);
DELETE FROM person WHERE id IN (SELECT person_id FROM temp_ids);
DROP TABLE temp_ids;
COMMIT;

And to finalize the change:


UPDATE site_aggregates SET users = (SELECT count(*) FROM local_user) WHERE site_id = 1;

If you read the code, you’ll see that we deleted records whose person_id was > 85347. That’s the approach that worked for us. But you could just as easily delete all users who haven’t passed email verification, for example. If that’s the approach you want to use, try this SQL statement:


BEGIN;
CREATE TEMP TABLE temp_ids AS
SELECT person_id FROM local_user WHERE email_verified = 'f';
DELETE FROM local_user WHERE person_id IN (SELECT person_id FROM temp_ids);
DELETE FROM person WHERE id IN (SELECT person_id FROM temp_ids);
DROP TABLE temp_ids;
COMMIT;

And to finalize the change:


UPDATE site_aggregates SET users = (SELECT count(*) FROM local_user) WHERE site_id = 1;

Even more aggressive mods could put these commands into a nightly cron job, wiping accounts every day if they don’t finish their registration process. We chose not to do that (yet). Our user count has remained stable with email verification on.

After that, the bots were gone. Third party tools reflected the change in about 12 hours. We did some testing to make sure we hadn’t destroyed the site, but found that everything worked flawlessly.

Wrapping Up

We chose to write this up for the rest of the new Lemmy administrators out there who may unwittingly be hosts of bots. Hopefully having all of the details in one place will help speed their discovery and elimination. Feel free to ask questions, but understand that we aren’t experts. Hopefully other, more knowledgeable people can respond to your questions in the comments here.

  • That’s a good indicator when you find your instance blocked by a lot of other instances. I think the lesson is don’t leave low hanging fruit out there.

    It actually amazes me there’s people out there doing these bot infestations. I mean there is some effort involved. Why go to all the trouble, what’s the payoff? And how are they able to find new unadvertised instances so quickly.

    • That’s a good indicator when you find your instance blocked by a lot of other instances.

      That’s just it: it took a third-party tool for us to even know we were being blocked. Our Lemmy instance really had no tools in place for us to see anything was wrong. If we hadn’t been extremely curious about our high user count, we never would have known there was a bot on our site. Never.

      Interestingly, when we discovered the tool that let us see that we were being blocked, I noticed that almost all of the sites that were reported as blocking us were in fact not blocking us. To their immense credit, they had apparently blocked us and then unblocked us after we wiped out the bots. It says a lot that those admins kept checking whatever report they were checking and followed up after we cleared up the problem.

    • I believe the way it works is that the moment you interact with something, instance with at least one user who subscribe to the community you’re interact with gets a ping with activity associated with you. Since each message is signed, webfinger is used to verify your user’s authenticity (prevents me from posting something offensive pretending to be from your instance). That would then allow the bad actors to quickly collect instances to bot upon.

      Payoff is minimal but theoretically they’d be able to shill for things just like they already do on Reddit.

  •  Jamie   ( @Jamie@jamie.moe ) 
    link
    fedilink
    English
    61 year ago

    I run a private instance, but haven’t had captcha or email verification on because, well, it’s just me and one friend that I don’t think even uses his account. I have applications on and don’t approve anyone because it’s a personal instance. So far, I’ve had 5 bots apply. I’ll put their application text at the bottom of this post.

    Names tended to follow (noun)(noun)## format. One actually only had one noun. But it seems like having applications on by itself makes a lot of them just not bother with you. Even better, the wording of the applications was… odd. They’d stick out like a sore thumb in a batch of real ones, I think.

    “I’m eager to join the World News@lemmy.ml community to broaden my global perspective and participate in discussions about current affairs.”

    “I want to join the Lemmy.world community because I’m curious to connect with fellow users and engage in discussions about various topics.”

    “I yearn to depart from Reddit and embark on a transformative journey within this innovative social network by joining your instance.”

    “Joining the /kbin meta@kbin.social community seems interesting as I can engage in discussions about the platform’s development and future enhancements.”

    “Driven by the ongoing events on Reddit, I’m eager to join this instance and find the satisfaction and pleasure that has eluded me elsewhere.”

    • The wording on those applications would definitely raise a flag for me. They totally sound bot generated.

      I’ve joined a number of instances looking for the best performer, hops, pings, server response. That’s what I’ve been saying in my applications. Interestingly, my first sign-up was on Beehaw before I knew what I doing and they are the only ones that rejected me, about a week after I applied. Made me think, what did I say that was so awful? No biggie I already had some good instances to sign into.

        • That will be the problem with LLMs. Considering the application questions can simply be used as a prompt, bots will ace the Turing test. Would different questions or phrasings make it easier to filter them?

          I guess the tell from your single application to all these, is that they flock at the registration.

          All this just proves why 3rd party tools are important for managing an instance.

  • As a webmaster myself, I’ve noticed a small number of users with repeating seemingly generated names, all with the same or similar answer to the registration screening question. I’d be curious if you could release the database of usernames and screening question answers. I’d bet other Lemmy admins would benefit from any analysis done on that database. TTP.

        • Let’s rephrase that… if you can’t manage your instance with an appropriate approval process (bearing in mind, no process is also a process; the community might just choose to de-federate a no approval server, however), then don’t host an instance. Not everyone have to, nor should they, all congregate in one instance. They’d have access to all the communities as long as they’re not on a de-federated instance, so spreading out will prevent another single instance’s admin going down spez’s path, thereby reinforcing the federated network’s resilience.

          •  Ada   ( @ada@lemmy.blahaj.zone ) 
            link
            fedilink
            English
            1
            edit-2
            1 year ago

            What I’m saying is that if every instance tried to do manual approval, the threadiverse wouldn’t have been able to cope with the influx of reddit users. Across every instance, all combined, we didn’t have the resources to manually approve the influx of users.

            To cope, some instances had to be on open signups. If people coming from reddit couldn’t sign up at their preferred instance, they went somewhere else with open signups. And if there was nowhere with open signups, a good portion of them would have given up, moved on, and the threadiverse would have lost momentum before it found it.

            And in our case specifically, as the only explicitly queer focused instance (at least at the time of the initial reddit migration) we felt it was important to be open so queer folk could find a space and set up communities during those early days, rather than forcing them on to generalist instances without the protections and community that come with queer spaces.

            • Right; and as I was saying, the choice to be open sign up and have no approval process is in itself a process choice that the instance operator can choose to take.

              However, if the instance (not your instance, just a hypothetical instance) gets abused, and bad actors chooses to launch attacks by massing bot accounts, then it is also entirely possible for others to choose to de-federate that instance.

              It’s a fine line to balance; as someone who’s been building discussion forums since early 2000’s, I fully understand the implications of needing to balance between ease of sign up and having the appropriate process in place to keep the community clean.

              I think having a more modular bot prevention system (I.E. allowing user to plug in code to handle different types of captcha/question answer/bot detection/etc.) will add a lot of value, but the devs haven’t quite figure their footing yet. They’ve removed captcha all together in 0.18 only to be told vocally to put it back in. I’d say it is just typical growing pains of suddenly being vaulted to the spotlight…

              • That’s a little less confrontational than what you first wrote, where you said that you shouldn’t be running an instance if you can’t handle a manual approval process. My whole point is that no one is resourced the properly handle manual approvals at that scale.

                I absolutely agree that open approvals come at a cost, and do have real risks associated. We’re holding off on upgrading to 0.18 specifically because of lack of captcha. That’s not something we’re prepared to risk

                Like you, I’ve been doing this for decades. We might have made different choices, but I think it’s fair to say, both of us are making choices from positions of first hand experience, and I think that’s probably why I got a little defensive.