Data protection, the right to be forgotten, and federation

interolivary ( @interolivary@beehaw.org ) · edit-2 1 year ago

Data protection, the right to be forgotten, and federation

pe1uca ( @pe1uca@lemmy.pe1uca.dev ) · 1 year ago

I don’t fully understand the “right to be forgotten”.
I mean, it’s very useful when you want to make sure a corporation which profits from your data doesn’t want to delete that data, but from the perspective of forums like in here I struggle to understand the need of people to delete everything at some point.

The only result I see from this is useful knowledge being lost.
Imagine if I make a useful post which people come from time to time to solve their issue. People would probably link to beehaw not my instance, since I posted in this community. After a couple of years I no longer can maintain my instance and goes down, then my useful post has a silent self-destruct, people won’t know this and keep linking it and eventually it’ll end up like with a lot of forums:
“The solution is in this link”
“Thanks, that solved my issue”
But now link is dead and the solution gone.

With how lemmy works now then people will still be able to find the content even if the instance where it originated from dies.
I see this as a very useful feature to preserve knowledge.

If you don’t want something to be forever in the internet then don’t post it, as you said, the wayback machine exists, so even then you’re acknowledging the GDPR request you made to the instance was useless, you still need to go to any archiver there is to be sure your data has been properly deleted.

Jummit ( @Jummit@lemmy.one ) · edit-2 1 year ago

I don’t fully understand the “right to be forgotten”.

I think there is a difference between agreeing with the law itself and agreeing with the usefulness. GDPR gives users incredible power over their data, and in the case of Reddit it allows you to leave the platform very effectively for example.

“The solution is in this link” “Thanks, that solved my issue” But now link is dead and the solution gone.

This is sadly the case with everything on the internet and life in general tbh.

even then you’re acknowledging the GDPR request you made to the instance was useless

Don’t quote me on this, but I don’t think GDPR says they have to delete every instance of your content across the internet, just the ones they have power over.> “The solution is in this link”

“Thanks, that solved my issue” But now link is dead and the solution gone.

Also, I’m mainly adding some of my thoughts, don’t take this as criticism of your post or your viewpoint. I fully agree that there is no solution that pleases everyone here.

interolivary ( @interolivary@beehaw.org ) · edit-2 1 year ago

I don’t fully understand the “right to be forgotten”.

The general idea is that when a user deletes their account, anyone storing their data or data produced by them must delete it in some time frame (can’t remember off hand.)

The knowledge preservation angle is valid, but it’s at odds with the right to be forgotten. Of course we can always choose to just ignore that right, but down the line that could well lead to trouble with data protection authorities, at least for instances hosted in the EU. I’m also fairly convinced other countries and states have similar regulations. But yeah, this is by no means a simple problem and I absolutely see your point about data preservation.

If you don’t want something to be forever in the internet then don’t post it, as you said, the wayback machine exists, so even then you’re acknowledging the GDPR request you made to the instance was useless, you still need to go to any archiver there is to be sure your data has been properly deleted.

As I noted in the proposal, “best effort” is what I’m looking for here. You can never guarantee that all traces of your personal data (ie. data you’ve produced or is about you) are totally gone from the internet, but that doesn’t mean we shouldn’t make any effort to do so.

It’d be worthwhile to check whether simply eg. deleting the username from content would be enough, but based on previous experience with GDPR stuff (although in a stricter context) I’m betting it wouldn’t be.

Edit: I floated the idea of allowing users to opt out of this in another comment

bjornsno ( @bjornsno@lemm.ee ) · edit-2 1 year ago

TTL on all content scales extremely poorly. You touch on this but I don’t think you appreciate just hope big of a SELECT * WHERE TTL ... this would be in just a few months/years. As an alternative, every instance sync should come with a list of newly deleted users. Retrying would not need to be reimplemented. If a user who wishes to be forgotten has had their home instance go dark, there will need to be a way for them to prove ownership over the original account (signup confirmation email perhaps) so a delete can be started from a foreign instance.

If we can’t make the proof of identity work on a trust level, it does have issues with rogue instances being able to delete people, we’ll need to instead set a TTL on whole instances, which wouldn’t have the performance issues it would on content.

interolivary ( @interolivary@beehaw.org ) · 1 year ago

Good points!

I think you’re right about mandatory individual content TTLs (instead of just optional like what Mastodon does, which is a completely different proposal) being a bad fit for this problem.

Admittedly it’s been a while since I’ve had to run Postgres at scale, but with good indices (and the case is simple since it’s a numeric comparison plus primary key), having a separate worker process, and sensible intervals it should be fine for quite a while (especially if Postgres can stream results without materializing the whole result, can’t remember if it could).

I’ll have to have a ponder about unfucking the idea 😁 Thank you for the feedback

StrayCatFrump ( @StrayCatFrump@beehaw.org ) · edit-2 1 year ago

IMO it can be MUCH simpler. Deleting content should propagate across federation just like adding content does. De-federating should retroactively remove all content that it would normally keep from propagating (possibly leaving “this post/comment deleted” markers so that replies make sense). And losing track of an instance for long enough (e.g. a week, or a month) should be equivalent to de-federating, possibly with the option to resurrect content when and if the instance comes back online.

I believe that would remove a lot of the issues with extra traffic, and possibly a lot of the issues with extra processing. I don’t know enough about the protocol to tell whether it would add requirements for extra data, but I suspect it wouldn’t.

1 year ago

I like this idea, I think you could do some smart logic with non-responses to avoid spurious deletions. Like if any post from another instance responded recently, hold off. I’m just imagining if a server had some downtime and suddenly their content in the fedi is gone

interolivary ( @interolivary@beehaw.org ) · edit-2 1 year ago

I’m just imagining if a server had some downtime and suddenly their content in the fedi is gone

Yeah this is definitely a possibility, but if content TTLs are a month or two and refreshing user account liveness starts happening much before the TTL of a content runs out (possibly even halfway through the TTL as I noted somewhere), an instance would have to be unreachable for up to a month (depending on the TTL of content / cached user account liveness info).

But in general you’re definitely right that there’s probably Smart Stuff™ we could do regarding liveness checks to make spurious deletions as unlikely as humanly possible.

Edit: I could also see having some sort of provision for “returning instances”, but no idea how this would work

lazyguru ( @lazyguru@discuss.online ) · edit-2 1 year ago

Something else to consider here would be some kind of batching. A system doing this check should group users together by instance and make a single call to that instance. Something like: “Hey, I have this list of users from your instance. Are they all still active? A, B, C, D…” Reply: “From your request, here is the list of users that I found in my database: A, D”. Now the calling system would know it should remove all data for users B & C.

interolivary ( @interolivary@beehaw.org ) · 1 year ago

Oh yeah, good point, thanks. I forgot all about batching since I was just concentrated on caching. I’ll add this to the proposal

LemmyLurker ( @LemmyLurker@beehaw.org ) · 1 year ago

Pretty interesting! At some point, developers an admins of federated services need to carefully consider the GDPR and how to comply efficiently. This jives well with the concept of “don’t store data for longer than absolutely necessary”. There is a risk that it will lead to broken or crippled conversations as they get older. I do agree that an instance shouldn’t try to act as an archive of all data. But there is sometimes great value in keeping these - both for historical and practical reasons. Maybe the data could be anonymized somehow, instead of deleted? But that would require manual review of the data, to ensure correct anonymization.

Maybe you could mark certain threads as " important" and only these would require manual review, the rest of the users data would be deleted.

Just some quick thoughts

interolivary ( @interolivary@beehaw.org ) · edit-2 1 year ago

Yeah it’s a difficult problem for sure, and like you and @pe1uca@lemmy.pe1uca.dev noted, there’s absolutely value in having some stuff be around essentially forever.

Maybe users could opt out of this mechanism? Not sure if it’d be per user or per content? So either allow flagging their profile with “keep my data around forever unless I specifically delete it myself, please” or flagging some of their own content as “keep this around forever”

edit: added opt-out to the proposal

Max-P ( @Max_P@lemmy.max-p.me ) · 1 year ago

A similar recent discussion on GDPR: https://lemmy.one/post/629353

interolivary ( @interolivary@beehaw.org ) · 1 year ago

Ah nice, thanks for the link. I’ll have to give that a read later, I’m all brained out for today 😅

evilviper ( @evilviper@beehaw.org ) · 1 year ago

Not a backend dev, but it would seem like this could possibly be partially solved by purging data past a certain age that falls into specific scenarios:

Data from unfederated instances
Data from users/posts/comments that have been deleted/removed

Also, deleting/removing content doesn’t really seem to do much currently as you still get all the info back from the server and it’s up to the frontend to not display it. I’m normally of the opinion of it you want to delete your comment it should be properly deleted (moderation removal being a separate issue).

Data protection, the right to be forgotten, and federation

Data protection, the right to be forgotten, and federation

The problem

1. The proposal: TTLs on user content

2. Advantages of this proposal

3. Disadvantages of this proposal

3.1 “It’s a feature, not a bug”