Submitted by Loquzofaricoalaphar t3_10ixiu6 in MachineLearning

Obviously nation states can already pretty comprehensively identify people using other methods, even on tor and such because of user error, but If your average home user can quickly do this using text what will implications be for the web?

  1. I am Assuming that is it currently possible to feed a model a bunch of text written by “Bobby” and put a specific post into model and get confidence stat that is was written by Bobby

  2. would it be possible in future with better models and a lot more compute to use non anon data from all of Facebook or internet to quickly scan pseudo anonymous places like Reddit, twitter or even something truly anon like dark web and return all results of list of probable authors?

I’m assuming people whom are seeking true anonymity already put their text through paraphrase models or just write very bland.

I am Using the word mask instead of anonymous because Reddit seems more like obfuscation than potential true anonymity like with some tor forum with a sophisticated user or something.

It is interesting to think that all the subtle errors and invisible algorimic choices of the human brain is trivial for a machine to identify given a sufficient natural language model that can translate the text and incorporate pattern matching.

Edit: I mean a a noisy probability stat not an assurance that x was written by y. More like 75% match to Bobby 32% match to sally. Matching to errors, flow, unusual word choices, more advanced than just a plagiarism detector.

0

Comments

You must log in or register to comment.

[deleted] t1_j5h4cp9 wrote

[deleted]

17

Loquzofaricoalaphar OP t1_j5h5kq4 wrote

Perhaps It could return the top 10 likelihoods of the author of the account, some patterns of writing and and grammatical errors might be pretty unique and the more post it has the more unique right?

−4

neanderthal_math t1_j5henyu wrote

People have been working on the Author Identification problem for about 20 years.

https://dergipark.org.tr/en/download/article-file/2482752

https://en.wikipedia.org/wiki/Author_profiling?wprov=sfti1

There is no way to unmask all of Reddit though. Too many people and many text samples are way too short. Some Redditors only speak in emoji and gif.

13

sothatsit t1_j5hhb31 wrote

I’ve actually done some work on this and the real issue here is that:

  1. You’d need a lot of text from other sources with people’s real names.
  2. You’d need the user to have written a lot of Reddit comments or posts.
  3. The style of user’s writing would need to match between Reddit and your other source.

If you’re interested though, I made the following library for my Master’s thesis, which can be used for this: https://github.com/TycheLibrary/Tyche

However, it would need more work to get close to identifying thousands, never mind millions, of users.

3

PredictorX1 t1_j5h3ymz wrote

>With more compute could it be easy to quickly un Mask all the people on Reddit by using text correlations to non masked publicly available text data?

With labeled samples of text, I think it would be pretty easy to come up with a a likelihood model, giving a reasonable educated guess of the identity of some Reddit members, and I don't think it would take much computing power.

2

Loquzofaricoalaphar OP t1_j5h59id wrote

So like if you fed it 200 peoples samples you were looking and then fed it Reddit? Perhaps all of Reddit would be tricky because some might not have public text and it would be difficult to label all the text on Facebook or link-en, etc.

2

PredictorX1 t1_j5h5pb5 wrote

The biggest technical challenges I see:

  1. Having enough reference samples from known people
  2. The difference how people write on Reddit and how they write elsewhere (professional articles, e-mail, etc.: presumably used as reference)
  3. If too many Reddit users are being considered, it may all dissolve into mush (estimated probabilities would all be low)
3

Loquzofaricoalaphar OP t1_j5h6s4z wrote

That is interesting to think about. I’m biased to think text patterns have lots of variables and are fairly unique. Perhaps it’s more of a model than compute problem to analyze it at scale and not get mush.

1

HateRedditCantQuitit t1_j5hymmu wrote

Could you? Probably, but with a nontrivial error rate. Should you? No, that would make YTA.

2

PryomancerMTGA t1_j5hbwrg wrote

Trying to match all businesses with fuzzy matching is hard enough when you have misspellings. To think you could identify redditors with any degree of certainty is optimistic at best.

1

1980sMUD t1_j5inqqg wrote

If you’re worried about this, then first ask a model to generate your comments for you.

1

MrEloi t1_j5j2hz1 wrote

Most people are already uniquely identifiable via browser fingerprinting.

The Powers That Be can find you if they are interested enough.

The Una Bomber had the 'right' idea with regard to security - he lived in a basic hut in the woods.

Ironically, he was identified by his writing style ... his brother recognized the text style in a letter sent by Ted Kaczynski.

1

Loquzofaricoalaphar OP t1_j5kmiqg wrote

Yes this is the sort of thing I am thinking about. Some percentage of people have very distinct styles, however with Ted it might have been the content that gave it away.

Yes I am familiar with amiunique and all the variables of the browser.

I wonder if this way of identifying people is ever used when google or others get subpoenaed and hand over stuff. It would be more accurate than IP in determining the individual with correlations it seems, however I wonder if accepted by or holds up in court of law?

1

glitteringpenny t1_j5nfjzq wrote

Start by getting your hands on lastpasses customer data. Bet you could unmask a good amount of us. 25 million users data…unleashed

1

glitteringpenny t1_j5nfl5t wrote

Yea I’m being salty because was viewing Lastpass before this subreddit

1