Sanitext – Remove LLM-Generated Text Fingerprints

14 points by panpan2 4 months ago

"− (U+2212) is a minus sign instead of a dash"

This is technically true, but the character in the other version is a hyphen instead of a dash (though given the absences of dashes in ASCII, one, two, or three ASCII hyphens are often used in place of dashes in environments constrained to ASCII.)

And while AI watermarking and fingerprinting is real, using typographically-correct Unicode instead of base ASCII isn't really it (though I guess anything that transforms text in a way which reduces variety like this does will make some of it less effective.)

panpan2 4 months ago

Thanks for catching this, changed to "hyphen".
> And while AI watermarking and fingerprinting is real, using typographically-correct Unicode instead of base ASCII isn't really it (though I guess anything that transforms text in a way which reduces variety like this does will make some of it less effective.)
I disagree. Your "writing signature" changes when you go from never using proper typography to suddenly using it perfectly. If you don't typically follow typography rules, LLM-generated text can make your writing inconsistent and detectable-especially in notes, where some parts follow your natural style while others suddenly have perfect punctuation (e.g., now you need to search for both your usual punctuation and the LLM's version to find something). Also, if you use an LLM to help rewrite a sentence within a longer piece, the output might include typographic details (like curly quotes or en-dashes) that don't match the rest of your writing.

mwinatschek 4 months ago

"- I'm AI. (Normal text) − І’m󠅘󠅟󠅜󠅑 ΑІ.󠅓󠅙󠅑󠅟 (AI-tainted text)

’ (U+2019) is a right single quotation mark instead of a regular quote"

I think AI just uses the correct apostrophe, isn’t it?

https://dictionary.cambridge.org/ja/grammar/british-grammar/...

panpan2 4 months ago

That's right! The same goes for en-dashes, em-dashes, and some other punctuation. While these aren’t ASCII, you can enable them with `--allow-chars` if you want to keep them. I imagine the average person doesn't know when to use which.

gs17 4 months ago

> This isn’t a claim that major LLMs do all (or any) of these tricks. That said, I started working on this because I accidentally discovered an instance of text fingerprinting while debugging a byte-sensitive bug. That’s when I realized: it’s time to say goodbye to (at least these kinds of) fingerprints for good.

Are there any examples of this being used?

panpan2 4 months ago

Just try it :) I’ve definitely come across random variation selectors now and then. Otherwise, the most common case is typography: like em-dashes instead of hyphens, curly apostrophes, etc. But if you're feeding LLM output into a search tool, these subtle differences might not be helping you!

theamk 4 months ago

I don't think this has any legitimate use, does it?

It seems this is just to support cheating, misinformation and to generally make the web worse.

panpan2 4 months ago

Another viewpoint is that it's about privacy (e.g., unwanted tracking) and security (e.g., homograph attacks). As LLMs are increasingly used everywhere, this provides a way to normalize text as it moves between different systems.
Der_Einzige 4 months ago

Good. Cheating on interviews is a huge net positive for society. Unironically.
- theamk 4 months ago
  
  Are there interviews where the candidates submit large blocks of plain text? I am not aware of them. There are plenty of opportunities to cheat in coding challenges, but the unicode tricks obviously won't apply to it, the programs won't compile.
  Instead, by "cheating", I mean cheating in schools/colleges, as well as passing ChatGPT output as blog post and pretending it's human-written. First one is very unfair to the other students who are not cheating, as well as to future employers who discover that their new employee can only write on the topics that ChatGPT is familiar with. The latter is simply disrespectful to readers.
  - MandieD 4 months ago
    
    And in the end, unfair to the student themselves as they won't actually know how to write for themselves - and there are occasions in life when it is critical to be able to write for yourself, by hand. Not very many, but the ones that still exist are probably not going anywhere.