This is technically true, but the character in the other version is a hyphen instead of a dash (though given the absences of dashes in ASCII, one, two, or three ASCII hyphens are often used in place of dashes in environments constrained to ASCII.)
And while AI watermarking and fingerprinting is real, using typographically-correct Unicode instead of base ASCII isn't really it (though I guess anything that transforms text in a way which reduces variety like this does will make some of it less effective.)
> And while AI watermarking and fingerprinting is real, using typographically-correct Unicode instead of base ASCII isn't really it (though I guess anything that transforms text in a way which reduces variety like this does will make some of it less effective.)
I disagree. Your "writing signature" changes when you go from never using proper typography to suddenly using it perfectly.
If you don't typically follow typography rules, LLM-generated text can make your writing inconsistent and detectable-especially in notes, where some parts follow your natural style while others suddenly have perfect punctuation (e.g., now you need to search for both your usual punctuation and the LLM's version to find something).
Also, if you use an LLM to help rewrite a sentence within a longer piece, the output might include typographic details (like curly quotes or en-dashes) that don't match the rest of your writing.
That's right! The same goes for en-dashes, em-dashes, and some other punctuation. While these aren’t ASCII, you can enable them with `--allow-chars` if you want to keep them. I imagine the average person doesn't know when to use which.
> This isn’t a claim that major LLMs do all (or any) of these tricks. That said, I started working on this because I accidentally discovered an instance of text fingerprinting while debugging a byte-sensitive bug. That’s when I realized: it’s time to say goodbye to (at least these kinds of) fingerprints for good.
Just try it :)
I’ve definitely come across random variation selectors now and then.
Otherwise, the most common case is typography: like em-dashes instead of hyphens, curly apostrophes, etc. But if you're feeding LLM output into a search tool, these subtle differences might not be helping you!
Another viewpoint is that it's about privacy (e.g., unwanted tracking) and security (e.g., homograph attacks).
As LLMs are increasingly used everywhere, this provides a way to normalize text as it moves between different systems.
"− (U+2212) is a minus sign instead of a dash"
This is technically true, but the character in the other version is a hyphen instead of a dash (though given the absences of dashes in ASCII, one, two, or three ASCII hyphens are often used in place of dashes in environments constrained to ASCII.)
And while AI watermarking and fingerprinting is real, using typographically-correct Unicode instead of base ASCII isn't really it (though I guess anything that transforms text in a way which reduces variety like this does will make some of it less effective.)
Thanks for catching this, changed to "hyphen".
> And while AI watermarking and fingerprinting is real, using typographically-correct Unicode instead of base ASCII isn't really it (though I guess anything that transforms text in a way which reduces variety like this does will make some of it less effective.)
I disagree. Your "writing signature" changes when you go from never using proper typography to suddenly using it perfectly. If you don't typically follow typography rules, LLM-generated text can make your writing inconsistent and detectable-especially in notes, where some parts follow your natural style while others suddenly have perfect punctuation (e.g., now you need to search for both your usual punctuation and the LLM's version to find something). Also, if you use an LLM to help rewrite a sentence within a longer piece, the output might include typographic details (like curly quotes or en-dashes) that don't match the rest of your writing.
"- I'm AI. (Normal text) − І’m󠅘󠅟󠅜󠅑 ΑІ.󠅓󠅙󠅑󠅟 (AI-tainted text)
’ (U+2019) is a right single quotation mark instead of a regular quote"
I think AI just uses the correct apostrophe, isn’t it?
https://dictionary.cambridge.org/ja/grammar/british-grammar/...
That's right! The same goes for en-dashes, em-dashes, and some other punctuation. While these aren’t ASCII, you can enable them with `--allow-chars` if you want to keep them. I imagine the average person doesn't know when to use which.
> This isn’t a claim that major LLMs do all (or any) of these tricks. That said, I started working on this because I accidentally discovered an instance of text fingerprinting while debugging a byte-sensitive bug. That’s when I realized: it’s time to say goodbye to (at least these kinds of) fingerprints for good.
Are there any examples of this being used?
Just try it :) I’ve definitely come across random variation selectors now and then. Otherwise, the most common case is typography: like em-dashes instead of hyphens, curly apostrophes, etc. But if you're feeding LLM output into a search tool, these subtle differences might not be helping you!
I don't think this has any legitimate use, does it?
It seems this is just to support cheating, misinformation and to generally make the web worse.
Another viewpoint is that it's about privacy (e.g., unwanted tracking) and security (e.g., homograph attacks). As LLMs are increasingly used everywhere, this provides a way to normalize text as it moves between different systems.