Email Obfuscation
There’s a niche collection of curated website directories revolving around page size. See 1MB Club, 10 KB Club, and the very tiny 1kB Club. While I like fast loading pages as much as anyone, that’s not the main reason I find these directories interesting.
These directories almost entirely consist of bespoke websites made by individuals. I suppose it harkens back to the early days of the web when it was less corporate and more weird, but I wasn’t around for much of that period.
The other commonality in these directories is many of the sites are personal websites, meaning they usually containing contact information, and I’ve noticed a minor trend: email obfuscation.
I’m not sure why people are doing this, but the obvious assumption is privacy and spam. Here are some examples: 1
My email: john (at) (second and top level domain)
Contact
Email jane @ this domain
Contact me at john.doe(at)gmail(dot)com.
[first name].z.[last name]@icloud.com
You can reach me by email at me [at] [this-domain].
The most common obfuscation strategy is to describe the email address using surrounding context on the site. This seems like it could be effective against simplistic bots, but in the unfolding age of Large Language Models, I wouldn’t count on it.
Attack of the LLM’s
Let’s experiment with my site. I’ll copy my own about page, but I’ll use the “describe my email address” strategy. Here’s the prompt to ChatGPT 3.5:
Here’s some HTML from a website whose domain is
mthadley.com
. Please extract contact information from the website, including names and email addresses:<main> <h2 id="hey-there" style="background-image: linear-gradient(41deg, var(--pink), var(--blue), var(--light-blue));" > Hey there! </h2> <p> My name is <strong>Michael Hadley</strong>, and this is my personal site. I’m a programmer by trade, currently working remotely on web stuff at <a href="https://workos.com">WorkOS</a>, and living in Southern California. You can find me on <a href="https://github.com/mthadley">GitHub</a> to see what I’ve recently been working on. </p> <p>If you want to get in touch, you can email me at m at this domain.</p> </main>
And here is ChatGPT’s response:
From the given HTML, I’ve extracted the contact information, including the name and email address:
Name: Michael Hadley
Email: [email protected]Please note that the email address is obfuscated in the HTML as “m at this domain,” but it can be reconstructed as “[email protected].”
Ouch. That did not work so well. It even picked up on my obfuscation attempt and put a little disclaimer at the end. None of the other forms of this strategy faired any better.
One last tactic I came across was only revealing the email on click, with the address itself bas64 encoded inline into a script tag. I again replicated this approach on my own site:
Here’s some HTML from a website whose domain is
mthadley.com
. Please extract contact information from the website, including names and email addresses:<main> <h2 id="hey-there" style="background-image: linear-gradient(41deg, var(--pink), var(--blue), var(--light-blue));" > Hey there! </h2> <p> My name is <strong>Michael Hadley</strong>, and this is my personal site. I’m a programmer by trade, currently working remotely on web stuff at <a href="https://workos.com">WorkOS</a>, and living in Southern California. You can find me on <a href="https://github.com/mthadley">GitHub</a> to see what I’ve recently been working on. </p> <p> If you want to get in touch, you can email me at <a id="email" href="">Show Email</a>. </p> <script> const email = atob("bUBtdGhhZGxleS5jb20=") const link = document.getElementById("email") link.addEventListener( "click", event => { event.preventDefault() link.setAttribute("href", `mailto:${email}`) link.innerText = email }, { once: true } ) </script> </main>
Once again, ChatGPT’s response:
From the provided HTML, I can see that the contact information is embedded in the JavaScript code as a base64-encoded email address. Here’s the extracted contact information:
- Name: Michael Hadley
- Email: mthadley.com
Please note that the email address is encoded using base64 in the JavaScript code. It’s decoded using the
atob
function to reveal the actual email address. In this case, the actual email address is[email protected]
.
While it incorrectly displayed mthadley.com
as my “Email”, it clearly picked
up on the runtime decoding and still printed my actual email address in the end.
Maybe this could be a viable strategy if the code was less readable, but my
“naive” implementation isn’t cutting it.
Trial by Cloudflare
This site is hosted on Cloudflare Pages, which has email obfuscation as an out-of-the-box feature, so I was curious how it fared. Same prompt again, but I’ve included more of the surrounding HTML served by Cloudflare to give ChatGPT a better chance:
Here’s some HTML from a website whose domain is
mthadley.com
. Please extract contact information from the website, including names and email addresses:<body> <header> <h1> <a href="/"> <span aria-hidden="">></span> mthadley<span class="blink" aria-hidden="">_</span> </a> </h1> </header> <main> <h2 id="hey-there" style="background-image: linear-gradient(41deg, var(--pink), var(--blue), var(--light-blue));" > Hey there! </h2> <p> My name is <strong>Michael Hadley</strong>, and this is my personal site. I’m a programmer by trade, currently working remotely on web stuff at <a href="https://workos.com">WorkOS</a>, and living in Southern California. You can find me on <a href="https://github.com/mthadley">GitHub</a> to see what I’ve recently been working on. </p> <p> If you want to get in touch, you can email me at <a href="/cdn-cgi/l/email-protection#1b765b766f737a7f777e6235787476" ><span class="__cf_email__" data-cfemail="c8a588a5bca0a9aca4adb1e6aba7a5" >[email protected]</span ></a >. </p> </main> <footer> <nav> <ul> <li><a href="/">Home</a></li> <li><a href="/about">About</a></li> <li><a href="/feed.xml">RSS</a></li> </ul> </nav> </footer> <script data-cfasync="false" src="/cdn-cgi/scripts/5c5dd728/cloudflare-static/email-decode.min.js" ></script> <script defer="" src="https://static.cloudflareinsights.com/beacon.min.js/v8b253dfea2ab4077af8c6f58422dfbfd1689876627854" integrity="sha512-bjgnUKX4azu3dLTVtie9u6TKqgx29RBwfj3QXYt5EKfWM/9hPSAI/4qcV5NACjwAo8UtTeWefx6Zq5PHcMm7Tg==" data-cf-beacon='{"rayId":"7f64c4074c56dbc6","version":"2023.7.0","r":1,"token":"a6727bf0020b45dc993287dbbbe77c93","si":100}' crossorigin="anonymous" ></script> </body>
Finally, the response:
I’ve extracted the contact information from the provided HTML:
Name: Michael Hadley
Email: [email protected]Please note that the email address is encoded using a technique to protect it from email scraping bots. The actual email address is hidden within the JavaScript and would require further processing to decode it properly. If you’re the website owner and need to display this information in a usable format, you might need to adjust the way the email address is encoded or use a contact form instead.
Not bad, Cloudflare!
It still got my name, but since my email address doesn’t appear anywhere in the HTML (encoded or not), there isn’t away for the big bad AI to extract it. Though, I’d be in trouble if it grew the ability to execute the JavaScript injected by Cloudflare, which doesn’t seem too far-fetched. Hopefully its client would be detected as a bot and the de-obfuscation script wouldn’t be served, but it’s hard to say.
-
I don’t want to call anyone out so I’ve removed anonymized names and won’t be linking to the original sites. ↩