raduloff.dev


home blog

building an email verifier for fun (and profit)

Boris Radulov
published on 2024-07-23T00:00:00.000Z

1. the problem

Anyone who’s tried to do their own email marketing in-house (or, god forbid, even run their own mail server) can atest to how much of a nightmare modern email is:

However, one thing remains the same. One of the best ways to ensure your emails get to your recipients’ inboxes is to lower bounce rate and to increase open rate. These numbers are directly correlated to the quality of your email list: are the emails there active, valid, recent, non-catch-all, etc. While data brokers and marketing agencies might tell you they’re offering high quality data, the highest form of trust is verification.

So here’s my problem, I have a few million business emails (with some extra metadata) on mostly US-based companies. Who do I send to, knowing that they’re likely to recieve and open it?

2. making sure they recieve it

Here are some of the main methods I’ve found to make sure the emails I send at least get delivered.

the email has to be valid (duh)

Use a nice regex to verify the email is valid according to RFC 5322. I recommend this one. I’ve also found this regex to be a good litmus test of the quality of your data. If you’re getting anything more than 0.1% invalid emails, your dataset is probably garbage.

check the domains

Besides MX records that tell you where the mail server live, which you can simply check with dig (dig example.com MX), email servers often need a bunch of other DNS records to work properly.

For example, SPF records are TXT records that list all the hosts that can send emails for that domain. You can check for it by doing dig example.com TXT and looking for records that start with the string v=sfp1. Here’s an example one for my company’s google workspace: cbt.bg. 10800 IN TXT "v=spf1 include:_spf.google.com ~all"

DMARC records are another important TXT record that informs the reciever of what to do when they believe an email is spoofed. You can check it by looking at the TXT records and filtering for strings starting with v=DKIM1.

DMARC is the third important DNS record that you need to look at, however it’s a lot trickier. It’s another record that fights against spoofing through public-key cryptography but requires you to know the key because you need to query for $selector._domainkey.example.com where $selector could be anything. This means that in our use case, you need to bruteforce DNS queries until you find the one you need. I’ve compiled a short list of the most common ones here: DKIM selectors.

trim the fat

Two things you can also do to remove emails you probably shouldn’t be sending to is removing disposable domains and obvious catch-all such as office@. Here’s some lists you can use: disposable_domains.txt and catchall_users.txt.

the holy grail: SMTP handshake

While the above methods are nice to be used as heuristics, the best way to verify that the actual email exists and can be delivered to is through simulating an STMP handshake. You can read more about this process in the relevant RFC, but the gist of it is as follows: you connect to the server, pretend you’re gonna send an email, and watch how the server reacts. Here’s an example transcript of me doing this on my company’s mail server (hosted by google):

12:05:10 boris@fedora ~ → nc smtp.google.com 25
220 mx.google.com ESMTP 4fb4d7f45d1cf-5aa1b524138si1712282a12.603 - gsmtp
HELO localhost
250 mx.google.com at your service
MAIL FROM: <[email protected]>
250 2.1.0 OK 4fb4d7f45d1cf-5aa1b524138si1712282a12.603 - gsmtp
RCPT TO: <[email protected]>
250 2.1.5 OK 4fb4d7f45d1cf-5aa1b524138si1712282a12.603 - gsmtp
^C

If you get an OK status code (2XX), you’re on the right track. If you get some other status code, probably don’t send there.

Bonus points: try sending to a username that’s a random string. This way, you can tell if the mail server has a catch-all reciever set up. Generally, you don’t want to be sending to those too.

3. conclusion

Of course, there are SMTP and DNS client libraries that allow you to automate this process. All of this could be turned into a big script that processes csv files in bulk. However, you’d need multiple IP addresses as this process gets bottlenecked on network and ip blacklisting. Additionally, you’d need a database that can take in a lot of data at once and a system for messagepassing to all the secondary services that are doing the actual verification. This is outside the scope of this post, but might be worth exploring (or even building as a product) in the future.