CT logs => Dns Database

373 views
Skip to first unread message

Pierre Barre

unread,
Oct 15, 2024, 3:20:33 PM10/15/24
to certificate-transparency
Hello,

I recently shared my work regarding building new tooling to search CT logs.

Building upon this work, I have developed a pipeline to create a DNS records dataset based on the hostnames found in CT logs. 
The dataset includes all the common DNS record types  (A, AAAA, ANAME, CAA, CNAME, HINFO, HTTPS, MX, NAPTR, NS, PTR, SOA, SRV, SSHFP, SVCB, TLSA, TXT).
 
Some stats:

Number of DNS records: 4,028,766,034
Uncompressed size: 211 GB
Compressed size: 28 GB

The dataset is available for download at: https://www.merklemap.com/dns-records-database

As far as I know, it's the biggest public DNS records dataset that currently exists.

I believe this dataset will be valuable for various applications and research purposes. If you have any questions or feedback, please don't hesitate to reach out.

Best regards,
Pierre

Andrew C Aitchison

unread,
Oct 15, 2024, 5:39:03 PM10/15/24
to certificate-transparency
Is this a one-off you do you intend to updated it ?

If you intend to update it regularly or frequently, you might wish to
consider rsync or zsync (which works with "any" web server)
to reduce the amount of data transferred.

--
Andrew C. Aitchison Kendal, UK
and...@aitchison.me.uk

Pierre Barre

unread,
Oct 15, 2024, 5:44:14 PM10/15/24
to Andrew C Aitchison, certificate-transparency
Hi,

I plan to do two releases a month. However, I need to improve some automation here and there to be able to achieve that.

I'm not particularly concerned about bandwidth as it's hosted on Cloudflare R2, and egress is supposed to be free (as long as it's not costing them too much?).

Making one release a day would be feasible and great for history-building purposes, but I don't currently have the compute capacity to do that. I needed around 6,000 CPU-hours to make this one.

Best,
Pierre
> --
> You received this message because you are subscribed to a topic in the
> Google Groups "certificate-transparency" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/certificate-transparency/gDYS-_viPXE/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> certificate-transp...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/certificate-transparency/07a72587-05dc-10b8-d380-dd15a9c6633d%40aitchison.me.uk.

Pierre Barre

unread,
Oct 25, 2024, 9:34:21 AM10/25/24
to certificate-transparency

Hello,

Based on the feedback received, I have improved the format by:

- Adding timestamps
- Grouping querysets and results
- Saving errors

The documentation is available at: https://www.merklemap.com/documentation/dns-records-database

The data file remains accessible at the previous location: https://www.merklemap.com/dns-records-database

It's basically a https://jsonlines.org/ file with on each line a structure that looks like:

{
  "hostname": "example.com",
  "results": [
    {
      "success": {
        "query": "example.com IN A",
        "query_timestamp": "2024-10-22T21:48:45.415226364Z",
        "records": {
          "A": ["192.0.2.1", "192.0.2.2"]
        }
      }
    }
  ]
}


Best regards,
Pierre Barre

Reply all
Reply to author
Forward
0 new messages