OEIS and seqfan data, looking to help if i can, and some questions

Austin Cook

unread,

Jan 2, 2025, 3:20:07 PM1/2/25

to SeqFan

Hello all,

I joined quite recently. I read on the wiki and in some messages here that there was an issue of lost data for historical seqfan groups.

I did some digging and found that there are several years of conversations catalogued in the old common crawl dumps of Yahoo Groups, as well as several years' worth on the Internet Archive's web archive. I have the infrastructure to easily grab the data if it's at all helpful, as I commonly have to parse these resources for other reasons. However, I wouldn't know precisely where to send it, nor if there are privacy concerns implicit in the data.

Additionally, is there a place to get a JSON dump of the OEIS database? I hesitate to crawl the entire thing for the JSON responses, but all copies of the data I can find lack the citations and metadata associated with each sequence that the json endpoint from the html captures. My intention is to do some statistical analyses and experimentation over the whole encyclopedia, but I'm afraid my head would explode if I left anything out. the research of course, is entirely open and i would be happy to share, visualize, and create documentation around the analyses i do here.

I have parsed through this: https://github.com/oeis/oeisdata/tree/main/files. However, the .seq files inside appear to only contain the sequences without metadata in the line-delimited format that I worry I will fail to robustly parse into a columnar format without being able to validate due to the scale of the data.

A Hugging Face organization and repository would be an impactful way to share the data without having to pay an extraneous amount of money for people to pull it into their pipelines easily, and I imagine it would bring in a lot of positive visibility to help on the front of gathering contributions to continue to support the OEIS's continued persistence.

Thanks for your time, and I'm excited to be a part of the community!

Thanks,

Austin

Tom Duff

unread,

Jan 2, 2025, 3:24:00 PM1/2/25

to seq...@googlegroups.com

I have about 20,000 seqfan messages that I scraped from archive.org (of about 35,000 total — the rest appear not to have been archived). Not in any shape to publish, yet.

--
You received this message because you are subscribed to the Google Groups "SeqFan" group.
To unsubscribe from this group and stop receiving emails from it, send an email to seqfan+un...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/seqfan/d568a843-045f-4a04-ba87-9b37d5fd2289n%40googlegroups.com.

Austin Cook

unread,

Jan 2, 2025, 3:32:00 PM1/2/25

to seq...@googlegroups.com

that sounds about right, by eyeball measure, it seems like this should equal out to about 30-40k pages
https://web.archive.org/web/*;type=text/list.seqfan.eu/*

what about the yahoo groups data? is that needing to be recovered, or is there some assist i can provide you with if its covered?

from what i can tell the yahoo groups data appears well captured, in the WARC files for common crawl for a few years in a format like so:

{"numRecords": 1, "recFirstNextTopic": 0, "recFirstLastPosted": 0, "digestNum": 0, "recFirstTopicStatus": 0, "subject": "Hypergeometric 2F1", "yahooAlias": "grafixpl", "author": "Artur", "topicLastRecord": 162, "topicInfoStatus": 0, "recFirstTopicFirstRecord": 0, "topicStatus": 0, "email": "grafix@...", "firstRecInfoStatus": 2, "parent": 0, "recFirstTopicNextRecord": 0, "prevTopic": 154, "recFirstDigestNum": 0, "nextTopic": 0, "lastPosted": 1225116634, "date": 1225116634, "recFirstPrevTopic": 0, "topicNextRecord": 0, "recFirstTopicLastRecord": 0, "hasAttachments": 0, "threadLevel": 0, "topicPrevRecord": 0, "recFirstTopicPrevRecord": 0, "summary": "Dear Richard, Thank you for this formula!!!! That mean that roots of my quintic polynomial have also geometric interpretation! Root[4 k - k2 + 5 k2 x + (20 k -", "length": 2219, "messageId": 162, "recFirstNumRecords": 0, "topicFirstRecord": 0}

i run a computer science organization, i know something like parsing common crawl and extracting years of warcs is a compute intensive and extremely frustrating task if you dont have the infrastructure already built to doso, so i imagine that i may be able to really save a lot of effort on this front,

otherwise im very capable when it comes to large scale data so please do use me as a resource if it is at all helpful

appreciated,

Austin Cook

Alignment Lab AI|CEO
Transparency
Utility
Accessibility

∑ AlignmentLab.ai

☏: +1-936-777-0513

To view this discussion visit https://groups.google.com/d/msgid/seqfan/CAAt0taPbkNfJheMunNEqmAffPDHbD4Ky8tpNu4c6WNf-z5kWSg%40mail.gmail.com.

Tom Duff

unread,

Jan 2, 2025, 4:44:22 PM1/2/25

to seq...@googlegroups.com

Thanks for the offer of help. I have many irons in the fire at this point, but I expect to have time for this early next week. I'll be in touch...

To view this discussion visit https://groups.google.com/d/msgid/seqfan/CA%2Beuv9DvmzLPjBC0bg83rdz53RjK0USTNe3wQ%2BC2vfnQJt1J3A%40mail.gmail.com.

Joseph Myers

unread,

Jan 2, 2025, 5:16:57 PM1/2/25

to seq...@googlegroups.com

I have about 478 messages in digests from the old list server (Feb 1998 to
Sep 2002) from before I joined, and 36458 I received from the list since
then - all in Unix mbox files (a few of those might be automated messages
from the list server rather than actual messages from the mailing list).
As previously noted, however, there will be some messages missing in those
mbox files in the period Oct 2003 through Jul 2014 at least because of an
unreliable email hosting provider.

--
Joseph S. Myers
j...@polyomino.org.uk

> > <https://groups.google.com/d/msgid/seqfan/d568a843-045f-4a04-ba87-9b37d5fd2289n%40googlegroups.com?utm_medium=email&utm_source=footer>
> > .

> >
>
> --
> You received this message because you are subscribed to the Google Groups "SeqFan" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to seqfan+un...@googlegroups.com.

> To view this discussion visit https://groups.google.com/d/msgid/seqfan/CAAt0taPbkNfJheMunNEqmAffPDHbD4Ky8tpNu4c6WNf-z5kWSg%40mail.gmail.com.

Tom Duff

unread,

Jan 2, 2025, 5:39:37 PM1/2/25

to seq...@googlegroups.com

Great! We could consolidate what we all have, if you're up for that.
It sounds like you have almost everything and we can probably fill in
most of the blanks from other collections.

> To view this discussion visit https://groups.google.com/d/msgid/seqfan/03138daa-bc43-41b4-6ed1-cbc193c69ae6%40polyomino.org.uk.

Austin Cook

unread,

Jan 2, 2025, 10:57:37 PM1/2/25

to seq...@googlegroups.com

ive not scraped more than the 1 warc file, but certainly i will! should i grab the web archive data as well? and is there a place where the recovery is being coordinated that we should move to or that we could organize such that our efforts reach the right ears?

if its not inconvenient, could i bother you for a quick google meet to calibrate? my membership in the group is only about a day old and i would appreciate it deeply. currently my only exposure to the oeis is about 8 months of rabbit holes both concluding and dramatically growing in it depths.

thanks,

Austin Cook

Alignment Lab AI|CEO
Transparency
Utility
Accessibility

∑ AlignmentLab.ai

☏: +1-936-777-0513

To view this discussion visit https://groups.google.com/d/msgid/seqfan/CAAt0taO78o3furNS9O_t5M0HwSm8qExm1HtpiepwjXJp4cC7OA%40mail.gmail.com.

Reply all

Reply to author

Forward