Access to the original LaTeX files

200 views
Skip to first unread message

Alex Kleiner

unread,
May 15, 2017, 2:17:10 PM5/15/17
to arXiv api
Hi,

Many researchers upload (or used to upload) the original LaTeX file, from which a PDF was generated on your server.
For the purpose of data mining, LaTeX would be a far more useful resource then PDF.  
Are those LaTeX files available?

Thanks,

Alex 

Thorsten S

unread,
May 15, 2017, 2:32:11 PM5/15/17
to arXiv api

Dear Alex,

please search this thread before asking repeat questions
points to bulk source file access via S3.


Cheers
T.


--
You received this message because you are subscribed to the Google Groups "arXiv api" group.
To unsubscribe from this group and stop receiving emails from it, send an email to arxiv-api+unsubscribe@googlegroups.com.
To post to this group, send email to arxi...@googlegroups.com.
Visit this group at https://groups.google.com/group/arxiv-api.
For more options, visit https://groups.google.com/d/optout.

Grady D

unread,
May 15, 2017, 3:28:25 PM5/15/17
to arxi...@googlegroups.com
I was able to successfully download and decompress all of the source files, as well as convert (some of) the source files  to plaintext. I am happy to answer any questions you may have. 

The latex files are not available for all documents (some are HTML, DOCX, PS, etc), so it was difficult for me to work with. 

Sincerely, 
Grady Daniels

Thorsten S

unread,
May 15, 2017, 3:31:06 PM5/15/17
to arXiv api
On Mon, May 15, 2017 at 12:52 PM, Grady D <gsmai...@gmail.com> wrote:
I was able to successfully download and decompress all of the source files, as well as convert (some of) the source files  to plaintext. I am happy to answer any questions you may have. 

The latex files are not available for all documents (some are HTML, DOCX, PS, etc), so it was difficult for me to work with. 


right, arXiv doesn't have TeX source unless the author provided it. Other submission formats are accepted, albeit discouraged.

Cheers
T.

Johnathan Mercer

unread,
May 31, 2023, 9:32:52 AM5/31/23
to arXiv API
Hi Grady, do you remember what % had populated latex or are you familiar with a current 2023 estimate?

On Monday, May 15, 2017 at 3:28:25 PM UTC-4 Grady D wrote:
I was able to successfully download and decompress all of the source files, as well as convert (some of) the source files  to plaintext. I am happy to answer any questions you may have. 

The latex files are not available for all documents (some are HTML, DOCX, PS, etc), so it was difficult for me to work with. 

Sincerely, 
Grady Daniels
On May 15, 2017 11:32, "Thorsten S" <thorsten....@gmail.com> wrote:

Dear Alex,

please search this thread before asking repeat questions
points to bulk source file access via S3.


Cheers
T.

On Mon, May 15, 2017 at 10:38 AM, Alex Kleiner <alex.k...@gmail.com> wrote:
Hi,

Many researchers upload (or used to upload) the original LaTeX file, from which a PDF was generated on your server.
For the purpose of data mining, LaTeX would be a far more useful resource then PDF.  
Are those LaTeX files available?

Thanks,

Alex 

--
You received this message because you are subscribed to the Google Groups "arXiv api" group.
To unsubscribe from this group and stop receiving emails from it, send an email to arxiv-api+...@googlegroups.com.

To post to this group, send email to arxi...@googlegroups.com.
Visit this group at https://groups.google.com/group/arxiv-api.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "arXiv api" group.
To unsubscribe from this group and stop receiving emails from it, send an email to arxiv-api+...@googlegroups.com.

Jake Weiskoff

unread,
May 31, 2023, 9:37:27 AM5/31/23
to arxi...@googlegroups.com
About 89% of the corpus is tex-based.

Regards,
-Jake Weiskoff (he/him)
arXiv Technical Support Manager

You received this message because you are subscribed to the Google Groups "arXiv API" group.

To unsubscribe from this group and stop receiving emails from it, send an email to arxiv-api+...@googlegroups.com.

Johnathan Mercer

unread,
May 31, 2023, 1:42:26 PM5/31/23
to arXiv API
Oh wow, thanks Jake!
Reply all
Reply to author
Forward
0 new messages