what kind of hebrew content is on Common crawl ?

11 views
Skip to first unread message

Albatross

unread,
May 21, 2022, 10:00:18 AMMay 21
to Common Crawl
what kind of hebrew content is on Common crawl ?

how many "docs" ? Tokens ?

is there a list of topics ? like legal dos, gov docs etc

Henry S. Thompson

unread,
May 21, 2022, 1:48:05 PMMay 21
to common...@googlegroups.com
I happen to have a tabulation of the languages reported when a single
language is detected for CC-MAIN-2019-35, as reported in the index
(cdx files). Hebrew is in 30th place:

1 eng 1,156,553,881
2 zho 162,611,350
3 rus 132,453,629
4 deu 97,254,471
5 spa 81,140,006
6 fra 72,782,475
7 jpn 62,983,782
8 NA 62,831,851
9 por 42,768,802
10 pol 39,112,080
11 ita 34,186,127
12 nld 25,915,757
13 ces 24,771,767
14 tur 19,183,539
15 vie 16,990,376
16 swe 12,444,601
17 hun 11,688,805
18 fas 11,521,816
19 ara 8,788,615
20 ron 8,745,252
21 kor 8,257,168
22 fin 7,777,679
23 dan 7,490,724
24 slk 7,089,628
25 ind 6,711,637
26 lit 5,743,618
27 ukr 5,703,678
28 nor 5,669,081
29 ell 5,348,743
30 heb 5,335,808

You can use the query interface to retrieve just the 'heb' files...

ht
--
Henry S. Thompson, School of Informatics, University of Edinburgh
10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND -- (44) 131 650-4440
Fax: (44) 131 650-4587, e-mail: h...@inf.ed.ac.uk
URL: http://www.ltg.ed.ac.uk/~ht/
[mail from me _always_ has a .sig like this -- mail without it is forged spam]

The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

Reply all
Reply to author
Forward
0 new messages