Uppsala Persian Corpus (UPC)

Behnam Esfahbod

unread,

Apr 27, 2017, 3:25:52 PM4/27/17

to Persian Computing, Mojgan Seraji

Uppsala Persian Corpus (UPC) (Seraji, 2015, Chapter 3, pp. 68-81) is a large, freely available Persian corpus. The corpus is a modified version of the Bijankhan corpus (Bijankhan, 2004) with additional sentence segmentation and consistent tokenization containing 2,704,028 tokens and annotated with 31 part-of-speech tags.

The corpus is developed by Mojgan Seraji and licensed under GNU General Public License.

The corpus can be downloaded below: http://stp.lingfil.uu.se/~mojgan/UPC.html

The part-of-speech tags are listed with explanations in this table: http://stp.lingfil.uu.se/~mojgan/Table_tag.pdf

See also: Universal Dependencies for Persian

The Persian Universal Dependency Treebank (Persian UD) is the converted version of the Uppsala Persian Dependency Treebank (UPDT) (Seraji, 2015)

https://github.com/UniversalDependencies/UD_Persian

-Behnam

Behdad Esfahbod

unread,

Apr 27, 2017, 3:49:36 PM4/27/17

to Behnam Esfahbod, Persian Computing, Mojgan Seraji

Our team at Google paid vendors to vocalize the UPC and contributed that back to the corpus.

--
--
http://persian-computing.org/
http://groups.google.com/group/persian-computing/

---
You received this message because you are subscribed to the Google Groups "Persian Computing" group.
To unsubscribe from this group and stop receiving emails from it, send an email to persian-computing+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

behdad
http://behdad.org/

Pejman Habashi

unread,

Jun 24, 2022, 7:22:46 PM6/24/22

to Persian Computing

Being a very valuable resource for persian, Uppsala has a lot of mistakes which are easy [or sometimes hard] to catch and fix. I guess it is best to put it on a public repository (like github) and start proposing improvements over it. For example search for "دایرث" in the corpus. I assume it has been used instead of "دایره‌ی". I have found few similar mistakes in it and fixed them (to the best of my knowledge) however, I do not know where I can publish this so the rest of community can benefit from it.

To unsubscribe from this group and stop receiving emails from it, send an email to persian-comput...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
behdad
http://behdad.org/

Peter von Kaehne

unread,

Jun 25, 2022, 3:56:34 AM6/25/22

to Pejman Habashi, Persian Computing

If you maintain a change set/patch set you could keep that on GitHub? That is where I would first search, at least. What you probably want to avoid unless your interest is massive and lasting is to become a new publisher of the source.

Sent from my phone. Please forgive misspellings and weird “corrections”

On 25 Jun 2022, at 00:22, Pejman Habashi <pejman...@gmail.com> wrote:

Being a very valuable resource for persian, Uppsala has a lot of mistakes which are easy [or sometimes hard] to catch and fix. I guess it is best to put it on a public repository (like github) and start proposing improvements over it. For example search for "دایرث" in the corpus. I assume it has been used instead of "دایره‌ی". I have found few similar mistakes in it and fixed them (to the best of my knowledge) however, I do not know where I can publish this so the rest of community can benefit from it.

--
--
https://persian-computing.org/
https://groups.google.com/g/persian-computing/

---
You received this message because you are subscribed to the Google Groups "Persian Computing" group.
To unsubscribe from this group and stop receiving emails from it, send an email to persian-comput...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/persian-computing/edd5a19e-f669-471d-b1fb-c1faea1d2b32n%40googlegroups.com.

Reply all

Reply to author

Forward