An intelligent extension of the training set for the Persian n-gram language model: an enrichment algorithm

Rezvan Motavallian; Masoud Komeily

doi:10.7764/onomazein.61.09

An intelligent extension of the training set for the Persian n-gram language model: an enrichment algorithm

Authors

Rezvan Motavallian Linguistics Department, Faculty of Foreign Languages, University of Isfahan https://orcid.org/0000-0002-2319-8414
Masoud Komeily Linguistics Department, Faculty of Foreign Languages, University of Isfahan

DOI:

https://doi.org/10.7764/onomazein.61.09

Keywords:

training corpus, n-gram language model, dependency parsing, enrichment algorithm, free word-order

Abstract

In this article, we are going to introduce an automatic mechanism to intelligently extend the training set to improve the n-gram language model of Persian. Given the free word-order property in Persian, our enrichment algorithm diversifies n-gram combinations in baseline training data through dependency reordering, adding permissible sentences and filtering ungrammatical sentences using a hybrid empirical (heuristic) and linguistic approach. Experiments performed on baseline training set (taken from a standard Persian corpus) and the resulting enriched training set indicate a declining trend in average relative perplexity (between 34% to 73%) for informal/spoken vs. formal/written Persian test data.

Downloads

Download data is not yet available.

Downloads

Published

2023-11-06

How to Cite

Motavallian, R., & Komeily, M. (2023). An intelligent extension of the training set for the Persian n-gram language model: an enrichment algorithm. Onomázein, (61), 191–211. https://doi.org/10.7764/onomazein.61.09