Project Details
Description
Modern Standard Arabic is a morphologically and syntactically complex language, the understanding of which challenges both humans and machines. We propose to study problems encountered in correcting Arabic text automatically, which addresses errors in spelling, lexical choice, and grammar (morphology and syntax). Our approach is twofold. First, we will build a large corpus (~2M words) of human-corrected Arabic text produced by native speakers, non-native speakers, and machines. The QALB corpus (Qatar Arabic Language Bank) will provide a resource for training and testing automatic-correction systems. QALB's annotations will also support several other Arabic NLP efforts. Secondly, we will build a general system (ACLE) for automatically correcting Arabic-language errors. In developing ACLE, we will investigate and compare various methods for detecting and correcting errors, methods that rely on differing degrees of training-data availability. To compensate for sparsity of error-correction data, our methods will incorporate models that reflect Arabic's complex morphology and orthography. This project will be the first to study automatic Arabic error-correction using large-scale, manually annotated data and will form the basis for a shared-task workshop on Arabic automatic correction.
Status | Finished |
---|---|
Effective start/end date | 4/1/12 → 8/16/15 |
ASJC Scopus Subject Areas
- Information Systems
- Social Sciences(all)
Fingerprint
Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.