\`x^2+y_1+z_12^34\`
Advanced Search
Article Contents
Article Contents

Uyghur morphological analysis using joint conditional random fields: Based on small scaled corpus

  • * Corresponding author: Ghalip Abdukerim

    * Corresponding author: Ghalip Abdukerim 
Abstract / Introduction Full Text(HTML) Figure(4) / Table(7) Related Papers Cited by
  • As a fundamental research in the field of natural language processing, the Uyghur morphological analysis is used mainly to determine the part of speech (POS) and segmental morphemes (stem and affix) of a word in a given sentence, as well as to automatically annotate the grammatical function of the morphemes based on the context. It is necessary to provide various information for other tasks of natural language processing including syntactic analysis, machine translation, automatic summarization, and semantic analysis, etc. In order to increase the morphological analysis efficiency, this paper puts forward a hybrid approach to create a statistical model for Uyghur morphological tagging through a small-scale corpus. Experimental results show that this plan can obtain an overall accuracy of 92.58 % with a limited training corpus.

    Mathematics Subject Classification: Primary: 68T50, 68U15; Secondary: 60G60.

    Citation:

    \begin{equation} \\ \end{equation}
  • 加载中
  • Figure 1.  The morphological analysis result and hierarchical relationship of a Uyghur sentence

    Figure 2.  The Architecture of a semi-supervised morphological analysis based on the hybrid approach

    Figure 3.  Morphological Tag Decoding Process of Words in the Sentence

    Figure 4.  The Relationship between Parameter $\beta$ and Accuracy

    Table 1.  Feature Template of POS Tagging Model

    Features Description
    ${{w}_{i-2}}{{pos}_{i}}$, ${{w}_{i-1}}{{pos}_{i}}$,
    ${{w}_{i}}{{pos}_{i}}$, ${{w}_{i+1}}{{pos}_{i}}$,
    ${{w}_{i+2}}{{pos}_{i}}$ Unary context features of the word
    ${{w}_{i-2}}{{w}_{i-1}}{{pos}_{i}}$, ${{w}_{i-1}}{{w}_{i}}{{pos}_{i}}$,
    ${{w}_{i}}{{w}_{i+1}}{{pos}_{i}}$, ${{w}_{i+1}}{{w}_{i+2}}{{pos}_{i}}$,
    ${{w}_{i-1}}{{w}_{i+1}}{{pos}_{i}}$ Binary context features of the word
    $h_1(w_i){{pos}_{i}}$, $h_2(w_i){{pos}_{i}}$,
    $h_3(w_i){{pos}_{i}}$,
    $h_4(w_i){{pos}_{i}}$,
    $h_5(w_i){{pos}_{i}}$ n characters selected from the beginning of the word
    $t_1(w_i){{pos}_{i}}$, $t_2(w_i){{pos}_{i}}$, $t_3(w_i){{pos}_{i}}$,
    $t_4(w_i){{pos}_{i}}$, $t_5(w_i){{pos}_{i}}$ n characters selected from the end of the word
    ${{pos}_{i-1}}{{pos}_{i}}$ POS tag transition feature
     | Show Table
    DownLoad: CSV

    Table 2.  Feature Template of the Morphological Tagging Model

    Features Description
    ${{m}_{i-2}}{{t}_{i}}$, ${{m}_{i-1}}{{t}_{i}}$, ${{m}_{i}}{{t}_{i}}$, ${{m}_{i+1}}{{t}_{i}}$, ${{m}_{i+2}}{{t}_{i}}$ Unary context features of the morpheme
    ${{m}_{i-2}}{{m}_{i-1}}{{t}_{i}}$, ${{m}_{i-1}}{{m}_{i}}{{t}_{i}}$, ${{m}_{i}}{{m}_{i+1}}{{t}_{i}}$,
    ${{m}_{i+1}}{{m}_{i+2}}{{t}_{i}}$, ${{m}_{i-1}}{{m}_{i+1}}{{t}_{i}}$ Binary context features of the morpheme
    ${{t}_{i-1}}{{t}_{i}}$ Morphological tag transition feature
     | Show Table
    DownLoad: CSV

    Table 3.  List of Morphological Tag Candidates of Words in the Sentence

     | Show Table
    DownLoad: CSV

    Table 4.  Manually Tagged Corpus Format and Content Example

     | Show Table
    DownLoad: CSV

    Table 5.  Details of Experimental Data

    Number of sentences Number of words (including punctuation marks) Number of Uyghur words
    Training set 1000 12433 10391
    Development set 200 2564 2151
    Test set 200 2492 2075
     | Show Table
    DownLoad: CSV

    Table 6.  Experimental Results

    Method Accuracy (%)
    Stemming Morpheme segmentation POS Overall
    Tag sequence Markov model 90.18 83.25 86.17 75.13
    Joint CRF model 91.98 85.79 92.7 77.95
    Tag sequence Markov model, $\alpha$=0.95 92.65 88.47 88.12 79.65
    Joint CRF model, $\alpha$=0.9 92.85 89.76 92.6 80.73
     | Show Table
    DownLoad: CSV

    Table 7.  Analysis for the Influence of Filtering Rules on Morphological Tagging

    Method(Joint CRF model, $\alpha$=0.9, $\beta$=0.1) Accuracy (%)
    Stemming Morpheme segmentation POS Overall
    Joint CRF model,
    $\alpha$=0.9, $\beta$=0.1,
    When filtering rules are not used
    92.85 89.76 92.6 80.73
    Joint CRF model,
    $\alpha$=0.9, $\beta$=0.1,
    When filtering rules are used
    97.4 94.58 96.35 92.58
    Tag sequence transition model,
    $\alpha$=0.95,
    When filtering rules are used
    94.35 93.22 94.78 91.81
     | Show Table
    DownLoad: CSV
  •   B. Aisha and M. Sun, A statistical method for Uyghur tokenization, in International Conference on Natural Language Processing and Knowledge Engineering, (2009), 1-5. doi: 10.1109/NLPKE.2009.5313764.
      Uyghur Language, Available from: https://en.wikipedia.org/wiki/Uyghur_language.
      S. Dandapat, S. Sarkar and A. Basu, Automatic part-of-speech tagging for bengali: An approach for morphologically rich languages in a poor resource scenario, in ACL 2007, Proceedings of the Meeting of the Association for Computational Linguistics, June 23-30, 2007, Prague, Czech Republic, 2007.
      T. Ibrahim  and  B. Yuan , A survey on minority language information processing research and application in xinjiang, Journal of Chinese Information Processing, 6 (2011) , 149-156. 
      T. Klymchuk , Regularizing algorithm for mixed matrix pencils, Applied Mathematics and Nonlinear Sciences, 2 (2017) , 123-130. 
      O. Kohonen, S. Virpioja, L. Leppanen and K. Lagus, Semi-supervised extensions to morfessor baseline, Proceedings of the Morpho Challenge 2010 Workshop, 2010.
      T. Kudo, K. Yamamoto and Y. Matsumoto, Applying conditional random fields to japanese morphological analysis, in Conference on Empirical Methods in Natural Language Processing, EMNLP 2004, A Meeting of Sigdat, A Special Interest Group of the Acl, Held in Conjunction with ACL 2004, 25-26 July 2004, Barcelona, Spain, 6 (2004), 230-237.
      Lafferty, D. John, McCallum, Andrew, Pereira and C. N. Fernando, Conditional random fields: Probabilistic models for segmenting and labeling sequence data, 2001.
      T. Litip , The possibility of handling phonetic harmony by computer in Uyghur, Journal of the Central University for Nationalities, 5 (2004) , 108-113. 
      A. Mairehaba , W.-B. Jiang , Z.-Y. Wang , Y. Tuergen  and  Q. LIU , Directed graph model of Uyghur morphological analysis, Journal of Software, 12 (2012) , 3115-3129.  doi: 10.3724/SP.J.1001.2012.04205.
      A. Mijit , N. Graham , M. Masato , M. Shinsuke , K. Tatsuya  and  H. Askar , Uyghur Morpheme-based Language Models and ASR, Ipsj Sig Notes, (2010) , 581-584.  doi: 10.1109/ICOSP.2010.5656065.
      M. Orhun , A. C. eyd Tantug  and  A. Esref , Rule Based Analysis of the Uyghur Nouns, International Journal on Asian Language Processing, 1 (2009) , 33-44. 
      L. Tohti, Modern Uyghur Reference Grammar, China Social Science Press, Beijing, 2012.
      E. Tursun , D. Ganguly , T. Osman , Y. Yating , G. Abdukerim , Z. Junlin  and  L. Qun , A semisupervised Tag-Transition-Based markovian model for Uyghur morphology analysis, ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP), 16 (2016) , 8-23.  doi: 10.1145/2968410.
      A. Wumaier, T. Yibulayin, Z. Kadeer and S. Tian, Conditional random fields combined fsm stemming method for uyghur, in IEEE International Conference on Computer Science and Information Technology, (2009), 295-299. doi: 10.1109/ICCSIT.2009.5234727.
      H. Xue , Y. Yang , T. Osman , X. Li  and  R. Zhang , Uyghur word segmentation using a combination of rules and statistics, Advances in information Sciences and Service Sciences(AISS), 3 (2011) , 105-113. 
      H. Zhang , Q. Cai , W. Jiang , Y. Lv  and  Q. Liu , Joint voice harmony restoration and morphological segmentation for morphology analysis, Journal of Chinese Information Processing, 6 (2014) , 9-17. 
      L. Zhu , Y. Pan  and  J. Wang , Affine transformation based ontology sparse vector learning algorithm, Applied Mathematics and Nonlinear Sciences, 2 (2017) , 111-122.  doi: 10.21042/AMNS.2017.1.00009.
  • 加载中
Open Access Under a Creative Commons license

Figures(4)

Tables(7)

SHARE

Article Metrics

HTML views(2247) PDF downloads(349) Cited by(0)

Access History

Other Articles By Authors

Catalog

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return