Abstract
Introduction
Chinese Sentence Pattern Structure Treebank (the SPS Treebank) was developed at Beijing Normal University and Peking University. It contains 5,016 sentences and 119,627 tokens syntactically annotated following the concept of sentence constituent analysis which emphasizes sentence pattern structure. This concept is based on linguist Jinxi Li's The New Chinese Grammar. The source data consists of 27 chapters extracted from modern Mandarin and ancient Chinese works.
Data
The SPS Treebank has three annotation layers: lexical sense and structural mode for dynamic words; syntactic structure for clauses; and inter-clause relation within complex sentence and sentence clusters. These structures can be visualized using the Jbw-viewer tool.
Below are the text data sources and volumes contained in this release:
| Book Name | Chapters | Characters | Sentences |
| --- | --- | --- | --- |
| Selected Work of Luxun (《鲁迅全集》) | 8 | 25,545 | 948 |
| Selected Work of Mao Zedong (《毛泽东选集》) | 2 | 32,454 | 771 |
| From the Soil: The Foundations of Chinese Society (《乡土中国》) | 4 | 16,018 | 532 |
| A Dream in Red Mansions (《红楼梦》) | 5 | 33,087 | 1,781 |
| The Analects of Confucius (《论语》) | 6 | 5,392 | 517 |
| Mencius (《孟子》) | 2 | 6,771 | 467 |
| Total: | 27 | 119,267 | 5,016 |
The data is presented in UTF-8 encoding. Each file contains the three-layer annotation stored in XML format. All files were automatically verified and manually checked.
(2025-06-16)