|
Corpus Database
The corpus database consists of the corpus system and corpus data
The corpus system software was developed by a research group from the Computer-Aided Translation Unit of Universiti Sains Malaysia under the USM-DBP collaboration framework based on a memorandum of understanding signed in 1993.
This corpus system is equipped with the feature to access word forms, derivatives and phrases. The search results are displayed in the form of concordance lines with the search keywords organised and sorted in the middle of the line. A search based on these keywords can be made through a variety of search techniques according to the information to be extracted and displayed. The common search techniques used are as follows:
a) Access via Keyword (Word Form)
A word form can be accessed by typing the word. For example, a search that uses the keyword “kata” will display all forms of this word that exist in a text corpus. (See screen display example)
b) Access via Keyword and the symbols “*” and “?”
Accessing a word form can also be done by using a keyword and the symbols “*” and “?” (“*” represents one or more characters and “?” represents one character).
For example, a search using the keyword “*kata*” will display forms such as “kata,” “perkataan,” “berkata” and so on. (See screen display example)
A search using the keyword “b?t?l” will display forms such as “botol,” “batal,” “betul” and so on. (See screen display example)
Text analysis The corpus system is also equipped with a module to analyse texts (known as MATA - Malay Text Analysis) and generate statistics on a text as follows:
(a) Number of words;
(b) Word frequency; (c) Number and list of root words;
(d) Number and list of new words; and (e) Number and list of non-authentic words.
Corpus Data
DEFINITION
In general, corpus can be defined as “a collection of articles (writings, etc.) regarding a particular matter or collection of materials to be studied (such as group of word usage, etc.)” (Kamus Dewan, third edition, 1994). However, in modern linguistics, “corpus” has the additional meaning of material that is “read and processed by a computer”.
This means that the corpus collected and maintained in this project is a collection of digital texts that can be processed using the techniques and methods of linguistic computing to show patterns and to correlate one word with another.
DATA
Corpus data may be derived from written or oral form. However, the current focus of this programme is still on written material from books, magazines, newspapers, monographs, documents, working papers, correspondence, brochures, and so on.
Each type of discourse is compiled in a separate sub-corpus. As of 25 November 2008, the Corpus Database had approximately 135 million words contained in the following ten sub-corpuses:
| No |
Sub-Corpus |
Current Amount |
Type of Material |
| 1. |
Books |
31,580,305 |
Novels, scholarly books, general books, textbooks |
| 2. |
Magazines |
14,406,888 |
General, covering various fields |
| 3. |
Newspapers |
80,029,347 |
Dailies, tabloids, weeklies |
| 4. |
Translation (books) |
2,021,191 |
Scholarly books, general books |
| 5. |
Ephemerals |
290,207 |
Pamphlets, brochures, advertisements |
| 6. |
Dramas |
404,176 |
Dramas published in book form |
| 7. |
Poetry |
116,428 |
Poetry published in book form |
| 8. |
Material Cards |
3,130,641 |
Collection cards for compiling the Kamus Dewan |
| 9. |
Traditional Texts |
2,825,329 |
Classical texts in the form of hikayat, folklore |
| 10. |
Textbooks |
1,095,726 |
Primary and secondary level textbooks |
CORPUS-BASED GOALS AND RESEARCH OUTCOMES
This database was built with the aim of providing research data that can be used for compiling dictionaries, grammar research and other forms of linguistic studies. Examples of corpus-based studies are as follows:
| No |
Working Paper |
Notes |
| 1 |
Perempuan, Wanita Dan .....: Satu kajian hubungan leksikal berdasarkan korpus |
Presented at the Malay Lexicography Seminar on 20 and 21 December 1994 in the Seminar Hall, Dewan Bahasa dan Pustaka |
|
2 |
Ianya benar |
Presented at the International Conference of Malay-Indonesian Studies: Proactive and Creativity on 21-23 June 1999 at the Prince of Songkla University, Pattani, Thailand |
|
3 |
Melayari Samudera Maya, Mencari Mutiara Kata: Suatu Metodologi Pemerolehan Kata Baru Berdasarkan Korpus |
Presented at the 1st Asean Linguistics Conference on 14-16 November 2000 at Universiti Kebangsaan Malaysia |
|
4 |
KIM VS KIM: Kajian Leksis Berdasarkan Analisis Teks Selari |
(Presented at the Malay Language and Translation Department One-Day Seminar on 7 February 2001 at Universiti Kebangsaan Malaysia |
|
5 |
Istilah Sains Dalam Teks Bacaan Umum |
Published in Jurnal Rampak Serantau No. 8 2001 |
|
6 |
Penggunaan Istilah Teknologi Maklumat dan Komunikasi: Suatu Kajian Berdasarkan Teks Akhbar Harian
|
Presented at the Seminar on Science and Technology Challenges and Writings in the New Millennium on 25-26 April 2001 at Universiti Kebangsaan Malaysia |
|
7 |
Soal Hati: Suatu Kajian Korpus |
Presented at the 2001 National Language Convention on 2-4 May 2001 at the Nikko Hotel, Kuala Lumpur |
|
8 |
Yang Selari dan Yang Setanding: Peranan Korpus dalam Penterjemahan |
Presented at the 8th International Conference on Translation on 3-5 September 2001 in Langkawi, Kedah |
|
9 |
Baik Buruk Byte dan Bait |
Published in Jurnal Rampak Serantau No. 9 2002 |
|
10 |
'PUN', Kepelbagaian Makna Berdasarkan Teks Sejarah Melayu |
Presented at the Malay Lexicology and Lexicography Conference on 16-17 December 2002 at Universitas Indonesia, Depok, Jakarta |
|
Yang Dini dan Yang Kini:Kisah Dua Naskhah
|
Presented at the Malay Lexicology and Lexicography Conference on 16-17 December 2002 at Universitas Indonesia, Depok, Jakarta |
|
11 |
Pengkomputeran Bahasa Melayu:Kegiatan, Kerjasama dan Kemajuan |
Presented at the Seminar on Leading Indonesia into the Era of Globalisation through Language, Communication and Information Technology on 18 September 2003 at the Agency for the Assessment and Application of Technology (BBPT), Jakarta, Indonesia |
|
12 |
Pangkalan Data Korpus DBP: Perancangan, Pembinaan dan Pemanfaatan |
Presented at the One-Day Linguistics Seminar on “Grammar and the Usage of Malay Language: Corpus Data Analysis” on 30 March 2004 at Universiti Kebangsaan Malaysia |
| 13 |
Kesejagatan Bahasa Melayu Melalui Teknologi |
Published in Dewan Bahasa magazine, March 2004 |
| 14 |
Bahasa Sukuan: Suatu Kajian Analisis Terhadap Pengaruhnya dalam Bahasa Melayu |
Presented at the Universiti Kebangsaan Malaysia ATMA and IKON International Conference on “The Languages and Literatures of Western Borneo: 144 Years of Research” from 31 January to 2 February 2005 at Universiti Kebangsaan Malaysia |
| 15 |
Analisis -ik, -ikal dan -is dalam bahasa Melayu berdasarkan data korpus |
Presented at the National Linguistics Seminar on “Language and Corpus Studies: Current Linguistic Dimensions” on 12-13 April 2005 at Universiti Kebangsaan Malaysia |
| 16 |
Suara sasterawan, suara awam |
Presented at the Linguistics National Seminar on “Language and Corpus Studies: Current Linguistic Dimensions” on 12-13 April 2005 at Universiti Kebangsaan Malaysia |
| 17 |
Sinonim Tetapi Tidak Seerti |
Published in Pelita Bahasa magazine, May 2005 |
| 18 |
Lexical Associations of Malayness in Hikayat Abdullah: A Collocational Analysis |
A Collocational Analysis Published in the Research Journal of Applied Sciences 5 (6): 429-433, 2010. ISSN: 1815932X. Medwell Journals, 2010 |
|