Contact Us | Feedback| Soalan Lazim | Peta Laman Bantuan |    Mobil web                                                                                     Reset Latar Belakang Tukar Latar Belakang Tukar Latar Belakang Tukar Latar Belakang    Font Kecil   Font Pertengahan   Font Besar      Font Hitam  Font Merah  Font Biru  Font Hijau    Capaian Kurang Upaya (OKU) / Disability Access

                                                                 To be leader in the development of Malay language and Literature in the Nation                                                      | Melayu  |  English  

                                                                                            

  Welcome Guest  

  Thursday, 20 Jun 2013  

  Web Search
  Activity Calendar

<<   June 2013   >>

Sun

Mon

Tue

Wed

Thu

Fri

Sat

 

 

 

 

 

 

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

 

 

 

 

 

 

  Magazine
  Video Terpilih

 

  News Letter

If you would like to receive our newsletters, please enter your e-mail address below

Unsubscribe
  Kota Kata

Corpus Database

The corpus database consists of the corpus system and corpus data

The corpus system software was developed by a research group from the Computer-Aided Translation Unit of Universiti Sains Malaysia under the USM-DBP collaboration framework based on a memorandum of understanding signed in 1993.

This corpus system is equipped with the feature to access word forms, derivatives and phrases. The search results are displayed in the form of concordance lines with the search keywords organised and sorted in the middle of the line.
 
A search based on these keywords can be made through a variety of search techniques according to the information to be extracted and displayed. The common search techniques used are as follows:

a) Access via Keyword (Word Form)

A word form can be accessed by typing the word. For example, a search that uses the keyword “kata” will display all forms of this word that exist in a text corpus. (See screen display example)

b) Access via Keyword and the symbols “*” and “?”

Accessing a word form can also be done by using a keyword and the symbols “*” and “?” (“*” represents one or more characters and “?” represents one character).

For example, a search using the keyword “*kata*” will display forms such as “kata,” “perkataan,” “berkata” and so on. (See screen display example)

A search using the keyword “b?t?l” will display forms such as “botol,” “batal,” “betul” and so on. (See screen display example)

Text analysis
The corpus system is also equipped with a module to analyse texts (known as MATA - Malay Text Analysis) and generate statistics on a text as follows:

(a)  Number of words;

(b)  Word frequency;
(c)  Number and list of root words;

(d)  Number and list of new words; and
(e)  Number and list of non-authentic words.

Corpus Data

DEFINITION

In general, corpus can be defined as “a collection of articles (writings, etc.) regarding a particular matter or collection of materials to be studied (such as group of word usage, etc.)” (Kamus Dewan, third edition, 1994). However, in modern linguistics, “corpus” has the additional meaning of material that is “read and processed by a computer”.

This means that the corpus collected and maintained in this project is a collection of digital texts that can be processed using the techniques and methods of linguistic computing to show patterns and to correlate one word with another.


 

DATA

Corpus data may be derived from written or oral form. However, the current focus of this programme is still on written material from books, magazines, newspapers, monographs, documents, working papers, correspondence, brochures, and so on.

Each type of discourse is compiled in a separate sub-corpus. As of 25 November 2008, the Corpus Database had approximately 135 million words contained in the following ten sub-corpuses:

No Sub-Corpus Current Amount Type of Material
1. Books    31,580,305 Novels, scholarly books, general books, textbooks
2. Magazines      14,406,888 General, covering various fields
3. Newspapers    80,029,347 Dailies, tabloids, weeklies
4. Translation (books)            2,021,191 Scholarly books, general books
5. Ephemerals 290,207 Pamphlets, brochures, advertisements
6. Dramas               404,176 Dramas published in book form
7. Poetry               116,428 Poetry published in book form
8. Material Cards          3,130,641 Collection cards for compiling the Kamus Dewan
9. Traditional Texts            2,825,329 Classical texts in the form of hikayat, folklore
10. Textbooks          1,095,726 Primary and secondary level textbooks

CORPUS-BASED GOALS AND RESEARCH OUTCOMES

This database was built with the aim of providing research data that can be used for compiling dictionaries, grammar research and other forms of linguistic studies. Examples of corpus-based studies are as follows:

No  Working Paper Notes
1 Perempuan, Wanita Dan .....: Satu kajian hubungan leksikal berdasarkan korpus

Presented at the Malay Lexicography Seminar on 20 and 21 December 1994 in the Seminar Hall, Dewan Bahasa dan Pustaka

2

Ianya benar 

Presented at the International Conference of Malay-Indonesian Studies: Proactive and Creativity on 21-23 June 1999 at the Prince of Songkla University, Pattani, Thailand

3

Melayari Samudera Maya, Mencari Mutiara Kata: Suatu Metodologi Pemerolehan Kata   Baru Berdasarkan Korpus 

Presented at the 1st Asean Linguistics Conference on 14-16 November 2000 at Universiti Kebangsaan Malaysia

4

KIM VS KIM: Kajian Leksis Berdasarkan Analisis Teks Selari

(Presented at the Malay Language and Translation Department One-Day Seminar on 7 February 2001 at Universiti Kebangsaan Malaysia

5

Istilah Sains Dalam Teks Bacaan Umum 

Published in Jurnal Rampak Serantau No. 8 2001

6

Penggunaan Istilah Teknologi Maklumat dan Komunikasi: Suatu Kajian Berdasarkan Teks Akhbar Harian    

Presented at the Seminar on Science and Technology Challenges and Writings in the New Millennium on 25-26 April 2001 at Universiti Kebangsaan Malaysia

7

Soal Hati: Suatu Kajian Korpus

Presented at the 2001 National Language Convention on 2-4 May 2001 at the Nikko Hotel, Kuala Lumpur

8

Yang Selari dan Yang Setanding:   
Peranan Korpus dalam Penterjemahan  
 

Presented at the 8th International Conference on Translation on 3-5 September 2001 in Langkawi, Kedah

9

Baik Buruk Byte dan Bait 

Published in Jurnal Rampak Serantau No. 9 2002

10

'PUN', Kepelbagaian Makna Berdasarkan Teks  
Sejarah Melayu 
 

Presented at the Malay Lexicology and Lexicography Conference on 16-17 December 2002 at Universitas Indonesia, Depok, Jakarta

Yang Dini dan Yang Kini:Kisah Dua Naskhah 

 

Presented at the Malay Lexicology and Lexicography Conference on 16-17 December 2002 at Universitas Indonesia, Depok, Jakarta

11

Pengkomputeran Bahasa Melayu:Kegiatan, Kerjasama dan Kemajuan   

Presented at the Seminar on Leading Indonesia into the Era of Globalisation through Language, Communication and Information Technology on 18 September 2003 at the Agency for the Assessment and Application of Technology (BBPT), Jakarta, Indonesia
 

12

Pangkalan Data Korpus DBP:   
Perancangan, Pembinaan dan Pemanfaatan   
 

Presented at the One-Day Linguistics Seminar on “Grammar and the Usage of Malay Language: Corpus Data Analysis” on 30 March 2004 at Universiti Kebangsaan Malaysia

13 Kesejagatan Bahasa Melayu Melalui Teknologi Published in Dewan Bahasa magazine, March 2004
14 Bahasa Sukuan: Suatu Kajian Analisis Terhadap Pengaruhnya  dalam Bahasa Melayu 

Presented at the Universiti Kebangsaan Malaysia ATMA and IKON International Conference on “The Languages and Literatures of Western Borneo: 144 Years of Research” from 31 January to 2 February 2005 at Universiti Kebangsaan Malaysia

15 Analisis -ik, -ikal dan -is dalam bahasa Melayu berdasarkan data korpus

Presented at the National Linguistics Seminar on “Language and Corpus Studies: Current Linguistic Dimensions” on 12-13 April 2005 at Universiti Kebangsaan Malaysia

16 Suara sasterawan, suara awam

Presented at the Linguistics National Seminar on “Language and Corpus Studies: Current Linguistic Dimensions” on 12-13 April 2005 at Universiti Kebangsaan Malaysia

17 Sinonim Tetapi Tidak Seerti Published in Pelita Bahasa magazine, May 2005
18 Lexical Associations of Malayness in Hikayat Abdullah: A Collocational Analysis A Collocational Analysis Published in the Research Journal of Applied Sciences 5 (6): 429-433, 2010. ISSN: 1815932X. Medwell Journals, 2010



 

 





 

 

 

 

 

 

 

 

 
Total Page Hits : 2374807
 
 
Print a printer-friendly version of this page  Print this page   E-Mail this page to a friend  E-Mail this page
 




 Survey

 

What are your views on the services provided by DBP?

 

Satisfactory

Very satisfied

Not satisfied

Very unsatisfactory

 

View Past Polls

Access Total
  11047455110474551104745511047455110474551104745511047455110474551104745511047455 
Today 3
Yesterday 10
This Week 62
Last Week 98
This Month 253
Last Month 592
Total 11047455

Address:

Dewan Bahasa dan Pustaka,
Jalan Dewan Bahasa ,
50460 Kuala Lumpur.

Telephone: 03-2148 1011
Facsilimile: 03-2144 7248(Am)
03-2144 5727(Bahasa)
03-2141 4109(Sastera)
03-2148 2945(Korporat)
Pertanyaan: Khidmat Nasihat DBP

Telefon Pantas:

 

Direktori:

 

DBP Staff Telephone and
E-mel Directory

  DBP Facebook   PRPM Facebook  Twitter DBP Twitter  RSS DBP RSS Feed  DBP Web TV  DBP WEB Mobile

Laman Rasmi Kerajaan Malaysia Multimedia Super Corridor Unit Pemodenan Tadbiran dan Perancangan Pengurusan Malaysia Jabatan Perkhidmatan Awam Kementerian Pelajaran Malaysia Pasukan Petugas Khas Pemudahcara Perniagaan
       ..                 

Privacy Policy |Security Policy| Notice | Copyright | Website Unit | W3C

The use of the official web of Dewan Bahasa dan Pustaka subject to
Guidelines At myGovernment Portal and website/Portal for public sector agencies
Best view with Internet Explorer 8.0 and above or Google Chrome with resolution of 1024 x 768 pixels.
 Last updated: 27 May 2013