WordNets (Semantic net works similar to enhanced thesauruses) have been built for over 50 languages, based on the design of the original, freely released English WordNet and are widely used in research and text processing applications. However, not all WordNets have been released under an open license. We will measure the correlation of the openness of the license with the use of the WordNet in subsequent applications and research, based on metrics such as the number of citations for each WordNet. We hyothesize that open WordNets are used more foten in subsequent research and development. Finally, we will create a server that will offer a unified, online interface to all open WordNets, and encourage other projects to also make their WordNets available.
There will be two main deliverables:
(i) an academic paper that measures the effect of WordNet license restrictions on the general success of the WordNet (measured by its use in applications and citation in academic papers). We will attempt to control for such factors as size and age of the projects.
(ii) a server with a unified, online interface to all open WordNets. We will host this at Nanyang Technological University, but will make the source code open and expect the server to be mirrored at other sites. This multilingual WordNet allows interfaces using WordNet to be accessed in multiple languages.
We will also update the table of WordNet projects maintained by the Global WordNet Association, including: name, language(s), size, coverage, contact details and license(http://www.globalwordnet.org/gwa/wordnet_table.htm).
We are targeting two communities: the first is the WordNet developer community --- we hope to give quantitative arguments for why WordNets should be released under open licenses. According to discussions with other researchers, many projects would like to release their data, but have difficulty persuading their funding bodies that this is the right decision. A study showing the benefits of open release should help people to make their case. WordNets (Semantic networks similar to enhanced thesauruses) have been built for over 50 languages, based on the design of the original, freely released English WordNet and are widely used in research and text processing applications. However, these resources range in size from 117,00 concepts (English) to a few thousand (newly constructed languages such as Farsi). The original WordNet predates CC licenses and was released under a modified BSD license. Many new wordnets place more restrictions on their use, from research only to full commercial licensing. We would like to help move the community toward general adoption of open source licenses by showing that it is the most effective way of leveraging the investment in creating the language resource.
The second community is that of natural language processing researchers (and the wider world). Many people use the English WordNet, but not all people are aware that there are now free WordNets in a variety of languages. We will make these new resources more visible by providing an online API for the open ones. We also hope that the open WordNets will inspire other lexical resource creation projects to become more open.br />
Francis Bond is the principal developer of the Japanese WordNet and a member of two projects on WordNet development (the Kyoto Project which brings together the most active European projects; and the Asian WordNet project which is doing the same for Asian WordNets). Kyonghee Paik has extensive experience in machine translation and contrastive linguistic research. Together we represent both the WordNet community and the wider NLP community.
We will measure our impact on the WordNet community by seeing if we can persuade any existing projects to use a more open license, and any new projects to start with an open license. We will measure our impact on the wider community by seeing how much usage the open WordNet server gets.
The academic paper will be written by the two main participants. However, we will survey all known wordnet projects, starting with the global wordnet association list (email questionnaire) and interview a selected sample. We know at least 15 of the 58 wordnet developers.
For the open server, we will use existing standards as much as possible and share the code freely. We have spoken informally to some developers already and got their support. It may be possible to use the Asian WordNet project infrastructure directly, in which case all we will need to do is reformat all open WordnNets to the ASWN standard.
WordNets in many languages help to increase the amount of creativity in two main ways:
(a) by allowing semantic relations to be used in other applications. For example in the open clip art library, someone searching for a picture of “carnivore” will currently give no hits. With WordNets they could be given pictures from all nodes under “carnivore” such as lions, tigers and seals, ...
(b) the multilingual WordNets allow work done in one language to be more quickly ported to another: for example a word tagged as “driver#n#1” could be searched for with any of its synonyms in any language (eng: chauffeur; jpn: 運転手; deu: Fahrer; ...) without being confused by different senses of driver (the golf club or software driver).
We will use text mining techniques to find references to wordnets in academic papers (from the Global WordNet Conference proceedings and the Association for Computational Linguistics Anthology) and applications (sourceforge). We intend to reuse techniques from Kozawa et al (2008) for this. This will be enhanced with data from questionnaire's and interviews.
Automatic Acquisition of Usage Information for Language Resources, Shunsuke Kozawa, Hitomi Tohyama, Kiyotaka Uchimoto and Shigeki Matsubara, In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC 2008), pp. 672-677, Marrakech, Morocco, May, 2008.
The open WordNet server will require some programming and experience with dictionary formats, we are very experienced in this area.
We will need to distinguish between papers about the development of a resource, and papers about using a resource, this is beyond the level of current automatic techniques. In addition there are multiple WordNet projects for some languages (e.g. Polish, French, ... ) so we will need to separate their references. These are unsolvable by current NLP techniques, which is why we need human analysis of the results. Further, licensing differences are not the only difference between projects, we will also have to account for differences in project funding and longevity.
The individual WordNets sustain themselves. If we can persuade them to use open licenses, this will encourage community participation in further expansion. Our experience is that is hard to get funding for resource maintenance, whereas communities are good at extending resources. Persuading WordNet projects to adopt open source licenses thus improves their sustainability. We will also seek academic grants at Nanyang Technological University to continue research on exploiting and creating WordNets.
WordNets currently exist for 58 languages, but there are thousands of human languages. Our previous research has shown that you can efficiently bootstrap a new language from multiple existing languages. There are currently many WordNet projects being started for new languages, we hope to provide solid evidence for them to chose an open license. As the number of WordNets increse, the semantic web technologies based on resources such as WordNet straightforwardly scale to the new languages.
To attain our goal, it is important to interview developers and attend meetings to collect information as it is easier to get the full story through face to face discussions. We therefore need the Catalyst Grant to support our attendance at one Kyoto Project meeting, one Asian WordNet meeting as well to present our output at a conference.
Communication is done within projects by physical meetings, email, the Kyoto project wiki, and updating web-pages. Useful information is thus often unfortunately kept within one project only.