There is a great need for proper topic classification of research papers when contributed to conferences or submitted to journals. Currently authors are asked to indicate to which topic (EDIC category) their paper belongs. Contrary to what one would expect, many prospective authors do not have a sufficient understanding of the available EDIC categories to make the correct choice themselves. Or, alternatively, these EDIC categories may be inherently flawed. Therefore it would be useful to have an automatically generated "second opinion" of the topic. This additional information should help journal editors and conference organizers to spot wrong submissions in an early phase and to readjust topic and reviewer assignment as early as possible. This could considerably improve and speedup reviewing procedures.
The purpose of this thesis is to construct such a system applied to the field of speech and language technology based on data available from many years of conferences in this domain.
In recent years there has been substantial work on automatic topic classification of all kinds of topics. This is typically achieved by a technique called Latent Semantic Analysis (LSA) that makes use of document-word co-occurrence counts for its analysis. LSA will automatically group papers into a predefined number of classes. Within this thesis we want to explore how such an automatically generated set of topics can be used in conjunction with the human set. A number of issues that are of interest are: Is one set mappable onto the other? Can the automatic set be used to generate useful subcategories of the human set? Should the paper title be given extra weight? Is an abstract sufficient or do we need to screen the full paper? Is there relevant/different information in the reference list? etc.
