The Index Thomisticus is considered to be one of the pioneer projects in the research areas that today are known as Computational Linguistics, Humanities Computing and Digital Humanities. Started by father Roberto Busa SJ in the second half of the 40s, the Index Thomisticus is a corpus containing the opera omnia (in Latin) of Thomas Aquinas (118 texts) as well as 61 texts by other authors related to Thomas, for a total of approximately 11 million words, each morphologically tagged and lemmatized by hand.
Early in the 70s, Busa started to plan a project aimed at both the morphosyntactic disambiguation of the Index Thomisticus lemmatization and the syntactic annotation of its sentences. Nowadays, these tasks are performed by the Index Thomisticus Treebank project. Started in 2006 at UniversitĂ Cattolica del Sacro Cuore (Milan, Italy), the Index Thomisticus Treebank is a dependency-based syntactically annotated corpus built upon the texts of Thomas Aquinas provided by the Index Thomisticus corpus. The annotation style of the treebank is based on the guidelines developed in Prague for the so-called 'analytical' layer of annotation of the Prague Dependency Treebank for Czech. Furthermore, specific guidelines for the syntactic annotation of Latin texts were built together with the Latin Dependency Treebank, developed by the Perseus Digital Library at Tufts University in Boston, MA (USA).
Recently, the project has started to perform also the semantic and pragmatic annotation of the already available syntactically annotated data of the Index Thomisticus. This new level of annotation resembles the so called 'tectogrammatical' layer of the Prague Dependency Treebank; it features semantic role labelling, ellipsis resolution, coreference analysis and sentential information structure.
A more general aim of the Index Thomisticus Treebank project is to develop and make available different kinds of advanced language resources for Latin. Beyond the Index Thomisticus Treebank, the project includes also the following resources: