Lessico Tomistico Biculturale

 

 The Index Thomisticus Treebank

 'arbor est causa proxima fructus'

(Scriptum super Sententiis, lib. 2, dist. 34, qu. 1, art. 5, expos., 7-2.7-6)

 

The Index Thomisticus (IT) Treebank is an ongoing project, which is part of the Lessico Tomistico Biculturale (LTB) project by Father Roberto Busa. The project  is  hosted at the Catholic University of the Sacred Heart in Milan (Italy).

 

The IT is considered as a pathfinder in Digital Humanities; it retains the opera omnia by Thomas Aquinas (118 texts), plus works by other 61 authors related to Thomas (61 texts). The size of the corpus is around 11 million tokens (150.000 types; 20.000 lemmas).

 

LTB wants to develop a lexicon from the IT, whose lexical entries are all the IT lemmas. Each entry is a report about the morphological, syntactic and semantic uses and values of the lemmas in the IT.

 

The IT-Treebank is the syntactically annotated portion of the IT. The main reference of the IT-Treebank is the Prague Dependency Treebank (PDT).

Syntactic annotation at analytical level is performed on the basis of PDT Annotation Guidelines and according to guidelines specifically written for Latin, shared and developed with the Latin Dependency Treebank of the Perseus Project in Boston. The IT tagset is available here.

The IT-Treebank data are available in CSTS-SGML (Czech Sentence Tree Structure), PML-XML (Prague Markup Language) and CoNLL format.

Presently, the IT-Treebank is composed of 238,528 tokens, for a total of 13,754 syntactically parsed sentences excerpted from Scriptum super Sententiis Magistri Petri Lombardi, Summa contra Gentiles and Summa Theologiae.

A list of publications on the Index Thomisticus Treebank can be found here.

 

 

Partners and People

Browsing the data of IT-Treebank

The IT-Treebank can be browsed through Netgraph. Netgraph is a client-server application developed for browsing the data of PDT.

You can choose two ways to access the data:

 

1.    Stand-alone application: you should download Netgraph and require a password;

2.    Applet version of Netgraph

The IT-Treebank can also be accessed at the CLARIN website through the search tool TüNDRA (a - free of charge - CLARIN account is required).


The Index Thomisticus Treebank Valency Lexicon (IT-VaLex) can be browsed on-line here. Valency is generally defined as the number of obligatory complements required by a word: these complements are usually named ‘arguments’, while the non-obligatory ones are referred to as ‘adjuncts’. Although valency can be assigned to different parts of speech (usually verbs, nouns and adjectives), scholars have mainly focused their attention on verbs, so that the notion of valency often coincides with verbal valency.

IT-VaLex is a collection of verbal lexical entries enhanced with valency and subcategorization frames. IT-VaLex is closely related to the Index Thomisticus Treebank project, since it is a corpus-based valency lexicon automatically induced from IT-TB data. The lexicon can be browsed by lexical entry, or by number and surface order of the arguments, which are linked to their lexical fillers.

In the Index Thomisticus Treebank annotation style, verbal arguments are annotated using the following tags: Sb (Subject), Obj (Object), OComp (Object Complement) and Pnom (Predicate Nominal).

We are constantly improving our data. Please, report any error or send your comments to Marco Passarotti. The data of the Index Thomisticus Treebank corpus can be asked to Marco Passarotti.

IT data (not treebanked) can be browsed through Corpus Thomisticum.

The Index Thomisticus Treebank by CIRCSE is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.