Header




Outline

Motivation

In last decade, an important methodological breakthrough took place in Natural Language Processing (NLP) with the advent of statistical approaches. This permitted important advances in terms of efficiency and robustness in tools of almost every area of NLP -- ranging from tokenization to parsing -- and in a wide range of applications -- ranging from information extraction to machine translation.

These approaches are data-intensive and need large data sets for the estimation of relevant parameters as well as the subsequent evaluation of the trained classifiers. These data sets have steadily grown not only in terms of their size but also in terms of the complexity of the linguistic information they store, as the application of stochastic techniques have moved from relatively shallow (e.g. POS tagging) to more deep linguistic processing tasks (e.g. semantic role labeling).

Hence, development activities on annotated corpora have concentrated around extending morphological information with information concerning phrase constituency (aka TreeBanks), syntactic functions (aka DependencyBanks), and most recently with phrase-level semantic functions and roles (aka PropBanks). The next generation of annotated corpora will thus expand these annotations with semantic information of different sorts beyond the phrase level, starting at the sentence-level representations of meaning (logical forms).


Major Goals

A central goal of the SemanticShare project is the development of annotated corpora for Portuguese of the recent and next generations - a PropBank and a LogicalFormBank -, of which a part is parallel to similar banks being produced for other languages, in other projects.


Major features

These corpora are materializations of a single bank of utterances and corresponding grammatical representations, with the following major features:

  • Information breadth:
    They contain integrated morphological, syntactic and semantic information;
  • Flexible accessibility:
    They can be displayed in one or more of multiple views:
    1. Sentences
    2. Tokens
    3. Lemmas
    4. Inflection features
    5. POS tags
    6. Named entities and MWU
    7. Constituency trees
    8. Syntactic function trees
    9. Semantic function and role trees
    10. Logical forms
  • Empirical accuracy:
    Each representation is stored by human selection after being generated by a grammatical analyzer;
  • Linguistic depth:
    They are stored under an internal representation format that is linguistically well informed, in compliance with a top-level mature framework for computational linguistics (HPSG).
  • Dynamic evolution:
    They are supported by cutting edge corpora development tools that ensure easy extension of the annotated structures as information from more linguistic dimensions may be added in future extensions (e.g. tense, anaphor resolution, etc.), or as the grammar depth is upgraded.

Support from an international research community

This will be accomplished with the support of the Delph-in consortium, a world-level initiative fostering cutting edge research on deep linguistic processing by sharing open source development tools, resources and best practices among its invited participants.

Its unrivalled platform and annotation tool will permit rapid advances in the accomplishment of project's goals, thus continuing previous cooperation, namely in the scope of the GramaXing project, where LXGram, a grammar for deep linguistic processing of Portuguese, was developed and is being maintained.

Also, part of the linguistic bank to be developed is the Portuguese counterpart of parallel banks being developed for other languages by other Delph-In members, along similar design requirements.


Major applications

These annotated corpora represent key resources for the processing of Portuguese, including:
- providing an empirical basis for the linguistic study of this language and the development of hand crafted processing tools;
- training statistically based tools for shallow to deep language processing, including parsers, semantic role labelers, etc;
- evaluating of processing tools;
- supporting experimentation of novel approaches to multilingual NLP, including statistical machine translation or automatic meta-annotation for the semantic web, etc.




Participants

The SemanticShare project is currently under development by NLX-Natural Language and Speech Group, at the Department of Informatics of the Faculty of Sciences of the University of Lisbon. The experimental extraction of statistical parsers from the treebanks is being performed in cooperation with the Natural Language Processing Group of PUCRS-Pontifícia Universidade Católica do Rio Grande do Sul.


Team

Mariana Avelãs

António Horta Branco (coord.)

Sérgio Castro

Francisco Costa

Rosa del Gaudio

Patricia Nunes

Filipe Gil

Vera Strube de Lima

Clara Pinto

Carlos Prolo

Joana Ramos

David Raposo

João Silva

Sara Silveira

-->


Funding

The project is funded by the FCT - Foundation for Science and Technology of the MCT - Portuguese Ministery of Science and Technology, under the contract PTDC/PLP/81157/2006. The project was developed from February 2008 to December 2010.




Results

Online Services

Demos of the tools developed are available in the LX Center.


Corpora and datasets

Corpora and datasets developed are available from the LX Center.


Publications

Branco, António and Francisco Costa, forth.a, "HPSG: Arquitectura Gramatical", chapter in Alencar, Leonel and Gabriel Othero, eds., Abordagens Computacionais da Teoria da Gramática, Mercado de Letras, Campinas, Brasil.

Branco, António and Francisco Costa, forth.b, "Processamento Gramatical em HPSG", chapter in Alencar, Leonel and Gabriel Othero, eds., Abordagens Computacionais da Teoria da Gramática, Mercado de Letras, Campinas, Brasil.

Branco, António and Francisco Costa, forth.c, "Processamento Semântico em HPSG", chapter in Alencar, Leonel and Gabriel Othero, eds., Abordagens Computacionais da Teoria da Gramática, Mercado de Letras, Campinas, Brasil.

Costa, Francisco and António Branco, forth., LXGram Implementation Report, University of Lisbon, Faculty of Sciences, Department of Informatics.

Branco, António and Sara Silveira, 2011, Guidelines for versioning and data Management in Circular TAVA Treebanking, University of Lisbon, Faculty of Sciences, Department of Informatic (in Portuguese).

Branco, António, João Silva, Francisco Costa and Sérgio Castro, 2011, CINTIL TreeBank Handbook: Design options for the representation of syntactic constituency, University of Lisbon, Faculty of Sciences, Department of Informatics.

Branco, António, Sérgio Castro, João Silva and Francisco Costa, 2011, CINTIL DepBank Handbook: Design options for the representation of grammatical dependencies, University of Lisbon, Faculty of Sciences, Department of Informatics.

Branco, António and Francisco Costa, 2010, "A Deep Linguistic Processing Grammar for Portuguese" , In Lecture Notes in Artificial Intelligence, 6001, pp.86-89, Berlin: Springer.

Branco, António, Francisco Costa, João Silva, Sara Silveira, Sérgio Castro, Mariana Avelãs, Clara Pinto and João Graça, 2010, "Developing a Deep Linguistic Databank Supporting a Collection of Treebanks: the CINTIL DeepGramBank " , In Proceedings, LREC2010 - The 48th Annual Meeting of the Association for Computational Linguistics , La Valleta, Malta, May 19-21, 2010.

Branco, António and Sara Silveira, 2010, Guidelines for Dynamic Annotation, University of Lisbon, Faculty of Sciences, Department of Informatics (in Portuguese).

Branco, António, 2009, "LogicalFormBanks, the Next Generation of Semantically Annotated Corpora: key issues in construction methodology" , In Mieczyslaw Klopotek, Adam Przepiorkowski, Slawomir Wierzchón, Krzysztof Trojanowski, eds., Recent Adavnces in Intelligent Information Systems, Academic Publishing House EXIT, Warsaw, pp. 3-12.

Branco, António, Francisco Costa, Eduardo Ferreira, Pedro Martins, Filipe Nunes, João Silva and Sara Silveira, 2009a, "LX-Center: a center of online linguistic services" , In Proceedings of the Demo Session, ACL-IJCNLP2009 - Joint conference of the 47th Annual Meeting of the ACL-Association of Computational Linguistics and the 4th IJCNLP-International Joint Conference of Natural Language Processing, Singapore.

Branco, António, Francisco Costa, Eduardo Ferreira, Pedro Martins, Filipe Nunes, João Silva and Sara Silveira, 2009b, "LX-Center: A Center of Online Services for Education, Research and Development on Language Science and Technology" , In Proceedings of the I Iberian SLTech - I Joint SIG-IL/Microsoft Workshop on Speech and Language Technologies for Iberian Languages, Porto Salvo, September 3-4, 2009.

Branco, António, Sara Silveira, Sérgio Castro, Mariana Avelãs, Clara Pinto and Francisco Costa, 2009, "Dynamic Propbanking with Deep Linguistic Grammars" , In Proceedings, TLT009 - The 8th International Workshop on Treebanks and Linguistic Theories, Milan, December 4-5, 2009.

Branco, António and Francisco Costa, 2008a, "High Precision Analysis of NPs with a Deep Processing Grammar" , In Johan Bos and Rodolfo Delmonte (eds.), Semantics in Text Processing, London, College Publications, Research in Computational Semantics Series, Vol. 1, pp.31-44.

Branco, António and Francisco Costa, 2008b, "LXGram in the Shared Task "Comparing Semantic Representations" of STEP2008" , In Johan Bos and Rodolfo Delmonte (eds.), Semantics in Text Processing, London, College Publications, Research in Computational Semantics Series, Vol. 1, pp.299-314.

Branco, António, Francisco Costa, Pedro Martins, Filipe Nunes, João Silva and Sara Silveira, 2008, "LXService: Web Services of Language Technology for Portuguese" , In N. Calzolari, K. Choukri, B. Maegaard, J. Mariani, J. Odijk, . Piperidis, D. Tapias (eds.), Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC2008), Paris, ELRA.

Costa, Francisco, 2010, Processing Temporal Information in Unstructured Documents, Doctoral course qualification report, University of Lisbon, Faculty of Sciences, Department of Informatics.

Costa, Francisco and António Branco 2010, "Temporal information processing of a new language: fast porting with minimal resources" , In Proceedings, ACL2010 - The 48th Annual Meeting of the Association for Computational Linguistics , Uppsala, Sweden July 11-16, 2010.

Del Gaudio, Rosa, 2010, Automatic Extraction of Definitions, Doctoral course qualification report, University of Lisbon, Faculty of Sciences, Department of Informatics.

Del Gaudio, Rosa and António Branco, 2009a, "Extraction of Definitions in Portuguese: An Imbalanced Data Set Problem" , In New Trends in Artificial Intelligence, 14th Portuguese Conference on Artificial Intelligence, EPIA 2009, Aveiro, October 12-15, 2009 pag. 501-512 Luís Seabra Lopes, Nuno Lau, Pedro Mariano, Luís M. Rocha Editors.

Del Gaudio, Rosa and António Branco, 2009b, "Language Independent System for Definition Extraction: First Results Using Learning Algorithms" , In Proceedings, wDE2009- Workshop on Definition Extraction, RANLP2009, Borovets, September 18, 2009.

Nunes, Patricia, António Branco, 2010, "Buscador Online do CINTIL-Treebank" , In Proceedings of XXV Annual Meeting of the Portuguese Association of Linguistics (APL) , Lisbon, APL.

Nunes, Patricia, António Branco, 2009, "CINTIL-Treebank Searcher" , In Proceedings of the I Iberian SLTech - I Joint SIG-IL/Microsoft Workshop on Speech and Language Technologies for Iberian Languages, Porto Salvo, September 3-4, 2009.

Reis, Ruben, 2010, Marcação Semântica de Páginas Web Apoiada por Parsers de Dependências Gramaticais , MA Dissertation, University of Lisbon, Faculty of Sciences, Department of Informatics.

Silva, João, 2010, Robust Handling of Out-of-Vocabulary Words in Portuguese Deep Processing Grammar, Doctoral course qualification report, University of Lisbon, Faculty of Sciences, Department of Informatics.

Silva, João, António Branco, Sérgio Castro, and Ruben Reis 2010, "Out-of-the-Box Robust Parsing of Portuguese" , In Lecture Notes in Artificial Intelligence, 6001, pp.86-89, Berlin: Springer.

Silva, João, António Branco and Patricia Nunes, 2010, "Top-Performing Robust Constituency Parsing of Portuguese: freely available in as many ways as you can get it" , In Proceedings, LREC2010 - The 48th Annual Meeting of the Association for Computational Linguistics, La Valleta, Malta, May 19-21, 2010.




Cooperation

The QueXting project under development integrates participants and tools from the following related projects:

  • GRAMAXING - Computational Grammar for Deep Linguistic Processing of Portuguese
  • TagShare - Tagging and Shallow Processing Tools and Resources