Skip to Content

You are currently visiting an old archive website with limited functionality. If you are looking für the current Berlin Buzzwords Website, please visit

Urania Berlin, June 6-7, 2011

Semantic / NLP Hackathon


  • Date: June 8 & 9 (after the conference that takes place on June 6 & 7)
  • Time: Let us start at 10am

What to bring

  • Your Berlin Buzzwords badge to be admitted to the Hackathon.
  • Enthusiasm,
  • Creativity and expertise
  • Your laptop (or other hardware you need) preloaded with your favourite programming tools.

The general workshop topic is about R&D European projects that involve data intensive & semantic technologies (e.g. IKS, DICODE and related open source projects such as Stanbol, OpenNLP, UIMA, Mahout).

To be more concrete here is a task proposal that seems to have gained some traction among at least some members of the Stanbol and OpenNLP developer communities:

Task title: Open-data corpora and open-source tools to train statistical models for NLP and knowledge extraction from text

Incubating Apache projects such as Stanbol and OpenNLP need to train statistical models on annotated corpora for instance for Named Entity Recognition. Presently available models were mostly built on copyrighted corpora typically coming from the Linguistic Data Consortium (LDC) that prevent those projects to improve, modify, extend and re-distribute the existing annotated corpora to build and distribute user adapted statistical models.

We propose to take the opportunity of the Berlin Buzzword meeting to organize an hackathon to kick-start an effort to build our own annotated training corpus from free to redistribute sources such as Wikipedia, Wikinews, DBpedia, Gutenberg... while collaborating with other interested projects such as OpenNLP and UIMA.

Several developers from OpenNLP, UIMA and Stanbol already expressed interest in attending such a workshop. A practical goal could be to develop some UI tools to manually refine / correct / complete tokenized and semi-annotated NER corpus automatically extracted from Wikipedia / DBpedia using pignlproc. We could base such a work on existing projects such as wordfreak, the UIMA CasEditor or start a new web based UI for instance.

We could also extend the topic of the hackathon to improve or tools to build, package and distribute a Solr based index of entities and topics from various sources such as DBpedia and geonames with ranking scores based on popularity metrics. For instance one could use graph centrality metrics from the link structure of Wikipedia articles and Apache Mahout's Lanczos SVD of the adjacency matrix to compute the eigen-centrality scores for each DBpedia entity and SKOS topic.


If you would like to participate please add your-self below. Please also mention which particular aspect of the topic you would like to work on and links to existing relevant open source project you contribute to and that might be related to the topic.

Name Mail Bio
Olivier Grisel Author of pignlproc and contributor to Stanbol
Rupert Westenthaler contributor to Stanbol
Hannes Korte generally interested in open source NLP applications
Daniel Streiff interested in LOD, semantic graphs and NER
Szabolcs Grünwald interested in semantic annotation and search UI solutions
Eduardo Torres interested in NLP for social media data and semantic topic modelling
Julien Nioche Nutch, Tika, GORA committer; contributor to UIMA and GATE; main author of Behemoth
Doris Maassen Research Manager, currently working for Dicode
Martin Gerlach interested in LOD, Data aggregation, NER and NE disambiguation based on semantic graphs, duplicate detection and merging
Anna Głazek NLP for Polish language
Joseph Turian Machine Learning / NLP
Riko Tertsch Machine Learning, NLP, Semantic Search, BigData
Name Mail Your short bio and insterests go here
Name Mail Your short bio and insterests go here
Name Mail Your short bio and insterests go here

Login or register to be able to edit this wiki page and add your-self here. If your account does not get activated within 24h, please contact