Semantic Hackathon

Location:

Neofonie GmbH

Date and time:

Wed, 2011-06-08 10:00 - 18:00

Speaker:

Everyone

For updated information see also:

https://2011.berlinbuzzwords.de/wiki/semantic-hackathon

The general workshop topic is about R&D European projects that involve data intensive & semantic technologies (e.g. IKS [1], DICODE [2] and related open source projects such as Stanbol [3], OpenNLP [4], UIMA[5], Mahout [6]).

To be more concrete here is a task proposal that seems to have gained some traction among at least some members of the Stanbol and OpenNLP developer communities:

Task title: Open-data corpora and open-source tools to train statistical models for NLP and knowledge extraction from text

Incubating Apache projects such as Stanbol and OpenNLP need to train statistical models on annotated corpora for instance for Named Entity Recognition. Presently available models were mostly built on copyrighted corpora typically coming from the Linguistic Data Consortium (LDC) that prevent those projects to improve, modify, extend and re-distribute the existing annotated corpora to build and distribute user adapted statistical models.

We propose to take the opportunity of the Berlin Buzzword meeting to organize an hackathon to kick-start an effort to build our own annotated training corpus from free to redistribute sources such as Wikipedia, Wikinews, DBpedia, Gutenberg... while collaborating with other interested projects such as OpenNLP and UIMA.

Several developers from OpenNLP, UIMA and Stanbol already expressed interest in attending such a workshop. A practical goal could be to develop some UI tools to manually refine / correct / complete tokenized and semi-annotated NER corpus automatically extracted from Wikipedia / DBpedia using pignlproc [7]. We could base such a work on existing projects such as wordfreak [8], the UIMA CasEditor [9] or start a new web based UI for instance.

We could also extend the topic of the hackathon to improve or tools to build, package and distribute a Solr based index of entities and topics from various sources such as DBpedia and geonames with ranking scores based on popularity metrics. For instance one could use graph centrality metrics from the link structure of Wikipedia articles [10] and Apache Mahout's Lanczos SVD of the adjacency matrix to compute the eigen-centrality scores for each DBpedia entity and SKOS topic.

Please contact Isabel Drost isabel@apache.org and Olivier Grisel ogrisel@apache.org if you are interested in participating. Please mention which particular aspect of the topic you would like to work on.

[1] https://iks-project.eu
[2] https://dicode-project.eu
[3] https://incubator.apache.org/stanbol
[4] https://incubator.apache.org/opennlp
[5] https://uima.apache.org
[6] https://mahout.apache.org
[7] https://github.com/ogrisel/pignlproc
[8] https://wordfreak.sourceforge.net
[9] https://uima.apache.org/d/uimaj-2.3.1/tools.html#ugr.tools.ce
[10] https://dbpedia.org/Downloads36#wikipediapagelinks

Calendar

Berlin Buzzwords 2011 is a conference for developers and users of open source software projects, focussing on the issues of scalable search, data-analysis in the cloud and NoSQL-databases. Berlin Buzzwords presents more than 30 talks and presentations of international speakers specific to the three tags "search", "store" and "scale".