Semantic / NLP Hackathon

View
Talk

When

Date: June 8 & 9 (after the conference that takes place on June 6 & 7)
Time: Let us start at 10am

Where

Location: Neofonie GMBH in Berlin: https://www.neofonie.de/standorte
See also map

What to bring

Your Berlin Buzzwords badge to be admitted to the Hackathon.
Enthusiasm,
Creativity and expertise
Your laptop (or other hardware you need) preloaded with your favourite programming tools.

What
The general workshop topic is about R&D European projects that involve data intensive & semantic technologies (e.g. IKS, DICODE and related open source projects such as Stanbol, OpenNLP, UIMA, Mahout).

To be more concrete here is a task proposal that seems to have gained some traction among at least some members of the Stanbol and OpenNLP developer communities:

Task title: Open-data corpora and open-source tools to train statistical models for NLP and knowledge extraction from text

Incubating Apache projects such as Stanbol and OpenNLP need to train statistical models on annotated corpora for instance for Named Entity Recognition. Presently available models were mostly built on copyrighted corpora typically coming from the Linguistic Data Consortium (LDC) that prevent those projects to improve, modify, extend and re-distribute the existing annotated corpora to build and distribute user adapted statistical models.

We propose to take the opportunity of the Berlin Buzzword meeting to organize an hackathon to kick-start an effort to build our own annotated training corpus from free to redistribute sources such as Wikipedia, Wikinews, DBpedia, Gutenberg... while collaborating with other interested projects such as OpenNLP and UIMA.

Several developers from OpenNLP, UIMA and Stanbol already expressed interest in attending such a workshop. A practical goal could be to develop some UI tools to manually refine / correct / complete tokenized and semi-annotated NER corpus automatically extracted from Wikipedia / DBpedia using pignlproc. We could base such a work on existing projects such as wordfreak, the UIMA CasEditor or start a new web based UI for instance.

We could also extend the topic of the hackathon to improve or tools to build, package and distribute a Solr based index of entities and topics from various sources such as DBpedia and geonames with ranking scores based on popularity metrics. For instance one could use graph centrality metrics from the link structure of Wikipedia articles and Apache Mahout's Lanczos SVD of the adjacency matrix to compute the eigen-centrality scores for each DBpedia entity and SKOS topic.

Registration

If you would like to participate please add your-self below. Please also mention which particular aspect of the topic you would like to work on and links to existing relevant open source project you contribute to and that might be related to the topic.

Name	Mail	Bio
Olivier Grisel	ogrisel@apache.org	Author of pignlproc and contributor to Stanbol
Rupert Westenthaler	rupert.westenthaler@gmail.com	contributor to Stanbol
Hannes Korte	hannes.korte@iais.fraunhofer.de	generally interested in open source NLP applications
Daniel Streiff	daniel.streiff@htwchur.ch	interested in LOD, semantic graphs and NER
Szabolcs Grünwald	szaby.gruenwald@gmail.com	interested in semantic annotation and search UI solutions
Eduardo Torres	eduardo.torres-schumann@vico-research.com	interested in NLP for social media data and semantic topic modelling
Julien Nioche	julien@digitalpebble.com	Nutch, Tika, GORA committer; contributor to UIMA and GATE; main author of Behemoth
Doris Maassen	doris@neofonie.de	Research Manager, currently working for Dicode
Martin Gerlach	martin.gerlach@neofonie.de	interested in LOD, Data aggregation, NER and NE disambiguation based on semantic graphs, duplicate detection and merging
Anna Głazek	anna.glazek@nk.pl	NLP for Polish language
Joseph Turian	joseph@metaoptimize.com	Machine Learning / NLP
Riko Tertsch	dreamonspammmers@gmail.com	Machine Learning, NLP, Semantic Search, BigData
Name	Mail	Your short bio and insterests go here
Name	Mail	Your short bio and insterests go here
Name	Mail	Your short bio and insterests go here

Login or register to be able to edit this wiki page and add your-self here. If your account does not get activated within 24h, please contact isabel@apache.org

Login to post comments

Berlin Buzzwords 2011 is a conference for developers and users of open source software projects, focussing on the issues of scalable search, data-analysis in the cloud and NoSQL-databases. Berlin Buzzwords presents more than 30 talks and presentations of international speakers specific to the three tags "search", "store" and "scale".