TrendMiner: Large-scale, Cross-Lingual Trend Mining and Summarisation of Real-Time Media Streams
TrendMiner project website, @TrendMiner
Summary
The recent massive growth in online media and the rise of user-authored content (e.g weblogs, Twitter, Facebook) has lead to challenges of how to access and interpret these strongly multilingual data, in a timely, efficient, and affordable manner. Scientifically, streaming online media pose new challenges, due to their shorter, noisier, and more colloquial nature. Moreover, they form a temporal stream strongly grounded in events and context. Consequently, existing language technologies fall short on accuracy, scalability and portability.
The goal of this project is to deliver innovative, portable open-source real-time methods for cross-lingual mining and summarisation of large-scale stream media. TrendMiner will achieve this through an inter-disciplinary approach, combining deep linguistic methods from text processing, knowledge-based reasoning from web science, machine learning, economics, and political science. No expensive human annotated data will be required due to our use of time-series data (e.g. financial markets, political polls) as a proxy. A key novelty will be weakly supervised machine learning algorithms for automatic discovery of new trends and correlations. Scalability and affordability will be addressed through a cloud-based infrastructure for real-time text mining from stream media. Results will be validated in two high-profile case studies: financial decision support (with analysts, traders, regulators, and economists) and political analysis and monitoring (with politicians, economists, and political journalists). The techniques will be generic with many business applications: business intelligence, customer relations management, community support. The project will also benefit society and ordinary citizens by enabling enhanced access to government data archives, summarisation of online health information , and tracking of hot societal issues.
Contact: Kalina Bontcheva (PI)
Publications and Deliverables
Objectives
We aim to develop novel multilingual ontology-based extraction methods, which are capable of analysing the shorter, colloquial, noisy, and contextualised social media streams. Our goal is to identify trends and sentiment across multiple languages, as well as to extract relevant entities and events and store them in a knowledge base. We will use DFKI and USFD’s state-of-the-art methods for Ontology-Based Information Extraction (OBIE) as a starting point. Another innovative contribution will be the integration of opinion and trend elements in ontologies. This will be supported by semi-automatic lexical and terminological acquisition methods, applied to existing multilingual knowledge resources and unstructured documents. Those resources will help in modelling the type of information associated with sentiment and opinions in general.
We also plan to develop portable, weakly supervised machine-learning approaches for the automatic identification of important messages and for extracting text fragments from large volumes of streaming social media text. Training data for supervised learning is not readily available and would be expensive and time consuming to create. For this reason we will investigate ways of instead making use of readily available data in the form of market price movements and poll results.
Furthermore we aim to:
- Develop new approaches for timeline-based summarisation of stream media, which will display how events unfold and attitudes change over time.
- Deliver a multi-paradigm semantic search which can be used to index and search over multi-lingual stream media, linguistic annotations, semantic schemas (ontologies), and semantic meta-data (instance data).
- Develop a cross-lingual information access User Interface (UI) for scaleable semantic-based facetted browsing of trends and sentiment and linking these to the original stream media, via the multi-paradigm semantic search index.
- Evaluate the new stream media summarisation and semantic information access UIs.
Our Role
Our role focuses firstly on multilingual ontology-based information extraction and knowledge modelling:
- Developing methods for semi-automatically acquiring this multilingual lexical and terminological knowledge, that will allow us to model relevant aspects of opinion in ontologies.
- Developing ontology-based IE methods capable of analysing the shorter, colloquial, noisy, and contextualised social media streams.
Additionally we will work on machine learning models for mining trends from streaming media:
- We are faced with the problem is one of modelling a real-valued quantity that varies over time (price/poll results), making use of contemporaneous text to make prediction how the quantity changes. This is naturally modeled using regression, which learns correlations between features of the text at a given time and subsequent price/opinion poll movements (the response variable). We will develop a regression approach, building upon previous work for predicting market movements using textual input.
- Developing a new Bayesian non-parametric machine-learning algorithm for inferring a clustering of the text into groups of like items or authors.
- Evaluation of the above.
Furthermore we will focus on cross-lingual summarisation and information access over media streams:
- Adapting to stream media two open-source, state-of-the-art extractive multi-document summarisation tools: MEAD and USFD’s GATE-based SUMMA.
- Developing an open-source tool for multi-paradigm search over multi-lingual stream media, linguistic annotations, semantic schemas (ontologies), and semantic meta-data (instance data).
- Evaluation of these.
Partners
- DEUTSCHES FORSCHUNGSZENTRUM FUER KUENSTLICHE INTELLIGENZ GMBH
- THE UNIVERSITY OF SHEFFIELD
- Ontotext AD
- UNIVERSITY OF SOUTHAMPTON
- STICHTING INTERNET MEMORY FOUNDATION
- EUROKLEIS S.R.L.
- SORA OGRIS & HOFINGER GMBH
- Hardik Fintrade Pvt Ltd.
Key Personnel in Sheffield
- Kalina Bontcheva (Principal Investigator)
- Trevor Cohn (Co-Investigator)
- Danica Damljanovic
Funding
TrendMiner addresses Objective ICT-2011.4.2 Language Technologies, target outcome b) Information access and mining.
- Project Number: 287863
- Project Acronym: TrendMiner
- Project Title: Large-scale, Cross-lingual Trend Mining and Summarisation of Real-time Media Streams
- Starting Date: 01/11/2011
- Duration in Months: 36
- Call (part) identifier: FP7-ICT-2011-7
- Free Keywords: real-time text mining and summarisation; multilinguality; weakly supervised machine learning; analysing media streams; information extraction; cloud-based infrastructure; time series modelling