About

Project title	NewsReader: Building structured event indexes of large volumes of financial and economic data for decision making.
Funded by the EU
Funded by the EU	FP7 2011.4.4 -ICT Objective ‘Cooperation‘ – Research theme: ‘Information and communication technologies’
Targeted Project	duration 36 months
Starting	1/1/2013
Ending	1/1/2016
Overall budget	3.770.000 euros
Community funding	2.800.000 euros

More info @ Community Research and Development Information Service (Cordis): the Gateway to European research and development

Big data as a problem

Professionals in any sector need to have access to accurate and complete knowledge to take well-informed decisions. This is getting more and more difficult because of the sheer size of data they need to consider. There is more data than ever and it is getting highly interconnected, so that data gives access to other data. Partly, this is because digital data is only a click away but also because our world expanded from regional to global scope and is changing more rapidly. This also means that the knowledge and information of professionals is quickly getting out of date while at the same time their decisions have bigger impact in a highly-interconnected world. Likewise, professional decision-makers are involved in a constant race to stay informed and to respond adequately to any changes, developments and news. It is the big paradox of the information age that increase in available knowledge and information access leads to more difficulty to use and exploit it. The main reason is that we are not capable of sifting the right from the wrong or the relevant from the irrelevant. The more time we spend to select relevant information, the less time we have to digest and process it.

Depending on others

With the increase in volume and dynamism, we more and more rely on other people and technology that filter and select knowledge and information for us because we cannot physically and mentally cope with it. The industrial project partner LexisNexis recently issued a research, conducted by Purple Market Research (http://www.purplemr.co.uk), on the information consumption in a variety of private and public sectors. The research showed that most of these professionals use the Internet (almost 90%) for keeping up to date (monitoring and doing ad hoc research) and most of that usage is through Google (almost 60%). It is a striking fact that professional decisions are thus based on the quality of access provided by the Internet: an old search paradigm providing mostly relevance-ranked lists of pointers to sources. Likewise, there is no guarantee that the information is accurate, up-to-date and complete. Nevertheless, these same decision makers rank accuracy, being up-to-date and comprehensiveness as the most important features of information coming to them, as stated in the same report by Purple Market Research. Depending on their role and tasks, professional decision makers, such as security and credit officers in the finance domain or communications/PR managers or spokesmen, need to consider hundreds up to several thousands of documents on a daily basis. This is still a small part of all the information coming in each day.

How big is the problem?

A large international information broker such as the project partner LexisNexis handles about 1.5 million news documents and 400 thousand web pages each day. The cumulation of all this knowledge and information over time is enormous. The archive of LexisNexis contains over 25 billion documents spanning several decades. It includes among others 30,000 different newspapers (with 35,000 issues each day), 85 million company reports, over 60 million manager biographies, several hundreds of thousands market reports.1 LexisNexis serves about 4,000 different clients that consult their archive and the daily stream of information for their decision making. In fact, this information is gathered by specialists and journalists and distributed by information brokers such as LexisNexis, but is going back and can be linked to even more sources of knowledge and information of unknown proportion.

The big-data problem for professionals is to make well-informed decisions, based on accurate, up-to-date and comprehensive information, while sitting on the top of an information iceberg with a daily growth of millions of new documents. They partly solve this problem by relying on the quality of services offered by information brokers to preselect information. LexisNexis provides text search, metadata search, user-profiling and analytic tools to show the relevancy of topics and news. In the end, these tools still provide a list of (thousands of) documents that will still be too long to consider when the volume expands. Another common solution to the problem is to (re-)use the work of other people that compile overviews, reports and summaries rather than source documents. In that case, decision-makers need to assume that these reports and interpretations are correct and meet the above criteria. As a consequence of this dependency, knowledge and information gets less and less transparent and verifiable as the volume and complexity of information grows. Still, these professionals will be held liable for decisions based on trusted information provided by others.

Structuring information as stories

The most natural and effective way in which people process, store and remember information is not through lists of textual sources but by integrating the new information with what they already know into a single representation of the past. When we read the news, we extract abstract story lines from text. We do not remember separate events but a single coherent story line in which events are connected. We may read another news message about the same story. It contains both duplicate and new information. In the end, we separate the old from the new and still store only one single story line in which the information is integrated. In fact, we integrate that story line with everything we learned in the past and always store only one story line. Eventually, people summarize the information in a very compact and efficient form as an abstract story or plot.

Recording history

The NewsReader project provides a similar solution to the above data volume problem by partly mimicking humans that read text and integrate new information with what is known of the past. Like human readers, NewsReader will reconstruct a coherent story in which new events are related to past events. NewsReader will represent events and their relations using formal semantic structures in an extremely compact and compressed form:

eliminating duplication and repetition,
detecting event identity,
completing incomplete descriptions, and finally
chaining and relating events into plots (temporal, local and causal chains).

The result of the day-by-day processing of large volumes of news and information will be stored in a knowledge base, in which each event is unique, connected to time and place and connected to many other events.

The cumulation of this knowledge base represents a complete, exact and rich record of the past. However, this record of the past is at the same time the most compact representation of information contained in billions of textual data. Instead of a key-word index with relevance scores, NewsReader creates an index of abstract semantic schemas capturing sequences of essential event instances. Just as with normal key-word indexes, this story index provides access to all the original sources but abstracts from the exact wording, multiple references, and captures the relations as in a plot, which is the most intuitive way of structuring information for humans. Human memory is good at remembering stories but it is also selective. When we read the news, we forget what the sources are, how many different sources there were and how they differed. We may not be able to recall the exact story later on, we forget most details and may even memorize things wrongly. In fact, we may need to search again for the original sources to refresh our memory. Each of these sources again tells only part of the story and not the condensed and comprehensive story as a whole. In contrast to human readers however, NewsReader will not forget any detail, will keep track of all the facts and will even known when and how stories were told differently by sources. Likewise, NewsReader will be able to present the essential knowledge and information both as structured lists of data and facts but also as abstract schemas of event sequences that represent stories going back in time, as humans do.

Who will be using this?

NewsReader will be tested on economic-financial news and on events relevant for political and financial decision-makers. About 25% of the news is about finance and economy. This means that about five-hundred-thousand news items and websites processed by LexisNexis each day are potentially relevant for professionals working in this sector. We will process these data streams on a day-to-day basis using natural-language-processing techniques to extract the economic-financial events from the text. Each event is described in terms of who, what, where and when, where each of these slots is interpreted as referring to unique entities, e.g. companies, people, governments. We will process text in English, Dutch, Spanish, and Italian and from a variety of sources. From the text we extract a story line that connects different events. The extracted events are stored in a KnowledgeStore that keeps track of the event history by integrating the new knowledge into the knowledge of the past. Likewise, we will know what happened to whom in time and place and how these events took place in a sequence. We will also know what is new about an event and what is knowledge from the past. When more and more news are added, the knowledge and information will grow but it will be stored in the most compact form since each entity and each event will be represented only once. Likewise, the true information grows marginally in comparison to the size and volume of the new sources, with all its duplications, redundancies and speculations. The cumulated KnowledgeStore will still keep track of and provide access to all these sources from which the information is extracted. The large archive and the daily stream of data represent a major challenge to demonstrate scalability of storage, processing and access. We will use distributed architectures and cloud-computing services to show the scalability of our solution.