Big data blah $ blah $ blah $

Why does LinkedIn feed me with big data hype from 2011?
By only talking about dollar metrics, potential big data intelligence is turned into junk data science.

blunt abstraction of native domain metrics into dollars is a source of junk data

All  meaningful automation, quality, energy efficiency, and resilience metrics are obliterated by translating into dollars. Good business decisions are made by understanding the implications of domain-specific metrics:
  1. Level of automation
  2. Volume of undesirable waste
  3. Energy use
  4. Reliability and other quality of service attributes

Any practitioner of Kaizen knows that sustainable cost reductions are the result of improvements in concrete metrics that relate directly to the product that is being developed or the service that is being delivered. The same domain expertise that is useful for Kaizen can be combined with high quality external big data sources to produce insights that enable radical innovation.

Yes, often the results have a highly desirable effect on operating costs or sales, but the real value can only be understood in terms of native domain metrics. The healthcare domain is a good example. Minimising the costs of high quality healthcare is desirable, but only when patient outcomes and quality of care are not compromised.

When management consultants only talk about results in dollars, there is a real danger of only expressing goals in financial terms. This then leads down the slippery slope of tinkering with outcomes and accounting procedures until the desirable numbers are within range. It is too late when experts start to ask questions about outcomes, and when lacking native domain metrics expose reductions in operational costs as a case of cutting corners.

Before believing a big data case study, always look beyond the dollars. If in doubt, survey customers to confirm claims of improved outcomes and customer satisfaction. The referenced McKinsey article does not encourage corner cutting, but it fails to highlight the need for setting targets in native domain metrics, and it distracts the reader with blunt financial metrics.

Let’s talk semantics. Do you know what I mean?

Over the last few years the talk about search engine optimisation has given way to hype about semantic search.


context matters

The challenge with semantics is always context. Any useful form of semantic search would have to consider the context of a given search request. At a minimum the following context variables are relevant: industry, organisation, product line, scientific discipline, project, and geography. When this context is known, a semantic search engine can realistically tackle the following use cases:

  1. Looking up the natural language names or idioms that are in use to refer to a specific concept
  2. Looking for domain knowledge; i.e. looking for all concepts that are related to a given concept
  3. Investigating how a word or idiom is used in other industries, organisations, products, research projects, geographies; i.e. investigating the variability of a concept across industries, organisations, products, research projects, and geographies
  4. Looking up all the instances where a concept is used in Web content
  5. Investigating how established a specific word or idiom is in the scientific community, to distinguish between established terminology and fashionable marketing jargon
  6. Looking up the formal names that are used in database definitions, program code, and database content to refer to a specific concept
  7. Looking up all the instances where a concept is used in database definitions, program code, and database content

These use cases relate to the day-to-day work of many knowledge workers. The following presentation illustrates the challenges of semantic search and it contains examples that illustrate how semantic search based on concepts differs from search based on words.

semantic search

Do you know what I mean?

The current semantic Web is largely blind to the context parameters of industry, organisation, product line, scientific discipline, and project. Google, Microsoft, and other developers of search engines consider a fixed set of filter categories such as geography, time of publication, application, etc. and apply a more or less secret sauce to deduce further context from a user’s preferences and browsing history. This approach is fundamentally flawed:

  • Each search engine relies on an idiosyncratic interpretation of user preferences and browsing history to deduce the values of further context variables, and the user is only given limited tools for influencing the interpretation, for example via articulating “likes” and “dislikes”
  • Search engines rely on idiosyncratic algorithms for translating filters, and “likes” and “dislikes” into search engine semantics
  • Search engines are unaware of the specific intent of the user at a given point in time, and without more dynamic and explicit mechanisms for a user to articulate intent, relying on a small set of filter categories, user’s preferences, and browsing history is a poor choice

The weak foundations of the “semantic Web”, which evolved from a keynote from Tim Berners-Lee in 1994, compound the problem:

“Adding semantics to the Web involves two things: allowing documents which have information in machine readable forms, and allowing links to be created with relationship values.”

Subsequently developed W3C standards are the result of the design by committee with the best intentions.

All organisations that have high hopes for turning big data into gold should pause for a moment, and consider the full implication of “garbage in, garbage out” in their particular context. Ambiguous data is not the only problem. Preconceived notions about semantics are another big problem. Implicit assumptions are easily baked into analytical problem statements, thereby confining the space of potential “insights” gained from data analysis to conclusions that are consistent with preconceived interpretations of so-called metadata.

The root cause of the limitations of state-of-the-art semantic search lies in the following implicit assumptions:

  • Text / natural language is the best mechanism for users to articulate intent, i.e. a reliance on words rather than concepts
  • The best mechanism to determine context is via a limited set of filter categories, user preferences, and via browsing history
words vs concepts

words vs concepts

Semantic search will only improve if and when Web browsers rely on explicit user guidance to translate words into concepts before executing a search request. Furthermore, to reduce search complexity, a formal notion of semantic equivalence is essential.

semantic equivalence

semantic equivalence

Lastly, the mapping between labels and semantics depends significantly on linguistic scope. For example the meaning of solution architecture in organisation A is typically different from the meaning of solution architecture in organisation B.

linguistic scope

linguistic scope

If the glacial speed of innovation in mainstream programming languages and tools is any indication, the main use case of semantic search is going to remain:

User looks for a product with features x, y, and z

The other use cases mentioned above may have to wait for another 10 years.

Participate: Tweeting in the format URL relationship URL

Twitter has emerged as a very powerful medium for propagating ideas and thoughts. Possibly Twitter is the ideal data input tool for harnessing the collective insights of the humans and systems that are connected to the web – effectively a significant proportion of all humans and virtually every non-trivial system on the planet.

By simply adopting a convention of twittering important insights in the format <some URL> <some relationship> <some other URL>, users can incrementally, one step at a time, create a personal model of the web. These personal models can grow arbitrarily large, and Twitter is certainly not the appropriate tool for visualising, modularising and analysing such models. But arguably, Twitter is the most elegant and simplest possible front end for capturing atoms of knowledge.

Note that URLs used on Twitter typically point to a substantial piece of information, and not a simple word or sentence. Often a URL references an entire article, a web site, or a non-trivial web-based system. These articles, web sites or systems can be considered semantic identities in that specific users (or groups of users) associate them with specific semantics (or “meaning”). Hence tweets in the <some URL> <some relationship> <some other URL> format suggested above represent connections between two semantic identities. A set of such tweets amounts to the construction of a mathematical graph, where the URLs are the vertices, and the relationships are the edges.

If we add functions for transforming graphs into the mix, and considering that we are connecting representations of semantic identities, we end up in the mathematical discipline of model theory. Considering further that Twitter models are user specific, and that the semantics that users associate with a URL are not necessarily identical – but rather complementary, we can further exploit results from the mathematics of denotational semantics. For the average user there is no need to worry about the formal mathematics, and it is sufficient to understand that the <some URL> <some relationship> <some other URL> format (I will use #URLrelURL on Twitter when referencing this format) allows the articulation of insights that correspond to the atoms of knowledge that humans store in their brains.

With appropriate software technology it is extremely easy to translate sets of #URLrelURL tweets into a proper mathematical graph, and into a user specific semantic model. These models can then be analysed, modularised, visualised, compared, and transformed with the help of machine & human intelligence. Amongst other things, retweets can be taken as an indication of some degree of shared understanding in relation to a particular insight. Further qualification of the semantic significance of specific tweets can be calculated from the connections between Twitter users, and from analysis of the information/functionality offered by the two connected URLs.

The most interesting results are unlikely to be the individual mental models that are recorded via #URLrelURL tweets, but will rather be the overlay of all the mental models, leading to a complex graph with weighted edges, which can be analysed from various perspectives. This graph represents a much better organisation of semantic knowledge than the organisation of information delivered by systems like Google search.

Instead of processing semantic models, Google search must process entire web sites with arbitrary syntactic content, with no indication of which pairs of URLs constitute insights useful to humans. Google can only indirectly infer (and make assumptions about) the semantics that humans associate with URLs by applying statistics and proprietary algorithms to syntactic information.

In contrast, the raw aggregated #URLrelURL tweet model of the world captures collective human semantics, and any additional machine generated #URLrelURL insights can be marked as such. The latter insights will not necessarily be of less value, but it will be reassuring to know that they are firmly grounded in the collective semantic perspective of human web users.

Making this semantic perspective accessible to humans and to software via appropriate search, visualisation, and analysis tools will constitute a huge step forwards in terms of learning, effective collaboration, quality of decision making, and in terms of eliminating the boundary between biological and computer software intelligence.

Therefore, please join me in capturing valuable nuggets of insight in the format of
<some URL> <some relationship> <some other URL> tweets.

Example: #gmodel can be used to #translate twitter models into #semantic #models