Cognitive Analytics is Bigger than Big Data Analytics
“Big Data” has been a corporate buzzword for decades. Promising huge efficiency boosts for enterprise, dazzling customer insights leading to groundbreaking new products and industry-leading customer service, it sounds like the key to running a successful modern business. Cognitive Analytics is a more recent phenomenon, with very similar promises.
Big Data Analytics is primarily concerned with the symbolic processing of vast data for cumbersome problems: enterprise search, targeted advertisements, recommendation systems, and the learning of straightforward correlations. With the advent of scalable, distributed computing, data scientists are now also keen to intelligently manage information that is captured in text, speech and other unstructured forms. An important distinction here is the intelligent management of information. Data management tools often need experts to constantly oversee their deployment, even supporting text analytics directly. They have unloaded part of the problem to automated systems, but we still need human experts managing these systems.
- What if we were able to train these automated systems to think and learn in the same way as their human managers?
- What if we could structure our data analytics such that anyone could benefit from interaction without intensive training?
- What if we could leverage AI for human language?
Cognitive Computing-enabled Analytics, or Cognitive Analytics, as we call this field, holds much greater potential than Big Data Analytics because it helps unlock the value of Big Data by making the whole system more self-sufficient, and information contained more accessible. It also holds more challenges.
In this post we discuss what it is that makes Cognitive Analytics bigger – and more challenging – than Big Data Analytics.
Probabilistic Computing vs. Deterministic Computing
The answers to analyses conducted using Big Data are usually deterministic. Given a set of assumptions and rules, a machine will give a reliable output. The trick is to get the right set of assumptions and rules, and to program the machine or cluster in a resource-efficient way.
Builders of cognitive systems are not so lucky. Cognitive Analytics deals with problems that are inherently uncertain; problems whose solutions are unavoidably and necessarily probabilistic.
Consider a classic probabilistic problem: natural language processing (NLP). Stanford NLP and Ling-Pipe are two of the leading software packages in this area: both claim 98% accuracy, but if you ask them to process real-world data, the probabilistic nature of language means they only agree 70% of the time. We know because we tested this in 2015, asking both Stanford and Ling-Pipe to extract the noun phrases from a corpus consisting of one week’s content posted by a popular Wall Street Twitter account.
What our Wall Street experiment demonstrates is we can only be 70% confident that any noun phrases extracted by industry-standard NLP tools are accurate.
Coseer ameliorates this problem by using Noun Phrase Chunking, a Cognitive Analytics technique that identifies key topics in a corpus and uses them to learn associations between data. This creates more data points and true 98% accuracy, so analysts can be confident output data is trustworthy. The answer itself of course is still necessarily probabilistic, but now underwritten by extensive corpus-based processing.
Cognitive Analytics is Difficult to Map-Reduce
Unlike Big Data Analytics, most Cognitive Analytics tasks do not render themselves to the map-reduce style of distributed computing. Due to completely unpredictable, unstructured and string-based data the computational complexity of the reduce step tends to be the same order of magnitude as the complexity of the original problem.
For a simple problem of finding the phrase from a corpus with the highest TF-IDF, each node in the cluster will output a data-structure that is almost as bulky as the input corpus – only a very powerful node, therefore, is able to complete the reduce step.
What does this mean for the the AI Industry?
Ergo, Cognitive Analytics systems should be designed to execute as much as possible in a single node and use problem-specific distributed architecture where needed. That is partly why IBM Watson is a hardware-based solution, a derivative of the Blue Gene supercomputer. This is also one of the reasons Google, Facebook, Apple, and other companies dealing with large quantities of text, design their data centers using super-fast, flash-based Direct Attached Storage rather than prevalent network storage (SAN or NAS).
Coseer takes an engineer’s approach to the problem. For the step of Noun Phrase Extraction, we compare Coseer to the platinum standards in the image below.
What we give up marginally in accuracy, we make up with our corpus-based approach. As well as being incredibly memory-efficient, our technique is also far superior in terms of time taken, as demonstrated in the next graph.
The Need For Context Aware, Interactive Design
Ultimately, analytics are meant to enable better business decisions through effectively leveraged data. To this end, many Big Data Systems are employed to make decisions once trained. Due to the probabilistic nature of tools available, Cognitive Analytics Systems must be designed for higher interaction. This is true not only for learning, but also for decision-making. High levels of interactivity increase a user’s confidence in the system, improve the accuracy of results and reduce the complexity of challenges. Such interactivity also provides the context for further improvement. There is enough mystery when talking about AI – large enterprises with vast, valuable data sets need an AI that isn’t a black box.
Similarly, the output from Cognitive Analytics should come with attached confidence intervals. It is also possible for such systems to report multiple options and let the user choose which option is best. Such considerations increase the challenge for software designers to include features for easy interactivity.