Back to insights

Adding vision and speech inside your business

Over the last five years, the scale of cloud compute power, and the exponential growth of available training data for machine learning algorithms has created a viable business model for the technology platforms we all know and love. This growth created an environment for these platforms, allowing to further invest heavily in the development of these services, helping solve global problems and support the developer community to build more “intelligent” applications and at a cost that makes it accessible to most.

What services can be use

Cognitive service is very much a brand term used to define the things the human brain can do, so spans an array of areas. The two areas we are focusing on are; vision and speech as they enable the broadest of use cases and we use them daily.

If you want a comparison of the services from Microsoft/Google/Amazon etc to evaluate which platform is the best for you, check out the links at the bottom. This post to provide a background into the use cases of the technologies.

Interesting facts

Over 2 billion photos are uploaded every day, being processed, categorised and fed into training models to better identify images on the public and private cloud

Microsoft recently announced it hit a 5.1% error rate of converting speech to text, which is the same level as us humans


Vision services provide the ability to extract insight from both images and video, interpreting the context of the visual input similar to how our brains do. Vision works through the training of data models, you could want to identify a car’s make and model, the breed of a pet or which vacuum cleaner within a store window. This is where you need to provide a volume of training data (photos with a BMW, a Tesla etc).

The main “issue” of these training models is the need to have a large set of available data to train the model with what is a correct and incorrect match. This enables it to learn the difference between a BMW, a Tesla and not a cat. Unfortunately, there is no clear answer to how many images you need for each entity, apart from the more the better (useful I know, sorry).

Most vision platforms do however provide a higher level of abstraction for you so you can get started, these pre-trained models enable the identification of faces, objects, emotion and environment (plate on a table).


Facial identification within a photo or video can provide a great level of value within an organisation for an array of services, security being the most obvious, using door cameras or CCTV, people can be given or denied access to floors or areas. In the event of an alert, security can be notified if someone is spotted in a restricted zone or a face identification can’t be matched.

Imagine starting your presentation, everyone’s attendance is recorded by the meeting room camera by seeing their presence in the room but also as the presentation is underway you can see an engagement rating of the room, allowing you to optimise your delivery.

Face detection also provides facial characteristics such as wearing glasses, hair length, age, gender, facial hair as well as pixel level positioning all elements of your face. All this creates each face with a visual fingerprint allowing you to find all photo’s that contain a specific person or people who wear glasses.

As well as the more logic insights, you can extract emotional insights from faces covering key emotions such as happiness, fear, anger and surprised. These emotions are provided with a confidence score and are surprisingly accurate. KFC in Japan recently launched payments by a smile. In a single image or video frame, these platforms can identify over fifty faces providing a mass of insight and use cases.


Understanding the context around an image is a very useful use case, generic insights can be this photo is outside, inside, there is the sky, there is grass, a person is there, a group of people are there etc. These generic insights help with basic categorisation such as this photo can’t have a person in it or the image of the house ends to be taken outside.

Where context gets very interest is around deeper learning, such examples are, there is a cup, the windows is open or a phone is on the table. If you are working with user-generated content these vision services also provide an adult score, useful for filter out inappropriate content. This can be taken even deeper by building your own models and providing your own training data enabling you to, identify a Google Pixel or an iPad with a cracked screen.

Using this context mapping in video presents an interesting opportunity, particularly with the ability to index the appearance, movement or disappearance of people or items. Adding a catalogue of sales meetings, you could search for “when does the customer look down”, if you were monitoring a car park “show me when the car park is over 50% full” or running QA on factory equipment “show me when you see smoke”.


Voice computing has come a long way in recent years with error rates now as low as human transcribers and with the rise of consumer devices such as the Amazon Echo, Google Home, Siri and Cortana, people are becoming more open to voice as an interface to perform tasks around the home and workplace. The major advantage of voice computing over interfaces is its ability to enable us to multi-task more easily.

Speech to Text

There are three primary ways to get a textual result from the audio of a phone call, computer microphone or a connected speaker, each with their pros and cons.

  • Direct transcribing — a generalised best effort will work out what you’re saying with a low error rate but will tend to fail on domain-specific terms due to the lack of context.
  • Utterances — the ability to define sentences people tend to ask for your given service, matches are assigned a confidence score. The benefit of this approach yields a lower error rate though requires all your utterances to be defined or the use of a tool to generate them (shameless plug — we have a tool for that).
  • Knowledge-based — feeding your model with the context of your domain through documents with business model language. Your voice interface then sits on top of this knowledge graph (which can be harnessed for other things), speech then gets both the generalised transcription applied but with the context of your domain to produce the highest quality result. The downside is the necessity to feed it domain data which may not be accessible or available in high enough volume.
  • Building upon your selected method allowed you to take the textual representation of the request and perform an unlimited number of actions. For example, control (book me a meeting with John next Thursday in meeting room 4) and conversate, where one person can talk as the recipient uses SMS.


Translation has been a manual task for many years with a dream of real-time star trek language translation decades away. The good news is that we are almost there, you can in nearly real-time, 1–2 second delay have two people hold a voice conversation in two languages, translating in real-time to their native tongue. There are early previews of audio to audio group chat where ten people can all speak different languages and receiving all responses in their native tongue in near real-time.

Cognitive services are in their infancy and with machine learning and deep learning improving every day on a global scale the possibilities are endless. The question for today is, what are the use cases that you can embrace within your organisation and are they generic enough to use a generalise model or do you need to start recording data now to gain the benefits of a data science strategy in 3–5 years’ time.