Table of Contents

Computer Vision

There are many applications for computer vision, one of the most useful ones for Interaction tech is that of facial identification and emotional recognition. There are many opportunities for this technology, both intentional and unforeseen/harmful, so it may be helpful to ask yourself and your team about the possible ethical societal impact of your work. A few questions to get you started are: Does testing show that all races, genders and ages are equally accurate? If not, how may that impact your application/deployment? What are the consequences if the application/deployment misidentify someone? Are there mitigations we can put in place?

Scikit-image is an open-source library for python that lets you process images. The gallery is a good starting point. If you can not find your starting point in the gallery then the user guide may be more useful.

To convert between OpenCV and scikit-image or vis-versa see https://scikit-image.org/docs/dev/user_guide/data_types.html#working-with-opencv

OpenCV is another good option that is more geared towards faces and this makes it useful in Interaction Technologies, this tutorial is a good start https://realpython.com/face-recognition-with-python/ with this extension being relevant if you want to use a webcam https://realpython.com/face-detection-in-python-using-a-webcam/.

Google cloud vision AI has several built-in options or objects and faces. (Requires credit card even for free use)

Deep AI does very poorly on age prediction as well as other demographic information. It is also bad at predicting emotion in quick testing with a younger white woman’s face.

Media Pipe is an option that has face detection, face mesh, iris, hand and pose tracking. This option also has Model cards available.

IBM no longer does face recognition in Watson or any other platform.

AWS Rekognition is the ASW option and has 5000 images free per month for 12 months. This AWS service requires a credit card for free use.

OpenFace 2.0 is an open source facial behaviour analysis toolkit, capable of facial landmark detection, head pose estimation, facial action unit recognition, and eye-gaze estimation with available source code for both running and training the models.

OpenPose is an open source real-time full-body pose detection system. It is capable of multi-person tracking to jointly detect human body, hand, facial, and foot keypoints (in total 135 keypoints) on single 2D images.

Detectron is Facebook/Meta's open source object and person detection and segmentation tool.

ASR (Automatic Speech Recognition) and STT (Speech To Text)

Google Cloud speech-to-text is an API that returns a transcript of an audio file. https://cloud.google.com/speech-to-text/docs/quickstart-gcloud is a quick start guide of the API. The API supports some English accents fully and Dutch. There are several how-to guides at https://cloud.google.com/speech-to-text/docs/how-to, there are guides for transcription, detecting language and separating different speakers.

AWS has Amazon Transcribe which has (the free Tier is 60 min per month so this may not work well for a student project where prototyping and testing are on a short time frame but more intensively). Take a look at the get started guide, with additional links to more information. This AWS service requires a credit card for free use.

In IBM cloud there is Speech to text via the speech to text service (add it via Catalog) see the getting started guide. Dutch is supported and the lite plan has 500 minutes per month.

Microsoft has Speech to Text under cognitive services, this services documentation can be found here. Dutch is supported and the free Instance has 5 audio hours per month.

DeepSpeech is an open-source option that is developed by Mozilla, but seems to be no longer maintained. See the GitHub and tutorial for more information. The documentation can be found at https://deepspeech.readthedocs.io/en/r0.9/. A more recent fork of this project is actively developed and maintained by Coqui AI STT, which might work better on newer OS and hardware versions.

If you wish to easily compare and use various online/offline ASR tools in one package, you could consider using a wrapper like Uberi Speech Recognition, which abstracts away much of the underlying API's by offering a common speech recognition interface.

Once you have your transcript you can apply sentiment analysis or other NLP components.

Text to speech synthesis (TTS)

All 4 major platforms have some form of TTS, therefore it is probably easier to use the same platform as the rest of your project.

IBM can be found at https://cloud.ibm.com/apidocs/text-to-speech?code=python

Google cloud text to speech can be found at https://cloud.google.com/text-to-speech/docs/samples/tts-synthesize-text

Amazon Polly can be found at https://aws.amazon.com/polly/

Microsoft text to speech is at https://azure.microsoft.com/en-us/services/cognitive-services/text-to-speech/#overview.

TTS is an open-source option that comes with pre-trained models that has a high life-like quality. Github and a [Sample] (https://soundcloud.com/user-565970875/pocket-article-wavernn-and-tacotron2).