In the 4th semester of my Software Engineering course I chose Applied Data Sciences as a specialization route. In this route, I got the chance to do a personal project. For this project, I want to develop an app for Apple’s iOS platform which incorporates Machine Learning (ML). I am interested in using ML in a computer vision context, specifically.
Apple provides a broad collection of frameworks and libraries for image processing and recognition. In many applications on its own platform they’re already using these tools. One example is the photo app. With face recognition, an iPhone can recognize different people in a photo. One can then label these photos with the name of the person, which makes it easy to search for photos of a certain person on your phone. See image 1.
The app I want to develop, very much resembles the Not Hotdog app from the HBO series Silicon Valley. In order to achieve this, a good understanding of different the provided tools is necessary. In this document I want to look into the following subjects:
- Which frameworks does Apple provide for computer vision and machine learning;
- For which purposes can these frameworks be used;
- How do the implementations of these frameworks work (pre-trained models? etc…);
- How do the frameworks of Apple compare with exisiting frameworks and tools;
- Is there an ethical difference;
- What are potential shortcomings and how to deal with these shortcomings?
This research is the first step of three in making an application. In the second step, I want to go hands-on with coding and follow tutorials on the internet in order to learn using the frameworks. The last step is the actual development of the application.
Which frameworks does Apple provide for computer vision and machine learning?
On developer.apple.com, Apple provides its documentation for developers. In the getting started with CoreML part of the Machine Learning page, some introduction videos are provided which were recorded during the World Wide Developer Conference (WWDC) in June 2017.
CoreML forms the foundation for domain specific frameworks and functionality. A trained model can be integrated with CoreML into the app to perform predictions on new input data from the user. On top of CoreML, Vision is built for image analysis, Foundation for natural language processing (NLP) and GameplayKit for evaluating learned decision trees. CoreML itself is built on some low-level API’s for neural networking and GPU usage. CoreML runs on-device, which enhances performance and ensures the privacy of the user, but this will be discussed later on.
The framework supports a wide variety of model types. Table 1 shows the model type and the supported models. It also shows the supported tools for creating and training the model.
|Neural networks||feedforward, convolutional, recurrent||Caffe v1, Keras 1.2.2+|
|Tree ensembles||random forests, boosted trees, decision trees||scikit-learn 0.18, XGBoost 0.6|
|Support vector machines||scalar regression, multiclass classification||scikit-learn 0.18, LIBSVM 3.22|
|Generalized linear models||linear regression, logistic regression||scikit-learn 0.18|
|Feature engineering||Sparse vectorization, dense vectorization, categorical processing||scikit-learn 0.18|
|Pipeline models||Sequentially chained models||scikit-learn 0.18|
Table 1 Different supported model types and the tools that can be used to create and train the models.
As mentioned earlier, Vision is Apple’s framework for image recognition and analysis. It can be used for a variety of purposes: face detection, face landmarks, image registration, rectangle detection, barcode detection, object tracking (for faces, rectangles and general templates). It is also possible to integrate CoreML into Vision. One example that Apple gave during the WWDC presentation Vision Framework: Building on Core ML was recognizing and labeling different parts of a wedding: the reception, walking down the aisle etc. and creating new albums based on these events. A large CoreMLmodel is fed to vision and makes it possible to do these kind of applications.
For which purposes can these frameworks be used?
In this section I want to show some examples from Apple, but also from developers around the world, by searching on GitHub, YouTube and the App Store.
As shown above in image 1, Apple uses ML in combination with Vision for recognizing faces in photos. It is not possible to know how exactly Apple uses their own frameworks within its apps, but based on the WWDC videos, the facial detection API from Vision is used for this purpose. Another thing that can be done in the Photos app, is searching for a specific object in your own photos. If you, for instance, type in dog (hond in Dutch) into the search field, it will search all images from dogs. See image 2.
In the camera app, Vision’s face detection is used to automatically focus on the faces present in the frame. They keyboard app predicts the next word you would like to type and shows you a few options above the keyboard. Unfortunately this does not work (yet) in Dutch. Siri, the digital assistant, probably uses NLP to understand what the user is saying. Image 3, which comes from Apple’s website, shows a few apps in which Apple uses CoreML and Vision. Please click on the image for a larger picture.
Right after the WWDC keynote when the developer beta of iOS 11 and the next version of XCode beta came available, developer quickly began playing with the newly available frameworks. This video on instagram shows a developer that built an that shows the name of the object in Augmented Reality (with ARKit) above or in front of the scanned object. The keynote was held on the 5th of June and the video on instagram was posted on the 31st the month after, which shows fast some is able to built an app integrating CoreML and ARKit. Another example where these two techniques are combined is in this YouTube video.
On GitHub, a user called likedan shows an example which recognizes the manufacturer of a car. This person used a dataset called The Comprehensive Cars (CompCars) dataset which was converted to a CoreMLModel for use in his app. Another user
made an example using a food dataset the recognize what food the user of the app has on their plate.
When looking in the App Store, there are several apps that use Machine Learning. One example of such an app is BotanicApp. You take a photo from a plant and it tells you what species it is. In image 4 I took a photo from a plant in one of my study rooms and the app labeled it as a Dracaena fragans. Another, maybe not so obvious app, is Snapchat. Snapchat uses face detection to apply filters to someones face. Though snapchat probably uses its own algorithms for facial recognition, it shows the potential of ML. Nude also became quite popular when some blogs wrote about the app. The app filters NSFW photos from your library and ‘protects’
them in a secure album.
How do the implementations of these frameworks work?
As mentioned earlier, CoreML supports multiple machine learning models. To use it in an app, the model has to be converted into CoreML model format. The file extention is
.mlmodel. Apple made a python package available, which is called coremltools, to create, examine and test models in the .mlmodel format. It can be used to:
- Convert existing models to .mlmodel from tools as mentioned in the table above;
- Express models in .mlmodel format through a simple API;
- Make predictions with an .mlmodel.
More can be read on the coremltools page on the python website. The CoreML frameworks thus uses models that are (pre-)trained in one of the tools from the table above. Also, Siraj Naval, who is a guy that makes YouTube videos about AI, has made a video about CoreML. He also addresses how coremltools can be used, see the video at around 12:20. Image 5, which is obtained from his git page, shows the three steps conversion workflow.
It couldn’t be simpeler to use the model in a project: the .mlmodel file only has to be dropped in the directory structure of the project in Xcode and it is ready to be worked with.
In the vision framework, Apple supports a variety of image formats:
CVPixelBufferRef: Everything that has to do with a streaming video. The AR sample would be a good example of this;
CIImage(Ref)Core Image: When already using CI in your application or when you want to pre-process the image;
NSURLimages from disk;
NSDataimages from the web.
When the image is processed, the CoreML model is fed to the container class called
VNCoreMLModel. In the completion handler of
VNCoreMLRequest a request is then made. It depends on where the image comes from, which image format you might want to use. When you want process the image and do something interactive with the image,
VNImageRequestHandler should be used as a handler. This is an object that processes image analysis from a single image. When something has to be tracked,
VNSequenceRequestHandler should be used, since this object can be used for image analysis from a sequence of multiple images. Vision has countless methods for all kinds of image analysis, which all can be viewed on the page of Vision.
How does Apples CoreML framework compare with existing tools?
One way of using ML in your app, is using cloud based solutions like Google Cloud Vision, Microsoft Azure Cognitive Services and IBM Watson. One big advantage of cloud based ML tools, is that is “just” a matter of calling an API. All the hard work is done in the cloud and you get the result back from the API, which can then be integrated into the app. This ease-of-use, unfortunately comes at a cost, literally. The use of these services is usually not free, on a production level scale. Also it is slower then ML on the device itself: every API call has to go over the internet. On other issue, which I’ll be discussing down below, is privacy. Information of the user is sent over the web to a cloud service which is then a black box for the outside work about what is happening with the data of the user.
TensorFlow, google’s ML framework, also has a version that runs on iOS. On the positive side, TensorFlow is fairly easy to integrate and use in an app. If you ask someone in the IT field if they know TensorFlow, the answer is probably yes. TF is quite popular and thus has a large user base. This makes it easy to find solutions and models for it on the web. TensorFlow is not as quick as CoreML. CoreML optimizes usage of CPU/GPU for the desired task (via Apples Metal API), where TF runs only on the CPU. For heavier tasks, this makes the library slow. The tool is written in C++ and the app has to be written in Objective and C++. Since I only know (the basics of) Swift, this is an extra learning step when implementing TensorFlow in an app. The pros and cons that are true for TensorFlow, are also true for Caffe2. Caffe2 was announced by Facebook at 2017’s F8 conference and is aimed at use on the mobile platform. The speed of CoreML is this moment the biggest plus for the framework. On the blog of Hollemans, an article with more information can be found. At the time of writing, Google also announced TensorFlow Lite, a TF variant that is specifically targeted at mobile, but there are not many comparisons online yet. When more information is available, I will look into it.
A bit about privacy
When working with user data, there is always a concern about the privacy of the user of the app. Where is the border of gross violation of someones privacy compared to harmless Machine Learning. Not many humans were harmed with the recent discovery of a new exoplanet, for instance. NASA discovered the planet using AI in overlooked data from the Kepler telescope. On the other end of the spectrum, there are companies like Facebook, Google or Walmart and AHold, that collect data from humans on a large scale, with their primary goal to make money out of someones data. As long as there is no user data involved or anything that can be traced back to a user, there should be no concern of privacy. When using a personal mobile phone, things like the deviceId and GPS-coordinates might also be sent to the cloud. This is what also should be taken into consideration when using cloud connected solutions. When using on-device ML, the part of inferring knowledge from the data by a cloud service, is not that big of a concern anymore. Images or text do not have to be sent to the cloud and the device itself does all the magic. It should however be kept into mind that the developer might upload the photo that you took, to its servers in order to expand its dataset.
Though CoreML makes it easy to tinker with Machine Learning, there are however a few shortcomings. First, and it is already described earlier, CoreML does not actually do any Machine Learning, but it implements trained models in an app. Frameworks like ScikitLearn, Keras or Caffe do the actual magic. It is also not possible to update the model at runtime, which makes it impossible to update the model with the user’s input. It is however possible to update the model from a server, when the maker of the app has a new model available. Apple has the method
compile available for this. Perhaps the biggest downside, which is somewhat connected with updating the model, is that CoreML does not support federated learning. Instead of applying these services inside CoreML, the maker of the app has to built these provisions itself. If Apple would’ve built the infrastructure for federated learning, it would potentially be more privacy friendly, since the user then could be asked to consent with uploading training content to the cloud. More can be read on this blogpost of Alex Sosnovshchenko.
 HBO (2017, 14 mei) Silicon Valley: Season 4 Episode 4: Not Hotdog (HBO) [Youtube] https://www.youtube.com/watch?v=ACmydtFDTGs
Developer.apple.com. (2017). Converting Trained Models to Core ML | Apple Developer Documentation. [online] Available at: https://developer.apple.com/documentation/coreml/converting_trained_models_to_core_ml#2880117 [Accessed 1 Dec. 2017].