Get Started With Image Recognition in Core ML
With technological advances, we're at the point where our devices can use their built-in cameras to accurately identify and label images using a pre-trained data set. You can also train your own models, but in this tutorial, we'll be using an open-source model to create an image classification app.
I'll show you how to create an app that can identify images. We'll start with an empty Xcode project, and implement machine-learning-powered image recognition one step at a time.
Getting Started
Xcode Version
Before we begin, make sure you have the latest version of Xcode installed on your Mac. This is very important because Core ML will only be available on Xcode 9 or newer. You can check your version by opening Xcode and going to Xcode > About Xcode in the upper toolbar.
If your version of Xcode is older than Xcode 9, you can go to the Mac App Store and update it, or if you don't have it, download it for free.
Sample Project
New Project
After you have made sure you have the right version of Xcode, you'll need to make a new Xcode project.
Go ahead and open Xcode and click Create a new Xcode project.
Next, you'll need to choose a template for your new Xcode project. It's pretty common to use a Single View App, so go ahead and select that and click Next.
You can name your project anything you like, but I will be naming mine CoreML Image Classification. For this project, we'll be using Swift, so make sure that it's selected in the Language dropdown.
Preparing to Debug
Connecting an iPhone
Since the Xcode Simulator doesn't have a camera, you'll need to plug in your iPhone. Unfortunately, if you don't have an iPhone, you'll need to borrow one to be able to follow along with this tutorial (and for any other camera-related apps). If you already have an iPhone connected to Xcode, you can skip ahead to the next step.
A nifty new feature in Xcode 9 is that you can wirelessly debug your app on a device, so let's take the time to set that up now:
In the top menu bar, choose Window > Devices and Simulators. In the window that appears, make sure that Devices is selected at the top.
Now, plug in your device using a lightning cable. This should make your device appear in the left pane of the Devices and Simulators window. Simply click your device, and check the Connect via Network box.
You will now be able to wirelessly debug on this iPhone for all future apps. To add other devices, you can follow a similar process.
Simulator Selection
When you want to finally use your iPhone to debug, simply select it from the dropdown beside the Run button. You should see a network icon next to it, showing that it's connected for wireless debugging. I've selected Vardhan's iPhone, but you need to select your specific device.
Diving Deeper
Now that you've created your project and set up your iPhone as a simulator, we'll dive a bit deeper and begin programming the real-time image classification app.
Preparing Your Project
Getting a Model
To be able to start making your Core ML image classification app, you'll first need to get the Core ML model from Apple's website. As I mentioned before, you can also train your own models, but that requires a separate process. If you scroll to the bottom of Apple's machine learning website, you'll be able to choose and download a model.
In this tutorial, I will be using the MobileNet.mlmodel model, but you can use any model as long as you know its name and can ensure that it ends in .mlmodel.
Importing Libraries
There are a couple of frameworks you'll need to import along with the usual UIKit
. At the top of the file, make sure the following import statements are present:
1 |
import UIKit |
2 |
import AVKit |
3 |
import Vision |
We'll need AVKit
because we'll be creating an AVCaptureSession
to display a live feed while classifying images in real time. Also, since this is using computer vision, we'll need to import the Vision
framework.
Designing Your User Interface
An important part of this app is displaying the image classification data labels as well as the live video feed from the device's camera. To begin designing your user interface, head to your Main.storyboard file.
Adding an Image View
Head to the Object Library and search for an Image View. Simply drag this onto your View Controller to add it in. If you'd like, you can also add a placeholder image so that you can get a general idea of what the app will look like when it's being used.
If you do choose to have a placeholder image, make sure that the Content Mode is set to Aspect Fit, and that you check the box which says Clip to Bounds. This way, the image will not appear stretched, and it won't appear outside of the UIImageView
box.
Here's what your storyboard should now look like:
Adding a View
Back in the Object Library, search for a View and drag it onto your View Controller. This will serve as a nice background for our labels so that they don't get hidden in the image being displayed. We'll be making this view translucent so that some of the preview layer is still visible (this is just a nice touch for the user interface of the app).
Drag this to the bottom of the screen so that it touches the container on three sides. It doesn't matter what height you choose because we'll be setting constraints for this in just a moment here.
Adding Labels
This, perhaps, is the most important part of our user interface. We need to display what our app thinks the object is, and how sure it is (confidence level). As you've probably guessed, you'll need to drag two Label(s) from the Object Library to the view we just created. Drag these labels somewhere near the center, stacked on top of each other.
For the top label, head to the Attributes Inspector and click the T button next to the font style and size and, in the popup, select System as the font. To differentiate this from the confidence label, select Black as the style. Lastly, change the size to 24.
For the bottom label, follow the same steps, but instead of selecting Black as the style, select Regular, and for the size, select 17.
The image below shows how your Storyboard should look when you've added all these views and labels. Don't worry if they aren't exactly the same as yours; we'll be adding constraints to them in the next step.
Adding Constraints
In order for this app to work on different screen sizes, it's important to add constraints. This step isn't crucial to the rest of the app, but it's highly recommended that you do this in all your iOS apps.
Image View Constraints
The first thing to constrain is our UIImageView
. To do this, select your image view, and open the Pin Menu in the bottom toolbar (this looks like a square with the constraints and it's the second from the right). Then, you'll need to add the following values:
Before you proceed, make sure that the Constrain to Margins box isn't checked as this will create a gap between the screen and the actual image view. Then, hit Enter. Now your UIImageView
is centered on the screen, and it should look right on all device sizes.
View Constraints
Now, the next step is to constrain the view on which the labels appear. Select the view, and then go to the Pin Menu again. Add the following values:
Now, simply hit Enter to save the values. Your view is now constrained to the bottom of the screen.
Label Constraints
Since the view is now constrained, you can add constraints to the labels relative to the view instead of the screen. This is helpful if you later decide to change the position of the labels or the view.
Select both of the labels, and put them in a stack view. If you don't know how to do this, you simply need to press the button (second one from the left) which looks like a stack of books with a downwards arrow. You will then see the buttons become one selectable object.
Click on your stack view, and then click on the Align Menu (third from the left) and make sure the following boxes are checked:
Now, hit Enter. Your labels should be centered in the view from the previous step, and they will now appear the same on all screen sizes.
Interface Builder Outlets
The last step in the user interface would be to connect the elements to your ViewController()
class. Simply open the Assistant Editor and then Control-Click and Drag each element to the top of your class inside ViewController.swift. Here's what I'll be naming them in this tutorial:
-
UILabel
:objectLabel
-
UILabel
:confidenceLabel
-
UIImageView
:imageView
Of course, you can name them whatever you want, but these are the names you'll find in my code.
Preparing a Capture Session
The live video feed will require an AVCaptureSession
, so let's create one now. We'll also be displaying our camera input to the user in real time. Making a capture session is a pretty long process, and it's important that you understand how to do it because it will be useful in any other development you do using the on-board camera on any of Apple's devices.
Class Extension and Function
To begin, we can create a class extension and then make it conform to the AVCaptureVideoDataOutputSampleBufferDelegate
protocol. You can easily do this within the actual ViewController
class, but we're using best practices here so that the code is neat and organized (this is the way you would be doing it for production apps).
So that we can call this inside of viewDidLoad()
, we'll need to create a function called setupSession()
which doesn't take in any parameters. You can name this anything you want, but be mindful of the naming when we call this method later.
Once you're finished, your code should look like the following:
1 |
// MARK: - AVCaptureSession
|
2 |
extension ViewController: AVCaptureVideoDataOutputSampleBufferDelegate { |
3 |
func setupSession() { |
4 |
// Your code goes here
|
5 |
}
|
6 |
}
|
Device Input and Capture Session
The first step in creating the capture session is to check whether or not the device has a camera. In other words, don't attempt to use the camera if there is no camera. We'll then need to create the actual capture session.
Add the following code to your setupSession()
method:
1 |
guard let device = AVCaptureDevice.default(for: .video) else { return } |
2 |
guard let input = try? AVCaptureDeviceInput(device: device) else { return } |
3 |
|
4 |
let session = AVCaptureSession() |
5 |
session.sessionPreset = .hd4K3840x2160 |
Here, we're using a guard let
statement to check if the device (AVCaptureDevice
) has a camera. When you try to get the camera of the device, you must also specify the mediaType
, which, in this case, is .video
.
Then, we create an AVCaptureDeviceInput
, which is an input which brings the media from the device to the capture session.
Finally, we simply create an instance of the AVCaptureSession
class, and then assign it to a variable called session
. We've customized the session bitrate and quality to Ultra-High-Definition (UHD) which is 3840 by 2160 pixels. You can experiment with this setting to see what works for you.
Preview Layer and Output
The next step in doing our AVCaptureSession
setup is to create a preview layer, where the user can see the input from the camera. We'll be adding this onto the UIImageView
we created earlier in our Storyboard. The most important part, though, is actually creating our output for the Core ML model to process later in this tutorial, which we'll also do in this step.
Add the following code directly underneath the code from the previous step:
1 |
et previewLayer = AVCaptureVideoPreviewLayer(session: session) |
2 |
previewLayer.frame = view.frame |
3 |
imageView.layer.addSublayer(previewLayer) |
4 |
|
5 |
let output = AVCaptureVideoDataOutput() |
6 |
output.setSampleBufferDelegate(self, queue: DispatchQueue(label: "videoQueue")) |
7 |
session.addOutput(output) |
We first create an instance of the AVCaptureVideoPreviewLayer
class, and then initialize it with the session we created in the previous step. After that's done, we're assigning it to a variable called previewLayer
. This layer is used to actually display the input from the camera.
Next, we'll make the preview layer fill the whole screen by setting the frame dimensions to those of the view. This way, the desired appearance will persist for all screen sizes. To actually show the preview layer, we'll add it in as a sub-layer of the UIImageView
that we created when we were making the user interface.
Now, for the important part: We create an instance of the AVCaptureDataOutput
class and assign it to a variable called output
.
Input and Start Session
Finally, we're done with our capture session. All that's left to do before the actual Core ML code is to add the input and start the capture session.
Add the following two lines of code directly under the previous step:
1 |
// Sets the input of the AVCaptureSession to the device's camera input
|
2 |
session.addInput(input) |
3 |
// Starts the capture session
|
4 |
session.startRunning() |
This adds the input that we created earlier to the AVCaptureSession
, because before this, we'd only created the input and hadn't added it. Lastly, this line of code starts the session which we've spent so long creating.
Integrating the Core ML Model
We've already downloaded the model, so the next step is to actually use it in our app. So let's get started with using it to classify images.
Delegate Method
To begin, you'll need to add the following delegate method into your app:
1 |
func captureOutput(_ output: AVCaptureOutput, didOutput sampleBuffer: CMSampleBuffer, from connection: AVCaptureConnection) { |
2 |
// Your code goes here
|
3 |
}
|
This delegate method is triggered when a new video frame is written. In our app, this happens every time a frame gets recorded through our live video feed (the speed of this is solely dependent on the hardware which the app is running on).
Pixel Buffer and Model
Now, we'll be turning the image (one frame from the live feed) into a pixel buffer, which is recognizable by the model. With this, we'll be able to later create a VNCoreMLRequest
.
Add the following two lines of code inside the delegate method you created earlier:
1 |
guard let pixelBuffer: CVPixelBuffer = CMSampleBufferGetImageBuffer(sampleBuffer) else { return } |
2 |
guard let model = try? VNCoreMLModel(for: MobileNet().model) else { return } |
First we create a pixel buffer (a format which Core ML accepts) from the argument passed in through the delegate method, and then assign it to a variable called pixelBuffer
. Then we assign our MobileNet model to a constant called model
.
Notice that both of these are created using guard let
statements, and that the function will return if either of these are nil
values.
Creating a Request
After the previous two lines of code have been executed, we know for sure that we have a pixel buffer and a model. The next step would be to create a VNCoreMLRequest
using both of them.
Right below the previous step, paste the following lines of code inside of the delegate method:
1 |
let request = VNCoreMLRequest(model: model) { (data, error) in { |
2 |
// Your code goes here
|
3 |
}
|
Here, we're creating a constant called request
and assigning it the return value of the method VNCoreMLRequest
when our model is passed into it.
Getting and Sorting Results
We're almost finished! All we need to do now is get our results (what the model thinks our image is) and then display them to the user.
Add the next two lines of code into the completion handler of your request:
1 |
// Checks if the data is in the correct format and assigns it to results
|
2 |
guard let results = data.results as? [VNClassificationObservation] else { return } |
3 |
// Assigns the first result (if it exists) to firstObject
|
4 |
guard let firstObject = results.first else { return } |
If the results from the data (from the completion handler of the request) are available as an array of VNClassificationObservations
, this line of code gets the first object from the array we created earlier. It will then be assigned to a constant called firstObject
. The first object in this array is the one for which the image recognition engine has the most confidence.
Displaying Data and Image Processing
Remember when we created the two labels (confidence and object)? We'll now be using them to display what the model thinks the image is.
Append the following lines of code after the previous step:
1 |
if firstObject.confidence * 100 >= 50 { |
2 |
self.objectLabel.text = firstObject.identifier.capitalized |
3 |
self.confidenceLabel.text = String(firstObject.confidence * 100) + "%" |
4 |
}
|
The if
statement makes sure that the algorithm is at least 50% certain about its identification of the object. Then we just set the firstObject
as the text of the objectLabel
because we know that the confidence level is high enough. We'll just display the certainty percentage using the text property of confidenceLabel
. Since firstObject.confidence
is represented as a decimal, we'll need to multiply by 100 to get the percentage.
The last thing to do is to process the image through the algorithm we just created. To do this, you'll need to type the following line of code directly before exiting the captureOutput(_:didOutput:from:)
delegate method:
1 |
try? VNImageRequestHandler(cvPixelBuffer: pixelBuffer, options: [:]).perform([request]) |
Conclusion
The concepts you learned in this tutorial can be applied to many kinds of apps. I hope you've enjoyed learning to classify images using your phone. While it may not yet be perfect, you can train your own models in the future to be more accurate.
Here's what the app should look like when it's done:
While you're here, check out some of our other posts on machine learning and iOS app development!