Gentle Introduction to ARKit


Apple launched ARKit at WWDC 2017, since then it has created quite a buzz in the developer’s community as most of the mobile app developers have at least tried making augmented reality app and also let’s not leave the opportunity to formally say goodbye to Google Tango (DAMN YOU GOOGLE!!! ūüė°¬† Where should I shove my Asus Zenfone AR??)

With already so much happening in augmented reality Apple was kind of lagging behind the other players but they announced that they are going to launch AR platform last year and yes they rolled out the ARKit API¬†in WWDC. Before we begin let us break the ice; I am not an Apple fan for an obvious reason – I am poor ūüėú

The biggest advantage that Apple has over Android is that they control the hardware – look what happened to poor Tango (Dear Google I am very pissedūüė°).

ARKit does not require any special hardware modification which we have in Tango which had a fisheye lens and an IR sensor for better depth perception which gives it a better understanding of the environment. Moreover, ARKit is suitable for table-top experience which means that it can only detect horizontal planes like floors and tabletops at-least until the day of me writing this post, and I sincerely hope that the¬†developers at Apple must be working to make ARKit detect verticle walls too but this doesn’t mean that we can’t create rich AR experiences using ARKit. In fact, it is the best mobile AR experience I have ever had (I have to test ARCore though).

Something Technical –¬†

ARKit is a session based API and key highlights of an application built using ARKit API are –

  1. Tracking
  2. Scene Understanding
  3. Rendering

Note: I will not be covering rendering in this post.




As you can see from the image above, ARKit is not responsible for the rendering of graphics on screen its only objective is to do all the calculations and processing for a rich AR experience. ARkit receives data from two frameworks AVFoundation and CoreMotion.

AVFoundation is responsible for visual data i.e images whereas CoreMotion reports motion and environment related data.

Let us dig deep –¬†

As I have mentioned above ARKit is a session based API, you have to create a session before you could start rendering anything on screen

  1. Create an ARSession object, that controls all the processing of an AR app.
  2. Then you need to determine what sort of tracking you need to do in your app, for this you have to create ARSessionConfiguration. Also by enabling and disabling properties, you’ll get different scene understanding.
  3. To start session call run on ARSession object with ARSessionConfiguration you want to run with


ARSessionConfiguraion determines the tracking you want to run in your app i.e do you need 3Dof or 6 D0f. The base class ARSessionConfiguration provides 3Dof whereas its subclass ARWorldTrackingSessionConfiguration provides 6Dof. You also get feature points which can be enabled and disabled.

NOTE: When the session is started with ARSessionConfiguration it only provides 3Dof and does not have scene understanding, therefore, functionality like hit-testing will not be available.

As ARKit is a session based API we can pause, resume and reset its session. Like run(_configuration) is called to start the session you can call pause on the ARSession object to pause. To resume tracking after the pause you can call run with the configuration. A point worth noting is that you can call run with multiple configurations to switch between different configuration.

AVCatureSession and CMMotionManager are automatically created to get image and motion data. If everything goes planned and nothing is crashed then ARSession outputs ARFrame which is a snapshot of time(yeah right weird, but that is what Apple call it) and contains all the states of the session and everything else needed to render an augmented reality scene.

To access ARFrame call currentFrame on ARSession or you can set up a delegate to receive updates when ARFrame is available.

ARFrame РARFrame provides capture image, tracking information including its state and scene information which includes feature points and light estimate.

ARAnchor –¬†Physical location and space are represented as¬†ARAnchor¬†by ARKit. ARAnchor is a real-world position and orientation in space which can be removed and added from the scene but if you are using plane detection then ARAnchor is automatically added to the scene. ARAnchor objects can be accessed though a¬†list from ARFrame or we can create a delegate which will be notified when an ARAnchor is added/updated/removed.

By this far, I have discussed four main classes that are used to create augmented reality experience using ARKit i.e.

  1. ARSession
  2. ARSessionConfiguraion
  3. ARFrame
  4. ARAnchor



ARKit is capable of world tracking which is made using possible visual inertial odometry that provides you pose(position and rotation) and most importantly scale and distances.

Tracking is also made possible using 3D feature points, that capture unique points that are captured through the camera which is then mapped with the motion data to get precise position and rotation.

Below is the image showing what is happening under the hood.



CMMotionManager provides motion data at a higher rate and AVCaptureSession image data, the data from both is used to find the precise pose of the device which is returned in ARFrame which also contain ARCamera which is an object that represents device location and orientation as well as the translation from the initial point from which the session started. Camera intrinsics are also provided in ARFrame which matches the physical camera on the device used to find projection matrix which is used to render the virtual geometry.

There are few points worth noting for good tracking –

  1. Uninterrupted Sensor Data – Camera images are needed, and if camera is somehow gets covered then we will not get precise tracking
  2. Textured environment – If we are in room with only plain white walls then we will not have enough 3D-feature points which will then again leads to poor tracking
  3. Static Scenes – If what camera is seeing is moving too much then it will lead to drift or limited tracking experience.

ARCamera provides tracking state properties to manage limited tracking experiences. When the ARSession starts it begins with not available which means camera transform is the identity matrix.

When the app finds first tracking pose state changes from not available to normal i.e we can now use camera transform to place our virtual geometry. After that, if at any point tracking becomes limited due to reasons discussed above state goes to limited and also provide the reason in the delegate by the help which we can notify user to either light up the room or you are facing a white wall


But what if sensor data becomes unavailable, like when the camera gets covered or app gets backgrounded because you are using other apps simultaneously. In this case, tracking will be stopped and the session will be interrupted. ARKit provides a delegate to handle this scenario to tell the user that tracking is stopped and relative position of the device is lost to its better to re-start the experience when coming back from the interruption as the anchors will not be aligned anymore.

Scene Understanding

Scene understanding is important if you want to place a virtual item in a real world, it provides the information like horizontal surfaces, hit point on which item will be placed and light estimation so that item could merge with the environment.

  • Plane Detection – It provides horizontal planes with respect to gravity, it includes floor and table tops. ARKit does this in the background over multiple frames so when you move around a plane surface, device learns more about the plane. ARKit allows aligned extent for that particular surface in which it fits a rectangle on the surface and thus gives the orientation of the plane. If multiple planes are detected for the same physical plane it merges the multiple planes and the newer plane is removed.



  • Hit-Testing – We send a ray from the device which intersects in the real world and finds the intersection point so that we can then place our virtual object on that coordinate. For this ARKit uses all the scene information which includes the planes detected and 3D feature points. Resulted hitpoints are returned in an array which is sorted by distance i.e. closest intersection point to the device is at index zero of the array. There are four types of Hit-Test Types:-
    1. Existing plane using extent – If a ray hits within the extent, it then only gives the intersection within the extent, outside the extent there will be no intersection detected.
    2. Existing plane – In case of a moving furniture, if you have detected a small extent then you can ignore this extent and the intersection can be detected outside this extent as ARKit will treat this extent as an infinite plane.
    3. Estimate plane – In this case, ARKit estimates a plane based on the 3D-feature points by looking for co-planar points in the environment and fitting a virtual plane in it.
    4. Feature point – If the environment is very irregular or surface is very small then you can choose to intersect feature point directly.


  • Light Estimation – To make the virtual object look realistic by adjusting the relative brightness of the virtual object. For this, ARKit uses exposure information received from the captured image. The default value is 1000 lumen.


So far I tried covering all the major areas that you must know to make your own augmented reality app using ARKit. Here is a small demo built in Unity3D which I have created to showcase what can be done in ARKit

Download and install Unity3D to run this project







Download this project or clone it from GitHub


If you are curious and want to know about this project, then leave a comment or email me at

If you are interested to know more about the work I have done than do the same as above¬†ūüėÄ