Face tracking in video editing refers to software that identifies and follows the position of a human face across the frames of a video. The software updates its knowledge of the face's location with each new frame, allowing other systems — like cropping tools or effects — to use that position information in real time.
For content creators, face tracking has one primary practical application: keeping the speaker's face centered in the frame when reformatting a horizontal video to a vertical crop.
Modern face tracking uses neural networks trained on large datasets of human faces in different orientations, lighting conditions, and distances. The model is applied to each frame of the video and produces detection output: the coordinates of any faces in the frame, typically as a bounding box (x, y, width, height).
Tracking adds a temporal dimension to detection: the model maintains identity across frames, linking the face detected in frame 1 to the face detected in frame 2, frame 3, and so on. This allows it to follow a specific person even as they move, turn their head, or briefly disappear behind another object.
More sophisticated face tracking goes beyond bounding boxes to detect specific facial landmarks: the positions of the eyes, nose, mouth corners, jawline, and eyebrows. This allows the tracking system to determine head pose (which direction the face is turned), eye contact (is the person looking toward the camera?), and facial expression (are they smiling, raising eyebrows?).
These landmarks are used in AI clipping systems to score moment quality — a speaker making direct eye contact with the camera while expressing strong emotion scores higher than the same speaker looking away with a neutral expression.
When a horizontal video is converted to vertical format (from 16:9 to 9:16), only approximately 56% of the original frame width can be kept. The crop window must be positioned somewhere in the original frame.
Without face tracking: the crop window is fixed in one position — typically the center of the frame. If the speaker moves left or right, they drift out of the crop. This is the "bad vertical crop" you see when creators simply crop without any tracking.
With face tracking: the crop window follows the speaker's face across the frame. As the speaker moves left, the crop window moves left. As they lean forward (face gets larger), the crop adjusts to maintain the same relative head-to-frame ratio. The speaker stays centered throughout the clip.
Adobe Premiere Pro: the "Auto Reframe" effect uses Adobe Sensei AI to track subjects and adjust the crop dynamically. Applied when you're reframing a clip to a different aspect ratio.
DaVinci Resolve: face detection and tracking features are available in the Color page (for color-grading specific subjects) and the Fusion page (for motion graphics tracking to faces). Requires some manual setup for auto-reframe applications.
CapCut: basic auto-reframe that follows the primary subject in the frame. Effective for standard talking-head content.
AI clip generation tools like Clipsy include face tracking as part of their vertical reframing process. When you generate clips from a YouTube URL, the clips come back with face-tracked vertical crops applied — the speaker stays centered without any setup required on your end.
Face tracking isn't perfect. Common failure scenarios include: faces that turn more than about 90 degrees to the side (profile views), rapid motion that creates motion blur making detection unreliable, very small faces (speaker far from camera), occlusion (face blocked by another object), and multiple faces where the system can't determine which is the "correct" subject.
For standard talking-head YouTube content — a speaker sitting or standing in front of a camera — face tracking works well over 95% of the time without requiring manual correction.
Use dynamic face tracking reframe when: the speaker moves significantly, there are multiple speakers who take turns being the focus, or the original framing was wide with significant head room. Use a static crop when: the speaker is stationary and centered, the video has very little movement, and speed of processing is a priority over perfect tracking.
Try Clipsy Free