Debugging Hard Things: Safari Edition

Some problems can be hard for us to debug because we think it’s too difficult to understand. Signs that this may be happening is if we start making wild guesses, we try the same thing over and over again, or copy and paste a solution we don’t understand in the hopes that it will go away.

Hint: Adding more zeros usually won’t help…

I’ve seen this in practice from many smart and capable folks (including myself!) with concepts like z-index (like did you know there are multiple stacking contexts, not a single global one?), CSS specificity (it really is just counting!), spotting memory leaks, puzzling through concurrency issues, or trying to work around browser bugs.

So what can we do instead, when we notice something that’s hard for us to debug? Plenty!

  • Take a pause with the beginner’s mindset. It’s okay to not know how something works. We can take some time to learn some of the basics to help us move forward.
  • Remember that computers aren’t magic! 🪄 There’s a reason why this is happening. If we’re having trouble figuring it out, it might be for a reason that’s on a layer that we’re not familiar with. (browser, compiler, DB, OS, API, network, a bad physical cable, and so on).
  • Know that we can break big problems into smaller ones and work systematically to rule out or narrow down theories.

So that all sounds great, but how do we apply that?

Let’s take a closer look at what this process can look like in practice. The following is a real example of a browser bug I hunted down and some techniques I have found helpful.

A Real Example: Flickering Elements in Safari

One issue that piqued my interest while working full time on the WordPress Block Editor was some puzzling behavior in Safari when scrolling post content in the WordPress block editor.

The text was flickering on image captions and there was a black flashing when scrolling quickly.

Cursory searching said that most simply promoted more elements to their own compositing layer (more on this below) via some CSS like transform: translate3D(0,0,0); and called it a day.

While it’s pretty tempting to copy paste a CSS rule like that, what made me pause on accepting a PR that did just that is:

  1. We don’t understand why this was happening.
  2. We don’t understand why this maybe fixed it.
  3. We don’t understand what the consequences of doing so would be.

With that in mind, I dug in to try and provide an alternative solution. Here’s where I started:

Reproduce the Issue

A great first step to start when looking at a bug is testing to see if we can reproduce it.

Reproducing a bug lets us iterate on our theories and potential fixes in an ideally speedy test-a-fix and see-if-the-bug-is-still-there loop. If we can’t reproduce an issue, it doesn’t mean the problem doesn’t exist. It’ll just be much more difficult to iterate on. When things are not reproducible, sometimes we end up needing to test ideas by chatting with those who can reproduce the issue, or bulletproof and verify if an error or issue goes away with production instrumentation 😭.

Observe and Ask

In this Safari example, it was straightforward to reproduce. The issue already had a few videos attached to the issue, so I could verify which editors were being used by 🔍 looking at the UX elements and I could spot types of block content being used in the post as a starting point (paragraphs, cover block, gallery block and more). Creating a quick test post and scrolling in Safari confirmed that this was an issue.

If there’s not enough information in a bug report, we can ask the reporter for more information. Screenshots or full videos can help a lot too when folks don’t understand how to describe a behavior or technical terms of what’s actually happening.

Reduce What’s Needed to Reproduce

A bug report might have hidden assumptions or ideas on why something is happening. These assumptions aren’t always correct, so it helps to confirm or rule out these ourselves. One of the first things I did was also verify that this wasn’t showing any visual issues in FF or Chrome with the same content.

If we can reduce what’s needed to reproduce an issue, this can also help speed up our testing loop. With the flickering elements, I narrowed down post content to contain a single cover block and gallery.

Great so we can reproduce the issue! What’s next?

Assemble Our Known Clues

Like reading a murder mystery, it can help to keep our known facts or hints assembled together. It makes it easy to revisit when thinking of new ideas to investigate or if we need to go back and challenge our assumptions.

Initial Clues List

  • Glitches only display while scrolling in Safari?
  • The black flashing is from a bad browser paint?
  • Maybe related to images (since it’s easier to reproduce with that content)?
  • From quick internet searches: possibly related to layer compositing issues?

Pick a Clue to Investigate or Challenge

Sidenote: sometimes we can use past experience (intuition) to skip through some of these discovery steps. I personally have some experience working on some reasonably complex canvas apps, so compositing was something that came to top of mind when looking at the symptoms of this bug. For the purposes of this section let’s assume that we have zero experience to work with.

Okay so searching on the internet for “Safari flickering scroll” gives us an answer of something like add this magic CSS rule that does nothing: transform: translate3D(0,0,0); (move this element nowhere in 3D space) or something like backface-visibility: hidden (toggling this value is invisible in 2D space) with zero explanation.

Very suspicious, right?

At this point, it might be easy to give up and think “This is weird! Just paste that answer and move on with our lives!” but let’s remain curious and dig deeper. It’s okay if we’re never encountered a browser bug like this before or never had to dig more deeply into understanding browser rendering.

Be persistent and keep searching (or asking folks) and refine our questions!

  • Adding a follow up search like “transform: translate3D(0,0,0);” we can see that we’re trying to get the browser to do something around hardware acceleration.
  • Another search on hardware acceleration hints at a rendering process called compositing.

Bingo! There’s something new to learn or refresh our memories about! Let’s look at browser compositing.

A great way of gaining a basic understanding is finding multiple (hopefully reputable resources), read it, synthesize it and try to explain it again to someone else.

Here’s are the posts I used to try and summarize the next section:

Let’s give that explanation part a try!

Understand a System More Deeply: Compositing

Compositing is one of the last steps that a browser takes when turning a web page into pixels on your screen. It’s also an optimization over a more naive implementation.

Very broadly a modern browser renderer process handles:

  • Parsing: turning an HTML string into a Document Object Model (DOM). Loading external resources (images, styles, javascript), and loading, parsing and executing any JS.
  • Style: computing the style for each DOM node. (Which CSS rule won?)
  • Layout: calculating where to draw nodes and how big they should be.
  • Paint Order: what order should we paint elements? Think of how we might paint a real oil painting where we have some background mountains, a person, and a dog as our main subject. One method would be to paint back-to-front, drawing background elements first. Mountains, then the person, then the dog.
  • Paint and Compositing: Determine how to group elements into layers, paint each layer to fill with pixels, and then draw or put together each layer in the right order for a final image. Let’s go over this in more detail below.

What Is Compositing?

Animation of compositing process from https://developers.google.com/web/updates/2018/09/inside-browser-part3#what_is_compositing

While we might naively fill in pixels on our screen by painting each element in our viewport in paint order, this is slow. What if we break up parts of the page into their own layers that don’t change as much? Using our painting analogy, we might make a layer for the mountains, one for the person, and a last one for the dog. Like in cel animation, after we paint each layer, we can reposition each layer independently without needing to repaint the entire scene. For browsers this is very useful in scenarios like smooth animation or scrolling.

Determining what should be in a layer is non-trivial. These are internal implementation details and may vary by browser and change over time, but roughly a browser will create a new layer when it has:

  • 3D or perspective transform CSS properties
  • <iframe>, <canvas> or <video> elements
  • CSS animations and accelerated CSS filters
  • It has a descendant that is a compositing layer
  • It has a sibling with a lower z-index which has a compositing layer (in other words the layer overlaps a composited layer and should be rendered on top of it)

Layers can also get pretty large too, which can waste resources if the browser viewport only intersects a small part of it. We can optimize for this by subdividing a layer in a process called tiling. Going back to our painting analogy, think of how we might portion out squares of a large wall mural, in some prioritized order, for multiple artists to draw.

In the browser, determining what layers to create and how to put it back together again is usually split out to a compositor thread, which may in turn also create child threads to give small pieces of work to the Graphics Processing Unit (GPU). The GPU is great at small tasks like painting pixels for polygons, hence the term hardware acceleration that gets thrown around.

On Performance

Another way of thinking about this is that each compositing layer acts as a pixel cache. This is of interest to us as web developers because some types of updates to a web page can skip parts of the expensive rendering process.

From most to least expensive:

  • Layout Change: If we make an element bigger, smaller, or change its position on the page we can’t skip any steps of the rendering process. (Using cel animation as an analogy, think of needing to throw out all of our existing cels and needing to ink, color and reposition them.)
  • Paint Change: If we update a paint property like background, or color, we can skip layout. (Using cel animation as an analogy, we can repaint the existing cels, then reposition them).
  • Compositor change: If we only update a compositor supported property: transform or opacity. We can skip layout and paint. When done correctly we can see very smooth animations and scrolls. (Using cel animation as an analogy, no need to ink or repaint cels, we can simply reposition the layers for a different final image).

Why can’t we make everything a layer?

Have you tried promoting another layer? Maybe all of them? Is this perhaps a bad idea?

We can’t make everything a layer since the tradeoff is memory use and overhead for managing each of those layers! Done haphazardly, we can make our webpage much slower, or even crash! Like with most things we should profile to make sure layers make sense and are kept in check.

Revisit Our Browser Tools

After we understand a system more deeply, let’s check to see if browser have any tools to help track this down! It’s always great to double check what debugging tools are available to us since it makes investigation work go by much more quickly.

Thankfully, Chrome and Safari do have a layers tab in devtools which list these layers, their memory use, and why it created a layer.

Using the Layers Tool

So right out of the box, we can see a few suspicious things:

  1. We have a number of layers, some of them are very big!
  2. Some layers are caused by some position value, like “position: fixed”
  3. Others are caused by “–webkit-overflow-scrolling:touch”

Let’s update the clues list

  • Glitches only display while scrolling in Safari?
  • The black flashing is from a bad browser paint?
  • Maybe related to images (since it’s easier to reproduce with that content)?
  • Possibly related to layer compositing issues?
    • We have many layers. Some of them are big
    • Some layers are caused by some position value, like “position: fixed” or “–webkit-overflow-scrolling:touch”

Refine the Search Using New Information

Using the information taken from inspecting the layers, one thing that called out to me was --webkit-overflow-scrolling since this issue only triggered on scroll (that we know about).

What can we do with that? Well, maybe let’s try toggling the value!

And so here we try to override this in dev tools, but with an unhappy surprise! It’s an unsupported property! But somehow a compositing reason? How rude! 

Well how about we change overflow on .interface-interface-skeleton__content?

https://github.com/WordPress/gutenberg/pull/32637

This works to get rid of the glitches, but it breaks the sidebars. We can’t go with that approach of course, but we now have more information!

Let’s update the clues list

  • Glitches only display while scrolling in Safari?
  • The black flashing is from a bad browser paint?
  • Maybe related to images (since it’s easier to reproduce with that content)?
  • Possibly related to layer compositing issues?
    • We have many layers. Some of them are big
    • Some layers are caused by some position value, like “position: fixed” or “–webkit-overflow-scrolling:touch”
    • –webkit-overflow-scrolling:touch can no longer be set or unset. This is added automatically when overflow is set to scroll or auto. In other words, any scroller now uses compositing with hardware acceleration and we can’t turn it off.

Hidden Assumption

Sometimes we have assumptions on our clues list that are incorrect and can even limit our thinking sometimes. Let’s take a closer look at this one:

The black flashing is from a bad browser paint?

If true, this would almost 💯 be a browser bug to isolate and ideally be filed as a bug.

This wasn’t the case! While debugging in the elements pane, I stumbled upon the fact that it was coming from a .edit-post-layout .interface-interface-skeleton__content parent element, and we could make it any color we pleased, like pink! The gray overlay was intended to be used to frame the tablet/mobile and template previews.

This wasn’t a graphics glitch, but possibly an incorrect ordering or compositing problem on scroll.

Let’s update the clues list

  • Glitches only display while scrolling in Safari?
  • The black flashing is from a bad browser paint? The background color is “bleeding” from .edit-post-layout .interface-interface-skeleton__content
  • Maybe related to images (since it’s easier to reproduce with that content)?
  • Possibly related to layer compositing issues?
    • We have many layers. Some of them are big
    • Some layers are caused by some position value, like “position: fixed” or “–webkit-overflow-scrolling:touch”
    • –webkit-overflow-scrolling:touch can no longer be set or unset. This is added automatically when overflow is set to scroll or auto. In other words, any scroller now uses compositing with hardware acceleration and we can’t turn it off.

Reduce the Problem

There’s a lot going on in the WordPress Block Editor. To help isolate the noise, two approaches can work in reducing a problem. We can start turning off larger pieces of logic in the app in exploratory PRs OR create a new base case mimicking conditions from scratch.

I opted to try and create a simple HTML/CSS test case, since I suspected it was still part of a browser bug. It’d be much easier to test our guesses in simpler markup and we’d need a simple test case to use for browser bug reporting in WebKit anyway.

My first attempt I came up with was this. I tried to pick out what I thought were the most important parts of the skeleton interface, along with what I hoped was enough test content to trigger the scrolling glitch.

I had partial success. I could trigger this on my large resolution monitor (and see it stop display the behavior when I set font-size back down to something more reasonable). Others however still couldn’t consistently reproduce.

Spot the Difference

Another game I like to play when debugging is spot the difference, where we go piece by piece and make sure we note where a difference appears in one environment or another. This work can be a bit tedious and sometimes requires turning off your brain, similar to going through git bisect to test which commit caused a failure in production. Once we spot a difference, we can then dig in and question why that is.

So, as I was refining the test case to try and make it more reproducible for more folks, I noticed something. Do you see it?

One of the compositing layers was much bigger in size in WordPress Block Editor than in the test case! In the Block Editor one layer was as tall as all content in that pane! Meanwhile my test case shows a layer that is the size of the current browser viewport.

Let’s update the clues list

  • Glitches only display while scrolling in Safari?
  • The black flashing is from a bad browser paint? The background color is “bleeding” from .edit-post-layout .interface-interface-skeleton__content
  • Maybe related to images (since it’s easier to reproduce with that content)?
  • Possibly related to layer compositing issues?
    • We have many layers. Some of them are big. Our initial test case does not have such large layers, what could be causing it?
    • Some layers are caused by some position value, like “position: fixed” or “–webkit-overflow-scrolling:touch”
    • –webkit-overflow-scrolling:touch can no longer be set or unset. This is added automatically when overflow is set to scroll or auto. In other words, any scroller now uses compositing with hardware acceleration and we can’t turn it off.

Detailed Work

With that new clue in hand, I tried to make a much more accurate base case. I attempted to fully mirror the skeleton interface. It was a lot of divs 😭.

My second basecase attempt was this and from the methodical work a surprising item popped out. A simple div was the cause of the huge layer:

<div tabindex="0" style="position: fixed;"></div>

See how removing it gives us a much more reasonable layer size? With the scrollable browser pane at ~1584px x 588px with the test case, there was also around a 50MB difference in memory usage.

Neat! I also opened a debug PR to see what would happen. In the Block Editor we insert this div in the content to aid in focus related issues while scrolling. When we don’t allow it to insert we can see that it too also gives us a more reasonable layer size.

We have mixed results: I wasn’t able to recreate the text flickering issue anymore, but I could still see the background bleed on scroll in some cases.

And for no apparent reason, doing so also triggered a new fun glitch where the background color from .edit-post-layout .interface-interface-skeleton__content also “bleeds” into the scrollbar element when selecting an image.

Let’s update the clues list

  • Glitches only display while scrolling in Safari?
  • The black flashing is from a bad browser paint? The background color is “bleeding” from .edit-post-layout .interface-interface-skeleton__content
  • Maybe related to images (since it’s easier to reproduce with that content)?
  • Possibly related to layer compositing issues?
    • We have many layers. Some of them are big. Our initial test case does not have such large layers, what could be causing it? Adding a div in the scrollable content with position: fixed, will create a layer of height that equals of the scrollable content height.
      • Removing the fixed div, causes a different scrollbar glitch to appear. Removing the fixed div did not fix the background scroll flashing. Do we have more than one problem?
    • Some layers are caused by some position value, like “position: fixed” or “–webkit-overflow-scrolling:touch”
    • –webkit-overflow-scrolling:touch can no longer be set or unset. This is added automatically when overflow is set to scroll or auto. In other words, any scroller now uses compositing with hardware acceleration and we can’t turn it off.

Surfacing for Air

It is easy and normal for one to get frustrated or even stuck on debugging hard things. When that happens, pause, and surface for air. One great thing to do is recap what we’ve done so far. Reminding ourselves of what we’ve uncovered so far in our clues list, summarizing the pieces, and getting input on what to try next and test assumptions works in wonderful ways to unstick us!

So after writing an internal post at Automattic (which contained content like this post, just stopping at this part) I circled back and tried to decompose the problem. I was pretty sure we were seeing multiple issues instead of just one.

Background Fixes

I tried to tackle the background “bleeding” on scroll first. While I had a good lead on the flickering text issue, at the time I didn’t know what functionality I broke by deleting the position:fixed divs and what it was intended to be used for. I asked for help in tracking that down and moved back with fresh eyes on the background flashing issue.

One of the first things I attempted was changing what components were toggling the overlay for tablet and mobile previews. This wasn’t quite right.

After a lot of trial and error, and revising my clues list several times, I tried to move the background color to a child div that wasn’t a compositing layer! It worked.

https://github.com/WordPress/gutenberg/pull/32747

This was super simple in retrospect but took quite a bit of 🔍️ investigation to get there.

Back to Text Flickering

With that fix in hand I went ahead and rebased my debug PR with the background fixes. As a bonus, that scrollbar glitch was also fixed!

We already knew that removing the fixed divs would fix the text-flickering, doing so avoided creating a very large compositing layer on the block list wrapper.

Now it was a straightforward matter of understanding why the fixed divs were added, and providing an alternative implementation to retain functionality. (Spoiler: we used them to help prevent scrolling on tab).

https://github.com/WordPress/gutenberg/pull/32824

Isolate and Report

With limited time, it’d be pretty common to call it quits once we find a workaround for a browser bug. I was really curious about the root causes here too, so I wanted to isolate good test cases for WebKit bug reporting and hopefully get this fixed for others.

Doing this work took about as much time for me as it took to fix the problems in the WordPress Block Editor. I used all the techniques I noted before: reduce the problem, spot the difference, and lots of tedious detailed work.

I also bothered other folks to verify if they could reproduce the test case before reporting. Sometime OS settings, or different hardware can be needed to trigger an issue. Aiming for a simple and easy to reproduce test case will usually lead to faster fixes in a project.

The Solutions

Background Flashing

We fixed this in the Block Editor by moving a background color to a div that was not a compositing layer.

https://github.com/WordPress/gutenberg/pull/32747

https://bugs.webkit.org/show_bug.cgi?id=227532

To isolate the problem, one missing ingredient was it required very quick scrolling (usually from a mousewheel). Webkit maintainers also noted that the artifacts here are from tiled layer flashing. (When we split up a large layer into smaller pieces to paint).

It should look like this:

Text Flickering

This was fixed in the Block Editor by removing two position:fixed divs, so we avoided creating a compositing layer that was very large.

https://github.com/WordPress/gutenberg/pull/32824

To isolate a test case, I had to brute force this one, by turning off as much as I could in the editor, then perform a binary search on the styles to see what kept the glitch or not. The overall test case is a weird combination of needing flex styles, two fixed divs, an iframe and some extra z-index stacking contexts.

https://bugs.webkit.org/show_bug.cgi?id=227705

After submitting the issue, WebKit maintainers quickly used the technique to reduce what was needed to reproduce (https://bug-227705-attachments.webkit.org/attachment.cgi?id=432961 ✨). It turns out this was a duplicate bug of this regression which had a patch, and there was an amazing turn around of about a day to get that committed.

See also this comment from Simon Fraser on what was happening:

Compositing backing sharing logic exists to reduce the count of composited layers,
allowing layers that would otherwise get composited to paint into the backing store
of some containing block ancestor (usually a scroller). A "backing sharing sequence"
is a backing-provider layer, and a set of layers contiguous in z-order that can
paint into that shared backing. If a layer becomes composited, it must interrupt
the sequence (because layers later in z-order must render on top).

The bug occurred when a layer became composited between the calls to 
BackingSharingState::updateBeforeDescendantTraversal() and BackingSharingState::updateAfterDescendantTraversal(),
for example because of an indirect reason like overflow positioning...

Bonus: Safari Scrollbar Wrong Color

I happened to stumble on this one while debugging a black scrollbar on a test commit, but thought it would be interesting to isolate. This was fast for me to find through luck (half-a-day), since I’m inherently suspicious of extra z-index contexts.

https://bugs.webkit.org/show_bug.cgi?id=227545

Somehow the overflow controls container is getting behind the scroller layer. Triggered by the negative z-index child

Simon Fraser https://bugs.webkit.org/show_bug.cgi?id=227545#c3

There’s already a committed patch in WebKit for this one! 🎉

Summary

So when debugging hard problems, don’t despair! Remember that it’s okay to not know how things work. If we notice a gap in our knowledge, we can take some time to learn how things work.

If we work systematically and keep track of what we know and don’t know so far, we can work toward finding a fix, or help narrow down what the problem might be. Techniques like picking a clue to challenge or investigate, reduce the problem, spot the difference, detailed work and surfacing for air can help move an issue forward.

When time allows, isolating an issue and reporting upstream can both help others and deepen our own understanding of what’s actually happening in a layer we are unfamiliar with.

I’m happily employed by Automattic. If working on these kinds of problems excite you, check out our open jobs page!

#code

s
search
c
compose new post
r
reply
e
edit
t
go to top
j
go to the next post or comment
k
go to the previous post or comment
o
toggle comment visibility
esc
cancel edit post or comment
%d bloggers like this: