Some problems can be hard for us to debug because we think it’s too difficult to understand. Signs that this may be happening is if we start making wild guesses, we try the same thing over and over again, or copy and paste a solution we don’t understand in the hopes that it will go away.
I’ve seen this in practice from many smart and capable folks (including myself!) with concepts like z-index (like did you know there are multiple stacking contexts, not a single global one?), CSS specificity (it really is just counting!), spotting memory leaks, puzzling through concurrency issues, or trying to work around browser bugs.
So what can we do instead, when we notice something that’s hard for us to debug? Plenty!
- Take a pause with the beginner’s mindset. It’s okay to not know how something works. We can take some time to learn some of the basics to help us move forward.
- Remember that computers aren’t magic! 🪄 There’s a reason why this is happening. If we’re having trouble figuring it out, it might be for a reason that’s on a layer that we’re not familiar with. (browser, compiler, DB, OS, API, network, a bad physical cable, and so on).
- Know that we can break big problems into smaller ones and work systematically to rule out or narrow down theories.
So that all sounds great, but how do we apply that?
Let’s take a closer look at what this process can look like in practice. The following is a real example of a browser bug I hunted down and some techniques I have found helpful.
A Real Example: Flickering Elements in Safari
One issue that piqued my interest while working full time on the WordPress Block Editor was some puzzling behavior in Safari when scrolling post content in the WordPress block editor.
The text was flickering on image captions and there was a black flashing when scrolling quickly.
Cursory searching said that most simply promoted more elements to their own compositing layer (more on this below) via some CSS like transform: translate3D(0,0,0);
and called it a day.
While it’s pretty tempting to copy paste a CSS rule like that, what made me pause on accepting a PR that did just that is:
- We don’t understand why this was happening.
- We don’t understand why this maybe fixed it.
- We don’t understand what the consequences of doing so would be.
With that in mind, I dug in to try and provide an alternative solution. Here’s where I started:
Reproduce the Issue
A great first step to start when looking at a bug is testing to see if we can reproduce it.
Reproducing a bug lets us iterate on our theories and potential fixes in an ideally speedy test-a-fix and see-if-the-bug-is-still-there loop. If we can’t reproduce an issue, it doesn’t mean the problem doesn’t exist. It’ll just be much more difficult to iterate on. When things are not reproducible, sometimes we end up needing to test ideas by chatting with those who can reproduce the issue, or bulletproof and verify if an error or issue goes away with production instrumentation 😭.
Observe and Ask
In this Safari example, it was straightforward to reproduce. The issue already had a few videos attached to the issue, so I could verify which editors were being used by 🔍 looking at the UX elements and I could spot types of block content being used in the post as a starting point (paragraphs, cover block, gallery block and more). Creating a quick test post and scrolling in Safari confirmed that this was an issue.
If there’s not enough information in a bug report, we can ask the reporter for more information. Screenshots or full videos can help a lot too when folks don’t understand how to describe a behavior or technical terms of what’s actually happening.
Reduce What’s Needed to Reproduce
A bug report might have hidden assumptions or ideas on why something is happening. These assumptions aren’t always correct, so it helps to confirm or rule these out ourselves. One of the first things I did was also verify that this wasn’t showing any visual issues in FF or Chrome with the same content.
If we can reduce what’s needed to reproduce an issue, this can also help speed up our testing loop. With the flickering elements, I narrowed down post content to contain a single cover block and gallery.
Great so we can reproduce the issue! What’s next?
Assemble Our Known Clues
Like reading a murder mystery, it can help to keep our known facts or hints assembled together. It makes it easy to revisit when thinking of new ideas to investigate or if we need to go back and challenge our assumptions.
Initial Clues List
- The black flashing is from a bad browser paint?
- Maybe related to images (since it’s easier to reproduce with that content)? - From quick internet searches: possibly related to layer compositing issues?
Pick a Clue to Investigate or Challenge
Okay so searching on the internet for “Safari flickering scroll” gives us an answer of something like add this magic CSS rule that does nothing: transform: translate3D(0,0,0);
(move this element nowhere in 3D space) or something like backface-visibility: hidden
(toggling this value is invisible in 2D space) with zero explanation.
Very suspicious, right?
At this point, it might be easy to give up and think “This is weird! Just paste that answer and move on with our lives!” but let’s remain curious and dig deeper. It’s okay if we’re never encountered a browser bug like this before or never had to dig more deeply into understanding browser rendering.
Be persistent and keep searching (or asking folks) and refine our questions!
- Adding a follow up search like “
transform: translate3D(0,0,0);
” we can see that we’re trying to get the browser to do something around hardware acceleration. - Another search on hardware acceleration hints at a rendering process called compositing.
Bingo! There’s something new to learn or refresh our memories about! Let’s look at browser compositing.
A great way of gaining a basic understanding is finding multiple (hopefully reputable resources), read it, synthesize it and try to explain it again to someone else.
Here’s are the posts I used to try and summarize the next section:
- https://developers.google.com/web/updates/2018/09/inside-browser-part3
- https://developers.google.com/web/fundamentals/performance/rendering/stick-to-compositor-only-properties-and-manage-layer-count
- https://www.chromium.org/developers/design-documents/gpu-accelerated-compositing-in-chrome
- https://www.smashingmagazine.com/2016/12/gpu-animation-doing-it-right/
Let’s give that explanation part a try!
Understand a System More Deeply: Compositing
Compositing is one of the last steps that a browser takes when turning a web page into pixels on your screen. It’s also an optimization over a more naive implementation.
Very broadly a modern browser renderer process handles:
- Parsing: turning an HTML string into a Document Object Model (DOM). Loading external resources (images, styles, javascript), and loading, parsing and executing any JS.
- Style: computing the style for each DOM node. (Which CSS rule won?)
- Layout: calculating where to draw nodes and how big they should be.
- Paint Order: what order should we paint elements? Think of how we might paint a real oil painting where we have some background mountains, a person, and a dog as our main subject. One method would be to paint back-to-front, drawing background elements first. Mountains, then the person, then the dog.
- Paint and Compositing: Determine how to group elements into layers, paint each layer to fill with pixels, and then draw or put together each layer in the right order for a final image. Let’s go over this in more detail below.
What Is Compositing?
While we might naively fill in pixels on our screen by painting each element in our viewport in paint order, this is slow. What if we break up parts of the page into their own layers that don’t change as much? Using our painting analogy, we might make a layer for the mountains, one for the person, and a last one for the dog. Like in cel animation, after we paint each layer, we can reposition each layer independently without needing to repaint the entire scene. For browsers this is very useful in scenarios like smooth animation or scrolling.
Determining what should be in a layer is non-trivial. These are internal implementation details and may vary by browser and change over time, but roughly a browser will create a new layer when it has:
- 3D or perspective transform CSS properties
- <iframe>, <canvas> or <video> elements
- CSS animations and accelerated CSS filters
- It has a descendant that is a compositing layer
- It has a sibling with a lower z-index which has a compositing layer (in other words the layer overlaps a composited layer and should be rendered on top of it)
Layers can also get pretty large too, which can waste resources if the browser viewport only intersects a small part of it. We can optimize for this by subdividing a layer in a process called tiling. Going back to our painting analogy, think of how we might portion out squares of a large wall mural, in some prioritized order, for multiple artists to draw.
In the browser, determining what layers to create and how to put it back together again is usually split out to a compositor thread, which may in turn also create child threads to give small pieces of work to the Graphics Processing Unit (GPU). The GPU is great at small tasks like painting pixels for polygons, hence the term hardware acceleration that gets thrown around.
On Performance
Another way of thinking about this is that each compositing layer acts as a pixel cache. This is of interest to us as web developers because some types of updates to a web page can skip parts of the expensive rendering process.
From most to least expensive:
- Layout Change: If we make an element bigger, smaller, or change its position on the page we can’t skip any steps of the rendering process. (Using cel animation as an analogy, think of needing to throw out all of our existing cels and needing to ink, color and reposition them.)
- Paint Change: If we update a paint property like background, or color, we can skip layout. (Using cel animation as an analogy, we can repaint the existing cels, then reposition them).
- Compositor change: If we only update a compositor supported property: transform or opacity. We can skip layout and paint. When done correctly we can see very smooth animations and scrolls. (Using cel animation as an analogy, no need to ink or repaint cels, we can simply reposition the layers for a different final image).
Why can’t we make everything a layer?
We can’t make everything a layer since the tradeoff is memory use and overhead for managing each of those layers! Done haphazardly, we can make our webpage much slower, or even crash! Like with most things we should profile to make sure layers make sense and are kept in check.
Revisit Our Browser Tools
After we understand a system more deeply, let’s check to see if browser have any tools to help track this down! It’s always great to double check what debugging tools are available to us since it makes investigation work go by much more quickly.
Thankfully, Chrome and Safari do have a layers tab in devtools which list these layers, their memory use, and why it created a layer.
Using the Layers Tool
So right out of the box, we can see a few suspicious things:
- We have a number of layers, some of them are very big!
- Some layers are caused by some position value, like
position: fixed
- Others are caused by
–webkit-overflow-scrolling:touch
Let’s update the clues list
- The black flashing is from a bad browser paint?
- Maybe related to images (since it’s easier to reproduce with that content)?
- Possibly related to layer compositing issues?
- We have many layers. Some of them are big
- Some layers are caused by some position value, like "position: fixed" or "–webkit-overflow-scrolling:touch"
Refine the Search Using New Information
Using the information taken from inspecting the layers, one thing that called out to me was --webkit-overflow-scrolling
since this issue only triggered on scroll (that we know about).
What can we do with that? Well, maybe let’s try toggling the value!
And so here we try to override this in dev tools, but with an unhappy surprise! It’s an unsupported property! But somehow a compositing reason? How rude!
Well how about we change overflow on .interface-interface-skeleton__content
?
https://github.com/WordPress/gutenberg/pull/32637
This works to get rid of the glitches, but it breaks the sidebars. We can’t go with that approach of course, but we now have more information!
Let’s update the clues list
- The black flashing is from a bad browser paint?
- Maybe related to images (since it’s easier to reproduce with that content)? - Possibly related to layer compositing issues?
- We have many layers. Some of them are big
- Some layers are caused by some position value, like “position: fixed” or “–webkit-overflow-scrolling:touch”
- –webkit-overflow-scrolling:touch can no longer be set or unset. This is added automatically when overflow is set to scroll or auto. In other words, any scroller now uses compositing with hardware acceleration and we can’t turn it off.
Hidden Assumption
Sometimes we have assumptions on our clues list that are incorrect and can even limit our thinking sometimes. Let’s take a closer look at this one:
The black flashing is from a bad browser paint?
If true, this would almost 💯 be a browser bug to isolate and ideally be filed as a bug.
This wasn’t the case! While debugging in the elements pane, I stumbled upon the fact that it was coming from a .edit-post-layout .interface-interface-skeleton__content
parent element, and we could make it any color we pleased, like pink! The gray overlay was intended to be used to frame the tablet/mobile and template previews.
This wasn’t a graphics glitch, but possibly an incorrect ordering or compositing problem on scroll.
Let's update the clues list
-
.edit-post-layout .interface-interface-skeleton__content
- Maybe related to images (since it’s easier to reproduce with that content)?
- Possibly related to layer compositing issues?
- We have many layers. Some of them are big
- Some layers are caused by some position value, like “position: fixed” or “–webkit-overflow-scrolling:touch”
-
–webkit-overflow-scrolling:touch
can no longer be set or unset. This is added automatically when overflow
is set to scroll
or auto
. In other words, any scroller now uses compositing with hardware acceleration and we can't turn it off.Reduce the Problem
There’s a lot going on in the WordPress Block Editor. To help isolate the noise, two approaches can work in reducing a problem. We can start turning off larger pieces of logic in the app in exploratory PRs OR create a new base case mimicking conditions from scratch.
I opted to try and create a simple HTML/CSS test case, since I suspected it was still part of a browser bug. It’d be much easier to test our guesses in simpler markup and we’d need a simple test case to use for browser bug reporting in WebKit anyway.
My first attempt I came up with was this. I tried to pick out what I thought were the most important parts of the skeleton interface, along with what I hoped was enough test content to trigger the scrolling glitch.
I had partial success. I could trigger this on my large resolution monitor (and see it stop display the behavior when I set font-size back down to something more reasonable). Others however still couldn’t consistently reproduce.
Spot the Difference
Another game I like to play when debugging is spot the difference, where we go piece by piece and make sure we note where a difference appears in one environment or another. This work can be a bit tedious and sometimes requires turning off your brain, similar to going through git bisect to test which commit caused a failure in production. Once we spot a difference, we can then dig in and question why that is.
So, as I was refining the test case to try and make it more reproducible for more folks, I noticed something. Do you see it?
One of the compositing layers was much bigger in size in WordPress Block Editor than in the test case! In the Block Editor one layer was as tall as all content in that pane! Meanwhile my test case shows a layer that is the size of the current browser viewport.
Let’s update the clues list
-
.edit-post-layout .interface-interface-skeleton__content
- Maybe related to images (since it’s easier to reproduce with that content)?
- Possibly related to layer compositing issues?
- We have many layers. Some of them are big. Our initial test case does not have such large layers, what could be causing it?
- Some layers are caused by some position value, like “position: fixed” or “–webkit-overflow-scrolling:touch”
-
–webkit-overflow-scrolling:touch
can no longer be set or unset. This is added automatically when overflow
is set to scroll
or auto
. In other words, any scroller now uses compositing with hardware acceleration and we can’t turn it ofDetailed Work
With that new clue in hand, I tried to make a much more accurate base case. I attempted to fully mirror the skeleton interface. It was a lot of divs 😭.
My second basecase attempt was this and from the methodical work a surprising item popped out. A simple div was the cause of the huge layer:
<div tabindex="0" style="position: fixed;"></div>
See how removing it gives us a much more reasonable layer size? With the scrollable browser pane at ~1584px x 588px with the test case, there was also around a 50MB difference in memory usage.
Neat! I also opened a debug PR to see what would happen. In the Block Editor we insert this div in the content to aid in focus related issues while scrolling. When we don’t allow it to insert we can see that it too also gives us a more reasonable layer size.
We have mixed results: I wasn’t able to recreate the text flickering issue anymore, but I could still see the background bleed on scroll in some cases.
And for no apparent reason, doing so also triggered a new fun glitch where the background color from .edit-post-layout .interface-interface-skeleton__content
also “bleeds” into the scrollbar element when selecting an image.
Let’s update the clues list
- The black flashing is from a bad browser paint? The background color is “bleeding” from .edit-post-layout
.interface-interface-skeleton__content
- Maybe related to images (since it’s easier to reproduce with that content)?
- Possibly related to layer compositing issues?
- We have many layers. Some of them are big. Our initial test case does not have such large layers, what could be causing it? Adding a div in the scrollable content with position: fixed, will create a layer of height that equals of the scrollable content height.
- Removing the fixed div, causes a different scrollbar glitch to appear. Removing the fixed div did not fix the background scroll flashing. Do we have more than one problem?
- Some layers are caused by some position value, like “position: fixed” or “–webkit-overflow-scrolling:touch”
-
–webkit-overflow-scrolling:touch
can no longer be set or unset. This is added automatically when overflow is set to scroll or auto. In other words, any scroller now uses compositing with hardware acceleration and we can’t turn it off.Surfacing for Air
It is easy and normal for one to get frustrated or even stuck on debugging hard things. When that happens, pause, and surface for air. One great thing to do is recap what we’ve done so far. Reminding ourselves of what we’ve uncovered so far in our clues list, summarizing the pieces, and getting input on what to try next and test assumptions works in wonderful ways to unstick us!
So after writing an internal post at Automattic (which contained content like this post, just stopping at this part) I circled back and tried to decompose the problem. I was pretty sure we were seeing multiple issues instead of just one.
Workarounds
At this point, I was pretty sure that we were looking at at least one browser bug, perhaps several. The symptoms here were interesting enough to distill and report back to WebKit, but knowing if I could isolate a bug and see the fix upstream in a reasonable amount of time is typically out of my control.
As web developers, if functionality is important, we often need to pragmatically find a workaround where we typically re-implement the same functionality but avoid the browser bug, even if we’ve isolated the bug or have found a fix.
Background Bleeding on Scroll
The first workaround I focused on was the background “bleeding” on scroll. While, I had a good lead on the flickering text issue, at the time I didn’t know what functionality I broke by deleting the position:fixed
divs and what it was intended to be used for. I asked for help in tracking down that why and moved back with fresh eyes on the background-flashing issue.
I tried several things: like changing which components were toggling the overlay for tablet and mobile previews. Everything I tried wasn’t quite right, and I still saw that color bleed. Taking a break I looked at my clues list again.
Going back to my clues list:
- The black flashing is from a bad browser paint? The background color is “bleeding” from .edit-post-layout
.interface-interface-skeleton__content
- Maybe related to images (since it’s easier to reproduce with that content)?
- Possibly related to layer compositing issues?
- We have many layers. Some of them are big. Our initial test case does not have such large layers, what could be causing it? Adding a div in the scrollable content with position: fixed, will create a layer of height that equals of the scrollable content height.
- Removing the fixed div, causes a different scrollbar glitch to appear. Removing the fixed div did not fix the background scroll flashing. Do we have more than one problem?
- Some layers are caused by some position value, like “position: fixed” or “–webkit-overflow-scrolling:touch”
-
–webkit-overflow-scrolling:touch
can no longer be set or unset. This is added automatically when overflow is set to scroll or auto. In other words, any scroller now uses compositing with hardware acceleration and we can’t turn it off.Well it does look like the common thread here is the compositing layer. In this particular case, the background-color was also being set on a div that was a compositing layer.
I was focusing so much on business logic, what would happen if we changed how the CSS rules were applied? If there was a browser bug, what if we moved the background-color rule to a div that wasn’t a compositing layer?
https://github.com/WordPress/gutenberg/pull/32747
Moving the background-color to a div that was not a compositing layer did the trick! The workaround PR was super simple in retrospect but took quite a bit of 🔍️ investigation to get there.
Back to Text Flickering
With the background bleeding fix in hand I went ahead and rebased my debug PR with the background fixes. As a bonus, that scrollbar glitch was also fixed!
We already knew that removing the fixed divs would fix the text-flickering and doing so avoided creating a very large compositing layer on the block list wrapper.
Now it was a straightforward matter of understanding why the fixed divs were added and providing an alternative implementation to retain functionality. Thanks to others we learned that the divs were used to help prevent scrolling on tab. The workaround needed here was to remove these fixed divs and re-implement the scrolling on tab behavior in a different way:
https://github.com/WordPress/gutenberg/pull/32824
Isolate and Report
At this point all issues were resolved in Gutenberg with the workarounds that landed.
With limited time, it’d be pretty common to call it quits once we find a workaround for a browser bug. I was really curious about the root causes here too, so I wanted to isolate good test cases for WebKit bug reporting and hopefully get this fixed for others.
Doing this work took about as much time for me as it took to fix the problems in the WordPress Block Editor. I used all the techniques I noted before: reduce the problem, spot the difference, and lots of tedious detailed work.
One of the goals of isolation is to find the smallest test case possible that reproduces the behavior. If you have a good instinct on where the problem might lie, it might be faster to start from scratch and reproduce the conditions agnostic to your app. If not, it can make sense to start with where you can reproduce an issue, then try to chisel away as much as you can.
I also asked others to verify if they could reproduce a test case before reporting. Sometimes different operating system settings, or different hardware can be needed to trigger an issue. Aiming for a simple and easy to reproduce test case will usually lead to faster fixes in any project.
The Solutions
Background Flashing
We fixed this in the Block Editor by moving a background color to a div that was not a compositing layer.
https://github.com/WordPress/gutenberg/pull/32747
https://bugs.webkit.org/show_bug.cgi?id=227532
To isolate the problem, one missing ingredient was it required very quick scrolling (usually from a mousewheel). Webkit maintainers also noted that the artifacts here are from tiled layer flashing. (When we split up a large layer into smaller pieces to paint).
It should look like this:
Text Flickering
This was fixed in the Block Editor by removing two position:fixed
divs, so we avoided creating a compositing layer that was very large.
https://github.com/WordPress/gutenberg/pull/32824
To isolate a test case, I had to brute force this one, by turning off as much as I could in the editor, then perform a binary search on the styles to see what kept the glitch or not. The overall test case is a weird combination of needing flex styles, two fixed divs, an iframe and some extra z-index stacking contexts.
After submitting the issue, WebKit maintainers quickly used the technique to reduce what was needed to reproduce (https://bug-227705-attachments.webkit.org/attachment.cgi?id=432961 ✨). It turns out this was a duplicate bug of this regression which had a patch, and there was an amazing turn around of about a day to get that committed.
See also this comment from Simon Fraser on what was happening:
Compositing backing sharing logic exists to reduce the count of composited layers,
allowing layers that would otherwise get composited to paint into the backing store
of some containing block ancestor (usually a scroller). A "backing sharing sequence" is a backing-provider layer, and a set of layers contiguous in z-order that can paint into that shared backing. If a layer becomes composited, it must interrupt the sequence (because layers later in z-order must render on top).
The bug occurred when a layer became composited between the calls to
BackingSharingState::updateBeforeDescendantTraversal() and BackingSharingState::updateAfterDescendantTraversal(),
for example because of an indirect reason like overflow positioning...
Bonus: Safari Scrollbar Wrong Color
I happened to stumble on this one while debugging a black scrollbar on a test commit, but thought it would be interesting to isolate. This was fast for me to find through luck (half-a-day), since I’m inherently suspicious of extra z-index contexts.
Somehow the overflow controls container is getting behind the scroller layer. Triggered by the negative z-index child
- Simon Fraser https://bugs.webkit.org/show_bug.cgi?id=227545#c3
There’s already a committed patch in WebKit for this one! 🎉
Summary
So when debugging hard problems, don’t despair! Remember that it’s okay to not know how things work. If we notice a gap in our knowledge, we can take some time to learn how things work.
If we work systematically and keep track of what we know and don’t know so far, we can work toward finding a fix, or help narrow down what the problem might be. Techniques like picking a clue to challenge or investigate, reduce the problem, spot the difference, detailed work and surfacing for air can help move an issue forward.
When time allows, isolating an issue and reporting upstream can both help others and deepen our own understanding of what’s actually happening in a layer we are unfamiliar with.