So, You Want an Accessibility Score with Karl Groves

In this presentation, Karl Groves, Founder and President at Tenon, spoke about one of the bigger challenges when it comes to managing accessibility. How do you create a score in a way that accurately reflects how usable a product is for users with disabilities?

Thanks to Our Sponsors

Empire Caption Solutions strives to create inclusive experiences and engage individuals with different abilities and backgrounds by providing high-quality accessibility services for recorded media, such as closed captions, transcriptions, Audio Description, and ASL interpretation. By utilizing both the latest technology and human expertise, ECS is able to help its clients meet WCAG 2.1 success criteria and ADA compliance while offering options that fit almost any budget.

Bet Hannon Business Websites is a full-service WordPress agency, providing design, development, managed hosting and content management services. They have a special focus on website accessibility and offer accessibility audits as well as remediation assistance. BHBW also offer services related to integrating WordPress with other platforms and building out complex Gravity Forms implementations.

Watch the Recording

If you missed the meetup or would like a recap, watch the video below or read the transcript. If you have questions about what was covered in this meetup please tweet us @EqualizeDigital on Twitter or join our Facebook group for WordPress Accessibility.

Read the Transcript

Read the Meetup Video Transcript

>> AMBER HINDS: Welcome, everyone. We’re at 10:05, so I’m going to officially get started. Feel free to continue introducing yourself in the chat if you’d like. Hopefully, everyone can see my screen, someone shout out at me if you can’t because sometimes I think I’m sharing my slides and I’m actually not. A few announcements before we get started. The first one if you haven’t been before, we have a Facebook group. If you look on Facebook for WordPress Accessibility, you’ll find our Facebook group. It’s a great place to connect with people between meetups, ask questions, get information.

I saw someone recently was posting in there because they were having trouble getting the CSS to work to add underlines to all of their links. They asked for some help. I posted in recently about an issue with Beaver Builder and I had like three people be like, “Oh, here’s the JavaScript to fix that.” Why am I requesting remote control of my own screen? I don’t even know why I got a Zoom message about that. We’ll see how this goes. Anyway, the Facebook group is a great resource in between meetups, so feel free to join that and connect between meetups.

If you are looking for recordings, this meetup is recorded. It takes us about a week to correct the captions, and then get the video posted with corrected captions and a full transcript but it will be available on our website if you go to equalizedigital.com/meetup. That’s a frequently asked question. People always ask how they can get it and we genuinely will also post about it in the meetup on the meetup page. You’ll get a notification from meetup if you RSVP’d on meetup and we’ll also post about it in the Facebook group when it is available. You can also go to that same page and watch all of the recordings from all of the past meetups as well, including a different one that Karl gave for us earlier in the year.

If you are interested in staying on top of WordPress Accessibility news, and getting email reminders about the meetups, and getting email notifications when the recordings are available, please join our email list. It’s at equalizedigital.com/focus-state. We send out emails about twice a month if we’re on top of our game, and it’s a great resource and it is more accessible with headings and things like that than the emails that meetup sends out.

We do require sponsors to help us cover the cost for this meetup because we’ve committed to always having live captioning at our meetups. We have openings for sponsors for this daytime meetup starting I think next month. If anyone is interested in sponsoring the meetup or if their company is interested in sponsoring the meetup, please reach out to us and let us know. We will have captions otherwise we’ll pay for it but it is helpful if the community can help us as well.

If you have any suggestions for the meetup or you would be interested in speaking, we are actively looking for speakers for over the summer and early fall right now. Please email us at meetup@equalizedigital.com That goes to myself and Paula. That’s the best way to get a hold of us and please let us know if there’s anything we can do to make the meetup work better for you. Those are most of my announcements. I have not really introduced myself but if you haven’t been before, my name is Amber Hinds. I’m the CEO of a company called Equalize Digital.

We are a certified B Corporation and part of organizing this meetup is how we meet one of our goals and objectives as a company to create more education and resources out there. I will also say I do it somewhat selfishly because I love to hear from our speakers and learn from them and this was a great way to get them to come and answer all the questions that I have. We have a plugin called Accessibility Checker which you can find on wordpress.org and repo that scans for some of the problems that the ones that can be identified automatically. Not all problems can be but it does scan for some and that’s available if you’re interested.

Today, we have two sponsors. Our first is Bet Hannon Business Websites. Bet Hannon Business Websites has covered the cost of our live captioning for today so thank you so much to Bet and her team, we really appreciate it. Bet Hannon Business Websites is a full-service WordPress agency providing design, development, managed hosting, and content management services. They have a special focus on website accessibility and they offer accessibility audits as well as remediation assistance. I think they’ve also helped some plugin developers and things like that so it’s not always just limited to websites.

They also offer services related to integrating WordPress with other platforms and they have a specialty with complex Gravity Forms implementations. Let’s see. Please check out, learn more about Bet Hannon Business Websites, we always recommend that you tweet a thank you to our sponsors to help encourage them to continue wanting to sponsor so if you’re interested, you can find that on Twitter, @bethannon.

Then, our other sponsor for today is Empire Caption Solutions. They have very kindly donated their services for our post-event transcription. Have very kindly offered to transcribe all of our videos so that we can have accurate captions after the fact, we greatly appreciate that. You can get them at Empire Caption Solution on Twitter. We have a couple of upcoming events. We will have a talk from Joe Dolson about how you can contribute to the WordPress Accessibility team on Monday, June 20th and then on Thursday, July 7th, Dacey Nolan and Alex Zlatkus will be talking.

Dacey is going to do a 15-minute talk at the beginning about her experience as a person with epilepsy and how people can make the web work better for her and then the two of them will be giving a presentation on building accessible design systems. Then a quick save the date, WordPress Accessibility Day is a 24-hour conference. It is going to be on Wednesday, November 2nd. You can learn more about it at wpaccessibility.day. We are actively seeking sponsors for that event. Go there and you can get more information. Also, there’s a call for volunteers up as well.

Hopefully, we will have no more interruptions because I am super excited to have our speaker, Karl Groves, here today. Karl is Chief Innovation Officer at Level Access and he has nearly two decades of experience doing IT consulting for the biggest companies in the world and the biggest agencies for the US government. He has previously spoken for us and has done a phenomenal job. I always feel like I’m learning from him on social media and all the things so I decided to have him here. I am going to stop sharing.

>> KARL GROVES: Super happy to be doing this one. This is a topic that is actually very, very interesting to me. In particular, I’ve been doing this stuff for a really long time. For those who don’t know me, my name is Karl Groves. Amber has already done a really good introduction but I’ve been doing accessibility consulting for about 20 years now. There’s always this question from customers like, “Can you give us a grade?” Really, the true answer is, “Well, if you have accessibility problems, then you fail.” That’s not a message that most people want to hear. Most people want to hear that they’re doing, well, they don’t want bad news.

Most people just don’t want bad news and unfortunately, when it comes to web accessibility consulting, the news is mostly bad. If somebody is new to accessibility, then it’s always going to be bad because people just don’t think about it upfront and it’s always just going to be a bad time. This question in terms of can we get a grade, can I get accessibility score? This topic was raised in my mind after several customers asked me if tenon can give a grade. Tenon philosophically is a product that’s its sole job is to find problems not to give a grade.

That’s one thing that was something that I had to mention to people, which was, it is a diagnostic tool. It’s not a program management tool. There’s other tools out there, Level Access, for instance, has a product called AMP. AMP literally stands for accessibility management platform. It is a more all-encompassing performance, enterprise performance, and governance type of platform. We have a new product called Elevin. Elevin is again a much more enterprisey kind of tool and that has capabilities for scoring and things like that.

Part of me, when it came to Tenon and its philosophy on finding problems really also this question about getting a grade is perplexing in a lot of ways, because what is a grade going to be based on? Like I said, some of my background images have meaning behind them. This background image is a picture of a person assembling an engine, a car engine.

I mentioned to Amber before we started that I’m a gearhead. As a matter of fact, once I retire I’m starting a high rock shop. There’s a tool placed on top of the engine block, it’s a metal bar bolted to the block and its job is to find what’s called top dead center. Top dead center is especially top dead center on cylinder one it’s how you know how to time the engine’s ignition and all this other stuff. What are we going to base our grade on? That’s question number one, getting a grade for something a grade is really actually simple.

You divide the passed things, passed as in the things that are good, by the total things, multiply that quotient by 100, then apply your standard North American grading scale to it. A is 90 to 100, B is 80 to 89, C is 70 to 79, so on and so forth. If you subject your website to 20 accessibility tests and you pass 15 of those tests, you get a 75, which is a C grade and you’re done. That is actually the answer to how to score something for accessibility at least when it comes to an automatic testing tool, depending on your manual methodology, this idea works there too.

There’s a number of flaws with most manual testing methodologies that I’ll talk about as we go through this. That’s really my answer. Like if you’re not going to listen to anything else, if you’re like, you just want the answer to how to get a grade it’s passed items, total items, divide those, you’re done. There’s a lot of things other people talk about when they talk about grading and so you’ve hear people when they’re talking about this concept of applying a grade, there’s people who have other things that they want to think about when they talk about the grade.

One of those things could be relevance. Background image here is a protest poster. Someone’s carrying and the words on the protest poster says, “I can be persuaded by a logical argument.” Thinking about relevance actually is an interesting one. Most, if not all, automatic testing tools are unable to give a reliable score because they don’t track anything but failures and this goes back to the tenant philosophy, but it’s also true for almost all other tools out there. If you think about it, they’re not telling you what is good. They’re telling you what is bad. They’ve telling you they’ve found an issue. They found a problem.

There is no concept of relevance in terms of performance. There’s no concept of passing other than by virtue of not failing. You see this a lot when people are talking about accessibility, they’re saying, “I want a clean score from WAVE or a clean score from aXe,” or whatever. We got no issues. Just because an automatic testing tool didn’t find any issues doesn’t mean that you haven’t gotten any issues.

What it means is that you’ve got no failures from the things that that thing tests. A passed condition is created by either not failing the existing tests or the tests not being relevant. This is actually why Tenon doesn’t give a grade it’s because it doesn’t track what tests are relevant or passed, it just tests failures. While there is a value in getting a score based on the extent or lack thereof of accessibility errors, it lacks that context. Reading a really useful score requires knowing a couple of things.

First off, of the tests that were relevant, which ones passed and which ones failed. You really actually had to keep track of what tests were relevant in the first place. Again, to maybe simplify this, let’s say, we’re talking about a testing tool that has tests for tables. You have no tables and then therefore those tests are not relevant. You can’t neither pass nor fail those things. This is actually something that I’ve had taught conversations with customers that say, “Well, an irrelevant thing is a pass because it didn’t fail,” and I’m like, “No, that’s really spurious logic and irrelevant thing can neither pass or fail because it doesn’t meet the criteria to do either thing.”

To use a computer programming analogy, I think an irrelevant test is something that’s null. For anybody who’s program done any programming, JavaScript, PHP, whatever and irrelevant thing is null because it can neither be true or false. That’s what I think and that’s how I think about this. We actually built this capability into Mortise.io, which is the Tenon company’s manual testing tool. Each test has specific criteria and those criteria determine if that test is even applicable in the first place. It provides specific instructions for testing whether it’s applicable and whether it’s passed or failed.

I believe at least in terms of scoring, this relevance part is pretty important and it’s important to have that knowledge before we even think about any grading scheme. This is one of my favorite accessibility failed pictures. It is a picture unfortunately highly pixelated at this resolution, but there’s a set of stairs. There’s a ramp on it. It’s going at least a 45-degree angle, very high, clearly not a viable ramp, probably really fun to go down well, until you get to the bottom, impossible to get up.

Anyway, the text on the slide says, “What about user impact?” This is actually an interesting one. The argument made for factoring in user impact is basically this, that a raw pass versus fail score is fine, if everything that we’re testing for has the same impact, but we know, and all of us have done accessibility for any length of time knows accessibility’s different. Some things have very different levels of impact for different users and this is extremely hard to do with automation.

A lot of people know me for a lot of my content outbound on overlays. As I often say in the context of overlays, it’s easy to find images that have no text alternatives. Very easy to find, there’s an algorithm on W3C website about accessible name calculation. You can program something to find the accessible name of something, following that algorithm very easily. That’s really easy and that’s a bullying pass-fail thing. It’s really hard to determine whether a text alternative is accurate and informative.

As a matter of fact, I served as an expert witness on a legal case called Murphy versus Eyebobs. One of the big things about Murphy versus Eyebobs is that the defendant in the case, which is Eyebobs, their lawyer was trying to argue that we use AccessiBe and because we use AccessiBe, we are compliant. What we found in our research on AccessiBe’s case capabilities is that it uses, I don’t know what library or what system it uses for its image recognition, but it uses some image recognition software. It was often wrong.

When we were talking about looking through the machine-generated text alternatives that were provided, a lot of them were completely wrong. Sometimes they got the content of the image right, but not the context. It’s really, really, really hard to determine whether text alternative, if its supplied, is correct in the first place, but then there’s this whole issue of if the text alternative is wrong, how wrong is it? If we’re gauging our score based on user impact, how bad is it that it’s wrong?

For instance, if it’s a picture of a product and they describe the product, but they get the color of the product wrong, how big of a deal is that versus whether the text alternative is wrong enough that the user is missing important information, it’s not conveyed any other way on the page and so on and so forth. That’s a different story. In one case, we’re talking about nuisance, and in the other case, we’re talking about the complete failure of a core system task.

Another thing is that some issues impact multiple user types, and those impacts may vary. A missing label on a form field or a missing programmatic association for that label could cause an impact for person on voice dictation software, but they could use the mouse grid or something like that, and a person who’s blind could screw it up completely. We have a high impact for one population, and a medium to low impact for another population. How does that weigh in? For us, with 10 and more, it is what we do, is we actually factor some of that stuff into the prioritization scoring.

The prioritization scoring that we use has a number of factors that contribute to the prioritization score and this severity of impact is combined into the prioritization, not the actual score, so to speak. Priority is simply a measure of urgency that you really want to fix the issue. In other words, you fail and it’s how badly are you failing? That being said, I remain open to the idea that severity should be its own metric, but I still don’t know how to apply that to an accessibility grade that brings its own set of challenges. What about volume? The picture here is actually a picture of a tulip field in the Netherlands. Assuming millions upon millions of red tulips are in this picture. Okay.

The text on the slide here says, “Wait, what about volume?” As it is most basic, the more issues this system has, the lower its quality. This is not unique to the web and it’s not unique to web accessibility, this is actually a metric that’s pretty well tracked traditionally in software QA before the web being existed. Basically in the context of accessibility, the more accessibility issues a system has the higher number of the accessibility issues, the lower its accessibility grade should be. It’s a lower-quality system. Raw issue count. The difference here is on the web. Raw issue count isn’t really useful without additional context.

This is where the concept called defect density comes in. It takes into consideration the number of issues in the code versus the size of the page. Tenon was actually the first accessibility testing tool to provide this metric, but I didn’t come up with it myself. This is again, it’s been a traditional QA thing for a long time. In traditional QA, defect density is the number of issues per 1,000 lines of code. They call it KLOC, number of issues per kilobyte of code. Because websites have many blank lines and a lot of white space, what Tenon does is it collapses all the white space and uses that as kilobyte or source code comparison.

The logic for defect entities is pretty straightforward. A simple webpage with a lot of issues is worse than a complex webpage with the same number of issues, so when I talk about this with customers, what I say is, imagine that you test the Google homepage. Google homepage has like a logo, search form, button on the top and bottom corners, there’s other links and settings menus and stuff like that but feature-wise, super, super simple page. We test that we get 100 issues, and then we take the msnbc.com webpage and we get a hundred issues. Msnbc.com obviously is much more complex, and lots, lots more content, lots more code, of course.

100 issues on Google, 100 issues on MSNBC, if we discount raw account, of course, they’re going to seem the same, they’re going to seem equal, but we know for a fact that that’s not the case. If there’s 100 issues on the Google homepage, that means very simply that it’s a worse page. It’s a much more simple page, and in practice, what we’ve seen is a strong correlation between density and usability. Pages that exceed 50% density on Tenon are found to be more difficult for users to deal with in the real world. As density increases the likelihood that users are going to be completely unable to use the content and features on that page.

This, in my opinion, actually begs the question as to whether density is really the actual true metric that we should measure a grade on. Picture on the background of this slide is apparently a stuntman who is on a dirt bike riding through or has ridden through something on fire and he himself is also on fire. The text on the slide says, “Wait, what about comparing the norm?” We hear a lot of that in the accessibility consulting field from people who are like, “Well, how do we compare to our peers?” I’ve heard that a ton from e-retailers for sure. One retailer would be like, “How do we compare against–” and they would name their direct competitors, or at least who they see as a competitor.

At this point, the Tenon product has assessed millions of pages on the web and logged tens of millions of issues across those pages. This is actually more than enough data for us to calculate any data point that we want with statistically significant sample size, a confidence level of 99%, and a confidence interval of one. What I’m saying there is that we can gather any statistics that we want, and it will be an accurate comparison against the web as a whole. One way to do that, to do this is comparison thing, is to provide a grade based on the norm. In other words, a comparison against all the other pages that have ever been tested.

A common example of this could be basically considered like grading on a curve in college. Unfortunately, the normal webpage is really bad so the average number of errors across the web is 83 errors per page. Tenon is a little bit unique in the way it does its testing. We count individual instances of issues. Let’s say we have a table that has 10 columns, and none of those 10 columns has a scope attribute on them. We would log 10 issues. Other tools might log one issue and say, “This table is messed up.”

The average number of errors per page is 83, and the average density is 15%. What this means, this average density of 15% suggests that most pages on the web are kind of crappy. When it comes to grading for accessibility, it doesn’t really seem useful to base a grade on the norm, when the norm itself is just not accessible, so what about scope? This image doesn’t mean anything. What about scope? There’s several layers to consider in a scoring scenario in terms of the scope of the thing we’re measuring. We’ll talk about three different layers. The component, that’s the individual feature of a page or an application screen such as its navigation.

There’s the page, that’s going to be the entire page or application screen and all of its individual components or you can also call this a view. Then there’s the product, and that’s the entire collection of pages or screens that make up the product. Getting a grade on a component is actually really useful, I think. At least, determining the urgency with which you need to make repairs. Getting a grade on a page is less useful without any specific means to identify the value of the page. In other words, a per-page grade is pretty simple, but getting A grade on an inconsequential page is less useful than getting an A grade on a page that sees the most traffic from users or includes specific features or documentation for people with accessibility concerns. At the page level, what tends to happen is that the cumulative grade of the product could be impacted either too high or too low by outliers. That skew the results in one direction or another.

For instance, a great example of this would be pages that are just text, blog posts, or documentation, or something like that. Let’s face it. People do screw that stuff up, but by volume, you’re only going to get things like color contrast headings, that sort of stuff out of mostly text pages where stuff that’s interactive is going to have a lot more likelihood of having errors anyway. Identifying the relative importance of a page can be useful, but what I’ve seen or what I feel in practice is that that’s actually probably still better left as part of the priority scoring for those issues.

In other words, we’re going to deprioritize low traffic, low importance, low interactivity pages from the prioritization and increase the priority for the other things. In other words, I don’t think scoping of a page rather is relevant towards a grade. Probably scoping the grades for components is more useful. This background image has three slots for binders. One on the left is numbered 403. The number on the right is labeled 405. That means 404 is not found.

The text on this slide, “No, but really, what about relevance?” I want to harp on this one a little bit more and the reason why is relevant is especially in terms of automatic testing. Well, actually manual also applies here as well. No matter how we’re running this assessment, the relevance of the grade is directly tied to the completeness and relevance of the test set. The really simple one in terms of automation is the more tests you have, the more complete your coverage is going to be.

Of course, as anyone who’s involved in automatic testing in general, unit testing and stuff like that, testing irrelevant stuff or testing consequential stuff is a problem. You don’t want to just sit there and have a pile of completely irrelevant tests. Assuming, however, we can have a very large set of relevant tests for our system, the more the merrier. It pays to use a product that has a large number of tests.

If you’re, for instance, basing your score on something you get out of Lighthouse, Lighthouse is good and the tests within Lighthouse are very good, but there’s not a lot of them. They’ve chosen to use the subset of the aXe tests, even aXe doesn’t have as many tests as say AMP or Elevin or Tenon. That’s a real important thing to mention, but you can close that gap, of course, and you can close that gap with manual testing. Again, you have to have a codified set of manual tests, complete instructions, steps, and requirements for determining relevance, accuracy, and all that stuff. You can get 100% coverage reliably if both your automated and your manual testing is, of course, complete.

Our picture here in the background is another gearhead one. This is a person who’s assembling an engine. He’s got pistons on the table and a cylinder head. Our text here on the slides is putting it together. It turns out really in my exploration of this topic that there’s really two things that are most important with creating grade that’s relevance and number of tests, or really one thing or number of relevant tests. That’s the most important part. Relevance is really vital. If you have 50 tests to do with forms, but you have no forms on the thing you’re testing, then considering those tests into a grade makes no sense and artificially skews your data.

The number of tests is vital because if you don’t have a complete and thorough list of tests, then you may not be gathering enough data upon which to base the grade in the first place and you’ll wind up with a score that is not accurate. By the way, you also might remember that I talked about defect density. I mentioned that defect density itself may suffice is the only necessary metric forward grade, but what I found when I was looking at Tenon’s data is that once you start tracking relevant tests and passed tests, densities actually automatically figured into that. Either one of those is a fine metric density or the grade itself because they’re both synonymous at this point.

The image in the background of this slide has nothing to do with the content, but I’ve had it sitting in my slide assets archive for a long time and I just had to use it. It’s a picture of a stormtrooper and he’s holding the sign that says, “At least we didn’t kiss our sister.” If you’re familiar with Star Wars, you’ll know where that’s coming from. If you haven’t, then just watch the first one chronologically or according to release date.

Our text says, “Our target must be an A.” Regardless of what you base your score on, your target must be an A. Going back to these requests that we’ve gotten from customers, like how do we compare it to our competitor? No. That’s totally the wrong question to ask. Getting a grade that you can look at and immediately know where your system stands is super useful. It’s an awesome idea. It’s a great way of tracking your progress provided that, of course, you are tracking your progress.

It should be relatively straightforward to get a grade that’s useful and then use that as a– I don’t know. A KPI to say what’s our distance from an A accessibility because it’s a compliance domain. It’s the kind of thing that large companies want to track and that’s fine, but in practice, at least in my history, organizations that do this, or they’re doing a bottoms-up race to whatever their bare minimum acceptable grade is going to be, and then they stop. They’ll be, “Oh, well, a B is good enough.” Well, a B’s not good enough. That’s going to be their target. They’re not going to look any further. That’s why the desire to pursue a grade really is misleading and dangerous anyway.

I want to read from you a quote from WCAG standard, and it says, “Conformance to a standard means that you meet or satisfy the requirements of the standard.” That seems pretty self-explanatory. In WCAG, the requirements are the success criteria. To conform to WCAG, you need to satisfy the success criteria, that is, there is no content which violates the success criteria.

At a glance, the ability to see a score and intuitively understand how far away you are from getting that grade is cool, but choosing a less than perfect grade as good enough is dangerous, especially when you are working for an organization that has a high-risk profile e-retail, banking, travel, that sort of stuff, or you have your subject to regulatory oversight if your organizations of a certain size in Canada or Europe or your federal agency, or something like that. You have to comply with all of the success criteria at the chosen level of WCAG in order to be considered compliant. A B is not compliant, a C is not compliant.

It’s important to keep that in mind that anybody who asks you this question like what’s our grade is if you have any issues, you are non-compliant, and that’s really the message that they need to get. A grade is good for determining how far away you are from that. The background image here is another funny one. It says it’s a picture in a bathroom. Part of the sign is cut off, but it says, “If this bathroom needs service, please turn on the switch.” Below that sign is the switch plate that has no switch. The text and this slide says, “There’s only one true metric,” and this is actually one that I borrowed from a friend of mine who was doing accessibility compliance at Google. He mentioned this to me that there’s really only one thing that we need to track, there’s only one real KPI. That is, will users with disabilities want to use the product? For this, I’ll take us back to another quote from the WICAG standard. It says, although these guidelines cover a wide range of issues, they’re not able to address the needs of people with all types of degrees and combinations of disability. The real approach for scoring, it really requires us to interact with the real users, watch them use the product, ask them one of these three questions. If you’re not a current user of our product, would you want to use it? Or if you are a current user of our product, would you continue to use it? Third, if you’re a former user of this product, would you come back to using it?

So while automated and manual testing is really useful in finding potential problems in your product, only usability-testing with real users is going to tell you if you’ve gotten it right, and this one true metric is basically, will people with disabilities want to use this product? That’s the way we get our score. So that’s me, Karl Groves, you can follow me on Twitter @carlgroves, where I mainly talk about overlays and politics. The web here, levelaccess.com. My email is carl.groves@levelaccess.com. Are there any questions? I see there’s 17 comments in the chat that I haven’t read. Let’s take a look at some of these and see if there’s anything.

>> AMBER: There were a couple of comments about– let’s see, let me scroll back through, about measuring against, which I think you might have touched on, but just in case, measuring against that base of other sites, most of which have problems. I don’t know if you have any extra thoughts on that?

>> KARL: Yes. It’s really bad. I mean, the reality is there’s a lot of bad websites out there and especially in certain industries or certain market segments, a lot of those can be horrible. Unfortunately, a K-12 is a big area where things are just horrible for people with disabilities to even try to use. You can kind of guess that the larger the company, the more likely it is they’ll have done something for accessibility. But just in general, I wouldn’t use what other people do as a good example or a good thing to follow.

>> AMBER: M had a comment, you can get a 100% accessibility score from certain auto test tools, even though the site is not accessible, or you use workarounds to trick the tools?

>> KARL: Yes.

>> AMBER: I think a good follow on question like that is, are there common tools that people use that– and I know, I’m not asking you to bash on anything, but things that you wouldn’t recommend, or obviously your product is a great product, but are there other things that can be helpful in finding some of the obvious problems but that won’t create a misleading view of what the accessibility status of the product is or the website?

>> KARL: Well, first off I would say that any testing tool is good as a way to get started because you’re testing. I got to say that there are tools out there. There’s certain legacy tools that I’ve seen people use that are really old and sort of out of date and things like that. I’m not going to mention their names because I don’t want to disparage anybody. I mentioned before lighthouse. Lighthouse uses a subset of the x rule set. The tests are good. If you’re using lighthouse and doing testing with it or using access building insights from Microsoft which also uses subset. Those are good tests. If you’re finding stuff with those and you’re fixing them you’re already doing a good thing.

You’re not getting a [inaudible] because again they’re just a subset. What I would caution is people trusting any tool as gospel. We see this a lot with legal cases. There’s actually a law firm in Beverly Hills that fires out these demand letters and they base all of their stuff on wave. Wave is awesome, it’s the number one downloaded and used accessibility testing tool for a reason right? But Wave in the hands of an armature can be deceiving because what it’s going to list on wave is outright errors, color contrast problems which you do sometimes need to verify then warnings and also information. The warnings could be completely irrelevant. It could just be wrong and that’s fine because they disclose that’s a warning.

The information ones, that’s just for information view to assist you when you’re doing your manual tech checking. A lot of people don’t understand that and they’ll take it as gospel or even my tool or others will have an accurate test but there’s conditional thing that makes it irrelevant. I used to use this line from– I used to work at a Harley dealership a long long time ago and there was a mechanic there we called him old man Brian. His name was Brian and he’s old. There was this younger kid that worked as a mechanic next to him who– he had the worst snap on habit. Snap-on is a tool company. They’ll drive to auto-shops, something like that, they used to sell you tools and they had the best tools. They are phenomenal tools, Snap-on is.

But this kid always had to have the latest and greatest stuff and Brian would say it’s a poor mechanic who blames his tools for not getting the job done right. I carry that over to everything else. If you don’t understand what the testing tool is doing then you’re going to have problems. You can get misled or you’re going to chase your tail. That’s the only thing I would say about some of that stuff is tool quality is a big deal, understanding the tool is even bigger.

>> AMBER: For people that are just getting started, what resources do you have for learning or trying to better understand the tools and creating what their process should be for testing?

>> KARL: That’s a good question. For anyone involved in accessibility or anyone especially anyone new for accessibility, go to the web-aim website, webaim.org. They were instrumental in both their website and their discussion list were instrumental for me when I was beginning, starting out and I still point people there. Huge shout out to the folks at Webaim. I love them all dearly. Small plug level access has released access academy. Access academy is awesome because of free courses. It talks about everything from [inaudible] to older stuff.

>> AMBER: I don’t see too many more questions coming in so feel free anyone if you have questions put them in but I have one or two others that I sort of thought of while Karl was talking otherwise we might wrap a little early. One thing I was wondering because you mentioned that there are some law firms that just use the testing tools. Do you feel like there is a good response to that if someone has a client that gets a demand letter. I mean, obviously, the ideal response, of course, is make the website accessible and do user testing and prove that it already is and that if that was a false flag on Wave or whatever that might be. Do you have any thoughts on that having been involved in some of the lawsuits as an expert?

>> KARL: Yes, I mean it’s really– it’s a huge topic because there are trolls out there who– they’ll go away for $4000 and some of it, not all of them. A lot of them want real money but there are some of these folks, they’ll just settle for like a four grand, they’ll go away. You got to ask yourself like my lawyer at least before we got acquired by LevelAccess, my lawyer was $500 an hour. We’re not talking a lot of time before that $4000 is run out with me paying my lawyer to fight this thing.

The other reality is they probably have a good complaint. If you have a website and somebody reaches out to you and says we’re going to sue you, unless you know for a fact your website is accessible, they might have a point. That’s actually a huge problem, makes it really hard to defend. Obviously the bigger your company the more money you have to fight this sort of stuff and there you go, but there’s something to be said for just fixing your stuff.

As a matter of fact, when it comes to a legitimate complaint, all paths lead to you fixing your stuff. If you follow the possible option, we actually have a flow chart. There’s a blog post on the tenon website blog. blog.tenon.io. I forget what the blog post is titled, but we have a flowchart. It says, “Basically all paths lead to fixing your stuff.” If you fight the lawsuit and you win, the only way you’re winning is because you already have an accessible site. If you settle, you have to fix your site. If you lose, you have to fix your site.

There’s only one way to deal with the lawsuits, and that’s to fix your site. [coughs] That’s the one thing I would say. The other part, just strategically, is that some lawsuits can be mooted. Mooting in legal terms is a strategy where you basically pull the rug out from underneath the plaintiff by making it no longer an issue, so fixing your site before the trial date. Then you have the trial, you have the lawsuit thrown out, and so on and so forth. That’s legal stuff that I can’t get into anymore than that other than knowing that I have seen it happen and I participated in it happening. There’s lots of legal mumbo jumbo behind that, that I’m under-qualified for.

>> AMBER: I appreciate the thoughts and obviously that would be the most ideal response, is you get a complaint and you just go fix it and you say, “Oh, thanks for letting us know. Here’s the fix on our website.”

[laughter]

>> KARL: Yes.

>> AMBER: Someone said they’re trying to learn web development. “I find myself paralyzed because I’m terrified of learning inaccessible practices because accessibility is usually treated as an afterthought. Can you recommend any courses?” I think they’re referencing on web development that account for accessibility throughout.

>> KARL: Man, I wish there was a good answer for that.

>> AMBER: I know Joe Dolson does some LinkedIn Learning stuff. I know he has one on accessibility. I’m not certain if he has any without going and looking if he has just web development courses, but I would guess if he does on LinkedIn Learning, that might be a good resource.

>> KARL: Yes. That’s true. There’s a couple of LinkedIn Learning courses that are out there that have covered some of that Joe Dolson– I think Marcy Sutton had some stuff out there. Gerard Cohen, I think he did one on Lynda.com or something like that. I just put a link to an Amazon purchase for a book called Designing With Progressive Enhancement. [clears throat] The book is a little bit on the out-of-date side when it comes to like modern practices. However, if you’re just learning web development, this is going to cover a lot of really good stuff.

That’ll be a big deal because it’ll talk about JavaScript. It has HTML, CSS, CSS 3 stuff. It’s semi-modern, but a lot of the Java script in there is JQuery rather than more modern framework stuff. That will teach a lot of the fundamentals for that stuff. Also, this is another one that’s really out of date, Beginning Javascript with Java Scripting. A good friend of mine. Christian Hellman wrote this book.

>> AMBER: [silence] I think while you’re looking for that, Glen commented in the chat that if you start with semantic HTML, that’s a great start for accessible practices. Maybe don’t skip those HTML basics courses.

>> KARL: The unfortunate part is not a lot of those resources out there talk about semantic HTML. They don’t show you. None of the websites that are out there, like tutorial websites, even care about semantic HTML for the most part.

>> AMBER: Christina commented that [crosstalk] college in Toronto’s web development boot camp program integrates accessibility.

>> KARL: That’s cool.

>> AMBER: She attended that on a scholarship and there was accessibility there.

>> KARL: That’s cool.

>> AMBER: Here’s a question that is very current. How does accessibility fit into Metaverse and Web 3.0 and future emerging technologies like virtual reality, AR, VR, MR? AR I know VR. I don’t know MR. What is the scope of accessibility in these areas? Do you have thoughts on this?

>> KARL: Glenn Walker can answer that one. Thomas Logan is probably the first resource I would point anybody to for AR and VR stuff.

>> AMBER: Glen posted in the chat just in case anyone can’t see it. Thomas Logan of the A11yNYC meetup.

>> KARL: [clears throat]

>> AMBER: Do you know if those are virtual meetups?

>> KARL: Yes.

>> AMBER: Anyone could join? Great. There was a comment or question or earlier back that said, “One thing I’ve considered is using a VPAT template for developing an accessibility score so removing the non-relevant items and creating a score per accessibility level. For example, level A only has 10 relevant items divide that by 100 or so on for each level. Does this seem problematic?” The person said, “Keeping in mind, this is for clients who just really want to have a score.”

>> KARL: I don’t really agree with the idea of using VPAT for that. VPAT is great for disclosing where your failings are at a success criteria by success criteria way. I agree with that, but as far as do we have an a or b? No. [clears throat]

>> AMBER: You actually helped me when we were working on one for ours and I was looking out at some of them and I feel like there were a lot of what you were talking about earlier where they just said, “Oh, it passes.” Even though it’s it didn’t actually apply. I was going back and forth and I appreciated your thoughts on that as we were writing ours. I was like, “Well, does it pass?” [clears throat] We don’t have this. There’s no images of text. I guess we passed it. [chuckles]

>> KARL: Exactly. I’m not going to name them, but there’s a very, very large software company out there who says, if their criteria is irrelevant and it’s very– then it passes and I’m like, “No, it’s just not relevant,” but their logic for saying that is because they haven’t failed. I’m like, “That’s not the same.”

[laughter]

That’s not the same.

>> AMBER: I’m not seeing any other questions coming up. M said, “I agree that we should move away from scores and just fix what is wrong or build it right at the start. The point should be to create an inclusive and welcoming site for everyone.”

>> KARL: Yes.

>> AMBER: The last question that I wrote down was when you were talking about user testing. The way we sort of have approached this is we try to do automated testing and then we have our developers do manual testing themselves. After all those things are fixed, we bring in users. and I’m wondering if you have thoughts about like how people can incorporate users in their testing process. Should it be earlier? Also, maybe for someone who’s new to that, how would they go about finding users?

>> KARL: I would suggest strongly never to do usability testing until you’ve done automated and manual testing. The reason why is because you don’t want to waste your time, you don’t want to waste your participants’ time. If there’s money involved like paying stipends, which you should be doing, then you’re wasting money. I would strongly recommend not doing user testing until you’ve gotten that other stuff out of the way. I’ve seen this plenty of times. You’re going to have a person failing a task for an extraordinarily obvious reason. I’ll give you a great example. I was once testing a jobs website. This was a jobs site where in order to post your resume, you have to have an account. In order to, apply for a job, you have a resume so on and so forth.

If you’re testing this whole ability to upload a resume, but the fields don’t have labels or there’s a stupid focus problem or something like that, then you’re not really testing the usability of this process. You’re getting bogged down in technical problems that should have been addressed in the first place, but once you’ve done that the next part would be usability testing. There’s a ton of ways to do it right. There’s a ton of ways to do it effectively that aren’t like ideal. The ideal is of course you, you do a recruit. You have a defined persona. You recruit against that persona, blah, blah, blah, blah. That stuff can get really, really expensive.

If you don’t have the budget for that, you can grab some people who at least have some domain knowledge in the area of this website, and recruit them to do the testing. Lots of places are out there to finding participants in this. NFB, local lighthouses or any of the other AFB, NFB and ACB, those sort of things. You can reach out to them to see if there are any people who would be willing to participate. That should have you covered for non-visual testers. The same thing goes for any other sort of population.

There’s plenty of disability rights organizations or advocacy organizations that have lists of people who might want to participate or might be available to participate or they’ll share your announcement to your call for participants. Like M said, just make sure you pay them. [chuckles] Give them a stipend of some kind. The going rate for stipends is about $75 to $150 bucks. It’s well worth it once you’ve done that testing if you’re ready for it. By the way, going back to the question about the VPAT. Another thing that I should have mentioned before is there’s a new organization that’s out there that is trying to create an accessibility reports standard. It’s like what they say here in the introduction of a window sticker, similar to a car window sticker for telling people what your performance is for accessibility.

If this’s a topic you’re interested in and have time for, definitely get in touch with Chris Law and try to participate in that and give your feedback in that. That is one area of trying to boil all of this stuff I’ve been talking about down into something actually digestible for consumers.

>> AMBER: Is that like a “Self-proclaimed” or they’re going to come and be an external credentialing body that people can opt into? Or do you know much about what they’re looking at going to for that window sticker?

>> KARL: I don’t know, actually. I guess that has yet to evolve. Step one of course, is getting people involved in the organization and setting it up then there’s more to go from that. I have been only involved in the very very early days with the acquisition of Tenon from level access. That took up pretty much most of my time throughout from Fall to winter and into Spring this year. That’s been all I’ve been able to work on. [clears throat]

>> AMBER: Awesome. Well, thank you very much. This has been fabulous despite our surprise in the beginning.

[laughter]

>> KARL: It’s in the transcript. It’s in the transcript too, so that’s a joyous little nugget for [crosstalk]

[laughter]

[] [END OF AUDIO]

Links Mentioned

About the Meetup

The WordPress Accessibility Meetup is a global group of WordPress developers, designers, and users interested in building more accessible websites. The meetup meets twice per month for presentations on a variety of topics related to making WordPress websites that can be used by people of all abilities. Meetups are held on the 1st Thursday of the month at 10 AM Central/8 AM Pacific and on the 3rd Monday of the month at 7 PM Central/5 PM Pacific.

Learn more about WordPress Accessibility Meetup

Article continued below.

Stay on top of web accessibility news and best practices.

Join our email list to get notified of changes to website accessibility laws, WordPress accessibility resources, and accessibility webinar invitations in your inbox.

X/Twitter

This field is for validation purposes and should be left unchanged.

Email(Required)

Name

First Last

Summarized Session Information

Accessibility scoring is a topic that often comes up when discussing web accessibility. Organizations frequently ask for a grade or a score to quantify their accessibility efforts. However, accessibility expert Karl Groves argues that this approach is fundamentally flawed. In this presentation, Karl explores the challenges of assigning an accessibility score, the limitations of automated testing, and the importance of real user experiences in determining true accessibility.

Session Outline

The problem with accessibility scoring
The issue of relevance in scoring
The impact of accessibility issues on users
The role of issue volume in accessibility grading
Comparing accessibility scores across websites
Determining the scope of accessibility scoring
The importance of relevance in testing
What should an accessibility grade be based on?
The ultimate accessibility metric

The problem with accessibility scoring

Clients often ask, “Can you give us a grade?” The reality is that if a website has accessibility problems, it fails. While this blunt truth is brutal for many to accept, accessibility issues often go unnoticed until tested, and the results are rarely good.

Some organizations look to tools like Tenon for grading, but Tenon is designed as a diagnostic tool rather than a program management system. Unlike tools such as Level Access AMP, which serves as an enterprise accessibility management platform, Tenon aims to find problems, not assign grades.

The fundamental question in accessibility scoring is: what should a grade be based on? The simplest method is to calculate the number of passed tests divided by the total tests, multiply by 100, and apply a standard letter grading scale. However, this approach has significant flaws, particularly when considering the nuances of accessibility.

The issue of relevance in scoring

Automated testing tools primarily identify failures rather than successes. Because they focus on detecting problems, they do not account for what is working correctly. This creates a skewed perception that if no failures are detected, the website is fully accessible, which is not necessarily true.

The concept of relevance also plays a crucial role in accessibility testing. If a test assesses table accessibility but the website does not contain any tables, then the test is irrelevant. Some argue that an irrelevant test should be considered a pass, but Karl strongly disagrees, stating that an irrelevant test is neither a pass nor a fail—it is simply null.

To address this issue, Tenon’s manual testing tool, Mortise.io, includes criteria to determine whether a test is applicable before assessing whether it passes or fails. This approach ensures that the grading system accurately reflects accessibility issues that matter.

The impact of accessibility issues on users

One of the key arguments for factoring user impact into accessibility scoring is that not all failures have the same effect. Some issues, such as missing alt text on images, are easy to detect automatically but challenging to assess for accuracy and usefulness. The severity of an issue varies depending on the user. For example, a missing form label might be an inconvenience for voice dictation users but a complete barrier for blind users.

Tenon incorporates impact severity into its prioritization scoring. Rather than influencing the overall accessibility grade, the prioritization score helps organizations determine which issues to fix first based on their potential to disrupt user experiences. However, Karl acknowledges that the challenge remains: how do we factor severity into a universal accessibility score?

The role of issue volume in accessibility grading

Another aspect of accessibility scoring is issue volume. In traditional software quality assurance (QA), defect density—the number of issues per thousand lines of code—is a common metric. Tenon applies a similar concept to accessibility, measuring the number of issues relative to the size of a web page. A complex webpage with a hundred issues is inherently more accessible than a simple webpage with the same number of issues.

Karl has found a strong correlation between issue density and usability through real-world testing. Pages exceeding 50% defect density tend to be significantly more challenging for users to navigate. This suggests that defect density alone may be the most meaningful accessibility metric.

Comparing accessibility scores across websites

Organizations frequently ask how their accessibility compares to competitors. With Tenon’s vast dataset from millions of web pages, statistical comparisons can be made. However, since the average webpage has 83 errors and a defect density of 15%, the “norm” itself is not accessible. Using a relative score based on industry averages does not provide meaningful insight into true accessibility.

Instead of striving to be “better than competitors,” organizations should aim for an A-grade in accessibility, meaning full compliance with WCAG standards. Anything less than an A still leaves barriers in place for users with disabilities.

Determining the scope of accessibility scoring

Another factor in accessibility grading is scope. Should accessibility be measured at the component, page, or product levels? Karl argues that:

Component-level grading helps identify high-priority issues.
Page-level grading can be misleading if the page is of low importance.
Product-level grading can be skewed by outlier pages, making it an unreliable measure of overall accessibility.

Instead, prioritization should be based on each page’s importance and traffic level, ensuring that the most impactful accessibility issues are addressed first.

The importance of relevance in testing

When performing accessibility assessments, relevance is crucial. Automated tools should include many relevant tests, ensuring comprehensive coverage. Lighthouse, for example, includes only a subset of aXe’s tests, making it less thorough than enterprise tools like AMP or Tenon.

Manual testing is essential to fill gaps left by automated tools. However, manual testing must be structured with clear instructions and criteria to ensure consistent and meaningful results. Combining automated and manual testing is the best way to achieve a complete and accurate accessibility assessment.

What should an accessibility grade be based on?

Karl concludes that the two most important factors in creating an accessibility score are relevance and the number of tests performed. If tests are irrelevant, they should not be included in the grade, as they distort the results. Similarly, a grading system that does not incorporate enough tests will fail to provide an accurate picture of accessibility.

Interestingly, Karl found that tracking relevant tests and passed tests inherently accounts for defect density. Whether an organization uses defect density or a pass/fail grading system, both methods ultimately provide the same insights into accessibility.

The ultimate accessibility metric

Regardless of how an accessibility score is calculated, Karl emphasizes that organizations must aim for an A. Many companies are satisfied with a “good enough” grade, setting a B as their target rather than striving for full accessibility compliance. However, this mindset is problematic, especially for organizations at high risk of legal action or regulatory scrutiny.

To truly understand accessibility, organizations must go beyond automated and manual testing and ask real users:

If you’re not a current user, would you want to use this product?
If you are a current user, would you continue to use it?
If you were a former user, would you return to using it?

Ultimately, the best accessibility metric is user experience. Only real users can determine whether a product is accessible in practice. Organizations must shift their focus from scores and compliance checkboxes to ensuring that people with disabilities can and want to use their digital products.

So, You Want an Accessibility Score? – Karl Groves

Thanks to Our Sponsors

Watch the Recording

Read the Transcript

Links Mentioned

About the Meetup

Stay on top of web accessibility news and best practices.

Summarized Session Information

Session Outline

The problem with accessibility scoring

The issue of relevance in scoring

The impact of accessibility issues on users

The role of issue volume in accessibility grading

Comparing accessibility scores across websites

Determining the scope of accessibility scoring

The importance of relevance in testing

What should an accessibility grade be based on?

The ultimate accessibility metric

Easier, Faster Accessibility Testing

Thanks to Our Sponsors

Watch the Recording

Read the Transcript

Links Mentioned

About the Meetup

Stay on top of web accessibility news and best practices.

Summarized Session Information

Session Outline

The problem with accessibility scoring

The issue of relevance in scoring

The impact of accessibility issues on users

The role of issue volume in accessibility grading

Comparing accessibility scores across websites

Determining the scope of accessibility scoring

The importance of relevance in testing

What should an accessibility grade be based on?

The ultimate accessibility metric

About Equalize Digital