Testing CSS in mozilla: where to go from here

Version 1.1

Introduction
Testing methodology
CSS: what has been done so far
Writing manual QA tests for CSS
Glossary
References
Contributors

Introduction

Aims

The primary aim of Quality Assurance (QA) is to make the product better by reporting failures in sufficient detail for the engineering team to then fix these failures.

In order to report failures, they first have to be found. Finding failures forms the bulk of the work done by QA. There are several ways of finding failures, and these are discussed in detail in the next section. Testing the product directly is the most obvious way of finding failures, but there are several other techniques, for instance reading user feedback (e.g. from beta programmes).

CSS

Cascading Style Sheets (CSS) is a simple technology for styling web pages. It is designed to allow an easy separation of the stylistic aspects of a document (e.g. "green bold text") from the structural and semantical aspects of a document (e.g. "section header").

CSS is based on two fundamental concepts.

The first concept is that CSS is a tree decoration language. CSS defines a list of properties, for example 'color' or 'font-size'. Applying CSS style sheets to a tree causes each node in the tree to have a specified value for each of these properties.

The second concept is the CSS rendering model. This describes how blocks, tables, text and other layout elements are displayed, how fonts are selected, and so on.

The mapping of the decorated tree into the rendering model is what forms the majority of the CSS specification.

CSS was originally invented in 1995 by Hakon Lie and Bert Bos, and became a World Wide Web (W3C) Recommendation in late 1996. In 1998 a second version was released [2], and since then much progress has been made on a third version, which will be split into many modules for both political and technical reasons, and on publishing extensive errata for the published versions based on implementation experience.

The CSS Working Group, which is responsible for this work, consists of representatives from various different implementors of CSS user agents, users of the technology, and other interested parties.

Further reading: http://www.w3.org/Style/CSS/

Mozilla

Mozilla is a free software internet application suite. It forms the basis of products such as the Netscape browser, the "Instant AOL" consumer device, and the ActiveState Komodo IDE.

Further reading: http://www.mozilla.org/

Testing methodology

Types of coverage ("when")

There are several ways of finding failures. These form a broad spectrum of test coverage, ranging from the unsuspecting public to the simplest of targeted test cases. Let us examine each in turn.

Real world coverage. This is the ultimate test. Ideally, end users would never experience software defects (bugs) but, unfortunately, writing software as complex as a web browser on a tight schedule inevitably means a compromise must be made between perfection and shipping the product before the company goes bankrupt. In the case of CSS, bugs in the last released version of the product are sometimes reported by web developers in official feedback forms or in public forums. These are an important source of bug reports, but in practice the signal to noise ratio is usually too high to warrant spending much time here. (Typically, issues reported in official feedback forms and public forums are well known issues, or even incorrect reports.)
Pre-release beta testing. Errors reported in widely distributed releases of non-final versions of the product are often known issues (as with bugs reported in final versions) but by examining how many times issues are reported the most important bugs can be prioritized before the final version is released.
Dogfood. It is good practice to use the software one is developing on a daily basis, even when one is not actively working on developing or testing the product. This is known as "eating one's dogfood". [2] Many bugs, usually user interface (UI) issues but also occasionally web standards bugs, are found by people using daily builds of the product while not actively looking for failures. A bug found using this technique which prevents the user of the product on a regular basis is called a "dogfood" bug and is usually given a very high priority.
"top100" and site-specific testing. The web's most popular pages are regularly checked by visual inspection to ensure that they display correctly. (It is hard, if not impossible, to automate this task, because these pages change very, very frequently.) Bugs found through this technique are important, because many users would encounter them should a product be released with such a defect. In practice, many rendering issues found on top100 pages are actually caused by errors on the pages themselves, for example using invalid CSS or CSS which is incorrectly handled by other browsers. Most web authors do not check the validity of their pages, and assume that if the page "looks right" on popular browsers, it must be correct.
Smoketests. Each day, before allowing work to begin on the code base, the previous day's work must pass an extremely simple set of tests known as "smoketests". The name comes from the idea that these tests are the software equivalent of powering a new prototype circuit and seeing if it catching fire! CSS does not figure very prominently on the smoketest steps, so few, if any, CSS bugs are caught this way. However, since CSS underpins a large part of the application's UI, if anything is seriously wrong with the CSS infrastructure, it will be caught by the smoketests. Bugs found this way are known as "smoketest blockers", and with good reason: all work is blocked until the bugs are fixed. This is to ensure that the bugs are fixed quickly.
Tinderbox tests. Tests are also run on a continuous basis on a system known as the tinderbox. [3] This tests the absolute latest code base, and therefore is a good way of catching unexpected compile errors (code that works on one platform might not work on another) and major problems such as startup failures. There are no CSS tests currently being run on the tinderboxes, however this is a direction which will be worth pursuing in the future.
Automated tests. Some tests have been adapted for a test harness known as NGDriver. These tests run unattended and can therefore cover large areas of the product with minimum effort. Until recently, CSS could not easily be tested using an automation system. However, with the advent of the Layout Automation System (LAS), there now exists a test harness that is capable of displaying a test page and then comparing this test page, pixel for pixel, with a pre-stored image.

Automation is the holy grail of QA. Unfortunately, there are many aspects that are hard to impossible to automate, such as printing.
Engineer regression and pre-checkin tests. In order to catch errors before they are flagged on the tinderbox (and thus wasting a lot of time) engineers must run a set of tests before committing their changes to the main code base. These tests are known as 'pre-checkin tests'. Certain changes also require that the new code be run through specially designed regression tests that are written to flag any regressions (new defects which were not present in a previous version) in the new code.
Manual test runs. Before releases, and occasionally at other times as well (for instance when a new operating system is released) every test case is run through a test build of the product and manually inspected for errors. This is a very time consuming process, but (assuming the person running the tests is familiar with them and the specification being tested) it is a very good way of catching regressions.
QA test development. The main way of discovering bugs is the continuous creation of new test cases. These tests then get added either to the manual test case lists or the automated test case lists, so that they can flag regressions if they occur.
Engineer test development. When a bug is discovered, the file showing this bug is then reduced to the smallest possible file still reproducing the bug. This enables engineers to concentrate on the issue at hand, without getting confused by other issues. Oddly enough, during this process it is not unusual to discover multiple other related bugs, and this is therefore an important source of bug reports.

Types of test cases ("where")

If the various techniques for finding failures gives a list of when bugs are typically found, then the various different kinds of test cases gives a list of where the failures are found.

The original files of a bug found in the field. Typically, bugs reported by end users and beta testers will simply consist of the web address (URI) of the page showing the problem. Similarly, bugs reported by people while using daily builds (dogfood testing) and bugs found on top100 pages will consist of entire web pages.

An entire web page is usually not very useful to anyone by itself. If a bug is found on a web page, then the web site will have to be turned into a reduced test to be useful for developers (engineer test development). This will then typically then be used as the basis for a group of more complicated QA tests for regression testing (and maybe finding more bugs in that area).

Example:
- http://www.cnn.com/ or any other web site.
Reduced test (attached to a bug). In order for an engineer to find the root cause of a bug, it is helpful if the original page which demonstrates the bug is simplified to the point where no unrelated material is left. This is known as minimizing or reducing a test case, and forms part of engineer test development.

Reduced tests are typically extremely small (less than one kilobyte including any support files) and extremely simple. There are obviously exceptions to these rules, for example tests for bugs that only manifest themselves with one megabyte files will be big (although still simple) and bugs that are only visible with a convoluted set of conditions will be complicated (although still small).

A good reduced test will also be self explanatory, if that can be done without adding text which would be unrelated to the test.

Examples:
- http://bugzilla.mozilla.org/attachment.cgi?id=25713&action=view
- http://bugzilla.mozilla.org/attachment.cgi?id=39662&action=view
Note that both these examples are attached to Bugzilla, the Mozilla bug tracking tool. [4]
Simple test. When the implementation of a feature is in its infancy, it is useful to create a few simple tests to check the basics. These tests are also known as isolation tests (since the features are typically tested in isolation) or ping tests (in computing terms, to ping something means to check that it is alive).

Simple tests consist of a test as simple as a reduced test, but designed to be easy for QA to use, rather than for engineers, and therefore may have the appearance of a complicated test.

Simple tests are often used as part of a complicated test. When used in this way they are known as a control test, with analogy to the concept of a control in experimental physics. If the control test fails, then the rest of the test is to be considered irrelevant. For example, if a complicated test uses colour matching to test various colour related properties, then a good control test would be one testing that the 'color' property is supported at all.

Example:
- http://www.hixie.ch/tests/adhoc/css/cascade/style/001.xml
Complicated test. This is the most useful type of test for QA, and is the type of test most worth writing. One well written complex test can show half a dozen bugs, and can therefore they are worth many dozens of simple tests. since for a complicated feature to work, the simpler features it uses must all work too.

Complicated tests should appear to be very simple, but their markup can be quite convoluted since it is typically testing combinations of several things at once. The next chapter describes how to write these tests.

Example:
- http://www.hixie.ch/tests/adhoc/css/selectors/not/006.xml
Use case demo page. Occasionally, pages will be written to demonstrate a particular feature. Pages like this are written for various reasons -- they are written by marketing teams to show users the new features of a product, they are written by technology evangelism teams to show web developers features that would make their site more interesting or to answer frequently asked questions about a particular feature, sometimes they are even written for fun! CSS, due to its very graphical nature, has many demo pages.

Demo pages are really another kind of complicated test, except that because the target audience is not QA it may take longer to detect failures and reduce them to useful tests for engineers.

Example:
- http://damowmow.com/mozilla/demos/layout/demo.html
Extremely complicated demo page. When an area has been highlighted as needing a lot of new complicated tests, it may be hard to decide where to begin working. To help decide, one can attempt to write an entire web site using the feature in question (plus any others required for the site). During this process, any bug discovered should be noted, and then used as the basis for complicated tests.

This technique is surprisingly productive, and has the added advantage of discovering bugs that will be hit in real web sites, meaning that it also helps with prioritisation.

At least two web sites exist purely to act as extremely complicated demo pages:
- http://www.libpr0n.com/
- http://www.mozillaquestquest.com/
Automated tests. These are used for the same purposes as complicated tests, except that they are then added to automated regression test suites rather than manual test suites.

Typically the markup of Automated Tests is impenetrable to anyone who hasn't worked on them, due to the peculiarities of the test harness used for the test. This means that when a bug is found on an automated test, reducing it to a reduced test can take a long time, and sometimes it is easier to just use automated tests as a pointer for running related complicated tests. This is suboptimal however, and well designed automated tests have clear markings in the source explaining what should be critical to reproducing the test without its harness.

The worst fear of someone running automated tests is that a failure will be discovered that can only be reproduced with the harness, as reducing such a bug can take many hours due to the complexities of the test harnesses (for example the interactions with the automation server).

Example:
- http://www.hixie.ch/tests/ngdriver/domcss/sc2p004.html

Finding bugs

The following flowchart is a summary of this section.

                                  BETA FEEDBACK
                                        |
 EXTREMELY                             \|/
COMPLICATED        USER FEEDBACK --> WEB SITE <-- DOGFOOD
 DEMO PAGE                              |
     |                                  |
     |                                  |
    \|/                                \|/
  LIST OF -----> COMPLICATED ------> REDUCED
   BUGS   <-----    TESTS             TEST
    /|\            |                    |
     |            \|/                  \|/
     AUTOMATED TESTS                BUG FILED

CSS: what has been done so far

Existing coverage (as of September 2001)

CSS1 is thoroughly covered and methodical testing at this stage would not give a tests-to-bugs ratio that is worth the time investment. The only exception would be the list related properties.

CSS2 is less thoroughly covered. Positioning, tables, generated content and the font matching algorithm have had little testing.

Selectors, the cascade, syntax, the block box model, the inline box model, floats, colour and background related properties, the text properties, and the font properties are all well covered.

Current tests are spread across many test suites, including:

http://www.hixie.ch/tests/adhoc/ A large selection of complicated tests designed for ease of use by QA.
http://www.hixie.ch/tests/evil/ Some very complicated tests and test generators. These tests are designed more with exploratory testing in mind -- in some cases, it is not even clear what the correct behaviour should be.
http://www.people.fas.harvard.edu/~dbaron/csstest/ A set of complicated tests. Some of these tests require careful study and are not designed for use by QA.
http://www.bath.ac.uk/~py8ieh/internet/results.html An earlier set of complicated tests. Most of these tests are very descriptive, and are therefore quite useful when learning CSS. This test suite has some tests that examine some fundamental, if complicated, aspects of CSS, such as the inline box model and the 'width' and 'height' properties.
http://www.w3.org/Style/CSS/Test/current/ The official W3C CSS1 Test Suite.

New tests

The majority of new tests should be in the areas listed as lacking tests in the previous section. These are the areas that have the least support in Mozilla.

With the recent advent of LAS, the automation system for layout tests, it would be a good idea to work on automating the many manual tests already in existence. Having done this, linking the automation with the tinderbox tests would give a good advance warning of regressions.

Writing manual QA tests for CSS

How manual QA tests are used

Tests are viewed one after the other in quick succession, usually in groups of several hundred to a thousand. As such, it is vital that:

the results be easy to determine,
the tests need no more than a few seconds to convey their results to the tester,
the tests not need an understanding of the spec to use them.

A badly written test can lead to the tester not noticing a regression, as well as breaking the tester's concentration.

Ideal tests

Well designed CSS tests typically fall into the following categories, named after the features that the test will have when correctly rendered by a user agent (UA).

Note: The terms "the test has passed" and "the test has failed" refer to whether the user agent has passed or failed a particular test -- a test can pass in one web browser and fail in another. In general, the language "the test has passed" is used when it is clear from context that a particular user agent is being tested, and the term "this-or-that-user-agent has passed the test" is used when multiple user agents are being compared.

The green paragraph. This is the simplest form of test, and is most often used when testing the parts of CSS that are independent of the rendering, like the cascade or selectors. Such tests consist of a single line of text describing the pass condition, which will be one of the following:
```
This line should be green.
This line should have a green border.
This line should have a green background.
```
Example:
- http://www.hixie.ch/tests/adhoc/css/box/inline/002.html
- http://www.hixie.ch/tests/adhoc/css/background/20.xml
The green page. This is a variant on the green paragraph test. There are certain parts of CSS that will affect the entire page, when testing these this category of test may be used. Care has to be taken when writing tests like this that the test will not result in a single green paragraph if it fails. This is usually done by forcing the short descriptive paragraph to have a neutral colour (e.g. white).

Example:
- http://www.hixie.ch/tests/adhoc/css/background/18.xml
(This example is poorly designed, because it does not look red when it has failed.)
The green block. This is the best type of test for cases where a particular rendering rule is being tested. The test usually consists of two boxes of some kind that are (through the use of positioning, negative margins, zero line height, or other mechanisms) carefully placed over each other. The bottom box is coloured red, and the top box is coloured green. Should the top box be misplaced by a faulty user agent, it will cause the red to be shown. (These tests sometimes come in pairs, one checking that the first box is no bigger than the second, and the other checking the reverse.)

Examples:
- http://www.hixie.ch/tests/adhoc/css/box/absolute/001.xml
- http://www.hixie.ch/tests/adhoc/css/box/table/010.xml
The green paragraph and the blank page. These tests appear to be identical to the green paragraph tests mentioned above. In reality, however, they actually have more in common with the green block tests, but with the green block coloured white instead. This type of test is used when the displacement that could be expected in the case of failure is likely to be very small, and so any red must be made as obvious as possible. Because of this, test would appear totally blank when the test has passed. This is a problem because a blank page is the symptom of a badly handled network error. For this reason, a single line of green text is added to the top of the test, reading something like:
```
This line should be green and there should be no red on this page.
```
Examples:
- http://www.hixie.ch/tests/adhoc/css/fonts/size/002.xml
The two identical renderings. It is often hard to make a test that is purely green when the test passes and visibly red when the test fails. For these cases, it may be easier to make a particular pattern using the feature that is being tested, and then have a reference rendering next to the test showing exactly what the test should look like.

The reference rendering could be either an image, in the case where the rendering should be identical, to the pixel, on any machine, or the same pattern made using totally different parts of the CSS specification. (Doing the second has the advantage of making the test a test of both the feature under test and the features used to make the reference rendering.)

Examples:
The positioned text. There are some cases where the easiest test to write is one where the four letters of the word 'PASS' are individually positioned on the page. This type of test is then said to have passed when all that can be seen is the word with all its letters aligned. Should the test fail, the letters are likely to go out of alignment, for instance:
```
PA
  SS
```
...or:
```
SSPA
```
The problem with this test is that when there is a failure it is sometimes not immediately clear that the rendering is wrong. (e.g. the first example above could be thought to be intentional.)

Example:
- http://www.hixie.ch/tests/adhoc/css/box/block/text-indent/001.html

Ideal tests, as well as having well defined characteristics when they pass, should have some clear signs when they fail. It can sometimes be hard to make a test do something only when the test fails, because it is very hard to predict how user agents will fail! Furthermore, in a rather ironic twist, the best tests are those that catch the most unpredictable failures!

Having said that, here are the best ways to indicate failures:

Red. This is probably the best way of highlighting bugs. Tests should be designed so that if the rendering is a few pixels off some red is uncovered.

Examples:
- http://www.hixie.ch/tests/adhoc/css/box/block/first-line/001.html
Overlapped text. Tests of the 'line-height', 'font-size' and similar properties can sometimes be devised in such a way that a failure will result in the text overlapping.
The word "FAIL". Some properties lend themselves well to this kind of test, for example 'quotes' and 'content'. The idea is that if the word "FAIL" appears anywhere, something must have gone wrong.

Examples:
- http://www.hixie.ch/tests/adhoc/css/box/table/004.html
- http://www.hixie.ch/tests/adhoc/css/box/absolute/002.xml
Scrambled text. This is similar to using the word "FAIL", except that instead of (or in addition to) having the word "FAIL" appear when an error is made, the rest of the text in the test is generated using the property being tested. That way, if anything goes wrong, it is immediately obvious.

Examples:
- http://www.hixie.ch/tests/adhoc/css/quotes/001.xml

These are in addition to those inherent to the various test types, e.g., differences in the two halves of a two identical renderings test obviously also shows a bug.

Tests to avoid

The long test. Any manual test that is so long that is needs to be scrolled to be completed is too long. The reason for this becomes obvious when you consider how manual tests will be run. Typically, the tester will be running a program (such as "Loaderman") which cycles through a list of several hundred tests. Whenever a failure is detected, the tester will do something (such as hit a key) that takes a note of the test case name. Each test will be on the screen for about two or three seconds. If the tester has to scroll the page, that means he has to stop the test to do so.

Of course, there are exceptions -- the most obvious one being any tests that examine the scrolling mechanism! However, these tests are considered tests of user interaction and are not run with the majority of the tests.

In general, any test that is so long that it needs scrolling can be split into several smaller tests, so in practice this isn't much of a problem.

This is an example of a test that is too long:
- http://www.bath.ac.uk/~py8ieh/internet/eviltests/lineheight3.html
The counter intuitive "this should be red" test. As mentioned many times in this document, red indicates a bug, so nothing should ever be red in a test.

There is one important exception to this rule... the test for the 'red' value for the colour properties!

The first subtest on this page shows this problem:
- http://www.people.fas.harvard.edu/~dbaron/css/test/childsel
Unobvious tests. A test that has half a sentence of normal text with the second half bold if the test has passed is not very obvious, even if the sentence in question explains what should happen.

There are various ways to avoid this kind of test, but no general rule can be given since the affected tests are so varied.

The last subtest on this page shows this problem:
- http://www.w3.org/Style/CSS/Test/current/sec525.htm

Techniques

In addition to the techniques mentioned in the previous sections, there are some techniques that are important to consider or to underscore.

Overlapping. This technique should not be cast aside as a curiosity -- it is in fact one of the most useful techniques for testing CSS, especially for areas like positioning and the table model.

The basic idea is that a red box is first placed using one set of properties, e.g. the block box model's margin, height and width properties, and then a second box, green, is placed on top of the red one using a different set of properties, e.g. using absolute positioning.

This idea can be extended to any kind of overlapping, for example overlapping to lines of identical text of different colours.
Special Fonts. Todd Fahrner has developed a font called Ahem, which consists of some very well defined glyphs of precise sizes and shapes. This font is especially useful for testing font and text properties. Without this font it would be very hard to use the overlapping technique with text.

Examples:
- http://www.hixie.ch/tests/adhoc/css/fonts/ahem/001.xml
- http://www.hixie.ch/tests/adhoc/css/fonts/ahem/002.xml
The self explanatory sentence followed by pages of identical text. For tests that must be long (e.g. scrolling tests), it is important to make it clear that the filler text is not relevant, otherwise the tester may think he is missing something and therefore waste time reading the filler text. Good text for use in these situations is, quite simply, "This is filler text. This is filler text. This is filler text.". If it looks boring, it's working!
Colour. In general, using colours in a consistent manner is recommend. Specifically, the following convention has been developed:

Red

Any red indicates failure.

Green

In the absence of any red, green indicates success.

Blue

Tests that do not use red or green to indicate success or failure should use blue to indicate that the tester should read the text carefully to determine the pass conditions.

Black

Descriptive text is usually black.

Fuchsia, Yellow, Teal

These are useful colours when making complicated patterns for tests of the two identical renderings type.

Gray

Descriptive lines, such as borders around nested boxes, are usually light gray. These lines come in useful when trying to reduce the test for engineers. Dark gray is sometimes used for filler text to indicate that it is irrelevant.

Here is an example of blue being used:
- http://www.hixie.ch/tests/adhoc/css/fonts/size/004.xml
Methodical testing. There are particular parts of CSS that can be tested quite thoroughly with a very methodical approach. For example, testing that all the length units work for each property taking lengths is relatively easy, and can be done methodically simply by creating a test for each property/unit combination.

In practice, the important thing to decide is when to be methodical and when to simply test, in an ad hoc fashion, a cross section of the possibilities.

This example is a methodical test of the :not() pseudo-class with each attribute selector in turn, first for long values and then for short values:
- http://www.hixie.ch/tests/adhoc/css/selectors/not/010.xml

Glossary

There are many terms which will be encountered when writing or using tests for CSS. This list is by no means complete, but should give the reader a head start.

Full support. The unachievable goal of perfection. A user agent which claims to have "full support" for a specification is claiming the impossible. In addition to the great difficulty in attaining "full support" there is the problem that the specification itself currently has some minor contradictions, and therefore cannot be fully implemented.
100% Support. See full support. Note that Microsoft claim that Internet Explorer has "100% support for CSS1" while meaning that Internet Explorer passes the majority of the tests explicitly mentioned in the W3C CSS1 Test Suite that test the CSS1 core properties and that are not controversial.
Best support. At all times, one particular user agent will have the best implementation of CSS. There is a quite friendly and healthy rivalry between the competing implementors to beat the others in terms of CSS support, and this is probably the main reason for increased support in recent releases of the main browsers.
Complete support. See full support.
Compliant implementation. Claiming to be a compliant implementation is not as bold as claiming full support, but is just as unlikely to be true. The main difference between a full implementation and a compliant implementation is that the specification lists certain aspects as being optional, and therefore one can legitimately fail to implement those parts.
Comprehensive testing. A feature has been comprehensively tested if every possible combination has been tested. This is generally impossible unless the feature is very well defined. For example, testing all possible style sheets to ensure that they are all correctly parsed is impossible, because it would take longer to do that than the estimated lifetime of the universe. However, it is possible (although rather pointless) to perform that exercise for all one byte style sheets.
Exhaustive testing. See comprehensive testing.
Implementation. See user agent.
Methodical testing. This is the antithesis of ad hoc testing. Methodical testing is the act of taking a set of possible input values, and enumerating all permutations, creating a test for each. (Due to the mechanical nature of this process, it is common to create such tests using some sort of script.)
Thorough testing. A feature is said to have been thoroughly tested if it is believed that a reasonably large and well distributed cross section of possible combinations has been tested. This is no guarantee that no bugs are lurking in the untested cases, of course!
User agent. A web browser. Technically, a user agent can be more than just a web browser -- any application that processes CSS is a user agent of some kind. For example, a CSS validator would classify as a user agent.

References

Cascading Style Sheets: http://www.w3.org/TR/REC-CSS2/
Dogfood in the Jargon File: http://www.tuxedo.org/~esr/jargon/html/entry/dogfood.html
The Mozilla Tinderbox: http://tinderbox.mozilla.org/showbuilds.cgi?tree=SeaMonkey
Bugzilla: http://bugzilla.mozilla.org

Contributors

Ian Hickson <ian@hixie.ch>