Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Dismiss

Automated GUI testing

38 views

Skip to first unread message

ides...@eudoramail.com

unread,

Jul 1, 2005, 1:03:15 AM7/1/05

Hi,

I'm in the process of developing a GUI testing tool. The tool should
provilde the ability to define a GUI test case using a 'script' and
then 'simulate' the necessary key/mouse inputs and 'recognize' actual
outputs on the screen using OCR or similar techniques.

The platform I use is MFC with VC 6.0 or 7.0 (unmanaged).

My approach is to use two levels of scripts:
1) Window Definitions Scripts
2) Business Scripts

What I want now is to come up with a scheme to 'define' the layout of a
window. This definition must be 'tied' to the actual app windows, to
enable input simulation and output recognition.

Say, I want to test a Login Screen. Then I want to define the screen
as:

//----------------------------------------------------------------
<Window Name="LoginScreen">
<Component Title="UserID" Type="Lable" X=100 Y=20>
<Component Title="Password" Type="Lable" X=100 Y=50>
...
<Component Title="OK" Type="Button" X=400 Y=100>
<Component Title="Cancel" Type="Button" X=400 Y=130>

<Action Name="OK" Target="OK" ID="IDC_OK" Command="WM_CLICK">
</Window>
//----------------------------------------------------------------

Then to write the business script as:

//----------------------------------------------------------------
LoginScreen scr; // define a LoginScreen object

scr.UserID = "test"
scr.Password = "123"

scr.OK() // Perform the 'Action' defined above
//----------------------------------------------------------------

If I define 'actions' as in:

I should be able to send commands to a 'different' app from my test
tool.

Can this be done? How can one application query the window components
of another app and send commands to it?

I would appreciate any insights (or references) into how MS Windows
uses resource IDs, control IDs and window IDs etc. to tie up a UI.
Please share your experiences and comment on the feasibility of the
approach I'm going to take.

To sum up, my main question is how can one 'define' the layout of a
window so that a robot app can read and control it?

Many thanks in advance,
Ishan.

Joseph M. Newcomer

unread,

Jul 1, 2005, 3:35:00 AM7/1/05

There isn't a way to "define" the layout. Period. Forget it. You have to derive the layout
from the actual controls. There is absolutely NO RELIABLE WAY to encode the x,y
coordinates because it depends on the screen resolution, the device driver, the fonts
selected by the user, and so on. You cannot rely on strings like "UserID" or "Password",
because of internationalization. The best you can deal with is control IDs; given the
control ID, you can locate the control, and determine, for the currently running instance,
what its coordinates are, and what its type is.

That said, there is almost no reliable way to send mouse clicks to an application; lots of
people have tried this and failed. You are in the model of "I want to write a Windows
scripting language", and it is nearly impossible to get right. I've been involved with at
least two projects that failed miserably (a fact which I predicted in both cases, but my
part of the project didn't depend on doing the scripting). Adding to this is Microsoft's
continuous "improvement" of GUIs by creating bizarre controls with non-standard and
undocumented behavior. For example, I used to have a remote-control program that would
force multiple button clicks into Outlook 2000 so I could manage my mail rules in spite of
the hopelessly inadequate tooling provided by Microsoft. I installed Office 2003, and it
uses some bizarre toy created by some programmer with nothing better to do, and there is
no way to determine how this control works. So it is hard to simulate any activity to it.

While we need a good testing tool, it is a very, very difficult task. One thing that is
guaranteed to NOT work is to encode window coordinates into the scripting language. Since
you have no idea how the dialog coordinates will actually be translated (this is a
funciton of the display driver, and is based on numerous parameters over which you can
assume you have no control), the coordinates are pretty meaningless. And if you wanted
them, creating a new scripting language to represent them is pointless, unless the script
is created by actually reading the .rc file, or by extracting the dialog templates from an
executable and examining them. Having to maintain two representations of controls is a
simply impossible task, and imposes a burden that is pointless. But having extracted the
DBU values, either from the .rc file or the actual resources, does NOT tell you how they
will actually be instantiated. You need to spend a lot of analysis, including
::GetSystemMetrics calls, to infer what the actual representation is going to look like.

Note that your scripting language encodes nothing that could not be derived from the
actual windows themselves, and more reliably. The caption, the coordinates, and the type
are all there. The most likely stable piece of information is not the caption or
coordinates, but the control ID.

CBT hooks were intended to help with this process, but with the increasing tendency within
Microsoft to create completely off-the-wall controls with ill-defined and undocumented
behavior, I think that these efforts at developing scripting languages are utlimately
doomed. I have yet to hear of a single success story. Eventually, eveyrone bogs down in
one or another of the deep problems involving common controls, common dialogs (especially
common dialogs with enhancements), owner-draw controls, and other nasty realities.

I've not seen a single successful scripting language yet that simulated user interaction.
The problems are deep and largely intractable. Everyone seems to get to the point you are
at; the longer you work on it, the more problems you will encounter, and eventually you
will run into the same solid brick wall everyone else runs into, the fact that you can
neither simulate input accurately nor analyze output.
joe

Joseph M. Newcomer [MVP]
email: newc...@flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm

ides...@eudoramail.com

unread,

Jul 1, 2005, 6:57:43 AM7/1/05

Hi Joe,

First of all, thanks a lot for sharing your experience.

Well... it seems that we (I'm part of a team doing this) are up to a
difficult task. But I have some questions regarding your comments. Hope
you would consider them too.

>From the literature I've read so far, there seem to be two approaches
to GUI testing; position-based and object-based. Position-based is
where you define the (x,y) cordinates and object-based uses control IDs
etc.

You were mainly referring to the difficulty of using position-based
testing, due to problems in reliably defining cordinates.

But we have a question whether using control IDs to recognize output
will actually verify what the human eye sees. Because Windows will
return us the data contained in a component irrespective of it been
properly drawn on the screen or not. Right? That's the reason why we
thought of defining screen cordinates in a script. (This, however, is a
separate question)

We found some 3rd party components for input simulation (AutoIt) and
screen output recognition (ScreenOCR). Given a control ID, AutoIt can
send commands to it, and given cordinates, ScreenOCR reads it
correctly.

> The best you can deal with is control IDs; given the control ID, you
> can locate the control, and determine, for the currently running
> instance, what its coordinates are, and what its type is.

This comment signals a possiblity of using the above components,
provided that we know the control ID of each GUI component we need to
test. What if we query the cordinates at run-time rather than scripting
them? (We also can remove certain variables such as
internationalization, different windows versions etc. from the
equation)

I would really appreciate if you can reconsider the changed approach of
using 'control IDs' with the above 3rd party components. Also, give
some guidance as to where I should start if I'm using 'control IDs' to
query the cordinates, type etc. of GUI components.

Thanks,
Ishan.

Scott McPhillips [MVP]

unread,

Jul 1, 2005, 8:55:55 AM7/1/05

ides...@eudoramail.com wrote:
> I would really appreciate if you can reconsider the changed approach of
> using 'control IDs' with the above 3rd party components. Also, give
> some guidance as to where I should start if I'm using 'control IDs' to
> query the cordinates, type etc. of GUI components.
>
> Thanks,
> Ishan.
>

You can use FindWindowEx and EnumChildWindows to find a window and
enumerate its controls. For each callback with an HWND you can call
GetClassName and GetWindowLong(GWL_ID...) to determine the control type
and ID. GetWindowRect will give you the coordinates.

--
Scott McPhillips [VC++ MVP]

Joseph M. Newcomer

unread,

Jul 1, 2005, 1:47:35 PM7/1/05

See below...

On 1 Jul 2005 03:57:43 -0700, ides...@eudoramail.com wrote:

>Hi Joe,
>
>First of all, thanks a lot for sharing your experience.
>
>Well... it seems that we (I'm part of a team doing this) are up to a
>difficult task. But I have some questions regarding your comments. Hope
>you would consider them too.

***
Actually, "difficult" is not the characterization that I would use. "Impossible" is a good
adjective. Let's put it this way: you could not hire me, at any rate of pay you choose, to
try to build such a system. I do not think the problem is solvable. I can't even figure
out how I would write test drivers to test specific applications I have written, let alone
solve the general problem.

Note that a testing harness actually requires full Win32 API Scripting support as well
(read: VB), because I would need, in my testing code, to actually execute real system
calls.
***

>
>>From the literature I've read so far, there seem to be two approaches
>to GUI testing; position-based and object-based. Position-based is
>where you define the (x,y) cordinates and object-based uses control IDs
>etc.
>
>You were mainly referring to the difficulty of using position-based
>testing, due to problems in reliably defining cordinates.
>
>But we have a question whether using control IDs to recognize output
>will actually verify what the human eye sees. Because Windows will
>return us the data contained in a component irrespective of it been
>properly drawn on the screen or not. Right? That's the reason why we
>thought of defining screen cordinates in a script. (This, however, is a
>separate question)

****
"Recognize output" already is a problem. What do you mean by this? For example, in an
owner-draw listbox without LBS_HASSTRINGS, what you get "returned" is an address in the
context of the process, which does you no good at all. You can't really use the address
effectively, but even if you figure out how to use the debugger calls, you then have to
understand the class/struct to figure out how the data is represented, and this could
change from release to release. Think about the problems of invoking a virtual method to
obtain information.

Example: take a look at my logging listbox control, on my MVP Tips site, and explain
exactly how you would analyze its output. And this is a SIMPLE example compared to, say,
the Outlook 2003 Tools>Create Rules dialog that comes up.

So Windows CAN'T "tell you the data" because it is impossible to do this except for a few
trivial built-in controls. Explain exactly how you are going to pass an LVITEM or TVITEM
structure across a process boundary so it can be filled in with the information about the
list control or tree control state, for example. (Hint: think "DLL Injection". Now figure
out how, even with DLL injection, you can interpret the data obtained without knowing
details of the structure that is contained in the LPARAM field of the TVITEM. I can send
you a program that will tax any algorithm you propose)

The only way to figure out what is going on is to analyze the pixels, and this means
something like OCR. And pray that the application doesn't do an owner-draw listbox that
uses icons to convey information. I have one app that has three icons on the left, and
those three icons are selected from sets of icons, so the combination of three icon
positions represent about 40 or so states. How do you determine that I'm displaying the
correct state as encoded in the icons? Oh yes, one way is to use a pre-canned database
(the icon selections depend on the database contents, which are modified by incoming input
data), precanned input data, which means you will have to, for this particular
application, create a simulation script that sends network messages in a particular
syntax, using a particular IP address and socket, and then, before sending the next
message, make sure the response to the first message is correct.

Now deal with the fact that some messages may change the behavior based upon whether or
not they are received during or after certain actions with respect to the first message
(if during the processing, a new message for the same remote device could terminate
processing of the earlier message; but if already processed, it sets a different state).
And the simulator must be able to properly handle the response I send out, and send the
appropriate response back.

Test to make sure I respond properly to syntactically incorrect messages, by putting up
the correct information on the screen, and enabling/disabling appropriate menu items.

make sure that, given a particular selection in a listbox, list control, combo box, etc.,
that I properly enable/disable the appropriate controls, menu items, etc.

make sure that copy, paste, cut, and delete options are properly selected, just as a
beginning example. Note that what I implement is that if multiple selections are made in a
listbox, a "copy" operation will place, in the clipobard, the proper text (which might not
be the actual screen contents).

Here's one from a liquid CO2 analyzer: the data in the structure was represented by a
temperature converter that returned temperature in 1/64 degree C. But the end user might
choose F, C or K as the display. If a row of data from the grid control was placed in the
clipboard, the temperature data was in 1/64C units, although the text displayed was a
conversion to integral degrees F, C or K. So I would do a copy, the user could change the
display mode from C to K, and then do a paste, so the numbers pasted were not the numbers
copied, although (within the limits of the conversion to an integer value, which was
unimportant to this application, because external constraints did not care about the small
errors) the values themselves were "identical". Write a test script that proves that
copy-change representation-paste produces the correct result.

I've built scripting systems to test programs such as Windows control programs for
embedded systems; I build an embedded system simulator (often because the client does not
yet have working devices). These are hard to write. How do you propose to build a testing
system that would interface to a simulator component to provide the input data stream (the
input stream is not just user interaction. It is often user interaction in the context of
live data streams).

Since you can't define screen coordinates effectively, it is not clear how you would
maintain correspondence between your debugging script and each release.

Examples: two bitmap buttons. I decide to exchange their positions. How does your script
cope with this? Version 1.1 had two icon buttons. Version 2.0 has three icon buttons. How
do you determine their new arrangement? I move the listbox from the left to the right of
the dialog? What about situations in which I dynamically rearrange controls, so their size
and/or positions change as the window is resized? (Actually, my FIRST Windows app, back in
1990 or so, had a dialog with two columns, with the usual "Add", "Remove", "Add All" and
"Remove All" buttons. As you resized the dialog, the sizes of the two listboxes changed,
but the buttons remained fixed size, but always remained between the two listboxes. Hence,
static analysis not only fails because it is unmaintainable, it fails because dialogs,
form views, etc. dynamically resize).

One app I can't send you (but you can get by buying a $15,000 controller...) has edit
controls, combo boxes, check boxes, and radio buttons as child controls of CListBox. These
controls are creating dynamically based upon reading a controller configuration file,
where it defines abstract properties of the controller, and I create controls on the fly.
Note that since these controls are child controls of a listbox, they scroll with the
listbox! In a case like this, you cannot do a simple two-level enumeration to obtain the
controls, you cannot predict the control IDs (which I assign dynamically), and the
positions are non-constant. Oh yes, there are somewhere between 20 and 100 list boxes,
each in a tabbed dialog. How do you handle tab controls with child dialogs? How do you
handle the case where the tabs are scrolled? How do you handle the case where the tabs are
stacked? How do you handle the case where the number of tabs depends upon runtime
information? Where the number of tabs changes each time the program executes, and you can
only determine the correct number of tabs by reading the same configuration file I read
and figuring out what I did (which, I might add, is nontrivial).

>
>We found some 3rd party components for input simulation (AutoIt) and
>screen output recognition (ScreenOCR). Given a control ID, AutoIt can
>send commands to it, and given cordinates, ScreenOCR reads it
>correctly.

****
Interesing. Will ScreenOCR handle rich-edit controls, owner-draw CListCtrl with multiple
fonts, and owner-draw controls (how about one with rotated text. I've got one of these
right now). And what about my controls that are now displaying Hebrew, Japanese, Chinese,
Arabic, Korean and a dozen other scripts I also cannot read myself? Will it check my
owner-draw pushbutton to make sure the arrow is pointing in the correct direction? What
does it do with the app I have that displays circle-slashed icons in the tabs for those
tabs that are illegal in the current context? Will it verify that I have enabled/disabled
the correct tabs? Will it check my owner-draw combo box that uses non-textual output, such
as line shapes with a radio button indicating the one selected?

What about constraint management? For example, how will you express rules of the form "If
the thus-and-such field is blank, disable OK", "If the checkbox so-and-so is checked, the
listbox should be enabled" "If the combo box selection is thus-and-such, the following
controls should be visible, and these other controls should be invisible". How do you
impose rules that state the contents of controls that are sensitive to the context the
user is running in? "If the code page of the end user is thus-and-such, use this API to
get the correct character to use for this purpose"? "For locales in Europe, make sure the
correct digit separator is used"? (And test this out for a German user who is running
Windows in Indiana, but wants to see familiar information representations). What about "If
this item is selected in the rich edit control A, then the following text in rich edit
control B should also be selected"? (That is a piece of code sitting on the screen next to
me right this moment). "The output should be underlined in groups computed by the contents
of the text" (the bug I'm working on right now...there's an error in my highlighting code,
which I just wrote yesterday). What about parsing pictures (for example, making sure that
the values selected result in the correct part of the picture being highlighted)? What
about making sure that listbox items in a draggable listbox are properly dragged and end
up in the right place?

>
>> The best you can deal with is control IDs; given the control ID, you
>> can locate the control, and determine, for the currently running
>> instance, what its coordinates are, and what its type is.
>
>This comment signals a possiblity of using the above components,
>provided that we know the control ID of each GUI component we need to
>test. What if we query the cordinates at run-time rather than scripting
>them? (We also can remove certain variables such as
>internationalization, different windows versions etc. from the
>equation)

****
You HAVE to query the components the INSTANT BEFORE YOU TRY TO USE THEM. You cannot even
enumerate them at startup time, because bewteen two tests, their position will change.
Bewteen any two activations of a dialog, the layout could change (for example, I have an
applicaiton that has one dialog box that rearranges the controls based on the size of the
picture that is displayed).

Here's one scenario from an application I have right now: right click on part of the
screen. Get a menu item that says "Maximize". The window is maximized, and all the
contents are rearranged for the new geometry. Now send an activation request to a
particular control. If you don't do it via the control ID, AT THAT INSTANT, you would not
know where the control was.

Oh yes, here's another case: a dialog-based application for which there is only one
control ID. I use TextOut to draw the text in that one-and-only control. In a variety of
languages and fonts. Some of the text is conditional based on program state. How do you
test that I'm displaying the correct information relative to the internal program state.
And the program state changes dynamically based on internal computations, such as
timeouts. Can you verify that my program is working correctly? Can you test it?

What about a user-defined control that does its own selection highlighting? Can you tell
where the text is to highlight? Suppose it is as powerful as rich edit, so the text is in
different sizes. How do you check that my code is working correctly? Did I really put the
error message in red? DId I really put the warning message in italic? Did I use boldface
in the proper places?

Verify the proper tooltip message pops up. What is the control ID of a tooltip message?
What if it is a window that I created that simulates a tooltip, but is more powerful?

How do you deal with situations where the number of controls depends upon external state,
and the control IDs, layout, etc. are all determined at runtime based upon input data. How
do you test that the correct configuration has been constructed?

What about the "Do not enable this control if the operating system does not support this
feature"? I have controls that disable on XP but enable on Server 2003.

(Note that I've actually built and delivered code that contains all these features I'm
discussing)

I have a piece of pre-Alpha code--I'm still heavily involved in writing and debugging
it--that would be a serious test of any scripting engine design. If you agree to keep the
current version confidential and not distribute it, I'll even send you the existing code.
(I will be releasing it as open source as soon as it all works). Now your problem would be
to determine how you would check it for correctness, so having once created a testing
script, you could test the final release (or at least the subset of the final release that
corresponds to what I've sent). This represents perhaps 25% of the techniques I have
actually used, so it wouldn't be a full test of all the complex cases, but it would
represent the minimum you would have to achieve).

How about a couple scenarios based on time? For example, one product I deliver requires
that a script be executed at, say, 8am. Can you test to make sure that this script is
indeed executed at 8am? Can you even FIND OUT that such a script exists, given there may
be no display currently active on the screen that indicates this should be so?

And what about the speech output from that app? Can you verify that it actually says
"oh-three-hundred" for 3:00? (Yes, I admit that this program is highly ethnocentric. But
it is a 16 bit Windows program designed for the English-only marketplace). Note that the
Microsoft Text-to-speech engine translates 03:00 as "three" <short pause> "zero". (Oh, you
ask, how do you do TTS in a 16-bit program? The answer is, I don't. I do the TTS interface
in a 32-bit co-process which has its OWN GUI; I prep the text in the 16-bit app and send
the rendred TTS text to the 32-bit app, which also handles network traffic).

For that matter, take a situation where I receive a message via the network. This message
triggers a sequence of actions, some of which have GUI output, but most of which don't. Or
the output is optional depending upon what view is being displayed. Can you test that my
program is responding correctly to input across the network?

For that matter, consider the case of a two-view splitter window. Can you verify that a
change to the contents of one view is properly represented in the other view? Suppose the
other view is a graphical view?

The approach that you can actually read the contents of the text, even assuming that you
can capture the bits, is naive.

Here's one: I have a vertically-scrollable, horizontally-scrollable grid control. You
can't see some of the bits because they are off-window. Can you test my program?

Oh yes, this particular grid control highlights illegal values in red. Can you test that
it is properly highlighting illegal values? Write the constraint equation in your
scripting language. Assume that in order to see the information involved in seeing cell
(to use Excel terminology) L27, I have to use data from cells A3, B7..B9, C22, C23, and
L1..L6. This is a custom third-party grid control (e.g., Stingray, Dundas). Explain
exactly how you plan to test this program under a variety of inputs. How do you write the
script to test that it is behaving correctly? How do you write the script to enter the
data in the first place? Note that entering some data will cause some of the columns to
widen to accomodate the text, while entering other data will create a
horizontally-scrolling edit control. When the data is displayed, if the field is too
short, it might be displayed as "#######", and expect the user to manually widen the
column to make it visible. Or it might just truncate it, perhaps displaying only half a
letter (see what your OCR does with a partial letter). What about the case of ellipses?
Can you test that the data behind the ellipses represention is valid in spite of the fact
you can't see it on the screen?
*****

>
>I would really appreciate if you can reconsider the changed approach of
>using 'control IDs' with the above 3rd party components. Also, give
>some guidance as to where I should start if I'm using 'control IDs' to
>query the cordinates, type etc. of GUI components.
>

****
GetWindowRect is the way to get the coordinates. ScreenToClient will give you the client
coordinates relative to the parent window (use GetParent).

(If you have to ask this question, it strongly suggests that you really have no idea what
you're getting into! One other observation I've made is the less experience a Windows
programmer has, the more confident said programmer is that he or she knows how to write a
GUI scripting language).
****
>Thanks,

Kurt Grittner

unread,

Jul 1, 2005, 4:46:56 PM7/1/05

Hi Ide,

I did a program like this but it wasn't for testing, it was to
completely hide an expensive hard to use licensed program and present
to the users a simple cut down interface. Amazingly, this project
succeeded. It succeeded because of few key reasons.

1. Keyboard accelerators are your friends.
2. Command buttons are your enemy, especially those that can change
meaning with context.
3. Text boxes are your friends.
4. Bitmaps are your enemy.
5. Menus are your friend only if each menu item has a keyboard
accelerator associated with it.
6. Variable menus are your enemy.
7. Focus (or the loss of it) is your enemy.

Having this ideas in mind you can succeed if you control the program
to be tested and you can limit your remote control to keyboard
accelerators and setting text properties in dialogs etc. Thoughtful
naming can make your job of locating the proper window and control
easier.

In my case, the program to be manipulated was used by an operator in a
call center. The interface was considered too complicated for our
users to master. I never exected it to work at all, but somehow it
worked marvelously well.

So here's how I attacked the problem. I coded one step at a time. I
heavily used the SPY++ to see what messages were flowing in normal
operation of the program. Then I wrote code to find those windows,
give them focus, and send them the same messages. I was able to avoid
having to simulate mouse movement (thank god).

I used alot of FindWindow followed by PostMessage of WM_COMMAND.
For dialogs I used FindWindow with enumchildwindows wherein I would
read the window class name compare for "Button", then look at the
caption, strip the ampersands, compare the caption for "OK" or "Yes"
or "Cancel". Send that windows WM_SETFOCUS, WM_LBUTTONDOWN,
WM_LBUTTONUP. For text boxes on dialogs I would send the same three
messages follow by EM_SETSEL 0, and WM_KEYDOWN messages for each
character I wanted to load into the control on the (invisible) dialog.

I also had to use GetMenu to observe status information which was
thankfully visible as checkmarks on menu items.

Incredibly, this program worked even though the underlying program
went through several versions. It was used on on 50 workstations at
once.

So, if you control that target program then you can make your job much
easier by making everything visible in these ways which are fairly
straightforward. Avoid menus that add/change delete items all the
time. Use accelerators on everything you can. Always supply check
marks on menu items for things that you want to observe. Always use
simple text controls and unambiguous command buttons where you know
you will need to feed text into the target program.

I was lucky in my project because I only had to interact with perhaps
20 of these points in the target program. I wish you similar good
luck in your project,

Hope this helps,

- Kurt

On 30 Jun 2005 22:03:15 -0700, ides...@eudoramail.com wrote:

0 new messages