GMAT Overview  |  Analytical Writing  |  Integrated Reasoning  |  Quantitative Section  |  Verbal Section  |  GMAT Tutorials

GMAT Math Strategies — Solving Not-so-simple Statistics Problems

This tutorial is one of two focusing on GMAT math strategies. In this Q&A, you'll gain insight into descriptive statistics problems — specifically, those involving the concepts of arithmetic mean (simple average), median and range. You'll discover that these seemingly simple concepts can make for surprisingly difficult GMAT questions, unless you're truly prepared for them.

Q: Can you briefly define the term descriptive statistics, and describe what aspects of descriptive statistics GMAT test takers are likely to encounter?

A: The term descriptive statistics embraces such concepts as arithmetic mean (simple average), median, range, and standard deviation. All but the last of these concepts are easily understood. Just for the record, here's a definition of each one:

For any set of numerical terms:

  • The arithmetic mean is the sum of the terms, divided by the number of terms in the set.

  • The median is the middle term in value if the set contains an odd number of terms, or the arithmetic mean (average) of the two middle terms if the set contains an even number of terms.

  • The range is the difference on the real-number line between the term with the greatest value and the term with the least value.

You should understand each of these terms, because the test makers won't provide you with their definitions during the test. By the way, the same goes for standard deviation — a more advanced statistical concept that inherently makes for a relatively challenging GMAT question.

Q: The three concepts you just defined seem very straightforward. How can the test makers design challenging questions — or even moderately difficult ones — involving these concepts?

A: To increase the difficulty of questions that focus on these concepts, the test designers often use variables, which add an algebraic dimension to these questions. Consider, for instance, the following Problem Solving question (answer choices are omitted here):

Which of the following expressions represents the arithmetic mean (average) of the five terms p, q, p + q, p – 1, and q + 1 ?

Solving the problem requires not only application of the arithmetic-mean concept, but also a bit of algebraic manipulation, rendering the question a bit more complex than simply adding together five numbers and dividing by five. Here are the algebraic steps, plugging the variable expressions into a general equation for arithmetic mean (AM):

equation system

Still not to difficult, is it? So to further increase the difficulty level of an arithmetic-mean question, the test makers might provide the arithmetic mean and ask instead for the value of an unknown term in the set. Consider this variation on the problem just solved (again, omitting answer choices):

Assume that 3/5(p + q) represents the arithmetic mean (average) of a set containing five terms. If the set includes the four terms p, q, p + q, and p – 1, which of following expressions represents the fifth term?

This one's a bit more challenging, isn't it? It's more difficult to understand and to determine how to approach and solve. Moreover, although you apply the same arithmetic-mean formula to solve this problem as for the previous one, you need to perform more algebraic steps along the way:

equation system

Q: What about the concepts of median and range? How might the test makers design a challenging GMAT question involving either of these simple concepts?

A: Again, the use of variables, instead of or in addition to numbers, adds complexity to the question. Also, the concepts of median and range are typically incorporated into an arithmetic-mean problem. Here's a Problem Solving question that employs both devices (once again, answer choices are omitted here):

If 0 < q < p, and if the median of the four terms p, q, p + q, and qp is 2, what is the arithmetic mean (average) of the four terms?

Your first task here is to rank the four terms from least to greatest in value. Given q < p and that p and q are both positive, qp must be negative and hence lowest in value among the four terms, while q + p must be greatest in value among the four terms. Here are the four terms, then, ranked from least to greatest in value:

(qp) ... q ... p ... (p + q)

The median value, given as 2, is the average (arithmetic mean) of the two middle terms q and p:

equation system

To answer the question, you can substitute the value 4 for (p + q) in the arithmetic-mean formula:

equation system

Had the question asked instead for the range of values in the set, once you've determined the lowest and highest valued terms, you could express the range as the sum of the greatest term's value, which you know is positive, and the absolute value of the lowest value, which you know is negative:

equation system

Q: So far you've used examples only in the Problem Solving format. How do the test makers employ the Data Sufficiency format to cover the concepts of arithmetic mean, median, and range?

A: First consider the general equation for arithmetic mean (average), which contains three distinct components:

  1. The arithmetic mean
  2. The number of terms in the set
  3. The sum of the terms in the set

If you're given any two of these, you can determine the third. Thus the correct response to the following Data Sufficiency question would be (C):

How many sweaters does Hritik own?

(1) Hritik paid an average of $25 for each sweater he owns.

(2) Hritik paid a total of $240 for all of the sweaters he owns.

  1. Statement (1) ALONE is sufficient to answer the question, but statement (2) alone is NOT sufficient.
  2. Statement (2) ALONE is sufficient to answer the question, but statement (1) alone is NOT sufficient.
  3. BOTH statements (1) and (2) TOGETHER are sufficient to answer the question, but NEITHER statement ALONE is sufficient.
  4. Each statement ALONE is sufficient to answer the question.
  5. Statements (1) and (2) TOGETHER are NOT sufficient to answer the question.

This is a very simple example, of course. Just as with Problem Solving questions, to enhance the difficulty of an arithmetic-mean question in the Data Sufficiency format the test makers will often incorporate either the median or range concept into the question.

Q: Can you illustrate how an arithmetic-mean question in the Data Sufficiency format can be made more difficult by incorporating the concept of either median or range?

A: Sure. Here's a Data Sufficiency question that incorporates certain information about range. You'll probably agree that this example, which also involves Hritik's sweaters, is far more difficult than the previous one:

If Hritik paid an average of $25 per sweater for four sweaters, one of which was more expensive than any of the others, how much did he pay for the most expensive sweater?

(1) The amount Hritik paid for the most expensive sweater was $25 more than the lowest amount he paid for a sweater.

(2) Hritik paid an average of $20 per sweater for three of the sweaters.

  1. Statement (1) ALONE is sufficient to answer the question, but statement (2) alone is NOT sufficient.
  2. Statement (2) ALONE is sufficient to answer the question, but statement (1) alone is NOT sufficient.
  3. BOTH statements (1) and (2) TOGETHER are sufficient to answer the question, but NEITHER statement ALONE is sufficient.
  4. Each statement ALONE is sufficient to answer the question.
  5. Statements (1) and (2) TOGETHER are NOT sufficient to answer the question.

First consider statement (1) alone, which provides the range of values in the set. Without more information about the price of individual sweaters it is not possible to answer the question. Next consider statement (2) alone, which establishes that Hritik paid a total of $60 for three of the four sweaters. Given an average price of $25 for each of the four sweaters, the total for all four sweaters was $100. Thus the fourth sweater must have cost $40. But is that $40 sweater necessarily the most expensive one? No. For example, the three sweaters whose total cost was $60 might have cost $45, $10, and $5 individually. Thus statement (2) alone does not suffice to answer the question.

Considered together, however, statements (1) and (2) establish that the most expensive sweater must have cost $40. Why? Assume the contrary: that the $40 sweater was not the most expensive one. Given this assumption along with statement (1), the least expensive sweater must have cost more than $15. But the total cost of all four sweaters would total more than $100:

($40) + ($40+) + ($15) + ($15+) > $100

Since the contrary assumption is impossible, the most expensive sweater must have cost $40, and correct response to this question is (C).

Q: What other devices might the test makers use to enhance the difficulty of Data Sufficiency questions involving arithmetic mean, median, and range?

A: To answer this question, let's revisit the set of variable expressions from the first arithmetic-mean example:

S: {p, q, p + q, p – 1, and q + 1}

Determining the median value of these five terms requires additional information. The median value would depend on:

  • The signs of p and q — whether p and q are positive or negative
  • Which value — p or q — is greater, and by how much

For example, assuming p > q, whether (p + q) is greater or less than p and q depends on the sign of q. If q is positive, then (p + q) > p > q. But if q is negative, then p > (p + q) > q. Even if you assume p and q are both positive, the median value might be either (p – 1) or q, depending on the difference between p and q. If the difference is less than 1, then the median is p, whereas if the difference is greater than 1, then the median is (p – 1):

If pq < 1, then (p + q) > (q + 1) > p > q > p – 1.

If pq > 1, then (p + q) > p > (p – 1) > (q + 1) > q.

These sorts of dynamics between variable expressions is great fodder for Data Sufficiency questions, because whether you can determine the relationships between the expressions depends on how much and what type of information you're provided about them. For example, here's the scenario we just looked at, transformed into a Data Sufficiency question:

Among the five terms p, q, p + q, p – 1, and q + 1, which represents the median value?

(1) p > q

(2) pq < 1

  1. Statement (1) ALONE is sufficient to answer the question, but statement (2) alone is NOT sufficient.
  2. Statement (2) ALONE is sufficient to answer the question, but statement (1) alone is NOT sufficient.
  3. BOTH statements (1) and (2) TOGETHER are sufficient to answer the question, but NEITHER statement ALONE is sufficient.
  4. Each statement ALONE is sufficient to answer the question.
  5. Statements (1) and (2) TOGETHER are NOT sufficient to answer the question.

The correct answer is (E). Even considering both statements (1) and (2) together, the median value depends on the signs of p and q.

Q: In your last example, whether the question was answerable depended on the sign and relative values of the variable expressions. Is this typical of GMAT Data Sufficiency questions? If so, is there a systematic process for ensuring that your analysis accounts for all possible values of the variable expressions?

A: Yes, it's very typical. In fact, identifying possible value ranges for variables is at the heart of many Data Sufficiency questions. Whenever you encounter a Data Sufficiency question involving variable expressions — as opposed to numbers — check to see whether the question itself asks:

  • Which of two variable expressions is greater in value
  • Whether two variable expressions are equal in value
  • Whether the value of a variable expression is positive or negative

The question might look something like one of the following:

If..., is x greater than y in value?

If..., does x equal y in value ?

If..., is x a positive number?

Your immediate reaction to this sort of question should be to consider the following value ranges along the real-number line:

  • Values greater than 1
  • Fractional values between 0 and 1
  • Fractional values between –1 and 0
  • Values less than –1

Why these four ranges? Well, when you perform certain operations with variables, the result depends on what range the variable falls into. For instance, when you take a number's cube root, whether you end up with a positive number, negative number, a smaller number, or a larger number depends on which of the four ranges the original number falls into:

If x > 1, then 1 < ∛x < x

If 0 < x < 1, then x < ∛x < 1

If –1 < x < 0, then –1 < ∛x < x

If x < –1, then x < ∛x < –1

As you prepare for GMAT Data Sufficiency, go through the exercise of applying exponents (odd as well as even) and roots (odd as well as even) to numbers, and note the patterns that result.