NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL: MySQL
Which method is best to select values present in one table but missing in another one?
This:
1.
SELECT
l.*
2.
FROM
t_left l
3.
LEFT
JOIN
4.
t_right r
5.
ON
r.value = l.value
6.
WHERE
r.value
IS
NULL
, this:
1.
SELECT
l.*
2.
FROM
t_left l
3.
WHERE
l.value
NOT
IN
4.
(
5.
SELECT
value
6.
FROM
t_right r
7.
)
or this:
1.
SELECT
l.*
2.
FROM
t_left l
3.
WHERE
NOT
EXISTS
4.
(
5.
SELECT
NULL
6.
FROM
t_right r
7.
WHERE
r.value = l.value
8.
)
Finally, it’s MySQL time.
As always, we will create the sample tables:
Table t_left
contains 100,000 rows with 10,000 distinct values.
Table t_right
contains 1,000,000 rows with 10,000 distinct values.
There are 10 rows in t_left
with values not present in t_right
.
In both tables the field value
is indexed.
LEFT JOIN / IS NULL
1.
SELECT
l.id, l.value
2.
FROM
t_left l
3.
LEFT
JOIN
4.
t_right r
5.
ON
r.value = l.value
6.
WHERE
r.value
IS
NULL
View query results and execution plan
The query returns correct results in 0.73 seconds.
If we look into the Extra part of the execution plan, we will see an interesting thing there:
Using where; Using index; Not exists
What does it mean?
MySQL documentation mentions this:
Not exists
MySQL was able to do a
LEFT JOIN
optimization on the query and does not examine more rows in this table for the previous row combination after it finds one row that matches theLEFT JOIN
criteria. Here is an example of the type of query that can be optimized this way:1.
SELECT
*
2.
FROM
t1
3.
LEFT
JOIN
4.
t2
5.
ON
t1.id = t2.id
6.
WHERE
t2.id
IS
NULL
;
Assume that
t2.id
is defined asNOT NULL
. In this case, MySQL scanst1
and looks up the rows int2
using the values oft1.id
. If MySQL finds a matching row int2
, it knows thatt2.id
can never beNULL
, and does not scan through the rest of the rows int2
that have the same id value. In other words, for each row int1
, MySQL needs to do only a single lookup int2
, regardless of how many rows actually match int2
.
This is exactly our case.
MySQL, as well as all other systems except SQL Server, is able to optimize LEFT JOIN / IS NULL
to return FALSE
as soon the matching value is found, and it is the only system that cared to document this behavior.
The assumption that t2.id
should be defined as NOT NULL
, however, is too strong, since a successfull JOIN
on equality condition implies that the value found is NOT NULL
.
Since MySQL is not capable of using HASH
and MERGE
join algorithms, the only ANTI JOIN
it is capable of is the NESTED LOOPS ANTI JOIN
, which is exactly what we see in the query plan (despite the fact that MySQL doesn’t call it that). However, this behavior is what an ANTI JOIN
does: it checks the values from the left table against only one of each distinct values in the right table, skipping the duplicates.
NOT IN
1.
SELECT
l.id, l.value
2.
FROM
t_left l
3.
WHERE
value
NOT
IN
4.
(
5.
SELECT
value
6.
FROM
t_right
7.
)
View query results and execution plan
This query is as fast as the LEFT JOIN / NOT NULL
, however its plan looks quite different.
First, it mentions a DEPENDENT SUBQUERY
instead of a second table (which was used in the LEFT JOIN / IS NULL
). Nominally, is a dependent subquery indeed, since we don’t have a join here but rather a SELECT
from a single table with predicate in the WHERE
clause.
Second, the description part of the plan mentions this:
<exists>(<index_lookup>(<cache>(`20090918_anti`.`l`.`value`) in t_right on ix_right_value))
What is that?
This is of course our good old friend, the ANTI JOIN
. MySQL applies EXISTS
optimization to the subquery: it uses the index scan over ix_right_value
and returns as soon as it finds (or not finds) a row.
NOT IN
is different in how it handles NULL
values. However, since both t_left.value
and t_right.value
are defined as NOT NULL
, no NULL
value can ever be returned by this predicate, and MySQL takes this into account.
Essentially, this is exactly the same plan that LEFT JOIN / IS NULL
uses, despite the fact these plans are executed by the different branches of code and they look different in the results of EXPLAIN
. The algorithms are in fact the same in fact and the queries complete in same time.
NOT EXISTS
1.
SELECT
l.id, l.value
2.
FROM
t_left l
3.
WHERE
NOT
EXISTS
4.
(
5.
SELECT
value
6.
FROM
t_right r
7.
WHERE
r.value = l.value
8.
)
View query results and execution plan
This query of course produces the same results.
Execution plan, again, is different. MySQL is the only system that produces three different plans for three different methods.
The plan does not differ much: MySQL does know what an index lookup is and what EXISTS
is and it does combine them together.
EXISTS
in MySQL is optimized so that it returns as soon as the first value is found. So this query in fact is an ANTI JOIN
as well as first two queries are.
This query, however, is a little bit less efficient than the previous two: it takes 0.92 s.
This is not much of a performance drop, however, the query takes 27% more time.
It’s hard to tell exact reason for this, since this drop is linear and does not seem to depend on data distribution, number of values in both tables etc., as long as both fields are indexed. Since there are three pieces of code in MySQL that essentialy do one job, it is possible that the code responsible for EXISTS
makes some kind of an extra check which takes extra time.
Summary
MySQL can optimize all three methods to do a sort of NESTED LOOPS ANTI JOIN
.
It will take each value from t_left
and look it up in the index on t_right.value
. In case of an index hit or an index miss, the corresponding predicate will immediately return FALSE
or TRUE
, respectively, and the decision to return the row from t_left
or not will be made immediately without examining other rows in t_right
.
However, these three methods generate three different plans which are executed by three different pieces of code. The code that executes EXISTS
predicate is about 30% less efficient than those that execute index_subquery
and LEFT JOIN
optimized to use Not exists
method.
That’s why the best way to search for missing values in MySQL is using a LEFT JOIN / IS NULL
or NOT IN
rather than NOT EXISTS
.
fonte: