You should avoid using the IN-funtion in MySQL
In this small article, I would like to discuss why you should be careful with the popular IN function of SQL in certain situations. Of course, the function is not fundamentally bad, it is even extremely helpful. But there are always situations where the ill-considered use of this function can lead to real problems.
What is the aim of the IN function?
Let's start with what the IN function is good for. What problem does it actually solve, and why is it used so much?
For example, if you have a table with employees and a list of employee numbers for which you need all the information. Then you don't need to query the data individually, but you could create a statement like this:
SELECT * FROM employees AS e WHERE e.id = 7 OR e.id = 13 OR ... OR e.id = 5409;
In principle this works fine, but the statement gets very long quickly and if you want to generate it, it is much more time-consuming than if you only had to list the IDs. This is exactly where the IN function comes into play. This makes it possible to avoid a long count and the statement becomes much clearer.
SELECT * FROM employees AS e WHERE e.id IN (7, 13, ..., 5409);
This already looks much clearer and is also much easier to generate. As long as there is no other way to identify the employees you are looking for, this is probably also the easiest way to retrieve the data. Writing an SQL statement for each employee will only make sense in very few cases anyway.
What is the range optimizer?
Now, to understand why this statement can become a problem, one must first understand which part of MySQL takes care that such statements are processed efficiently.
Assuming that there is an index on the IDs column of our table, MySQL will probably use the BTREE algorithm, so we are searching relatively efficiently on a binary tree with our queries. However, we are looking for many different values that may be in very distant places in this tree. This is where the Range Optimizer comes in, trying to derive an optimal query from the values we have written to the IN function in this case. It tries to minimize the space of the binary tree that we have to search. A description of the process can be found directly in MySQL's own documentation.
That all sounds perfect so far. Our statement is relatively readable and MySQL itself makes sure that it can use the index as good as possible. Unfortunately, there is a downside to this, and it lies in the server system variable range_optimizer_max_mem_size which limits the amount of memory the range optimizer is allowed to use. The default value for this parameter is 8 MB and this is probably enough for most applications. But how is this memory used at all? According to the official documentation the Range Optimizer uses about 230 bytes for each OR. An IN function is technically a collection of OR statements, so each value counts as one OR. This means you can pack about 36472 entries into an IN function before the Range Optimizer runs out of memory. Maybe you should leave a little buffer for the calculation, but the limit can be calculated well. But this raises the question, what happens when the memory runs out? The MySQL documentation may not seem very clear at this point, but it is very precise and says "the optimizer falls back to less optimal plans".
In many situations, less optimal means that the statement performs a full table scan. If you now use the IN function in an update or delete, it unfortunately does not get better. A full table scan can have an extreme effect on the performance of an application in such situations.
Unfortunately, these problems do not appear slowly, but abruptly. There are applications that run completely smoothly for years because they always work within the memory limit. However, if the lists of IDs in the IN functions grow as the amount of data increases, at some point the threshold is exceeded at which the performance suddenly collapses. The difficulty lies especially with old applications, whose code was never tested on certain amounts of data because one simply did not expect it.
Another tip, increasing the memory is often not useful. Since the data volumes in most systems do not grow linearly but exponentially, you lose this battle every time. A permanent solution is usually only possible with adjustments to the application's code, for example by splitting the data into batches or adding other ways to select the data.