Using LIKE operator in SQL for efficient slug generation
In the current state of the set_slug
method, we are running an iteration and
generating a non-duplicate slug for the current task at hand.
This approach is highly inefficient. The reason is that, for every iteration, we
are calculating a possible slug candidate by appending the iterator to
parameterized title of current task.
In the background our application is making a call to the database to check the
existence of a task whose slug matches the possible slug candidate in that
particular iteration.
In a real world scenario, there might be hundreds or even thousands of such
iterations before we can generate a valid slug. This will increase the load on
our database and might hamper with the performance.
Let's see how we can increase the efficiency of slug generation process. Our
main concern is excessive database queries. Thus if we can figure out a way to
minimize database calls then our problem will be solved.
If we were using PostgreSQL
, then the solution was straightforward, which is
to use the REGEXP
operator. But SQLite
doesn't have any in-built REGEXP
operator. Thus let's try to devise a solution with what we have first.
One solution here is to use the LIKE
operator from SQLite
to query all tasks
with a matching slug. Once we have a list of all such tasks, we can append an
integer value greater than the count of such tasks to parameterized task title.
This will produce a unique slug.
For example, if a list of matching slugs in our application looks like,
example-slug, example-slug-1, example-slug-2
, then the slug count is 3. Thus
when creating a new slug, appending 4
, that is slug count + 1
, to
example-slug
will result in a unique slug, which is example-slug-4
.
LIKE
operator is used inside a WHERE
clause to search for a specific pattern
and return a collection of records matching the pattern.
Syntax of a query using LIKE
operator is:
table_name
, column_name
and pattern
are placeholders and they should be
replaced with appropriate values.
Before moving on, it is important to understand how to declare patterns inside a
query that uses LIKE
operator for pattern matching. The SQL LIKE
condition
allows us to use the following wildcards to perform pattern matching in a query.
-
The percent sign %
: It represents one or more than one character preceding
or succeeding the pattern string, depending upon where it is placed.
For example, if the pattern is buy%
, then SQL will match with "buy" along
with all words which are prefixed with "buy" such as "buying". If percent sign
precedes "buy", then SQL will look for all words which end with "buy".
LIKE
pattern matching always covers the entire string. Therefore, if it's
desired to match a sequence anywhere within a string, the pattern must start
and end with a percent sign.
-
The underscore sign _
: Each underscore sign represents a single character
preceding or succeeding the pattern text, depending upon where it is placed.
For example, if the pattern is buy__
, then along with the pattern string SQL
will also match with all five letter words prefixed with "buy" because of the
two underscore signs succeeding "buy".
If pattern does not contain percent signs or underscores, then the pattern only
represents the string itself.
Let us apply the optimized logic now. Fully replace the set_slug
method with
the following lines of code:
We are treating the parameterized title as a potential slug candidate for the
task at hand and storing this value in title_slug
variable.
We have appended a percent sign to title_slug
to use it as the pattern for
LIKE
query. Doing so will match the pattern with all slugs which are either
equal to the title_slug
variable's value or prefixed with the value of
title_slug
variable.
With the above approach, we are taking the count of all such tasks in the
database whose slug matches the value of the title_slug
variable.
If this count
is greater than 0
, as in a task exists in database whose slug
matches with title_slug
, then we will append the integer value of count + 1
to the parameterized title.
If the count
is less than 0
, then it means that no such tasks exists whose
slug matches with title_slug
and we can use the parameterized title as it is
for slug.
Different edge cases while using LIKE query
Let us understand this with the help of a hypothetical example. Suppose we have
to add a new task record whose title is "Buying", then its potential slug
candidate will be "buying", and database already contains another task record
whose slug also happens to be "buying".
In such a case, we will query the count for tasks with slugs similar to the
potential slug candidate which will be equal to 1
here, since the slug
buying
already exists in the database. Hence appending 2
to the potential
slug candidate will yield a unique slug.
With the above approach we can generate a unique slug while restricting the
number of database calls to one.
There are however, a couple of edge cases for which the above logic will fail.
To understand those edge cases let us consider a hypothetical example where
three different task records exist in the database whose slugs are "buying",
"buying-2" and "buying-3" respectively.
Edge case 1
Suppose one of the task records, say the one whose slug is "buying-2" is deleted
from the database. Now, when we create a new task record whose title is
"buying", the count in set_slug
method will be equal to 2
.
Hence the application will append 3
, that is count + 1
, to the title_slug
and attempt to save it.
But the database transaction will not go through because a task with slug
"buying-3" is already present in the database and the uniqueness validation will
fail. Hence for this edge case our set_slug
method will fail to generate an
unique slug.
Edge case 2
If we create a new task with title "buy", the slug generated for new task by
set_slug
method would be "buy-4". Ideally the slug should have been "buy".
Upon taking a good look at the pattern we are using for pattern matching it will
become evident that even though no slugs that match with either buy
or buy-
pattern exist in database at the time of record creation, the query for
calculating the count of tasks will still return 3
.
This is because SQL will also match with the "buying-2" and "buying-3" slugs in
the database due to the use of percent wildcard sign in pattern.
Recall that using a succeeding %
wildcard sign in the pattern string will
match the string with the task slug whose value is equal to the pattern string
along with all other slugs which contain the pattern string as a prefix.
Since 3 such tasks are found in this case, 4
will be appended to the
parameterized title of task at hand and the resulting slug will be buy-4
.
Handling discussed edge cases
To fix the edge cases, fully replace the set_slug
method with following lines
of code:
Let us break down the updated set_slug
method with the help of the same
example we had just discussed and understand how it is handling the edge cases.
Consider a hypothetical example of task records in the database whose slugs are
buying
, buying-2
and buying-3
.
We'd suggest reading the upcoming edge case handling section with the above code
split open to the side. This will give you more clarity on the variable names,
etc, that are being referred to.
Handling edge case 1
Note that, in all cases where we mention title_slug
, we are referring to the
value of the variable title_slug
from above mentioned code block and not any
random string title_slug
.
Suppose the task with slug buying-2
has been deleted from the database. Now,
during the slug generation for a new task with title Buying
, set_slug
method
will assign a potential slug candidate which will be equal to buying
and store
it inside the title_slug
variable.
In the next step, all tasks whose slug matches either buying
or buying-1
,
buying-2
etc are matched. That is why we have two LIKE
queries, one for
matching buying
and one for matching the hyphenated version of the slug.
Those matching tasks are then ordered in descending order of the length of their
respective slugs.
We are using the order
method to sort the records in descending order of the
length of their slugs. order
method accepts the name of attribute and the
order in which the records are to be sorted. For example, in the set_slug
method, order
method contains two arguments, LENGTH(slug) DESC
and
slug: :desc
.
Let us understand why we need two arguments. Suppose the task slugs stored in
database are buying-1
, buying-2
and buying-10
. If we sort by slug: :desc
then the result would be buying-2
, buying-10
and buying-1
which is
incorrect and if we sort by the length of slugs alone then buying-2
and
buying-1
will have the same rank. Hence we have combined the sorting
parameters.
After combining the parameters and sorting by "LENGTH(slug) DESC", slug: :desc
the result would be buying-10
, buying-2
and buying-1
. So the suffix
number, that is the integer after the hyphen of the slug is also taken into
account while ordering.
The main reason for ordering is because we are only interested in the latest
task in the database which matches the pattern.
Thus after ordering in descending order we take the first task from the
collection of matched tasks and its corresponding slug and store it in the
latest_task_slug
variable.
We have also declared a variable called slug_count
and initialized it to 0
.
slug_count
variable will keep track of the number of tasks whose slugs matches
with title_slug
at the time of slug generation.
If latest_task_slug
doesn't exist, then it will mean that no task with a slug
similar to title_slug
is present and the value for slug_count
will remain
0
. In such a case we can use the parameterized title itself for slug.
There can be two different cases if latest_task_slug
is present.
-
No integer value is appended to latest_task_slug
, then slug_count
will be
0
itself while we try to split the slug using the hyphen and try to get the
integer suffix. This also implies that only one other task exists in the
database whose slug matches with title_slug
. Hence we will update the value
of slug_count
to 1
.
-
The latest_task_slug
variable contains a value that has a valid integer as
a suffix after the hyphen. This implies that at the time of slug generation,
the count of tasks whose slugs match with task_title
is either equal to or
less than this integer value. Thus to play it safe, we will update
slug_count
to this value.
If the value of slug_count
is positive, then incrementing it by 1
and
appending it to title_slug
will guarantee a unique slug else we can use
title_slug
as it is for slug.
In this case the value of latest_task_slug
and slug_count
will be buying-3
and 3
respectively, hence the slug generated will be "buying-4".
Handling edge case 2
On creating a new task with title Buy
, set_slug
method will assign a
potential slug candidate which will be equal to buy
and store it inside the
title_slug
variable.
Note that we have updated the pattern matching query and we are using two
LIKE
operators separated by an OR
operator.
The first LIKE
operator will match slugs with slug_title
pattern whereas the
second LIKE
operator will match slugs with slug_title-%
pattern.
In this case, the LIKE
query will match the slug values with either buy
itself or with the buy-
pattern. Hence any other slugs which are suffixed or
prefixed with buy
will not be included in the pattern matching result.
This approach towards slug generation is more efficient because it will only
lead to a single database query for each slug.
Some missed out edge cases
Although we have solidified our slug generation process a lot, there still
remains some edge cases that we missed out or let's say left out for the last.
Consider the following scenario:
- A task with slug "buy-milk" already exists in database.
- A new task with title "buy" is being created.
- Upon reaching the slug generation process for "buy", it will check if any slug
matches "buy" or "buy-".
- The "buy-" pattern will match with "buy-milk" of the unrelated task's slug.
- Thus new slug generated for the task with title "buy" will be "buy-2", even
when no other slug having value "buy" exists in database.
Here, we could make a call and allow this edge case to go through. But we are
not going to give up yet. Let's try to handle this edge case too.
For handling this edge case, one logic looks like this:
- Check for existence of slugs that match "buy". This matches whole word or
none. Thus the regex pattern would look like "buy$", where "$" denotes the end
of string.
- After that check for the regex pattern "buy-[0-9]+$". This regex pattern will
check for all slugs that start with "buy-" and ends with a number.
Now let's take the edge case we discussed before:
- A task with slug "buy-milk" already exists in database.
- A new task with title "buy" is being created.
- Upon reaching the slug generation process for "buy", it will check if any slug
matches "buy$" or "buy-[0-9]+$".
- Neither matches. Already existing slug "buy-milk" won't match with
"buy-[0-9]+$", given that it doesn't end with a number.
- Slug value of "buy" will be generated for this new task with title "buy".
Consider another scenario as follows:
- A task with slug "go-for-grocery-shopping-and-buy-apples" already exists in
database.
- A new task with title "buy apples" is being created.
- Upon reaching the slug generation process for "buy apples", it will check if
any slug matches "buy-apples$" or "buy-apples-[0-9]+$".
- The "buy-apples" pattern will match with
"go-for-grocery-shopping-and-buy-apples" slug as they both are ending with the
same substring that is "buy-apples".
- Thus new slug generated for the task with title "buy apples" will be
"buy-apples-2", even when no other slug having value "buy-apples" exists in
database.
Let's try to handle this edge case too.
For handling this edge case, one logic looks like this:
- Check for existence of slugs that match "buy-apples". This matches whole word
or none. Thus the regex pattern would look like "^buy-apples$", where "^"
denotes the start of string.
- After that check for the regex pattern "^buy-apples-[0-9]+$". This regex
pattern will check for all slugs that start with "buy-apples-" and ends with a
number.
Now let's take the edge case we discussed before:
- A task with slug "go-for-grocery-shopping-and-buy-apples" already exists in
database.
- A new task with title "buy apples" is being created.
- Upon reaching the slug generation process for "buy apples", it will check if
any slug matches "^buy-apples$" or "^buy-apples-[0-9]+$".
- Neither of them found any matches. Already existing slug
"go-for-grocery-shopping-and-buy-apples" won't match with "^buy-apples$",
given that it doesn't start and end with "buy-apples".
- Slug value of "buy-apples" will be generated for this new task with title "buy
apples".
But still there exists one more minor edge case. We are calling "allow" on that
one for now. If you're curious, then go ahead and find out that particular edge
case on your own.
Adding REGEX function to SQLite
The implementation of this logic can be a bit tricky because SQLite doesn't have
any in-built functions to handle the REGEX
operator by default.
But as per the official SQLite documentation it supports adding user defined
functions to act as the REGEX
matcher.
Another issue is that these custom functions are not persisted in the database.
They are only stored in local memory and will only remain as long as the
database is up and running.
Fortunately, Rails provides an elegant way to run code on application start-up
through initializer files, which are loaded after frameworks and gems.
Initializer files are stored in the ./config/initializers
directory, and it
allows running Ruby scripts.
SQLite allows creating new SQL functions in the database level using the
sqlite3_create_function()
family of interfaces.
For doing this using Ruby, let's see check the custom DSL defined in the
sqlite3
gem, which looks like this:
The important arguments here are name
, which will set the name of the custom
function and arity
, which denotes number of arguments the newly defined
function will require.
The neat part here is that the FunctionProxy
object fp
allows us to set the
result of custom function directly.
Let's try to apply this knowledge and add the custom function.
Create a new initializer file:
In Rails, the AbstractAdapter
class represents a connection to the database
and provides an abstract interface for database-specific functionality such as
establishing a connection, escaping values etc.
The initialize
method within this AbstractAdapter
class establishes the
connection with database and runs other side effects.
Thus, what we can do is, make the initialize
method run our custom logic. In
order to do that we can use the class_eval
method to define the initialize
instance method of the AbstractAdapter
class.
But when doing that we'd have to execute all the code within AbstractAdapter's
initialize
method first before executing our custom logic. This can be done by
aliasing the original initialize method.
Now, let's actually implement our case-insensitive REGEXP
pattern matcher. Add
the following lines of code into
config/initializers/sqlite_custom_regexp_function.rb
:
That's it! Now every time Rails boots up, this initializer will also be run,
such that the REGEXP
matcher function will be available in the local memory
and SQLite will be able to use it.
Updating the set_slug method
Let's modify the set_slug
method to use the REGEXP
operator.
Fully replace the set_slug
method in the task model with the following lines
of code:
This handles the final edge case we had discussed.
Ensuring initializer works in production environment
There is one minor catch here. Our production environment doesn't support SQLite database. That is why in the upcoming chapters we add
PostgreSQL for production environment.
We have to modify our SQLite initializer file such that it will only run if the
current database adapter is SQLite3. PostgreSQL comes with an in-built REGEX
function. Thus it wouldn't make sense running this initializer file in
production since there we use will be using PostgreSQL.
First fully replace config/initializers/sqlite_custom_regexp_function.rb
with
the following content:
So previously we had used SQLite3::Database
for checking whether database
connection is SQLite3
or not. But the SQLite3
class is part of the sqlite3
gem and it won't be available in production environment. That's why we replaced
that condition with native ActiveRecord::Base
statement which is part of Rails
itself.
Now comes the important part. In our current set_slug
method we have used the
REGEX
operator. But in PostgreSQL, the
POSIX Regex operator
is ~
. We need a case insensitive regex and the operator for that is ~*
.
Since this regex operator, that is either REGEXP
or ~*
, needs to be
accessible in most of the Rails files after booting, it can be defined as a
constant under a global context like Constants
instead of defining it under a
specific Model or Class, like we have done in the previous chapters. We can do
so by defining it in our config/initializers/constants.rb
file.
The reason for making this a constant is because we will only be setting the
operator to a string value and it will remain constant throughout.
Create constants.rb
inside config/initializers/
to store constant values:
Now add the following lines of code into config/initializers/constants.rb
:
Now fully replace the set_slug
method in Task
model with the following lines
of code:
count vs length vs size
So if you had noticed, in one of the iterations of our set_slug
method, we had
used size
method like this
count = Task.where("slug LIKE ?", "#{title_slug}%").size
. But why exactly did
we use it over count
and length
? It's pretty straightforward:
count
: This method determines the number of elements with an SQL COUNT
query. If you do not have anything loaded in the memory, use count to make a
query on database.
length
: This method simply loads the data of the association into the memory
and then returns the number of elements loaded.
size
: This method works as a combination of both the above mentioned methods
i.e. if the collection has already been loaded in the memory, it will return
the length same as calling length. If it has not been loaded yet, it is like
calling count.
So in most cases size
does the job.
Let's commit the changes made in this chapter: