A brief intro:
One of the core responsibilities of a back-end developer is to write optimal queries to fetch the required data from the database. The way this task is handled dictates the overall performance of the APIs. Even the smallest errors here can snowball into big issues that can downgrade the application as a whole.
Read on for a list of things to look out for while designing queries to fetch data from the database.
Points to ponder
1. Working with ORMs
ORMs are a great way to delegate the task of writing complex queries to libraries specialised in this task. But make sure these ORMs aren’t complicating the simple stuff! Always make it a point to examine the query they brew up underneath.
Say we have two tables Slot
and Course
, such that there can be many slots under any given course. We need to query for one particular Slot
along with its Course
details. TypeORM offers us a way to do so:
getConnection().getRepository(Slot).findOne({
where: { id: "f9df9b0f-48d6-4d46-851f-2d044a25b1c7" },
relations: ["course"]
});
Now, let us look at the query that this simple two-liner generates. If you said
select * from slot s
inner join course c on s.course_id = c.id
where s.id = 'f9df9b0f-48d6-4d46-851f-2d044a25b1c7'
here is your reality check:
SELECT DISTINCT "distinctAlias"."Slot_id" as "ids_Slot_id" FROM (
SELECT "Slot"."id" AS "Slot_id", {...} FROM
"slot" "Slot"
LEFT JOIN "course" "Slot__course" ON "Slot__course"."id" = "Slot"."course_id"
WHERE "Slot"."id" = $1
) "distinctAlias"
ORDER BY "Slot_id" ASC LIMIT 1
-- PARAMETERS: ["f9df9b0f-48d6-4d46-851f-2d044a25b1c7"]
SELECT "Slot"."id" AS "Slot_id", {...} FROM
"slot" "Slot"
LEFT JOIN "course" "Slot__course" ON "Slot__course"."id" = "Slot"."course_id"
WHERE ("Slot"."id" = $1) AND ("Slot"."id" IN ($2))
-- PARAMETERS: ["f9df9b0f-48d6-4d46-851f-2d044a25b1c7","f9df9b0f-48d6-4d46-851f-2d044a25b1c7"]
Hard to believe? It was for us too! Nonetheless, that was what happened.
So during development, always log the queries that the ORMs generate and make sure they're doing only what they are meant to do.
2. Using Indexes
Choose your index wisely. They are subject to risks of not getting used at all.
Indexes are often the one-stop solution to all our query performance related problems. But sometimes, they are not so magical. They don't work on columns that return a high percentage of data rows after filtering using a WHERE clause.
Consider the Student-User Table of a Subscription based Application where the students can be of the following two types:
- FREE (free user of app)
- PAID (subscribed to paid offerings on the application)
Assuming there are around 2000 FREE users, and 3000 PAID users, there is no point in creating an index on the column, subscription_type, that stores this information. A query to select from this table using a where clause on subscription_type , would still go for a sequential scan instead of using the index.
This is because the DBMS finds it faster to fetch sequentially if there is a large percentage of rows to be returned.
Similarly, columns that are frequently manipulated should not be indexed as the maintenance on the index in such cases can become excessive.
3. Reuse of Data Access Methods
Reuse of code helps a product developer save a lot of time and effort. And sometimes reusing DB queries will help save a lot of time and effort, that can later be spent on troubleshooting the performance issues caused by reusing them.
Consider a hypothetical scenario in a School where there is 1 Course, with 3 associated slots, 5 teachers, and 110 students subscribed to this course.
Performing a JOIN on these tables to get these details would yield in a total of 1 * 3 * 5 * 110 = 1650 rows.
SELECT * FROM "course"
INNER JOIN "slot" ON "slot"."course_id" = "course"."id"
INNER JOIN "teacher" ON "teacher"."course_id" = "course"."id"
INNER JOIN "student" ON "student"."course_id" = "course"."id"
WHERE "course"."id" = '97c83afb-529b-46fa-828e-145c2a852c0f'
If there is a subsequent use-case that requires you to fetch only the Course, Slots and Teachers data, instead of reusing the existing method, write a new query to fetch only those details. This would drastically bring down the query output to just 15 rows.
4. Queries with multiple Joins
This is a corollary to the previous point. Keep track of the tables that are being joined, especially the ones that store a user input. This is because there is limited control over the amount of data in such tables and they can quickly escalate in size.
Leaving such queries unchecked can lead to huge data sets getting pulled to the program memory, ultimately bringing down the entire service.
If the use-case calls for numerous joins, limit the result set by adding a limit param.
Rounding up
These are some of the points to look out for while writing that perfect query for your perfect API. Surely, this list is never exhaustive. We will keep adding points here as we design products, develop software using the latest technology and uncover more secrets to writing great queries. So do follow us to be in the loop.