Skip to content

Ivan Li

Database & SQL Overview

8 min read

General Database Overview:

MySQL in general used to be considered very bad, has come a long way since then - used to be called ‘the php of databases’

MariaDB designed to be a drop-in replacement for MySQL - but ‘better’ … better how exactly? Forked from MySQL itself, maintained completely open source and not tied to any company like MySQL -> Oracle

PostgresQL also is conceptually very similar to the above two databases. Also very old and has been in use in production, battle tested for a very long time.

RDBS contain relationships linking data together in tables (customers -> products -> orders)

SQL is the language used to work with relational DB systems

NoSQL DBs generally store desired data through mixed blobs or ‘collections of things’ instead of neat, clearly defined tables like in RDBS.

For all database architectures and deployments we can generally only pick between fast updates or strong consistency. This is due to limitations in physics and current technologies. Two servers sitting on the opposite ends of the world will still have a long way to travel in order to communicate with each other.

GCP Database Table

Main Difference between SQL & noSQL:

An example would be that in MongoDB, we would have a ‘people’ JSON collection that would store a persons name, address, etc in one collective ‘mixed’ object. In SQL, we might decide to partition specific data out into separate tables for better relational organization. We might have a person table that holds only their name and age but also an address ID. We would then have an address table with the primary key being the address ID, with each row storing address information such as zip code, street name and city. In addition, we could partition the city column out into its own separate table and reference it as a city ID instead. Thereafter, we can fill the city table with other relevant information pertaining to just the city itself, such as lat long, timezone, population, etc.

How to properly store passwords in a DB?

Some of the popular cryptographic hash functions are MD5 and SHA1 . Instead of storing plain text password in the database one way is to store the hash of the password. You might be thinking that if we cannot get the actual password back from the hash then how are we going to verify the credentials that the user entered? It’s simple, apply the same hash function on the password which user entered and then compare it with the hash stored in the database. If both hashes match then the user is authenticated (since hash of same input will give same output). Now if the attacker is able to get database access, he will be only able to view the hashed output and not the actual password.

An additional layer of security should be added in the form of a ‘SALT’. A Salt is basically just another randomly generated sequence of characters that is appended to the password before being processed by the hash function. For instance, if your password was abc123, and the Salt was $*%%^678 , the password to be hashed and stored into the database would be: abc123$*%%^678

Salt is not stored in the database and only present in the application configuration file which is not accessible to outer world. Gaining access to the back end source files is much more difficult than gaining access to the database.

How do we deal with databases failing?

We can utilize a parent child architecture where the children are always replicating the data from the parent/main database server and are available to be promoted in the case of parent failure. We can also have an external service constantly recording all transaction into logs so that the database can be recreated if our system were to face catastrophic failure.

How do you speed/scale up a database?

We can begin by inspecting slow running, frequently used queries to find optimizations to be made via SQL. We can also set up table indexing over frequently accessed keys for both RDBs and many NoSQL implementations as well such as MongoDB.

Or we can also throw more hardware at the server by upgrading RAM/CPU/STORAGE speeds (vertical scaling)

If there are any stored procedures or processing being done on the database server, these calculations might have to be offloaded to the client side or back end server.

Although not directly related to the database itself, we could start caching the most common queries that clients request to relieve some pressure.

Finally, we can also rely on sharding the database horizontally or vertically by splitting the data out into multiple servers.

What is sharding?

Sharding is a method for distributing data across multiple machines via some designated key. We can imagine a pizza pie cut into equal slices that we then fill depending on some rule based on the user ID or user Location. Instead of having all the data sitting on one database, we can horizontally scale the data across multiple servers as needed. The trade off for horizontal scaling is increased complexity in the infrastructure and maintenance.

There are two main strategies to selecting a key for horizontal sharding:

  1. Using category based sharding keys such as using the user ID. If we go with this method, all the associated information about the particular user should all be on the same shard to avoid cross-node queries/joins.
  2. We can also do hashed sharding keys, which automatically distribute data across shards based on the hash function. The downside to this is that we must now perform cross-node joins in order to retrieve information for some particular data sets.

Another idea related to sharding is vertical partitioning, which is separating each RDB table to its own dedicated instance and then performing slower cross joins on them.

Good article on MySQL database sharding: Challenges of Sharding MySQL - DZone Database

What is database indexing?

Adding Indexing to a database is simply adding a sorted helper table based on one of the keys in a table which we want to query more efficiently.

The concept is similar to the word index in a scientific textbook where we would look up a word by its first letter and it would provide us with the location of where to find such words in the textbook. Indexing helps speed up queries when a table becomes very large with billions of rows. As an example, if we were to find all posts by a user based on user ID, we could significantly speed up that look up time by creating an index that sorts the post table by the user ID beforehand. Now the database logic can quickly and efficiently search for the desired user ID with a binary search instead of combing through each unordered row from the original table.

The tradeoff for creating an index for a table is that upon every update we make on the table, the index will also have to be updated to reflect the pointers correctly.

We can create multiple indexes per table if we deem the tradeoff to be worthwhile.

What is Consistent Hashing?

The concept of creating a consistent system of hashing so that a load balancer is able to easily adapt to changes when origin nodes are added or removed. A diagram of an abstract ‘hash ring’ is often used to represent S servers that share the load between all incoming requests R.

hash ring

If we theoretically have only 4 origin servers with the hash function hash(R = 0) % S and one of them goes down, there is potential for one to become a ‘hot spot’ that will receive 50% of the total traffic because the ‘next’ closest server will receive all the traffic designated for the failed server.

However, through the use of multiple K replica IDs for each origin server to create ‘virtual’ points of presence on the hash ring, we can ensure that no single server will become overloaded in the event that another goes down. For instance, if all the red colored nodes went down on the ring above, the ideal setup would still allow for a fairly even re-distribution of the load to the remaining nodes.

What is ACID?

ACID is an acronym of Atomicity, Consistency, Isolation and Durability.

Atomicity means transaction either completes or fails in entirety. There is no state in between. No body sees a partial completion of a transaction.

Consistency means the transaction leaves the database in the valid state.

Isolation means no two transactions mingle or interfere with each other. The result of two transactions executed in parallel would be same as sequential execution.

Durability means the changes of the transaction are saved. It remains there even if power is turned off.

Basically all relational database such as MySQL, postgresql, oracle, etc guarantees ACID properties on transactions.

A common example is that if you go to the bank and withdraw $100, the new balance should show up if someone else were to access your account at any another location in the world (although there might be some wait time for the second person). In an ACID database, if we perform a ‘write’, any subsequent ‘read’ should reflect the changes “immediately”.

If we did not require ACID transactions for our application and only cared for ‘eventual’ consistency, we can have faster reads and writes in a more widely distributed system. For instance, if one of your friends posts something new on Facebook, its probably okay if you don’t see it show up on your news feed immediately upon refresh as long as you see it “eventually”

When would you use NoSQL vs. SQL?

Duplicate data, faster efficient queries instead of using joins across tables, simply have an orders collection that already stores customer information and product information combined. Good for applications where you have a lot of reads and not complicated writes (depending on application). SQL is ‘stricter’, uses Schemas, has relations — if you need to update data frequently, it is better to do it on just one table and reference it with a join instead of maintaining multiple table sources in MongoDB.

The downside of SQL and Schemas is if you have complex queries and need to do a lot of reads, it can become slow (vs. Mongo where we would just continue to read from a collection that already contains all the data we want). SQL databases are hard to scale horizontally (split data amongst several servers and ‘share’ them as a pool of resources)… Many NoSQL DBs have sharding/horizontal scaling as a supported feature. There is no “winner”, it depends on the application.

NoSQL has no inherent relationships and therefore, performing JOIN operations are expensive.

NoSQL does not care about or have a schema to follow. If a user inputs their information and opts out of entering their address, the NoSQL DB can handle that and enter it into the 'Users' collection just as well (assuming the back end logic of your application is tolerant of this).

Faster insertion to NoSQL due to all relevant information coming in as a giant object package instead of having to insert a users information to separate tables.

The downside of storing data as a mixed blob is that, upon retrieval, the entire blob must always be retrieved … in other words, every query on a NoSQL collection is equivalent to a SELECT * from SQL. If we wanted to get the average age of every user, it wouldn’t be as efficient as just going down the age column in an equivalent SQL database because each blob would have to be read and ‘age’ specifically picked out.

SQL Overview:

General query format: SELECT * FROM animals WHERE animal_name = ‘tiger’ ORDER BY ID

Select only specific table columns (faster performance), perform arithmetic inline, set an alias to a result:

2 last_name,
3 first_name,
4 customer_id,
5 customer_id * 10 + 100 AS ‘my_alias_here’
6FROM customer

Distinct query (return only 1 result):

2FROM actor
3WHERE first_name = ‘nick’

Equality operators work the same as any other language, demo of AND clause:

2FROM actor
3WHERE actor_id <= 5
4AND first_name != ‘NICK’
5AND last_name != ‘GUINESS’

Query for items between two specific dates, ORDER BY Descending:

1SELECT * from payment
2WHERE payment_date >=2005-05-01
3AND payment_date <2005-06-01
4ORDER BY payment_date DESC

Another way to find query for results BETWEEN two ranges:

2FROM film
3WHERE length BETWEEN 90 AND 150

Combine multiple OR statements using IN:

2FROM customer
3WHERE first_name IN (‘MARY’, ‘MARIA’, ‘JOHN’)
4WHERE first_name NOT IN (‘MARY’, ‘MARIA’, ‘JOHN’)return opposite

LIKE clause:

2FROM film
3WHERE title LIKE%ALA%’ — this will find all titles with ‘ALA’ anywhere in the name
4WHERE title LIKE ‘ALA%’ — this will find all titles with ‘ALA’ only in the front

Regex like wildcard using LIKE:

2FROM film
3WHERE title LIKE ‘A__**_**__A’ — this will return results such as ALABAMA where ‘LABAM’ could be any character
4% any number of chars
5— _ represent one single character


2FROM customers
3-- WHERE first_name REGEXP 'ELKA|AMBUR'
4WHERE last_name REGEXP 'B[R|U]'

Data LENGTH less than 60 and title contains ‘ANA’ or ‘CAN’:

2FROM film
3WHERE (title LIKE%ALA%OR title LIKE%CAN%)
4AND length <= 60

Query NULL column values, missing phone number:

2FROM customers

Using LIMIT, top 3 customers by points:

2FROM customers
3ORDER BY points desc

Using INNER JOIN, match orders to products based on shared key

1SELECT order_id, p.product_id,, o.quantity, o.unit_price
2FROM order_items o
3JOIN products p ON p.product_id = o.product_id

Query same table (SELF join) to find employee and manager

2 e.employee_id,
3 e.first_name as employee,
4 m.first_name as manager
6 sql_hr.employees e
7 JOIN sql_hr.employees m ON m.employee_id = e.reports_to

Multiple tables using multiple JOINs:

2 o.order_id, o.order_date, c.first_name, c.last_name, os.`name` as status
4 orders o
5 JOIN customers c ON o.customer_id = c.customer_id
6 JOIN order_statuses os ON os.order_status_id = o.status

Querying to match COMPOUND keys:

2FROM order_items oi
3JOIN order_item_notes oin
4 ON oi.order_id = oin.order_id
5 AND oi.product_id = oin.product_id


1SELECT c.customer_id, c.first_name, o.order_id
4 ON c.customer_id = o.customer_id -- LEFT JOIN IGNORES THIS CONDITION
5ORDER BY c.customer_id

LEFT JOIN to show quantity of products sold:

1SELECT p.product_id, p.`name`, oi.quantity
2FROM products p
3LEFT JOIN order_items oi ON p.product_id = oi.product_id

Multiple LEFT JOIN to get NULL results as well:

2 c.customer_id,
3 c.first_name,
4 o.order_id,
5 as shipper
6FROM customers c
7LEFT JOIN orders o
8 ON c.customer_id = o.customer_id
9LEFT JOIN shippers sh
10 ON o.shipper_id = sh.shipper_id
11ORDER BY c.customer_id


1SELECT * FROM table1 [BLANK] OUTER JOIN table2 ON table1.col = table2.col

Replace [BLANK] with: LEFT - if you want all records from table1 even if they don’t have a col that matches table2’s (also included are table2 records with matches) RIGHT - if you want all records from table2 even if they don’t have a col that matches table1’s (also included are table1 records with matches) FULL - if you want all records from table1 and from table2

AGGREGATE Math functions:

1SELECT SUM(salary) FROM employees;
3SELECT MAX(salary) FROM employees;
5SELECT COUNT(*) FROM employees;
7SELECT CEIL(AVG(salary)) FROM employees;