— 8 min read
MySQL in general used to be considered very bad, has come a long way since then - used to be called ‘the php of databases’
MariaDB designed to be a drop-in replacement for MySQL - but ‘better’ … better how exactly? Forked from MySQL itself, maintained completely open source and not tied to any company like MySQL -> Oracle
PostgresQL also is conceptually very similar to the above two databases. Also very old and has been in use in production, battle tested for a very long time.
RDBS contain relationships linking data together in tables (customers -> products -> orders)
SQL is the language used to work with relational DB systems
NoSQL DBs generally store desired data through mixed blobs or ‘collections of things’ instead of neat, clearly defined tables like in RDBS.
For all database architectures and deployments we can generally only pick between fast updates or strong consistency. This is due to limitations in physics and current technologies. Two servers sitting on the opposite ends of the world will still have a long way to travel in order to communicate with each other.
An example would be that in MongoDB, we would have a ‘people’ JSON collection that would store a persons name, address, etc in one collective ‘mixed’ object. In SQL, we might decide to partition specific data out into separate tables for better relational organization. We might have a person table that holds only their name and age but also an address ID. We would then have an address table with the primary key being the address ID, with each row storing address information such as zip code, street name and city. In addition, we could partition the city column out into its own separate table and reference it as a city ID instead. Thereafter, we can fill the city table with other relevant information pertaining to just the city itself, such as lat long, timezone, population, etc.
Some of the popular cryptographic hash functions are MD5 and SHA1 . Instead of storing plain text password in the database one way is to store the hash of the password. You might be thinking that if we cannot get the actual password back from the hash then how are we going to verify the credentials that the user entered? It’s simple, apply the same hash function on the password which user entered and then compare it with the hash stored in the database. If both hashes match then the user is authenticated (since hash of same input will give same output). Now if the attacker is able to get database access, he will be only able to view the hashed output and not the actual password.
An additional layer of security should be added in the form of a ‘SALT’. A Salt is basically just another randomly generated sequence of characters that is appended to the password before being processed by the hash function. For instance, if your password was abc123, and the Salt was $*%%^678
, the password to be hashed and stored into the database would be: abc123$*%%^678
Salt is not stored in the database and only present in the application configuration file which is not accessible to outer world. Gaining access to the back end source files is much more difficult than gaining access to the database.
We can utilize a parent child architecture where the children are always replicating the data from the parent/main database server and are available to be promoted in the case of parent failure. We can also have an external service constantly recording all transaction into logs so that the database can be recreated if our system were to face catastrophic failure.
We can begin by inspecting slow running, frequently used queries to find optimizations to be made via SQL. We can also set up table indexing over frequently accessed keys for both RDBs and many NoSQL implementations as well such as MongoDB.
Or we can also throw more hardware at the server by upgrading RAM/CPU/STORAGE speeds (vertical scaling)
If there are any stored procedures or processing being done on the database server, these calculations might have to be offloaded to the client side or back end server.
Although not directly related to the database itself, we could start caching the most common queries that clients request to relieve some pressure.
Finally, we can also rely on sharding the database horizontally or vertically by splitting the data out into multiple servers.
Sharding is a method for distributing data across multiple machines via some designated key. We can imagine a pizza pie cut into equal slices that we then fill depending on some rule based on the user ID or user Location. Instead of having all the data sitting on one database, we can horizontally
scale the data across multiple servers as needed. The trade off for horizontal scaling is increased complexity in the infrastructure and maintenance.
There are two main strategies to selecting a key for horizontal sharding:
Another idea related to sharding is vertical partitioning, which is separating each RDB table to its own dedicated instance and then performing slower cross joins on them.
Good article on MySQL database sharding: Challenges of Sharding MySQL - DZone Database
Adding Indexing to a database is simply adding a sorted helper table based on one of the keys in a table which we want to query more efficiently.
The concept is similar to the word index in a scientific textbook where we would look up a word by its first letter and it would provide us with the location of where to find such words in the textbook. Indexing helps speed up queries when a table becomes very large with billions of rows. As an example, if we were to find all posts by a user based on user ID, we could significantly speed up that look up time by creating an index that sorts the post table by the user ID beforehand. Now the database logic can quickly and efficiently search for the desired user ID with a binary search instead of combing through each unordered row from the original table.
The tradeoff for creating an index for a table is that upon every update we make on the table, the index will also have to be updated to reflect the pointers correctly.
We can create multiple indexes per table if we deem the tradeoff to be worthwhile.
The concept of creating a consistent system of hashing so that a load balancer is able to easily adapt to changes when origin nodes are added or removed. A diagram of an abstract ‘hash ring’ is often used to represent S servers that share the load between all incoming requests R.
If we theoretically have only 4 origin servers with the hash function hash(R = 0) % S
and one of them goes down, there is potential for one to become a ‘hot spot’ that will receive 50% of the total traffic because the ‘next’ closest server will receive all the traffic designated for the failed server.
However, through the use of multiple K replica IDs for each origin server to create ‘virtual’ points of presence on the hash ring, we can ensure that no single server will become overloaded in the event that another goes down. For instance, if all the red colored nodes went down on the ring above, the ideal setup would still allow for a fairly even re-distribution of the load to the remaining nodes.
ACID is an acronym of Atomicity, Consistency, Isolation and Durability.
Atomicity means transaction either completes or fails in entirety. There is no state in between. No body sees a partial completion of a transaction.
Consistency means the transaction leaves the database in the valid state.
Isolation means no two transactions mingle or interfere with each other. The result of two transactions executed in parallel would be same as sequential execution.
Durability means the changes of the transaction are saved. It remains there even if power is turned off.
Basically all relational database such as MySQL, postgresql, oracle, etc guarantees ACID properties on transactions.
A common example is that if you go to the bank and withdraw $100, the new balance should show up if someone else were to access your account at any another location in the world (although there might be some wait time for the second person). In an ACID database, if we perform a ‘write’, any subsequent ‘read’ should reflect the changes “immediately”.
If we did not require ACID transactions for our application and only cared for ‘eventual’ consistency, we can have faster reads and writes in a more widely distributed system. For instance, if one of your friends posts something new on Facebook, its probably okay if you don’t see it show up on your news feed immediately upon refresh as long as you see it “eventually”
Duplicate data, faster efficient queries instead of using joins across tables, simply have an orders collection that already stores customer information and product information combined. Good for applications where you have a lot of reads and not complicated writes (depending on application). SQL is ‘stricter’, uses Schemas, has relations — if you need to update data frequently, it is better to do it on just one table and reference it with a join instead of maintaining multiple table sources in MongoDB.
The downside of SQL and Schemas is if you have complex queries and need to do a lot of reads, it can become slow (vs. Mongo where we would just continue to read from a collection that already contains all the data we want). SQL databases are hard to scale horizontally (split data amongst several servers and ‘share’ them as a pool of resources)… Many NoSQL DBs have sharding/horizontal scaling as a supported feature. There is no “winner”, it depends on the application.
NoSQL has no inherent relationships and therefore, performing JOIN operations are expensive.
NoSQL does not care about or have a schema to follow. If a user inputs their information and opts out of entering their address, the NoSQL DB can handle that and enter it into the 'Users' collection just as well (assuming the back end logic of your application is tolerant of this).
Faster insertion to NoSQL due to all relevant information coming in as a giant object package instead of having to insert a users information to separate tables.
The downside of storing data as a mixed blob is that, upon retrieval, the entire blob must always be retrieved … in other words, every query on a NoSQL collection is equivalent to a SELECT * from SQL. If we wanted to get the average age of every user, it wouldn’t be as efficient as just going down the age column in an equivalent SQL database because each blob would have to be read and ‘age’ specifically picked out.
General query format: SELECT * FROM animals WHERE animal_name = ‘tiger’ ORDER BY ID
Select only specific table columns (faster performance), perform arithmetic inline, set an alias to a result:
1SELECT2 last_name,3 first_name,4 customer_id,5 customer_id * 10 + 100 AS ‘my_alias_here’6FROM customer
Distinct query (return only 1 result):
1SELECT DISTINCT first_name2FROM actor3WHERE first_name = ‘nick’
Equality operators work the same as any other language, demo of AND clause:
1SELECT *2FROM actor3WHERE actor_id <= 54AND first_name != ‘NICK’5AND last_name != ‘GUINESS’
Query for items between two specific dates, ORDER BY Descending:
1SELECT * from payment2WHERE payment_date >= ‘2005-05-01’3AND payment_date < ‘2005-06-01’4ORDER BY payment_date DESC
Another way to find query for results BETWEEN two ranges:
1SELECT *2FROM film3WHERE length BETWEEN 90 AND 1504ORDER BY length DESC
Combine multiple OR statements using IN:
1SELECT *2FROM customer3WHERE first_name IN (‘MARY’, ‘MARIA’, ‘JOHN’)4— WHERE first_name NOT IN (‘MARY’, ‘MARIA’, ‘JOHN’) — return opposite
LIKE clause:
1SELECT *2FROM film3WHERE title LIKE ‘%ALA%’ — this will find all titles with ‘ALA’ anywhere in the name4— WHERE title LIKE ‘ALA%’ — this will find all titles with ‘ALA’ only in the front
Regex like wildcard using LIKE:
1SELECT *2FROM film3WHERE title LIKE ‘A__**_**__A’ — this will return results such as ALABAMA where ‘LABAM’ could be any character4— % any number of chars5— _ represent one single character
REGEXP:
1SELECT *2FROM customers3-- WHERE first_name REGEXP 'ELKA|AMBUR'4WHERE last_name REGEXP 'B[R|U]'
Data LENGTH less than 60 and title contains ‘ANA’ or ‘CAN’:
1SELECT *2FROM film3WHERE (title LIKE ‘%ALA%’OR title LIKE ‘%CAN%’)4AND length <= 60
Query NULL column values, missing phone number:
1SELECT *2FROM customers3WHERE phone IS NULL
Using LIMIT, top 3 customers by points:
1SELECT *2FROM customers3ORDER BY points desc4LIMIT 3
Using INNER JOIN, match orders to products based on shared key
1SELECT order_id, p.product_id, p.name, o.quantity, o.unit_price2FROM order_items o3JOIN products p ON p.product_id = o.product_id
Query same table (SELF join) to find employee and manager
1SELECT2 e.employee_id,3 e.first_name as employee,4 m.first_name as manager5FROM6 sql_hr.employees e7 JOIN sql_hr.employees m ON m.employee_id = e.reports_to
Multiple tables using multiple JOINs:
1SELECT2 o.order_id, o.order_date, c.first_name, c.last_name, os.`name` as status3FROM4 orders o5 JOIN customers c ON o.customer_id = c.customer_id6 JOIN order_statuses os ON os.order_status_id = o.status
Querying to match COMPOUND keys:
1SELECT *2FROM order_items oi3JOIN order_item_notes oin4 ON oi.order_id = oin.order_id5 AND oi.product_id = oin.product_id
Outer Joins: (LEFT JOIN, RIGHT JOIN)
1SELECT c.customer_id, c.first_name, o.order_id2FROM customers c -- LEFT JOIN, ALL RECORDS FROM THE 'LEFT' SIDE ARE RETURNED3RIGHT JOIN orders o -- RIGHT JOIN, RETURN ALL RECORDS FROM 'RIGHT' SIDE4 ON c.customer_id = o.customer_id -- LEFT JOIN IGNORES THIS CONDITION5ORDER BY c.customer_id
LEFT JOIN to show quantity of products sold:
1SELECT p.product_id, p.`name`, oi.quantity2FROM products p3LEFT JOIN order_items oi ON p.product_id = oi.product_id
Multiple LEFT JOIN to get NULL results as well:
1SELECT2 c.customer_id,3 c.first_name,4 o.order_id,5 sh.name as shipper6FROM customers c7LEFT JOIN orders o8 ON c.customer_id = o.customer_id9LEFT JOIN shippers sh10 ON o.shipper_id = sh.shipper_id11ORDER BY c.customer_id
LEFT JOIN vs. RIGHT JOIN vs. FULL JOIN
1SELECT * FROM table1 [BLANK] OUTER JOIN table2 ON table1.col = table2.col
Replace [BLANK] with: LEFT - if you want all records from table1 even if they don’t have a col that matches table2’s (also included are table2 records with matches) RIGHT - if you want all records from table2 even if they don’t have a col that matches table1’s (also included are table1 records with matches) FULL - if you want all records from table1 and from table2
AGGREGATE Math functions:
1SELECT SUM(salary) FROM employees;23SELECT MAX(salary) FROM employees;45SELECT COUNT(*) FROM employees;67SELECT CEIL(AVG(salary)) FROM employees;