This Week in Data with Colin Charles 28: Percona Live, MongoDB Transactions and Spectre/Meltdown Rumble On

Colin Charles

Colin CharlesJoin Percona Chief Evangelist Colin Charles as he covers happenings, gives pointers and provides musings on the open source database community.

In case you missed last week’s column, don’t forget to read the fairly lengthy FOSDEM MySQL & Friends DevRoom summary.

From a Percona Live Santa Clara 2018 standpoint, beyond the tutorials getting picked and scheduled, the talks have also been picked and scheduled (so you were very likely getting acceptance emails from the Hubb.me system by Tuesday). The rejections have not gone out yet but will follow soon. I expect the schedule to go live either today (end of week) or early next week. Cheapest tickets end March 4, so don’t wait to register!

Amazon Relational Database Service has had a lot of improvements in 2017, and the excellent summary from Jeff Barr is worth a read: Amazon Relational Database Service – Looking Back at 2017. Plenty of improvements for the MySQL, MariaDB Server, PostgreSQL and Aurora worlds.

Spectre/Meltdown and its impact are still being discovered. You need to read Brendan Gregg’s amazing post: KPTI/KAISER Meltdown Initial Performance Regressions. And if you visit Percona Live, you’ll see an amazing keynote from him too! Are you still using MyISAM? MyISAM and KPTI – Performance Implications From The Meltdown Fix suggests switching to Aria or InnoDB.

Probably the biggest news this week though? Transactions are coming to MongoDB 4.0. From the site, “MongoDB 4.0 will add support for multi-document transactions, making it the only database to combine the speed, flexibility, and power of the document model with ACID guarantees. Through snapshot isolation, transactions will provide a globally consistent view of data, and enforce all-or-nothing execution to maintain data integrity.”. You want to read the blog post, MongoDB Drops ACID (the title works if you’re an English native speaker, but maybe not quite if you aren’t). The summary diagram was a highlight for me because you can see the building blocks, plus future plans for MongoDB 4.2.

Releases

Link List

Upcoming appearances

  • SCALE16x – Pasadena, California, USA – March 8-11 2018
  • FOSSASIA 2018 – Singapore – March 22-25 2018

Feedback

I look forward to feedback/tips via e-mail at colin.charles@percona.com or on Twitter @bytebot.

MongoDB revs you up: What storage engine is right for you? (Part 3)

MongoDBDifferentiating Between MongoDB Storage Engines: RocksDB

In this series of posts, we discussed what a storage engine is, and how you can determine the characteristics of one versus the other:

“A database storage engine is the underlying software that a DBMS uses to create, read, update and delete data from a database. The storage engine should be thought of as a “bolt on” to the database (server daemon), which controls the database’s interaction with memory and storage subsystems.”

Generally speaking, it’s important to understand what type of work environment the database is going to interact with, and to select a storage engine that is tailored to that environment.

The first post looked at MMAPv1, the original default engine for MongoDB (through release 3.0). The second post examined WiredTiger, the new default MongoDB engine.

This post will cover RocksDB. RocksDB builds on LevelDB, Google’s open source key value database library. It was designed to address several scenarios:

  1. Scale to run on servers with many CPU cores.
  2. Use fast storage efficiently.
  3. Be flexible to allow for innovation.
  4. Support IO-bound, in-memory, and write-once workloads.

RocksDB

Find it in: Percona Builds

RocksDB, designed originally at Facebook, uses LSM trees to store data, unlike most other storage engines which are using B-Trees.

LSM trees are designed to amortize the cost of writes: data is written to log files that are sequentially written to disk and never modified. Then a background thread merges the log files (compaction) into a tree like structure. With this design a single I/O can flush to disk tens or hundreds of write operations.

The tradeoff is that reading a document is more complex and therefore slower than for a B-Tree; because we don’t know in advance in which log file the latest version of the data is stored, we may need to read multiple files to perform a single read. RocksDB uses bloom filters and fractional cascading to minimize the impact of these issues.

As far as workload fit, RocksDB can provide very good insert and query performance while providing compression ratios that are typically better than wiredTiger and slightly worse than PerconaFT. Also, RocksDB is theoretically better than PerconaFT at keeping up with the frequent and heavy delete workloads that accompany TTL indexes in high insert workloads.

Percona is excited to offer enterprise support for RocksDB! RocksDB as part of our MongoDB support options: https://www.percona.com/services/support/rocksdb-support.

Conclusion

Most people don’t know that they have a choice when it comes to storage engines, and that the choice should be based on what the database workload will look like. Percona’s Vadim Tkachenko performed an excellent benchmark test comparing the performances of, PerconaFT and WiredTiger to help specifically differentiate between these engines.

In Part Four of this blog series, we’ll take a closer look at Percona’s MongoDB storage engine: PerconaFT.

Part 1: Intro and the MMAPv1 storage engine.

Part 2: WiredTiger storage engine.

Find unused indexes on MongoDB and TokuMX

Finding and removing unused indexes is a pretty common technique to improve overall performance of relational databases. Less indexes means faster insert and updates but also less disk space used. The usual way to do it is to log all queries’ execution plans and then get a list of those indexes that are not used. Same theory applies to MongoDB and TokuMX so in this blog post I’m going to explain how to find those.

Profiling in MongoDB

To understand what profiling is you only need to think about MySQL’s slow query log, it is basically the same idea. It can be enabled with the following command:

db.setProfilingLevel(level, slowms)

There are three different levels:

0: No profiling enabled.
1: Only those queries slower than “slowms” are profiled.
2: All queries are profiled, similar to query_long_time=0.

Once it is enabled you can use db.system.profile.find().pretty() to read it. You would need to scan through all profiles and find those indexes that are never used. To make things easier there is a javascript program that will find the unused indexes after reading all the profile information. Unfortunately, it only works with mongodb 2.x.

The javascript is hosted in this github project https://github.com/wfreeman/indexalizer You just need to start mongo shell with indexStats.js and run db.indexStats() command. This is an sample output:

scanning profile {ns:"test.col"} with 2 records... this could take a while.
{
	"query" : {
		"b" : 1
	},
	"count" : 1,
	"index" : "",
	"cursor" : "BtreeCursor b_1",
	"millis" : 0,
	"nscanned" : 1,
	"n" : 1,
	"scanAndOrder" : false
}
{
	"query" : {
		"b" : 2
	},
	"count" : 1,
	"index" : "",
	"cursor" : "BtreeCursor b_1",
	"millis" : 0,
	"nscanned" : 1,
	"n" : 1,
	"scanAndOrder" : false
}
checking for unused indexes in: col
this index is not being used:
"_id_"
this index is not being used:
"a_1"

 

So “a_1” is not used and could be dropped. We can ignore “_id_” because that one is needed :)

There is a problem with profiling. It will affect performance so you need to run it only for some hours and usually during low peak. That means that there is a possibility that not all possible queries from your application are going to be executed during that maintenance window. What alternative TokuMX provides?

Finding unused indexes in TokuMX

Good news for all of us. TokuMX doesn’t require you to enable profiling. Index usage statistics are stored as part of every query execution and you can access them with a simple db.collection.stats() command. Let me show you an example:

> db.col.stats()
[...]
{
"name" : "a_1",
"count" : 5,
"size" : 140,
"avgObjSize" : 28,
"storageSize" : 16896,
"pageSize" : 4194304,
"readPageSize" : 65536,
"fanout" : 16,
"compression" : "zlib",
"queries" : 0,
"nscanned" : 0,
"nscannedObjects" : 0,
"inserts" : 0,
"deletes" : 0
},
{
"name" : "b_1",
"count" : 5,
"size" : 140,
"avgObjSize" : 28,
"storageSize" : 16896,
"pageSize" : 4194304,
"readPageSize" : 65536,
"fanout" : 16,
"compression" : "zlib",
"queries" : 2,
"nscanned" : 2,
"nscannedObjects" : 2,
"inserts" : 0,
"deletes" : 0
}
],
"ok" : 1
}

 

There are our statistics without profiling enabled. queries means the number of times that index has been used on a query execution. b_1 has been used twice and a_1 has never been used. You can use this small javascript code I’ve written to scan all collections inside the current database:

db.forEachCollectionName(function (cname) {
	output = db.runCommand({collstats : cname });
	print("Checking " + output.ns + "...")
	output.indexDetails.forEach(function(findUnused) { if (findUnused.queries == 0) { print( "Unused index: " + findUnused.name ); }})
});

 

An example using the same data:

> db.forEachCollectionName(function (cname) {
... output = db.runCommand({collstats : cname });
... print("Checking " + output.ns + "...")
... output.indexDetails.forEach(function(findUnused) { if (findUnused.queries == 0) { print( "Unused index: " + findUnused.name ); }})
...
... });
Checking test.system.indexes...
Checking test.col...
Unused index: a_1

 

Conclusion

Finding unused indexes is a regular task that every DBA should do. In MongoDB you have to use profiling while in TokuMX nothing needs to be enabled because it will gather information by default without impacting service performance.

The post Find unused indexes on MongoDB and TokuMX appeared first on MySQL Performance Blog.

ObjectRocket’s David Murphy talks about MongoDB, Percona Live Amsterdam

Say hello to David Murphy, lead DBA and MongoDB Master at ObjectRocket (a Rackspace company). David works on sharding, tool building, very large-scale issues and high-performance MongoDB architecture. Prior to ObjectRocket he was a MySQL/NoSQL architect at Electronic Arts. David enjoys large-scale operational tool building, high performance OS and database tuning. He is also a core code contributor to MongoDB. He’ll be speaking next month at Percona Live Amsterdam, which runs Sept. 21-13. Enter promo code “BlogInterview” at registration to save €20!


Tom: David, your 3-hour tutorial is titled “Mongo Sharding from the trench: A Veterans field guide.” Did your experience in working with vast amounts of data at Rackspace give you a unique perspective, in view, that now puts you into a position to help people just getting started? Can you give a couple examples?

David: I think this has been something organically I grew into from the days of supporting Cpanel type MySQL instances to today. I have worked for a few verticals from hosts to advertising to gaming, finally entering into the platform service. The others give me a host of knowledge around how customer need systems to work, and then the number and range of workloads we see at Rackspace re-enforces this.

ObjectRocket's David Murphy talks MongoDB & Percona Live Amsterdam

ObjectRocket’s David Murphy

Many times the unique perspective comes with the scale such as someone calling up a single node to the multi-terabyte range. When they go to “shard” they can find the process that is normally very light and unnoticeable to most Mongo sharding can severally lock the metadata for an extended time. In other cases, the “balancer” might not be able to keep up with the amount of working being asked of it.

Toward the smaller end of the spectrum, having seen so many workloads from big to small. I can see similar thought processes and trends. When this happens having worked with some many of these workloads, and honestly having learned along the evolution of mongo helps me explain to clients the good, bad, and the hairy. Many times discussions come down to people not using connection pooling, non-indexed sorting, or complex operators such as $in, $nin, and more. In these cases, I can talk to people about the balance of using these concepts and when they will become bigger issues for them. My goal is to give them the enough knowledge to help determine when it is correct to use development resource to fix and issue, and when it’s manageable and that development could be better spent elsewhere.

 

Tom: The title of your tutorial also sounds like the perfect title of a book. Do you have any for one?

David: What an excellent question! I have thought about this. However, I think the goal of a book if I can find the time to do it. A working title might be “Mongo from the trenches: Surviving the minefield to get ahead”. I think the book might be broken into three sections:  “When should you use or not user Mongo”,  “Schema and Operatorators in the NoSQL world”, “Sharding”. I would do this as this could be a great mini book on its own the community really could use a level of depth similar to the MySQL 5.0 certification guides.  I liked these books as it helped someone understand all the bits of what to consider with your schema design and how it affects the application as much as the database hosts. Then in the second half more administration geared it took those same schema and design choices to help you manage them with confidence.

In the end, Mongo is a good product that works well for most people as it matures we need more and discussion. On topics such as what should you monitor, how you should predict issues, and how valuable are regular audits. Especially in an ecosystem where it’s easy to spin something up, launch it, and move on to the next project.

 

Tom: When and why would you recommend using MongoDB instead of MySQL?

David: I am glad I mentioned this is worthy of a book already, as it is such a complex topic and one that gets me very excited.

I feel there is a bit or misinformation on both sides of this field. Many in the MySQL camp of experts know when someone says they can’t get more than 1000 TPS via MySQL. 9 out of 10 times and design, not a technology issue,  the Mongo crowd love this and due to inherit sharding nature of Mongo they can sidestep these types of issues. Conversely in the Mongo camp you will hear how bad the  SQL standard is, however, omitting transactions for a moment, the same types of operations exist in MySQL and Mongo.  There are some interesting powers in the Mongo aggregation. However, SQL is more powerful and just as complex as some map reduce jobs and aggregations I have written.

As to your question, MySQL will always win in regards to repeatable reads to the database in a transaction. There is some talk of limited transactions in Mongo. However, these will likely not become global and cluster wide anytime soon if ever.  I don’t trust floats in Mongo for financials; it’s not that Mongo doesn’t do them but rather JavaScript type floats are present. Sometimes you need to store data as a 64-bit integer and do math in the app to make it a high precision float. MySQL, on the other hand, has excellent support for precision.

Another area is simply looking at the history of Mongo and MySQL.  Mongo until WiredTiger and  RocksDB were very similar to MyISAM from a locking behavior and support perspective. With the advent of the new storage system, we will-will see major leaps forward in types of flows you will want in Mongo. With the writer lock issue is gone, and locking between the systems is becoming more and more similar making deciding which much harder.

The news is not all use. However, subdocuments and array support in Mongo is amazing there are so many things  I can do in Mongo that even in bitwise SET/ENUM operators I could not do. So if you need that type of system, or you want to create a semi denormalize for of a view in the database. Mongo can do this with ease and on the fly. MySQL, on the other hand, would take careful planning and need whole tables updated.  In this regard I feel more people could use Mongo and is ability to have a versioned document schema allowing more incremental changes to documents. With new code  releases, allowing the application to read old version and “upgrade” them to the latest form. Removing a whole flurry of maintenance related pains that RDBMs have to the frustration of developers who just want to launch the new product.

The last thing I would want to say here is you need not choose, why not use both. Mongo can be very powerful for keeping a semi denormalized version of the data that is nimble to allow fast application or system updates and features. Leaving MySQL for a very specific workload that need the precision are simple are not expected to have schema changes.  I am a huge fan of keeping the transactional portions in MySQL, and the rest in Mongo. Allowing you to scale quickly up and down the build of your data needs, and more slowly change the parts that need to be 100% consistent all of the time with no room for eventual consistency.

 

Tom: What another session(s) are you most looking forward to besides your own at Percona Live Amsterdam?

David: There are a few that are near and dear to me.

Turtles all the way down: tuning Linux for database workloads” looks like a great one. It is one view I have always had, and DBA’s should be DBA’s,  SysAdmins, and Storage people rolled into one. That way they can understand the impacts of the application down to the blocks the database reads.

TokuDB internals” is another one. I have used TokuDB in MySQL and Mongo to some degree but as it has never had in-depth documentation. A topic like that is a great way to fill any gaps for experienced and new people alike.

Database Reliability Engineering” looks like a great talk from a great speaker.

As an InnoDB geek, I like the idea around “Understanding InnoDB locks: case studies.”

I see a huge amount of potential for MaxScale if anyone else is curious, “Anatomy of a Proxy Server: MaxScale Internals” should be good for R/W splits and split writing type cases.

Finally, one of my favorite people is Charity as she always is so energetic and can get to the heart of the matter. If you are not going to “Upgrade your database: without losing your data, your perf or your mind” you are missing out!

 

Tom: Thanks for speaking with me, David! Is there anything else you’d like to add: either about Rackspace or Percona Live Amsterdam?

David: In regards to Rackspace, I urge everyone to check out the Data Services group.  We handle everything from Redis to Hadoop with a goal of augmenting your groups or providing experts to help keep your uptime as high as possible. With options for dedicated hosts to platform type services, there is something that helps everyone. Rackspace is not just a cloud company but a real support company that provides amazing hardware to use, or support for other hardware location that is growing rapidly.

With Percona Amsterdam, everyone should come the group of speakers is simply amazing, I for one am excited by so many topics because they are all so compelling. Outside of that you will it hard find another a gathering of database experts with multiple technologies under their belt and who truly believe in the move to picking the right technology for the right use case.

The post ObjectRocket’s David Murphy talks about MongoDB, Percona Live Amsterdam appeared first on MySQL Performance Blog.

Percona Live Europe 2015! Call for speakers; registration open

Percona Live Europe 2015! Call for speakers; registration now openPercona Live is moving from London to Amsterdam this year and the event is also expanding to three full days. Percona Live Europe 2015, September 21-23, will be at the Mövenpick Hotel Amsterdam City Centre. The call for speakers and Super Saver registration are now open. Hurry though because the deadline for submitting a speaking proposal is June 21st and Super Saver registration ends July 5th!

This year’s conference will feature one day of tutorials and two days of keynote talks and breakout sessions related to MySQL, NoSQL and Data in the Cloud. You’ll get briefed on the hottest topics, learn about operating a high-performing deployment and hear from top-industry leaders describe the future of the ecosystem – encompassing MySQL, MariaDB, Percona Server, MongoDB (and more). Attendees include DBAs, sysadmins, developers, architects, CTOs, CEOs, and vendors from around the world.

Have something to say? Why not lead a breakout session or a tutorial?

Breakout sessions are 50 minutes including a Q&A. Tutorial sessions focus on an immediate and practical application of in-depth MySQL and NoSQL knowledge. Tutorial speakers should assume that attendees will have laptops to work through detailed and potentially hands-on presentations. Tutorials are typically three hours long including a Q&A, however, if you have content for a full day, submissions for 6-hour-long tutorials are also being accepted. If your tutorial or breakout session is approved, you’ll receive a complimentary full-conference pass.

Huge thanks to our Conference Committee!

There would be no Percona Live without the hard work of our conference committees. Meet this year’s Percona Live Europe 2015 Conference Committee – a dedicated group of experts in MySQL, NoSQL and Data in the Cloud:

  • Erik Beebe – Founder and CTO, ObjectRocket
  • Luis Motta Campos – Database Administrator, ebay Classifieds Group
  • Colin Charles – Chief Evangelist, MariaDB
  • César Trigo Esteban – Development Director, Gigigo
  • Kenny Gorman – Chief Technologist; Data. Office of the CTO, Rackspace
  • Amrith Kumar – Founder and CTO, Tesora
  • Giuseppe Maxia – Quality Assurance Architect, VMWare
  • Shlomi Noach – Senior Systems Engineer, Booking.com
  • Konstantin Osipov – Engineering Manager, Mail.Ru
  • Morgan Tocker – MySQL Community Manager, Oracle
  • Art van Scheppingen – Head of Database Engineering, Spil Games
  • Charity Majors- Production Engineering Manager, Facebook
  • Peter Zaitsev – Co-founder and CEO, Percona

Sponsorships

Sponsorship opportunities for Percona Live Europe 2015 are now available. Sponsors become part of a dynamic and fast-growing ecosystem and interact with hundreds of DBAs, sysadmins, developers, CTOs, CEOs, business managers, technology evangelists, solution vendors, and entrepreneurs who typically attend the event. This year’s conference will feature expanded accommodations and turnkey kiosks.

Planning to attend?

Super Saver registration discounts for Percona Live Europe 2015 are available through July 5th (at 11:30 p.m. CEST). Visit the Percona Live Europe 2015 website for more information about the conference. Interested community members can also register to receive email updates about Percona Live Europe 2015.

Percona has also negotiated a special hotel rate at the Mövenpick Hotel Amsterdam City Centre. If you book your hotel before July 6th your delicious breakfast is included.

I hope to see you in Amsterdam!

The post Percona Live Europe 2015! Call for speakers; registration open appeared first on MySQL Performance Blog.

New MySQL features, related technologies at Percona Live London

Call for papers: Percona Live LondonThe upcoming Percona Live London conference, November 11-12, features quite a number of talks about the latest MySQL features and related technologies. There will be a lots of talks about the new MySQL 5.6 features:

  • Opening keynote highlights MySQL 5.6 new features.
  • New InnoDB Compression talk will cover the new compression algorithm, implemented by Facebook and included in MySQL 5.6.
  • New MySQL Replication features, including multi-treaded slave applier, Global Transaction Ids which can help for automatic failover and lots of performance optimizations and much).

Altho MySQL 5.6 is a very important milestone there are much more interesting technologies going on around MySQL. Here are some of the talks, which look pretty interesting (at least for me):

NoSQL World

Hadoop

Hadoop is a relatively new topic at MySQL conferences, however, it gains more and more traction, especially after MySQL applier for Hadoop (alpha version) release. Danil  Zburivsky will be talking about building a data warehouse with Hadoop and MySQL. I personally have a strong interest in Hadoop and recently  did a webinar about this topic. Hadoop concept is very different from MySQL, but there are a lots of real use cases where Hadoop will fit best.

MongoDB

MongoDB is a another interesting technology. There will be full MongoDB tutorial by Stephane Combaudon as well as MongoDB for MySQL Guru talk by Robert Hodges (Continuent) and Tim Callaghan (Tokutek)

New MySQL Cluster features.

MySQL Cluster 7.3 (based on  a mainline MySQL Server 5.6 release + NDBCluster storage engine) was recently released. Johan Andersson will cover some new MySQL Cluster 7.3 features in his MySQL Cluster Performance Tuning talk, including foreign key constrains (Foreign key constrains were the “showstopper” for many customers), memcached integration, etc. I knew Johan from the early MySQL Ab days and he always was (and now is) “the MySQL Cluster guy”, so I’m sure he will show some new MySQL cluster magic.

Other Storage Engines

TokuDB features the fractal tree and compression. Vadim blogged about using TokuDB  for storing timeseries data and it looks promising. Tim Callaghan of Tokutek will talk about Fractal Tree Indexes.

MariaDB contains the CONNECT engine (to join data between Oracle and Cassandra for example) and SPIDER storage engine (for automatic “sharding”). Colin Charles from Monty Program Ab will talk about new MariaDB features

Percona Live London is approaching fast so be sure to register today!

The post New MySQL features, related technologies at Percona Live London appeared first on MySQL Performance Blog.

Schema Design in MongoDB vs Schema Design in MySQL

For people used to relational databases, using NoSQL solutions such as MongoDB brings interesting challenges. One of them is schema design: while in the relational world, normalization is a good way to start, how should we design our collections when creating a new MongoDB application?

Let’s see with a simple example how we would create a data structure for MySQL (or any relational database) and for MongoDB. We will assume in this post that we want to store people information (their name) and the details from their passport (country and validity date).

Relational Design

In the relational world, the basic idea is to try to stick to the 3rd normal form and create two tables (I’ll omit indexes and foreign keys for clarity – MongoDB supports indexes but not foreign keys):

mysql> select * from people;
+----+------------+
| id | name       |
+----+------------+
|  1 | Stephane   |
|  2 | John       |
|  3 | Michael    |
|  4 | Cinderella |
+----+------------+
mysql> select * from passports;
+----+-----------+---------+-------------+
| id | people_id | country | valid_until |
+----+-----------+---------+-------------+
|  4 |         1 | FR      | 2020-01-01  |
|  5 |         2 | US      | 2020-01-01  |
|  6 |         3 | RU      | 2020-01-01  |
+----+-----------+---------+-------------+

One of the good things with such a design is that it’s equally easy to run any query (as long as we don’t consider joins as something difficult to use):

  • Do you want the number of people?
    SELECT count(*) FROM people
  • Do you want to know the validity date of Stephane’s passport?
    SELECT valid_until from passports ps join people pl ON ps.people_id = pl.id WHERE name = 'Stephane'
  • Do you want to know how many people do not have a passport? Run
    SELECT name FROM people pl LEFT JOIN passports ps ON ps.people_id = pl.id WHERE ps.id IS NULL
  • etc

MongoDB design

Now how should we design our collections in MongoDB to make querying easy?

Using the 3rd normal form is of course possible, but that would probably be inefficient as all joins should be done in the application. So out of the 3 queries above, only the query #1 could be easily run. So which other designs could we have?

A first option would be to store everything in the same collection:

> db.people_all.find().pretty()
{
	"_id" : ObjectId("51f7be1cd6189a56c399d3bf"),
	"name" : "Stephane",
	"country" : "FR",
	"valid_until" : ISODate("2019-12-31T23:00:00Z")
}
{
	"_id" : ObjectId("51f7be3fd6189a56c399d3c0"),
	"name" : "John",
	"country" : "US",
	"valid_until" : ISODate("2019-12-31T23:00:00Z")
}
{
	"_id" : ObjectId("51f7be4dd6189a56c399d3c1"),
	"name" : "Michael",
	"country" : "RU",
	"valid_until" : ISODate("2019-12-31T23:00:00Z")
}
{ "_id" : ObjectId("51f7be5cd6189a56c399d3c2"), "name" : "Cinderella" }

By the way, we can see here that MongoDB is schemaless: there is no problem in storing documents that do not have the same structure.

The drawback is that it is no longer clear which attributes belong to the passport, so if you want to get all passport information for Michael, you will need to correctly understand the whole data structure.

A second option would be to embed passport information inside people information – MongoDB supports rich documents:

> db.people_embed.find().pretty()
{
	"_id" : ObjectId("51f7c0048ded44d5ebb83774"),
	"name" : "Stephane",
	"passport" : {
		"country" : "FR",
		"valid_until" : ISODate("2019-12-31T23:00:00Z")
	}
}
{
	"_id" : ObjectId("51f7c70e8ded44d5ebb83775"),
	"name" : "John",
	"passport" : {
		"country" : "US",
		"valid_until" : ISODate("2019-12-31T23:00:00Z")
	}
}
{
	"_id" : ObjectId("51f7c71b8ded44d5ebb83776"),
	"name" : "Michael",
	"passport" : {
		"country" : "RU",
		"valid_until" : ISODate("2019-12-31T23:00:00Z")
	}
}
{ "_id" : ObjectId("51f7c7258ded44d5ebb83777"), "name" : "Cinderella" }

Or we could embed the other way (however this looks a bit dubious as some people may not have a passport like Cinderella in our example):

> db.passports_embed.find().pretty()
{
	"_id" : ObjectId("51f7c7e58ded44d5ebb8377b"),
	"country" : "FR",
	"valid_until" : ISODate("2019-12-31T23:00:00Z"),
	"person" : {
		"name" : "Stephane"
	}
}
{
	"_id" : ObjectId("51f7c7ec8ded44d5ebb8377c"),
	"country" : "US",
	"valid_until" : ISODate("2019-12-31T23:00:00Z"),
	"person" : {
		"name" : "John"
	}
}
{
	"_id" : ObjectId("51f7c7fa8ded44d5ebb8377d"),
	"country" : "RU",
	"valid_until" : ISODate("2019-12-31T23:00:00Z"),
	"person" : {
		"name" : "Michael"
	}
}
{
	"_id" : ObjectId("51f7c8058ded44d5ebb8377e"),
	"person" : {
		"name" : "Cinderella"
	}
}

That’s a lot of options! How can we choose? Here is where you should be aware of a fundamental difference between MongoDB and relational databases when it comes to schema design:

Collections inside MongoDB should be designed with the most frequent access patterns of the application in mind, while in the relational world, you can forget how data will be accessed if your tables are normalized.

So…

  • If you read people information 99% of the time, having 2 separate collections can be a good solution: it avoids keeping in memory data is almost never used (passport information) and when you need to have all information for a given person, it may be acceptable to do the join in the application.
  • Same thing if you want to display the name of people on one screen and the passport information on another screen.
  • But if you want to display all information for a given person, storing everything in the same collection (with embedding or with a flat structure) is likely to be the best solution.

Conclusion

We saw in this post one of the fundamental differences between MySQL and MongoDB when it comes to creating the right data structure for an application: with MongoDB, you need to know the data access pattern of the application. This should not be neglected as creating a wrong schema design is a recipe for disaster: queries will be difficult to write and to optimize, they will be slow and they will sometimes need to be replaced by custom code. All that can lead to low performance and frustration.

The next question is: which way is better? And of course, there is no definite answer: MongoDB fans will say that by making all access patterns equal, normalization make them equally bad, and normalization fans will say that a normalized schema provides good performance for most applications and that you can always denormalize to help a few queries run faster.

The post Schema Design in MongoDB vs Schema Design in MySQL appeared first on MySQL Performance Blog.