Saturday, June 13, 2015

NoSQL Schema Design: "The Questions you have" vs. "The Answers I can give"

Migrating to NoSQL from SQL has been a real continuing challenge for us. SQL third normal form provided a very clear set of rules on how to design databases, and I had long mastered how to design SQL schema, and had become fairly adept at making indexes to optimize queries. When these tools seemed to fail then I would build some selective de-normalizations into our databases. However, with NoSQL we are given a choice to have a flexible schema and anything goes.

Where to start? When given a complete blank slate with no fixed rules or methodologies, I was not sure what to do. Fortunately, I had already moved our application architecture to a Service Oriented Architecture and my APIs already generated structured objects which could be easily ported to our new Couchbase NoSQL database, and that's what we did. So now we were finally dipping our toes into the NoSQL world. We then attached our Couchbase cluster to an ElasticSearch cluster, and then we could text search absolutely everything in our universe of data with one simple query. It was amazing. We could search all the people, companies, deals, emails, telephone numbers, notes, phone logs, email logs, street addresses, web addresses...everything with just one simple query. In SQL, this would be enormously difficult. You would have to create text indexes on all the numerous tables and then write a very complex union query to look at all the tables, and then who knew what the performance would be like. My users found this new ability interesting, but then I got these questions...

  • I would like to see only my stuff, or what my co-workers have
  • I would like to see stuff from a certain time period
  • I would like to see only companies within a certain revenue range or other criteria
  • I would like to see counts of what I want before I see the results, so that if there isn't anything there I don't want to bother seeing the results
  • I spoke to someone but I can't remember their name, but we talked about "blah blah" in the last year or maybe two years ago
I then realized that the object models I had used from my API were not going to work, and I realized that these object models were really based on doing CRUD (Create, Read, Update, Delete) operations on a SQL data store, and that I had to change the model.

I read online extensively. I posted on forums asking how to design a schema for NoSQL. I really did not see anything as definitive as third normal form. Finally, I saw on Stack Overflow an idea which cleared up everything. I paraphrase the idea:
With SQL databases you design your schema based on the answers you have, and with NoSQL databases you design your schema based on the questions you have.
I wish I could credit where I saw this idea, but it cleared everything up for me. Searching for the third normal form equivalent in NoSQL was pointless. Those rules were built to optimize space and simplify queries. It was based on linear algebra. It was perfect for a world of smaller clearly defined data. However, in this new world the questions became more important than the answers. The users expected information to be more loosely correlated, and they wanted their simple questions to be answered simply and quickly even though the underlying information can be quite complex. And so I begin now to restructure my schemas based not on the information I have but based on the questions my users have. Wish me luck.

No comments:

Post a Comment