Friday, July 31, 2015

Control Z in a NoSQL world

I was going on a trip with my 13 year old son, and out of the blue, he said "What if you can hit 'Command Z' in the real world?"  "What?" "What if you spilled your drink and you hit 'Command Z' and it never happened?" Silence indicating my mind blowing. As a millennial raised in age of the Internet and cloud computing, he was wondering if our real world actions can be as transient as our virtual world. In the laws of physics, entropy says that energy and structure will break down, and processes tend not to be reversible. But have we defied the laws of physics in our virtual world? I remember how awesome it was to use a word processor and learn how to fix typos. In high school, typos were fixed with correction tape and there always a little smudge to indicate that mistake was made. Meanwhile, 'Control Z' just makes your actions go away as if it never happened. My son's comments made me think if I wanted to 'Control Z' any thing I have done in my life. The reality is that like everyone I have regretted actions and decisions I have made and the resulting consequences. But these same actions may have also had some wonderful consequencesor at least lessons. So I don't think I would have used 'Control Z'. I would live with smudges in my life. 

In our virtual world, 'Control Z' is not as powerful as it seems. Our Email is archived and backed up to other systems. Gmail recently came out with a limited message unsend feature, but this only works for a limited period after hitting the send button. Facebook posts can be shared and propagated pretty soon after you have posted and even if you retract a post, it lives on in some archive. 

What does this mean in a NoSQL world? NoSQL databases are powering the internet these days, and though it is not required these data stores often contain denormalized data where data facts are often stored in multiple places. For instance a user name may be stored in a user profile and also in all that user's posts, comments and likes. Let's say a user gets married and as a traditional person she changes her last name. Then all her posts, comments and likes need to be updated. SQL databases usually solve this problem by storing the name once and using an immutable primary key for the user that is stored in the posts, comments and likes. Then the two tables are joined to show the name. NoSQL usually does not join its objects and hence the need to denormalize or repeat the data. This makes 'Contol Z' much harder and reminds of us that we do live in a world where actions leave a mark and those marks are hard to erase, and we have to live with the smudges that life gives us. 

Sunday, July 26, 2015

What is NoSQL?

I recently read Paul Ford's article, "What is code?", and it is a very compelling look into the ideas and culture of software and the people who create. He describes the religious adherence programmers have for the many competing technologies they use. Corporate vs. Startup, East coast vs. west coast, Windows vs. Linux vs. Mac, C vs. Ruby vs. Python vs. php vs. C# vs. Java. Django vs. Rails... There also is a bit of a divide between SQL and NoSQL. SQL is more definitive, there is an ANSI standard for SQL (structured query language). Schema are defined, and it is based on linear algebra. NoSQL is not defined. In fact there isn't even a real tight definition other than NOT SQL.
The site that is top ranked by Google for the word NoSQL,, has a definition:

Next Generation Databases mostly addressing some of the points: being non-relational, distributed, open-source and horizontally scalable.

It is an interesting definition. SQL databases are very elegant and until the advent of the Web 2.0 seemed to handle any data problem you could throw at it. Web 2.0 and specifically Social Networks, exploded data, since this was when the users of a given site were not only consuming (reading downloading the data), but also contributing to the content of the site. Currently Facebook has over 1.4 billion users and most of those users are posting and creating content, and this was where SQL databases started to fail. Elegant SQL became a bottleneck, and none of the SQL databases out there could scale to what is now called Internet Scale. So new technologies had to be developed to serve data at Internet Scale and that pretty much what NoSQL is. Some of the other features are mostly just side effects of scaling issue.

Interestingly enough, there are cases where SQL databases have been used for Internet Scale sites. A good example is Instagram. In this posting, Instagram co-founder, Mike Krieger, explains how they used PostgreSQL scaled out to handle Instagram data. However, they had employ a lot of other NoSQL products like Memcache, Redis, and a fair amount of engineering to get it to work and handle the millions of users, creating content every day.

Due to it's success with Internet Scale projects, programmers have been using NoSQL for everything, and not necessarily with good results. I know on my recent projects we are starting to experiment with NoSQL to see if we get the speeds and optimizations that we need. However, I will not pretend that these projects are even approaching Internet Scale with may 150,000 users at best. However, after years of slow and complicated queries and indexes in SQL we do get the speed we need. Wish me luck.

Wednesday, June 17, 2015

NoSQL- structured vs unstructured what's better?

I recently had a conversation with a friend of mine, Dan Torres who is one of the most brilliant SQL data gurus I know and has taught me most of the meaningful SQL knowledge I have. We were talking about how I was adopting NoSQL, and he was concerned. He has built some extremely large and meaningful data sets with SQL and he said he had trouble with the lack of data structure in most applications of NoSQL.

I do agree that with structured data you can get some very meaningful insight out of your data, and that NoSQL does allow for looser data structure. But I think that with NoSQL you can build an evolutionary structure that evolves as your data evolves. I have more than once found that data modeling decisions I have made on day one of a project need to be changed and depending on what the change is, it can be quite painful to change in a SQL database.
In addition, often your inbound data is not structured and it can be quite challenging to convert that unstructured data into structured data. My team has dedicated data entry people who are looking at RSS feeds and extracting and entering deal information field by field into our database. This is not an automated process, though we have put some automated helpers in place. 

So part of my exploration of NoSQL is to see how it can help us tease structure and meaning out of unstructured data. We have built semantic technologies to see if we can help our research group classify deals and companies from unstructured text with some success, and I am hoping NoSQL can help us be more successful. 

Looking at Google and Yahoo!, we see they are addressing the same problem at a much greater scale. They are taking all the web pages on the web, which are extremely unstructured, and are gleaning structure and meaning from them so that we can find them in their search applications. And I know that they are not using SQL to structure their data. 

Monday, June 15, 2015

NoSQL - Open source equals more innovation?

As I have moved my team to using more NoSQL data stores, I found an interesting side topic to the SQL vs. NoSQL issues. The vast majority of NoSQL technology is open source. Open source has been around for a while. Started in 1985 by the Free Software Foundation and now fostered by the Open Source Initiative.

But we can look at the roots of the concept going back to the Founding Fathers with the creation of the U.S. Patent system, which may seem to be an anathema to our current views of open source. At that point, the U.S. government wanted to spur technical innovation, by allowing inventors to share their inventions and in return have the ability to license their shared inventions for a limited time (20 years which is forever in software time frames). After that time, the invention then enters the public domain. The alternative to a patent was to keep the invention secret, and not share it at all. This is what Coca-cola did, and it is generally called a trade secret.

So it is interesting, that recently patents have now taken on the light of inhibiting innovation. NoSQL has been the most significant innovation in data technology and almost all the main players are open source: MongoDB, Couchbase, CouchDB, Cassandra, Redis, Elasticsearch, Lucene, Hadoop... Here is one list from last year.

So what is it that makes open source drive innovation? I found in my entry into the world quite intriguing. I jumped on the forums to learn about how the new software I was testing. Soon I found I was poking around GitHub to understand how the software works. I noted that one of software packages did not work with the JVM I was using, and I tweaked some Java code (I am not a Java programmer). This tweak was then rolled up into GitHub where the moderates incorporated it into the next release. Similarly for another product, I found I needed some connection pooling and working with the community we came up with C# code to pool the connections and those modifications were rolled into the next release.

I really did not spend that much time on these code changes, but open source took my minor efforts and improved the overall code base in a way that proprietary code would never have done, and if thousands of programmers are doing this, you can see how powerful open source is in driving innovation and software development.

As I side note, I did just get a patent grant on some 3D integral photography software, I created with a friend. It took almost 5 years, and in those 5 years a lot software innovation has occurred!

Sunday, June 14, 2015

NoSQL - Old Masters Oil painting technique and Software development

Years ago I devoted a substantial amount of time to oil painting. It was a deep dive for 6 years, and I learned a lot about the art of oil painting including some of the old master techniques commonly used by Rembrandt, Vermeer, Goya, Velasquez and more. 

Though I don't use the skills I learned from that period today, I recently realized I use the attitude and approach I learned, especially in working with NoSQL. 

To generalize, the Old Masters technique was to work from thin to fat. The artist would start a painting using very thin paint using lots a turpentine and medium to thin the color. With the thin paint, the artist would then cover the whole canvas with color but in a very general way, not specifying any details but setting up the top level structure of the composition. Then the artist would wait a day or two for it to dry. Note that a productive studio would have many canvases in process. The painter then would add another layer of paint which was a little bit fatter and less turpentine defining pictorial elements a little more. Then the painter would let that dry. The painter then with each session would progressively paint fatter paint and more definitively until the painting was completed. 

How is this similar to building a NoSQL application? I would say that with NoSQL you should start your design very broadly and the flexible schema allows you to do so. You can then go fatter and refine and add details as you go because of the flexible schema. The result is a growing and living application that is agile enough to adjust to new needs and ideas. 

In contrast, I would say that SQL design is more akin to ink drawing where you make permanent, definitive and detailed decisions up front. Then you have to live with them for the rest of the life of your application. You could change the design later but not without some heartache. 

So if you are using NoSQL, let's see your masterpiece!

NoSQL Evangelism - Are we drinking the Koolaid?

Recently I went to a workshop to learn more about Neo4j, a NoSQL graph database. I sat at table with a bunch of other techies, as we heard all about the benefits of graph databases, and how the whole world can be modeled as a set of graph relationships. They say graphs are everywhere. One of my neighbors at the table asked "Have you drunk the Koolaid yet?".

This expression got me thinking about how technology is evangelized. Technology companies need developers and systems engineers to adopt their technology in order to be successful, and they need to convince them to consider their technology, and it especially difficult when developers have already gotten comfortable with other technologies.

I unfortunately remember the origin of the "Koolaid" reference. Back in the 1970's, there were a lot of cults, and at Jonestown, Guyana, 500 people from Jim Jones People's Temple died in a mass homocide-suicide from drink poison laced Koolaid.

As gruesome as the origin is, we still use the expression today to mean blindly adopting beliefs and taking actions without sensibly acknowledging the consequences. And in the 1970's many institutions had broken down, and many people were looking for new institutions to join to feel community and structure. Unfortunately, some of these new institutions were not so benign.

Back to technology evangelism...

I believe we are in a Golden Age of information technology. The speed at which new technologies are being developed is breathtaking. But, how do tech leaders and architects who's job is to adopt the appropriate new technologies and to know which ones to pick. So, we listen to all the pitches from all the up and coming companies about how great their technologies are, and how they will make the software we are creating even better. But, in order to learn if this is all true, we have to invest a fair amount time and money to determine if such and such technology will work for us, and improve our own software. What do we do?

We trust our instincts and drink the Koolaid.

Saturday, June 13, 2015

NoSQL Schema Design: "The Questions you have" vs. "The Answers I can give"

Migrating to NoSQL from SQL has been a real continuing challenge for us. SQL third normal form provided a very clear set of rules on how to design databases, and I had long mastered how to design SQL schema, and had become fairly adept at making indexes to optimize queries. When these tools seemed to fail then I would build some selective de-normalizations into our databases. However, with NoSQL we are given a choice to have a flexible schema and anything goes.

Where to start? When given a complete blank slate with no fixed rules or methodologies, I was not sure what to do. Fortunately, I had already moved our application architecture to a Service Oriented Architecture and my APIs already generated structured objects which could be easily ported to our new Couchbase NoSQL database, and that's what we did. So now we were finally dipping our toes into the NoSQL world. We then attached our Couchbase cluster to an ElasticSearch cluster, and then we could text search absolutely everything in our universe of data with one simple query. It was amazing. We could search all the people, companies, deals, emails, telephone numbers, notes, phone logs, email logs, street addresses, web addresses...everything with just one simple query. In SQL, this would be enormously difficult. You would have to create text indexes on all the numerous tables and then write a very complex union query to look at all the tables, and then who knew what the performance would be like. My users found this new ability interesting, but then I got these questions...

  • I would like to see only my stuff, or what my co-workers have
  • I would like to see stuff from a certain time period
  • I would like to see only companies within a certain revenue range or other criteria
  • I would like to see counts of what I want before I see the results, so that if there isn't anything there I don't want to bother seeing the results
  • I spoke to someone but I can't remember their name, but we talked about "blah blah" in the last year or maybe two years ago
I then realized that the object models I had used from my API were not going to work, and I realized that these object models were really based on doing CRUD (Create, Read, Update, Delete) operations on a SQL data store, and that I had to change the model.

I read online extensively. I posted on forums asking how to design a schema for NoSQL. I really did not see anything as definitive as third normal form. Finally, I saw on Stack Overflow an idea which cleared up everything. I paraphrase the idea:
With SQL databases you design your schema based on the answers you have, and with NoSQL databases you design your schema based on the questions you have.
I wish I could credit where I saw this idea, but it cleared everything up for me. Searching for the third normal form equivalent in NoSQL was pointless. Those rules were built to optimize space and simplify queries. It was based on linear algebra. It was perfect for a world of smaller clearly defined data. However, in this new world the questions became more important than the answers. The users expected information to be more loosely correlated, and they wanted their simple questions to be answered simply and quickly even though the underlying information can be quite complex. And so I begin now to restructure my schemas based not on the information I have but based on the questions my users have. Wish me luck.