Spanner and Colossus: I don't know the answer to this for sure, but I'm going to hazard a guess.It's replication for two different purposes. Colossus replication is about not losing data that you've saved to disk. It's a file system. Spanner replication is about getting availability and throughput to transact on that data. When you commit something to a given spanserver or whatever, it gets written to Colossus, which can then do some replication to make sure that data doesn't disappear when a disk dies.
Other storage stuff:
- Dremel is a read-only query tool. It can do ridiculously complex queries on really large data, even on the fly. We use it a lot on my team to query our log files. I think the fancyness about it comes from the way that it scales by distributing the query execution over a lot of machines, but I don't really know anything about the internals. It seemed to me that the point of it being mentioned in the paper was that there's high demand for being able to do more than key lookups. There's a project called Apache Drill that's supposed to be an open-source version of it, if you're curious.
- Megastore is a lot like Spanner, and it's fair to consider spanner the next generation of megastore. Megastore, which is built on top of bigtable, also supports transactions and SQL-like queries.
- I'd actually never heard of Percolator before you mentioned it, but it looks from the documentation to also be along the same lines as spanner and megastore.
Spanner and search: Spanner, as far as I'm aware, isn't used by search at all. It's pretty new, and has relatively few clients. The read/write pattern for ads is very different than search. The stuff that ads is using spanner for is read/writes triggered by calls to our APIs. I would suspect that search has less complex queries, but much more data and QPS. They also do their writes from map reduces, probably. They may not even care too much about consistency or transactions, but I've never seen any search code, so...
Spanner queries: The values in spanner aren't just a byte array; they're structured. Like a LOT of other technologies at google, spanner data is often
protocol buffers. It also supports typical scalars like string, boolean, ints, etc. The schema is defined just like a SQL table. You can query on the fields you declare (including fields inside the protobufs). My understanding is that if you're not querying on the key of a table, you trigger a full table scan. That's why F1 needed to build secondary indexes on top of spanner. As far as joins go... they don't look supported to me. You basically get selects, sub-selects, and a where clause. When we defined our new schema, a lot of our joins became nested fields. Many-to-many relationships are a little trickier, and you need to consider the schema carefully.
Sorry I couldn't give very definitive answers. Hopefully that all makes sense, though? I wouldn't be surprised if there are other googlers lurking in the class that could fill in blanks/correct me. Most of my knowledge is concentrated in ads, particularly our SQL db.