Re: Facebook Database Schema Pdf Download

1 view

Skip to first unread message

Message has been deleted

Berry Spitsberg

unread,

Jul 17, 2024, 2:19:38 PM7/17/24

to trocadrice

Well, this is a graph. :) It doesn't tell you how to build it in SQL, there are several ways to do it but this site has a good amount of different approaches. Attention: Consider that a relational DB is what it is: It's thought to store normalised data, not a graph structure. So it won't perform as good as a specialised graph database.

Facebook Database Schema Pdf Download

Download File https://urllio.com/2yUMej

You'll see that these companies are dealing with data warehouses, partitioned databases, data caching and other higher level concepts than most of us never deal with on a daily basis. Or at least, maybe we don't know that we do.

It's not possible to retrieve data from RDBMS for user friends data for data which cross more than half a billion at a constant timeso Facebook implemented this using a hash database (no SQL) and they opensourced the database called Cassandra.

I am building an application that I want to interface with Facebook Connect, Twitter, OpenID, and potentially other social networks. Users will be able to login using any number of these methods at the same time. My application uses MySQL as a backend database.

Can someone give me guidance on what my db schema should look like for capturing user info from various social networks at the same time? One idea I have (based on my reading online) is something like:

This is what i do, i separate the accounts table from the authentication process, e.g. the account holds the account name, registration date, and unique id maybe. Then I can create 4 additional tables for example: users_openid, users_facebook, users_twitter and users (for your normal username/website authentication), all have a foreign key (account_id) that links to the account table.

MySQL is the primary database used by Facebook for storing all social data. They started with the InnoDB MySQL database engine and then wrote MyRocksDB, which was eventually used as the MySQL Database engine.

Polyglot persistence architecture has several upsides. Different databases with different data models can be leveraged to implement different use cases. The system is more highly available and easy to scale.

If we have ACID requirements like for a financial transaction, MySQL would fit best. On the other hand, when we need fast data access, we would pick Memcache or when we are okay with data denormalization and it being eventually consistent but need a fast highly available database, a NoSQL solution would fit best.

When a user updates the value of an object, the new value is written to the database and the old value is deleted from Memcache. The next time user requests that object, the updated value is fetched from the database and written to Memcache. Now after this for every request, the value is served from Memcache until it is modified.

The instances of an app are geographically distributed. When one instance of a distributed database is updated, say in Asia, it takes a while for the changes to cascade to all of the instances of the database running globally.

The migration of the messenger service database from HBase to RocksDB enabled Facebook to leverage flash memory to serve messages to its users as opposed to serving messages from the spinning hard disks. Also, the replication topology of MySQL is more compatible with the way Facebook data centers operate in production. This enabled the service to be more available and have better disaster recovery capabilities.

Facebook uses the storage engine to store system measurements such as product stats like how many messages are sent per minute, the service stats, for instance, the rate of queries hitting the cache vs the MySQL database. Also, the system stats like the CPU, memory and network usage.

Where can I find database architectures (db model or db schema) of Social Media Sites (like Facebook, Twitter), Content Websites (like medium, youtube, vimeo), & Question-Answer websites (like Stack Overflow, Quora etc) to learn database design effectively?

If you read the facebook engineering blog. They initially used HBase but now switched to MyRock which is a MySQL storage engine. Take these design posts with a grain of salt. Sometimes they are just taken from some old engineering posts with no explanation. Do your own research. You can actually use any type of database to store messages.

My question is, what is the best practice for storing comments and likes in a post? Should these be completely different tables? I read in a blog once that normalizing a database for the web can be horrible for querying; what's the best approach to this?

Performance. The copy is wasteful on disk space. However it does get rid of the need to maintain an index on a foreign key. It may not sound like much, but on an extremely large database that is a huge deal. With any face book caliber dataset you are going to have to break with tradition a bit to make things fast. Minus 1 to links.

Your intuition is correct, definitely use option 2. There is no reason to duplicate the entire post in the database each time someone shares it. Just have a separate SharedPosts table which simply maps userIDs to other users' posts, like you say.

In the lookaside caching pattern, the application first requests data from the cache instead of the database. If the data is not cached, the application gets the data from the backing database and puts it into the cache for subsequent reads. Note that the PHP application was accessing MySQL and memcache directly without any intermediate data abstraction layer.

Engineers had to work with two data stores with two very different data models: a large collection of MySQL master-slave pairs for storing data persistently in relational tables, and an equally large collection of memcache servers for storing and serving flat key-value pairs derived (some indirectly) from the results of SQL queries. Working with the database tier now mandated first gaining intricate knowledge of how the two stores worked in conjunction with each other. Net result was loss in developer agility.

TAO represented data items as nodes (objects) and relationships between them as edges (associations). The FB application developers loved the API because they could now easily manage database updates and queries necessary for their application logic with no direct knowledge of MySQL or even memcache.

Most of us in the enterprise world do not have Facebook-scale problems but nevertheless want to scale out SQL databases on-demand. We love SQL for its flexibility and ubiquity, which means we want to scale without giving up on SQL. Is there a general purpose solution for enterprises like us? The answer is Yes!

We are now in the second generation of distributed SQL databases where massive scalability and global data distribution are built into the database layer as opposed to 10 years back when Facebook had to build these features into the application layer.

Sharding is completely automatic in the Spanner architecture. Additionally, shards become auto balanced across all available nodes as new nodes are added or existing nodes are removed. Microservices needing massive write scalability can now rely on the database directly as opposed to adding new infrastructure layers similar to the ones we saw in the FB architecture. No need for an in-memory cache (that offloads read requests from the database thereby freeing it up for serving write requests) and also no need for a TAO-like application layer that does shard management.

The benefit of a globally-consistent database architecture is that microservices needing absolutely correct data in multi-zone and multi-region write scenarios can finally rely on the database directly. Conflicts and data loss observed in typical multi-master deployments of the past do not occur. Features such as table-level and row-level geo-partitioning ensure that data relevant to the local region remains leadered in the same region. This ensures that the strongly consistent read path never incurs cross-region/WAN latency.

Unlike the legacy NewSQL databases, SQL and ACID transactions in their complete form can be supported in the Spanner architecture. Single-key operations are by default strongly consistent and transactional (the technical term is linearizable). Single-shard transactions by definition are leadered at a single shard and hence can be committed without the use of a distributed transaction manager. Multi-shard (aka distributed) ACID transactions involve a 2-Phase Commit using a distributed transaction manager that also tracks clock skews across the nodes. Multi-shard JOINs are similarly handled by querying data across the nodes. The key here is that all data access operations are transparent to the developer who simply uses regular SQL constructs to interact with the database.

Database Administrators Stack Exchange is a question and answer site for database professionals who wish to improve their database skills and learn from others in the community. It only takes a minute to sign up.

The issue is complicated because the relational database method is intended to be a declarative approach where you specify your solution and the relational database management system (RDBMS) would then optimize your declarative expressions.

To get this question out of hand-waving opinion mongering, you should probably show some work of how you express your requirements and then speculate about how RDBMS might realize your query, and then only can one start discussing trade-offs of different solutions. You also have to be aware of the framing of the question. NoSQL always wins because they can just hard-code to one partial problem but the full context of the full system (especially ACID requirements) don't go away. It's not like graph databases did not exist before the relational model was invented by Edgar Codd.

The trade-off is where you want the massive UPDATE activity to reside, with the Post or with a separate table? Heck, if you use a column-oriented database it doesn't matter. This is the point, optimization done prematurely boxes you in. SQL is if anything a language for concise specification of the logic of the data, and in E.F. Codd's original idea manifoldly yet never perfectly implemented.