data system通常要做下面的事情:
保存数据,以便应用本身或者其它应用能够在未来使用(数据库);
保存大开销的计算结果,加速读取(缓存);
通过关键字查找/过滤数据;(查找索引)
给其它进程发消息异步处理(流处理);
周期性处理大批量的增量数据(批处理);
there are datastores that are also used as message queues (Redis), and there are message queues with database-like durability guarantees (Apache Kafka). The boundaries between the categories are becoming blurred.
Secondly, increasingly many applications now have such demanding or wide-ranging requirements that a single tool can no longer meet all of its data processing and storage needs. Instead, the work is broken down into tasks that can be performed efficiently on a single tool, and those different tools are stitched together using application code.
总之一个data system现在往往需要多种组件组合搭建。

In this book, we focus on three concerns that are important in most software systems:
Reliability The system should continue to work correctly (performing the correct function at the desired level of performance) even in the face of adversity (hardware or software faults, and even human error).
Scalability As the system grows (in data volume, traffic volume, or complexity), there should be reasonable ways of dealing with that growth.
Maintainability Over time, many different people will work on the system (engineering and operations, both maintaining current behavior and adapting the system to new use cases), and they should all be able to work on it productively.
thinking about Reliability
it only makes sense to talk about tolerating certain types of faults.
a fault is not the same as a failure . A fault is usually defined as one component of the system deviating from its spec, whereas a failure is when the system as a whole stops providing the required service to the user.
-
Hardware Faults(硬件错误往往是uncorrelated )
- Software Errors (经常是uncorrelated 的:举例,闰秒错误;共享了资源的跑飞程序;上游service不行了;Cascading failures雪崩)
- Human Errors (Design systems in a way that minimizes opportunities for error写好代码,接口强约束但是可能会影响效率;Decouple the places where people make the most mistakes from the places where they can cause failures;测试;快速恢复;监控,Set up detailed and clear monitoring, such as performance metrics and error rates. )