Star

离开这里两年后

2019-09-18T05:54:16.000Z

两年了，我感到自己离开最想象力丰富，最充满热情的年龄太遥远了。这些日子，一直在思考自己如何变成一个越发平庸的人。

甚至找不到给自己的标签。

在这个地方，还可以写点文字，找找自己真正想要的是什么。

今天收到别人请求内推的邮件，没有附上简历。第一次没有生气。忽然想起当初自己找工作时候莽莽撞撞，无所适从的样子，想起那些睡不着的夜晚，不停哭不停哭的时刻，想起无法面对自己的失败，却充满韧性的我。至少那个时候，充满希望，机智善良。

不知道工作改变了我什么，变成一个刻薄或者并不耐心的人，或者变成毫无生趣的人。终究是中了这个社会的圈套。

我思来想去，我擅长的可能不够好，但对于写字，记录，和思考，我终究还是喜爱的。我到达不了哪里，也没有什么小目标，可能这辈子也不会出一本书。但我要好好写下去，让自己更努力一点。

这几年，好像之前太努力了，工作以后总想着如何享受生活。可我大概不是享受的命，我还是很怀念奋斗时不记得自己是谁的感觉，怀念充满了观念的我。

不知道创业公司是不是更适合我，如果能解决生计，温饱，我为什么要为了那些所谓的优越而选择平庸呢。

我甚至不知道下一篇什么时候会写。我不知道明天我会不会努力。

但我记录在这里，至少在这一刻，我还想过自己是不是可以不平凡。

计算机网络原理知识点初整理

2017-06-07T17:57:55.000Z

更新中...

HTTP

HTTP报文

分为request line, header line, message body

HTTP基本方法：

GET，POST，PUT，DELETE。 URL全称是资源描述符，我们可以这样认为：一个URL地址，它用于描述一个网络上的资源。

GET：

用于信息获取，而且应该是安全的和幂等的。

所谓安全的意味着该操作用于获取信息而非修改信息。换句话说，GET 请求一般不应产生副作用。就是说，它仅仅是获取资源信息，就像数据库查询一样，不会修改，增加数据，不会影响资源的状态。

幂等的意味着对同一URL的多个请求应该返回同样的结果。

GET请求报文示例：

GET /books/?sex=man&name=Professional HTTP/1.1
 Host: www.example.com
 User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.6)
 Gecko/20050225 Firefox/1.0.1
 Connection: Keep-Alive

Post:

报文示例：

POST / HTTP/1.1
Host: www.example.com
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.6)
Gecko/20050225 Firefox/1.0.1
Content-Type: application/x-www-form-urlencoded
Content-Length: 40
Connection: Keep-Alive
sex=man&name=Professional

HTTP 协议中规定 POST 提交的数据必须在 body 部分中，但是协议中没有规定数据使用哪种编码方式或者数据格式。实际上，开发者完全可以自己决定消息主体的格式，只要最后发送的 HTTP 请求满足上面的格式就可以。

Post提交数据的方式：

application/x-www-form-urlencoded

浏览器的原生

表单，如果不设置 enctype 属性，那么最终就会以 application/x-www-form-urlencoded 方式提交数据。

multipart/form-data

响应报文

HTTP 响应与 HTTP 请求相似，HTTP响应也由3个部分构成，分别是：

状态行响应头(Response Header) 响应正文

常见状态码：

200 OK 客户端请求成功
301 Moved Permanently 请求永久重定向
302 Moved Temporarily 请求临时重定向
304 Not Modified 文件未修改，可以直接使用缓存的文件。
400 Bad Request 由于客户端请求有语法错误，不能被服务器所理解。
401 Unauthorized 请求未经授权。这个状态代码必须和WWW-Authenticate报头域一起使用
403 Forbidden 服务器收到请求，但是拒绝提供服务。服务器通常会在响应正文中给出不提供服务的原因
404 Not Found 请求的资源不存在，例如，输入了错误的URL
500 Internal Server Error 服务器发生不可预期的错误，导致无法完成客户端的请求。
503 Service Unavailable 服务器当前不能够处理客户端的请求，在一段时间之后，服务器可能会恢复正常。

Keep-alive:

在 HTTP 1.1 版本中，默认情况下所有连接都被保持，如果加入 "Connection: close" 才关闭。目前大部分浏览器都使用 HTTP 1.1 协议，也就是说默认都会发起 Keep-Alive 的连接请求了，所以是否能完成一个完整的 Keep-Alive 连接就看服务器设置情况。

使用长连接之后，客户端、服务端怎么知道本次传输结束呢？两部分：1. 判断传输数据是否达到了Content-Length 指示的大小；2. 动态生成的文件没有 Content-Length ，它是分块传输（chunked），这时候就要根据 chunked 编码来判断，chunked 编码的数据在最后有一个空 chunked 块，表明本次传输数据结束。

Cookie是Web服务器发送给客户端的一小段信息，客户端请求时可以读取该信息发送到服务器端，进而进行用户的识别。对于客户端的每次请求，服务器都会将Cookie发送到客户端,在客户端可以进行保存,以便下次使用。

客户端可以采用两种方式来保存这个Cookie对象，一种方式是保存在客户端内存中，称为临时Cookie，浏览器关闭后这个Cookie对象将消失。另外一种方式是保存在客户机的磁盘上，称为永久Cookie。以后客户端只要访问该网站，就会将这个Cookie再次发送到服务器上，前提是这个Cookie在有效期内，这样就实现了对客户的跟踪。

Cookie是可以被禁止的。

Session

每一个用户都有一个不同的session，各个用户之间是不能共享的，是每个用户所独享的，在session中可以存放信息。

在服务器端会创建一个session对象，产生一个sessionID来标识这个session对象，然后将这个sessionID放入到Cookie中发送到客户端，下一次访问时，sessionID会发送到服务器，在服务器端进行识别不同的用户。

Session的实现依赖于Cookie，如果Cookie被禁用，那么session也将失效。

Token

目前主流的做法是使用 Token 抵御 CSRF 攻击。

Token 使用原则

要足够随机————只有这样才算不可预测
是一次性的，即每次请求成功后要更新Token————这样可以增加攻击难度，增加预测难度
要注意保密性————敏感操作使用 post，防止 Token 出现在 URL 中

注意：过滤用户输入的内容不能阻挡 csrf，我们需要做的是过滤请求的来源

TCP

TCP提供一种面向连接的、可靠的字节流服务
在一个TCP连接中，仅有两方进行彼此通信。广播和多播不能用于TCP
TCP使用校验和，确认和重传机制来保证可靠传输
TCP给数据分节进行排序，并使用累积确认保证数据的顺序不变和非重复
TCP使用滑动窗口机制来实现流量控制，通过动态改变窗口的大小进行拥塞控制

三次握手 Three-way Handshake

是指建立一个 TCP 连接时，需要客户端和服务器总共发送3个包。三次握手的目的是连接服务器指定端口，建立 TCP 连接，并同步连接双方的序列号和确认号，交换 TCP 窗口大小信息。在 socket 编程中，客户端执行 connect() 时。将触发三次握手。在TCP/IP协议中，TCP协议提供可靠的连接服务，采用三次握手建立一个连接。第一次握手：建立连接时，客户端发送syn包(syn=j)到服务器，并进入SYN_SEND状态，等待服务器确认；

第二次握手：服务器收到syn包，必须确认客户的SYN（ack=j+1），同时自己也发送一个SYN包（syn=k），即SYN+ACK包，此时服务器进入SYN_RECV状态；第三次握手：客户端收到服务器的SYN＋ACK包，向服务器发送确认包ACK(ack=k+1)，此包发送完毕，客户端和服务器进入ESTABLISHED状态，完成三次握手。完成三次握手，客户端与服务器开始传送数据.

Socket

Socket 是对 TCP/IP 协议族的一种封装，是应用层与TCP/IP协议族通信的中间软件抽象层。从设计模式的角度看来，Socket其实就是一个门面模式，它把复杂的TCP/IP协议族隐藏在Socket接口后面，对用户来说，一组简单的接口就是全部，让Socket去组织数据，以符合指定的协议。 Socket 还可以认为是一种网络间不同计算机上的进程通信的一种方法，利用三元组（ip地址，协议，端口）就可以唯一标识网络中的进程，网络中的进程通信可以利用这个标志与其它进程进行交互。

References: https://hit-alibaba.github.io/interview/basic/network/HTTP.html

Binary Search 二分查找

2017-05-28T21:13:27.000Z

二分查找模板总结

四点要素：

start + 1 < end
start + (end - start) / 2
A[mid] ==, <, >
A[start] A[end] ? target

二分法关键

头尾指针，取中点，判断往哪儿走
寻找满足某个条件第一个或是最后一个位置
保留剩下来一定有解的那一半

Question 1: classical-binary-search

https://www.lintcode.com/en/problem/classical-binary-search/

Find any position of a target number in a sorted array. Return -1 if target does not exist.

Given [1, 2, 2, 4, 5, 5].

For target = 2, return 1 or 2.

For target = 5, return 4 or 5.

For target = 6, return -1.

Explanation

基本的二分查找。

Code

public class Solution {
    /**
     * @param A an integer array sorted in ascending order
     * @param target an integer
     * @return an integer
     */
    public int findPosition(int[] A, int target) {
        // Write your code here
        if (A == null || A.length == 0){
            return -1;
        }
        int start = 0;
        int end = A.length - 1;
        while (start + 1 < end) {
            int mid = start + (end - start) / 2;
            if (target == A[mid]) {
                end = mid;
            } else if (target > A[mid]) {
                start = mid;
            } else {
                end = mid;
            }
        }
        if (target == A[start]) {
            return start;
        }
        if (target == A[end]) {
            return end;
        }
        return -1;
    }
}

Question 2: First Position of target

For a given sorted array (ascending order) and a target number, find the first index of this number in O(log n) time complexity.

If the target number does not exist in the array, return -1.

If the array is [1, 2, 3, 3, 4, 5, 10], for given target 3, return 2.

Code

class Solution {
    /**
     * @param nums: The integer array.
     * @param target: Target to find.
     * @return: The first position of target. Position starts from 0.
     */
    public int binarySearch(int[] nums, int target) {
        //write your code here
        if (nums == null || nums.length == 0) return -1;
        int start = 0;
        int end = nums.length - 1;
        while (start + 1 < end) {
            int mid = start + (end - start) / 2;
            if (target == nums[mid]) {
                end = mid;
            } else if (target > nums[mid]) {
                start = mid;
            } else {
                end = mid;
            }
        }
        if (target == nums[start]) return start;
        if (target == nums[end]) return end;
        return -1;
    }
}

Question 3: last-position-of-target

给一个升序数组，找到target最后一次出现的位置，如果没出现过返回-1

Code

public class Solution {
    /**
     * @param A an integer array sorted in ascending order
     * @param target an integer
     * @return an integer
     */
    public int lastPosition(int[] A, int target) {
        // Write your code here
        if (A == null || A.length == 0){
            return -1;
        }
        int start = 0;
        int end = A.length - 1;
        while (start + 1 < end) {
            int mid = start + (end - start) / 2;
            if (target == A[mid]) {
                //difference with the above questions
                start = mid;
            } else if (target > A[mid]) {
                start = mid;
            } else {
                end = mid;
            }
        }
        //check end element first
        if (target == A[end]) {
            return end;
        }
        if (target == A[start]) {
            return start;
        }
        return -1;
    }
}

Question 4: search-insert-position

Given a sorted array and a target value, return the index if the target is found. If not, return the index where it would be if it were inserted in order.

You may assume NO duplicates in the array.

[1,3,5,6], 5 → 2

[1,3,5,6], 2 → 1

[1,3,5,6], 7 → 4

[1,3,5,6], 0 → 0

Code

public class Solution {
    /**
     * param A : an integer sorted array
     * param target :  an integer to be inserted
     * return : an integer
     */
    public int searchInsert(int[] A, int target) {
        // write your code here
        if ( A == null || A.length == 0) return 0;
        int start = 0;
        int end = A.length - 1;
        while (start + 1 < end) {
            int mid = start + (end - start) / 2;
            if (target == A[mid]) {
                return mid;
            } else if (target > A[mid]) {
                start = mid;
            } else {
                end = mid;
            }
        }
        if (target <= A[start]) return start;
        if (target <= A[end]) return end;
        return end + 1;
    }
}

Question 5: Search in a big sorted array

Given a big sorted array with positive integers sorted by ascending order. The array is so big so that you can not get the length of the whole array directly, and you can only access the kth number by ArrayReader.get(k) (or ArrayReader->get(k) for C++). Find the first index of a target number. Your algorithm should be in O(log k), where k is the first index of the target number. Return -1, if the number doesn't exist in the array.

Explanation

本题和上面考察如何在一个array中找到一个数不同的是，这个array会非常大。所以要考虑的是如何“倍增”的问题。增到大于target就可以了，接下来找到最初出现的问题。

Code

class Solution {
    /**
     * @param nums: The integer array.
     * @param target: Target to find.
     * @return: The first position of target. Position starts from 0.
     */
    public int binarySearch(int[] nums, int target) {
        //write your code here
        if (nums == null || nums.length == 0) return -1;
        int end = 0;
        while (end < nums.length && nums[end] < target) {
          end = end * 2 + 1;
        }
        if (end >= nums.length) end = nums.length - 1;
        int start = 0;
        while (start + 1 < end) {
            int mid = start + (end - start) / 2;
            if (target == nums[mid]) {
                end = mid;
            } else if (target > nums[mid]) {
                start = mid;
            } else {
                end = mid;
            }
        }
        if (target == nums[start]) return start;
        if (target == nums[end]) return end;
        return -1;
    }
}

Install Sudo in Debian

2017-04-29T05:21:34.000Z

Debian系统里默认是没有sudo的，安装操作如下。

Step 1: 用root的身份进入

su -u

Step 2: 安装sudo

1	apt-get install sudo

Step 3: 把你的用户名加入sudoer列表中

1	vi /etc/sudoers

编辑此文件，加入这样一句：

1	star ALL=(ALL:ALL) ALL

Step 4: 退出root，测试sudo

1
2
3

exit
sudo su -

Done!

论文笔记：《Stronger consistency and semantics for low-latency geo-replicated storage》

2017-04-12T06:11:01.000Z

本文是《Stronger consistency and semantics for low-latency geo-replicated storage》的阅读笔记。

(未完成更新...)

背景

地理复制（Geo-replication）

如今大型的网络服务常常需要大规模数据存储，需要支持上百万的并行用户对数据进行操作。在这些系统中，数据中心往往会对数据进行完全备份，也就是在每一个数据中心中都存储全部的数据。比如Facebook会把所有用户信息存在每个数据中心中。像这样讲数据备份在不同地理位置的方式称为：地理复制（Geo-replication）。
地理复制有两个好处：容错和低延迟。一个地理位置的数据库挂掉了，其他的可以继续提供服务。用户可以选择离自己最近的那个服务。

数据分区

在大规模数据库中，每一个数据中心的数据会非常大，常常需要分布在上万的机器中，常用的一个技术叫做分区（sharding），就是把不同部分的数据房子啊不同的服务器中。当有新的机器增加时，就需要重新进行分区。所以总而言之，在不同地理区域的数据是重复的，每个地区的数据则是分区存放的。

为了达到更少的round trop time也就是RTT，也就会从用户发出请求到返回的网络延迟，最重要的就是要减少到达数据存储之间延迟。一种方式就是尽量从本地获取。如图所示：

Local-replica-only

从最近的地方读写是最快的。读从最近的数据中心读取，就不会去远程获取。写的是后也是在这个数据中心中更新，在更新远程数据中心之前就返回结果。这种设计称为“local-replica-only”数据库设计，就是为了减少远程获取，来得到数据中心之间RTT最短时间。然而这种“Local-replica-only”的设计往往会达不到强一致性。

Linearizability线性一致化

Linearizability是一种强一致性模型。也就是说如果写完之后，在另一个数据中心读取时，就需要有刚刚写的更新。所以理论上来说，低延迟和强一致性往往是trade-off的关系。

ALPS系统

通过CAP理论，我们已经知道，一个系统无法同时达到Consistency，Availability和Partition-tolerance。因此，现代的网络服务通常牺牲了强一致性来满足可用性和分区容忍性。这类系统可以命名为“ALPS系统”，满足可用性，低延迟，分区容忍和高可拓。

鉴于ALPS系统必须牺牲强一致性，我们就来探索一下在ALSP束缚下最强的一致性可以达到什么程度。这里，我们提出casual consistency with convergent conflict handling,也就是casual+一致性。

论文笔记：Facebook可扩展架构概览

2017-03-30T04:43:57.000Z

本文是论文《Overview of Facebook Scalable Architecture》的阅读笔记，作者Hugo Barrigas, Daniel Barrigas, Melyssa Barata, Pedro Furtado, Jorge Bernardina。技术细节很少，只是一个粗略的大框架。其实根本也没有讲什么细节，所以以后有更多补充再往里写。感觉最大的亮点就是MySQL+Memcached。

这篇文章主要是介绍Facebook的网站架构，讲述在扩展方面遇到的困难和解决的方式，从而更好了解Facebook是如何运行的。对于大型分布式系统来说，可拓展性是网络、系统和进程中非常重要的指标。它标志着是否有成长的能力，是否能处理增长的工作量。规模只是拓展需要考虑的一个方面。可拓展性包含如下几个方面：
a. 是否能轻松增加存储能力
b. 能处理多少增加的traffic
c. 能多运行多少事务

Facebook网站架构

Facebook如何运作

随着用户增加，Facebook做了一些改动，但依旧使用LAMP（Linux-Apache-Memcached-PHP）模式： a. 依旧使用PHP，但写了一个将PHP转成C++的编译器来提高服务器的性能。 b. 依旧使用Linux但进行了一些优化。 c. 最具有争议的事情是使用MySQL，它依旧是最主要的数据库。另外Facebook还有两个自己系统：

Haystack：高拓展，用来存储大量的图片。
Scribe：一个可拓展的登录系统。

Facebook前端

前端是把LAMP服务器运行在Memcache上。

Linux & Apache

Facebook使用Linux和Apache HTTP Server

PHP & BigPipe

BigPipe是Facebook开发的动态网页系统。主要就是把各个部分通过不同的步骤在浏览器和服务器中完成。比如：

HipHop

将PHP编译成C++的编译器。一些关键点：

是PHP编译器
容易增加插件
极大减少了CPU和内存的使用量

Facebook后端

MySQL

Facebook在MySQL的使用上运用了sharding和caching的技术。为了让MySQL可拓展，主要的解决方案就是sharding。也就是数据库被分为几个部分，而且90%的query都存在缓存里，并不需要去数据库里取。Facebook非常依赖Memcached，并且值得一提的是Facebook在多个数据中心中有好几千个MySQL的服务器。另外，Facebook一些复杂的Join操作都是在服务器层面跑的，而不是直接在表上跑。（怎么做到的...不清楚技术细节）

Scribe

Scribe是Facebook的登录系统。Scribe主要做的就是从多个服务器端读取整合数据，然后把信息传送给Hadoop：

Thrift

Thrift协议提供不同语言之间的序列化，从而使Facebook支持不同语言共同开发应用。

Memcached

Memcache是键值对内存存储缓存系统。

Hadoop & Hive

Haystack

Haystack是在主内存中加入了可拓展的缓存：

主体架构

REFERENCES：

[1] Building Scalable Web Architecture and Distributed Systems http://www.drdobbs.com/web-development/building-scalable-web-architecture-and-d/240142422, (Accessed 26 January , 2014)

[2] How Does Facebook Work? The Nuts and Bolts [Technology Explained] http://www.makeuseof.com/tag/facebook-work-nuts-bolts-technology-explained/ (Accessed 25 February 2014)

[3] Lloys G. W. Lloyd and Connie U. S. 2008. Scalable Query Result Caching for Web Applications. PVLDB vol. 1 no. 1 pp. 550-561, 2008

[4] Parris I., Abdesslem F. B., and Henderson T. 2012. Facebook or Fakebook? The effect of simulation on location privacy user studies. Ad Hoc Networks vol. 12 pp. 35-49

[5] Performance and scalability techniques 101 http://www.webforefront.com/performance/scaling101.html , (Accessed 30 January 2014)

[6] Scaling the Messages Application Back End https://www.facebook.com/note.php?note_id=10150148835363920 (Accessed 25 February 2014)

[7] The effects of teacher self-disclosure via Facebook on teacher credibility http://www.gtaan.gatech.edu/meetings/handouts/MazerFacebook.pdf (Accessed 25 February 2014)

分布式系统知识整理：二阶段提交协议、三阶段提交协议

2017-03-30T02:19:56.000Z

本文梳理分布式系统中的常见概念：二阶段提交协议和三阶段提交协议。

二阶段提交协议(Two-phase Commit Protocol)

TPC是基于分布式系统架构下所有节点进行事务提交时保持一致性而设计的一种算法。在分布式系统中，每个节点虽然可以知晓自己操作的成功或失败，但无法知道其他节点的情况。需要一个协调者来组织所有节点的操作结果并最终指示他们是否进行真正的提交。

第一阶段：投票阶段

1）协调者给参与者发送信息，询问是否vote，等待响应。
2）参与者节点执行事务操作，写本地的redo和undo日志，但不提交。
3）参与者回复协调者，如果实务操作执行成功，返回“同意”。否则，返回“中止”。

第二阶段：提交阶段

a.如果第一阶段所有参与者提交都为“同意”：
1）协调者给参与者发送“是否正式提交”的询问。
2）参与者正式完成操作，释放资源。
3）参与者返回“完成”信息。
4）协调者收到所有“完成”信息，完成事务。

b.如果第一阶段有参与者提交“中止”：
1）协调者向所有参与者节点发出“回滚”的请求。
2）参与者利用undo信息执行回滚，释放在整个事务占用的资源。
3）参与者返回“回滚完成”信息。
4）协调者收到所有参与者的“回滚完成”后，取消事务。

整个过程如图所示：

存在的问题

节点在等待消息处于阻塞的状态，节点中的其他进程需要等待阻塞进程释放资源。
如果参与者故障，协调者需要给每个参与者制定超时机制，超时后整个事务失败。TPC没有容错机制，一个节点故障整个事务都要回滚，代价很大。
如果协调器发生故障，参与者无法完成事务，就会一直阻塞下去。
如果协调者发出commit消息之后宕机，唯一接收到消息的参与者也同时宕机了，即使协调者通过选举产生了新协调者，这条事务的状态是无人知晓的。

因此，后来产生了三阶段提交协议。

三阶段提交协议(Three-phase Commit Protocol)

非阻塞协议。

两方面的改动：

引入超时机制，协调者和参与者都引入超时机制。
在TPC的第一阶段和第二阶段之间插入了一个准备阶段，保证在最后提交阶段之前各参与节点的状态是一致的。

References:

3.1.4 Two-phase commit: http://itdoc.hitachi.co.jp/manuals/3000/30003D5030e/DESC0050.HTM
Wikipedia:
https://zh.wikipedia.org/zh-hans/%E4%BA%8C%E9%98%B6%E6%AE%B5%E6%8F%90%E4%BA%A4
分布式协议之两阶段提交协议（2PC）和改进三阶段提交协议（3PC）: http://www.mamicode.com/info-detail-890945.html

云计算知识整理：使用Apache Kafka和Samza进行流处理

2017-03-25T22:48:14.000Z

本文是关于Stream Processing with Kafka and Samza的介绍以及总结反思。

流处理

如今实时数据处理的需求越来越高，实时数据的来源包括传感器数据（比如物联网设备），社交网络交互，实时商业数据等。这些情况下需要极其的延迟率。比如LinkedIn就需要用实时的广告点击数据来不断扩充广告架构。类似Hadoop和Spark这种数据获取和数据处理分离的方式无法达到低延迟的实时处理需求。所以在这里就介绍一写管理和处理大量实时数据的流处理框架。

Apache Kafka

Kafka是分布式发布-订阅消息系统。由linkedin开发，后来成为Apache项目的一部分。发布者把消息放在不同的classes里，并不知道订阅者会如何使用这些数据。而订阅者可以订阅特定的消息并且只能收到相应的消息。Kafka使用commit log来保持数据，commit log是按顺序的，不可修改的，只能添加的数据结构。Kafka最大的优势是它提供完整的数据结构，所有的组织里的系统能够独立和可靠地获取数据。可以认为Kafka是流数据源。以下为一些主要Kafka术语：

Topic: 表示一个用户定义的类型，消息会在这个类别下发布。主要用partitioned log来维护。
Producers: 用来向Kafka集群中发布一个或多个topic信息的进程。
Consumers：用来向Kafka集群中读取消息的进程。
Partitions： topics被分为多个partitions。一个partition代表一个并行单位。总的来说，partition越多，吞吐量越多。每个partition中的每个信息都有特定的偏移量，这样数据消费者能借此定位。简单来说，我们认为Kafka会根据key来给数据排序并提供，类似于MapReduce中的Map和Shuffle的阶段。
Brokers： Brokers用来负责数据持久化和复制。brokers会和producers交流来发布信息给Kafka集群，和consumers交谈来获取信息。

值得注意的是，kafka不会运行处理数据，只是一种存储和分类流数据的一种方式。在这个Project里，你需要用Samza来处理Kafka提供的数据流。

可以看下这个视频（通常搞不懂一个概念的时候我都是查youtube_(:зゝ∠)_ ）Understanding Kafka with Legos

Apache Samza

Samza是由Linkedin开发的分布式流处理框架，以下为三层流处理框架中的关键组件：

Streaming：这一层是用partitioned stream的方式提供输入，这里也就是Kafka
Execution:在不同的机器间调度协调任务，这里使用YARN
Processing：负责具体的数据处理，这里使用Samza

Samza相关的术语包括：

Streams：等同于Kafka的topics
Jobs：使用Samza API来从一个或多个流读取和处理数据。一个Job可能被分割成不同的task，每个task可能会使用输入流中的一个或多个partition
Stateful Stream Processing：流处理可以分为有状态和无状态。

Samza结构：

Samza API

Samza API简单抽象，以一个ExampleCode为例，看Twitter类似的实时信息如何展示：

public class FanOutTask implements StreamTask, InitableTask, WindowableTask {
   private KeyValueStore socialGraph;
   private KeyValueStore> userTimeline;
   private long numMessages = 0;
   @Override
   @SuppressWarnings("unchecked")
   public void init(Config config, TaskContext context) throws Exception {
     socialGraph = (KeyValueStore) context.getStore("social-graph");
     userTimeline = (KeyValueStore>) context.getStore("user-timelin
 e");
}
   @Override
   @SuppressWarnings("unchecked")
   public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordin
 ator coordinator) {
     String incomingStream = envelope.getSystemStreamPartition().getStream();
     if (incomingStream.equals(NewsfeedConfig.FOLLOWS_STREAM.getStream())) {
       processFollowsEvent((Map) envelope.getMessage());
     } else if (incomingStream.equals(NewsfeedConfig.MESSAGES_STREAM.getStream())) {
       processMessageEvent((Map) envelope.getMessage(), collector);
     } else {
       throw new I

所有的Samza Jobs都要完成 StreamTask接口，有的时候还需要完成InitableTask和WindowableTask接口。

云计算应用实例：OFA/Start-ups/Data Analytics/Cloud Migration Exercise

2017-03-06T22:41:09.000Z

本文主要是总结一些云计算的应用实例，是否要用云计算这个问题是个很复杂的问题，大多数时候需要根据需求和预算才能得到确切的答案，云计算是趋势，但如何利用是个值得仔细考虑的问题。

云计算应用实例：Obama For America

Obama For America（OFA）是云计算一个经典实例，云计算被运用在筹集竞选资金、分析竞争对手、有效使用竞选资金等都为奥巴马的连任成功提供了极大的保障。

OFA是什么？

OFA利用数据集成和预测分析使奥巴马赢得了选举，主要通过数据挖掘来确定及影响摇摆州的投票目标。OFA的数据来源包含：人口数据，投票历史数据，筹款数据，志愿者数据，社交网络数据以及投票数据，得到相关数据后，通过上门拜访，打电话，邮件，信件，网络广告，社交媒体，付费电视和网站等方式来影响投票人的决定。

OFA的一些数据

根据DevOps的负责人透露，OFA的一些关键数据包括：

4Gbps带宽
每秒10,000条请求
2,000个数据节点
3个数据中心
180TB数据量
共约85亿条请求
从设计到部署到应用共用了583天
最多30分钟的宕机时间
技术团队共40人

核心应用

Obama For America团队一共构建了200个应用，包括：

Narwhal

Narwhal是一个python写的REST API，可调用存放的所有数据。在OFA中，大多数数据存放在MySQL为主的关系型数据库中，但也使用了PostgresDB, MS SQL Server, MongoDB, Vertica, LevelDB, S3, DynamoDB, SimpleDB数据库。除数据存储之外，主要应用了AWS EMR(Elastic MapReduce service)和Vertica来处理大量的数据建模和分析的工作。

CallTool

CallTool可以用来让志愿者给投票人打电话。在选举的最后四天里，这个工具被1000多个志愿者给超过百万的投票人打了电话。工具可以将志愿者和投票人进行配对。团队在其中应用了AWS的auto-scalling功能在需求高峰能够快速增加云资源。

DashBoard

一个用来登记志愿者信息，并允许志愿者查看相应进度和收集到的信息的Rails线上应用工具。

Dreamcatcher

Dreamcatcher可以通过处理社交网络上的政治舆情，精确确定目标投票人，并赢得投票人。

GOTV

GOTV即Get out the vote，这个应用用来动员支持者将支持转化为投票。

AWS如何帮助了OFA?

OFA所有的应用几乎都部署在AWS云端，他们应用了分布消息队列，NoSQL和SQL数据库，虚拟私有云服务，负载均衡，内容传输网络等服务。除了能够应对大规模数据处理需求之外，还能够在需求高峰通过自动增加云计算资源保证响应速度。另外，团队在数据备份上也充分利用了云计算的优势。

Reference:

How Obama’s tech team helped deliver the 2012 election

Wikipedia:Organizing for America

CMU 15719 slides

6 Ways Amazon Cloud Helped Obama Win

云计算应用实例：Start-ups

背景

对于Start-up来说，计算要求往往各种各样，需求量不大，变化却很快。在有限的预算下，还存在着硬件采购周期长，部署周期长，负担不起系统管理，数据中心能量供应，利用率低，灾难响应速度低等问题。

所以，云计算成为了Start-up的一种选择。举个栗子：

GoSquared

GoSquared是一家提供线上实时数据网站分析的企业，被TechCrunch报道后，使用量忽然增加，没有足够的资源应对增长，最终把网站搬到了AWS上，可以迅速得到更多资源，快速拓展。

Reference

AWS case study: GoSquared

云计算应用实例：Data Analytics

如今企业决策都讲究数据驱动，层出不穷的多媒体数据往往有不同的数据来源，从text到图片到音频和视频。数据处理还分为批处理和流处理。如今，传统的数据处理流程也慢慢转向了云端。

数据处理流程图

Lambda架构

Lambda架构定义了一套明确的架构原则，同时利用批处理和流处理对大规模的数据进行处理。此框架的提出主要是为了解决这样的一个问题：如何实时地在任意大数据集上进行查询？

Lambda架构包含三层layer：

批处理层：写一次，批量读取多次
主要有Hadoop实现，负责数据存储和产生试图数据。当新数据到达时，会用MapReduce迭代地讲数据聚集到视图中。根据数据大小和集群的规模，迭代转换计算时间大约需要几小时。
服务层：随机读取，不支持随机写入，批量计算和批量写入
这层由Cloudera Impala实现，对于上一层输出的一些列包含预计算视图的原始文件，服务层在Hive元数据层中创建一个表，元数据都指向HDFS中的文件，随后用户可以通过Impala查询到视图。这里服务层用Impala就是为了快速而且交互地查询到Hadoop存储和处理的数据。但是由于MapReduce在实时数据处理中的高延迟，我们还需要加速层。在这一层中，需要合并处理Batch views和Real-time views得到最终user需要的query结果。
加速层：随机读取，随机写入，增量计算
本层利用Strom框架计算实时视图，使用增量模型，将实时视图作为临时量，只要数据被传送到批处理层中，服务层相应的实时视图就会被丢掉。由于应用了增量模型，往往会比较复杂。

所以，整个流程为，当数据进来时，会并行地进入到批处理层与加速层，当两者查询都完成在服务层整合好后，才算完成一次完整的查询。

但是这样令人激动的模型，还存在着一些争议，比如

代码维护需要在两个复杂的分布式系统中进行，需要对不同的框架进行不同的编程。
在两个分布式系统中运行和维护程序任然很难，比如跨数据库系统中进行ORM（对象关系映射）就会比较难。

关于Lambda框架的具体细节可参考这本书：《Big data: Principles and best practices of scalable realtime data systems》其中除了介绍Lambda框架外，还重点放在了一些重要的地方：

分布式思想
避免增量架构
创建再计算算法
数据相关性
关注数据不可变

这张图很好诠释了Lambda是如何处理query的：

随着发展，后来，Spark被认为是Lambda框架的合理实现。所以在这一部分，感觉写Lambda框架有点过时呢，但老师提到了就再理解一下。时代发展就是这样的，Lambda框架这种临时解决方案，会被更完美的方案取代。不过，不管怎样，人类的进步是令人欣喜的。关于Spark有空可以再写一篇具体一点的，amazing。

References：

大数据Lambda架构

Wikipedia Lambda Architecture

Lambda Architecture with Apache Spark

云计算应用实例：云迁移（究竟是否应该选择Cloud？）

对研究室来说，是应该使用预置的硬件，还是迁移到云端？回答当然是，“It depends.”。

比如需求是这样：

如果我们选择预置的机器，按照五年来算：

有以下两种云计算的方案：方案1：方案2：

所以我们可以针对不同的需求、不同的云计算选择、不同的费用模型、不同的权限要求等来选择是否迁移到云端。

另外，在对比时，我们还需要考虑：

对于本地预置机器：供电、散热、安全性等
对于云端：迁移到/迁出云端的成本等

Binary Tree 二叉树

2017-02-28T22:12:00.000Z

本文是总结Tree这种结构的常用知识点，暂时总结Binary Tree。

Binary Tree二叉树

Why Tree?

因为树结合了其他数据结构的优势：

顺序数组：用Binary Search查找会很快。
链表：插入和删除会非常快，不需要shift值。

基本概念：

根：树的顶部。
父节点
子节点
叶节点：没有子节点的节点。
Leve（高度）：代表有几代。

平衡树和非平衡树

平衡树：左右子树及其的高度相差<=1，并且左右子树也是平衡树。

Full Tree 和 Complete Tree：

Full Tree:每个节点都有0/2个子节点。
Complete Tree:除了最右边的节点，其他节点都是满节点，并且都靠左。

遍历顺序：

Binary Tree代码实现

Binary Tree Interface

public interface BSTInterface {
    /**
     * Searches for the specified key in the tree.
     * @param key key of the element to search
     * @return boolean value indication of success or failure
     */
    boolean find(int key);
    /**
     * Inserts a new element into the tree.
     * @param key key of the element
     * @param value value of the element
     */
    void insert(int key, double value);
    /**
     * Deletes an element from the tree using the specified key.
     * @param key key of the element to delete
     */
    void delete(int key);
    /**
     * Traverses and prints values of the tree in ascending order based on key.
     */
    void traverse();
}

Binary Tree功能实现

Find:

public boolean find(int key) {
    // tree is empty
    if (root == null) {
        return false;
    }
    Node curr = root;
    // while not found
    while (curr.key != key) {
        if (curr.key < key) {
            // go right
            curr = curr.right;
        } else {
            // go left
            curr = curr.left;
        }
        // not found
        if (curr == null) {
            return false;
        }
    }
    return true; // found
}

Insert

public void insert(int key, double value) {
    Node newNode = new Node(key, value);
    // empty tree
    if (root == null) {
        root = newNode;
        return;
    }
    Node parent = root; // keep track of parent
    Node curr = root;
    while (true) {
        // no duplicate keys allowed
        // simply keep the existing one here
        if (curr.key == key) {
            return;
        }
        parent = curr; // update parent
        if (curr.key < key) {
            // go right
            curr = curr.right;
            if (curr == null) {
                // found a spot
                parent.right = newNode;
                return;
            }
        } else {
            // go left
            curr = curr.left;
            if (curr == null) {
                // found a spot
                parent.left = newNode;
                return;
            }
        } // end of if-else to go right or left
    } // end of while
} // end of insert method

Delete

public void delete(int key) {
    // empty tree
    if (root == null) {
        return;
    }
    Node parent = root;
    Node curr = root;
    /*
     * flag to check left child
     *
     * need this flag because actual deletion process happens after the
     * while loop that is to find the key to delete
     */
    boolean isLeftChild = true;
    while (curr.key != key) {
        parent = curr; // update parent first
        if (curr.key < key) { // go right
            isLeftChild = false;
            curr = curr.right;
        } else { // go left
            isLeftChild = true;
            curr = curr.left;
        }
        // case 1: not found
        if (curr == null) {
            return;
        }
    }
    if (curr.left == null && curr.right == null) {
        // case 2: leaf
        if (curr == root) {
            root = null;
        } else if (isLeftChild) {
            parent.left = null;
        } else {
            parent.right = null;
        }
    } else if (curr.right == null) {
        // case 3: no right child
        if (curr == root) {
            root = curr.left;
        } else if (isLeftChild) {
            parent.left = curr.left;
        } else {
            parent.right = curr.left;
        }
    } else if (curr.left == null) {
        // case 3: no left child
        if (curr == root) {
            root = curr.right;
        } else if (isLeftChild) {
            parent.left = curr.right;
        } else {
            parent.right = curr.right;
        }
    } else {
        // case 4: with two children
        // here we use successor but using predecessor is also an option
        Node successor = getSuccessor(curr);
        if(curr == root) {
            root = successor;
        } else if(isLeftChild) {
            parent.left = successor;
        } else {
            parent.right = successor;
        }
        successor.left = curr.left;
    }
}

找到下一个节点

/**
 * Helper method to find the successor of the toDelete node.
 * This tries to find the smallest value of the right subtree
 * of the toDelete node by going down to the left most node in the subtree
 * @param toDelete node to delete
 * @return the successor of the toDelete node
 */
private Node getSuccessor(Node toDelete) {
    Node successorParent = toDelete;
    Node successor = toDelete;
    // start the search from the root of the right subtree
    Node curr = toDelete.right;
    // move down to left as far as possible in the right subtree
    // successor's left child must be null
    while (curr != null) {
        successorParent = successor;
        successor = curr;
        curr = curr.left;
    }
    /*
     * If successor is NOT the right child of the node to delete, then
     * need to take care of two connections in the right subtree
     */
    if (successor != toDelete.right) {
        successorParent.left = successor.right;
        successor.right = toDelete.right;
    }
    return successor;
}

Traverse Binary Tree:

public void traverse() {
    inOrderHelper(root);
    System.out.println();
}
private void inOrderHelper(Node toVisit) {
    if(toVisit != null) {
        inOrderHelper(toVisit.left);
        System.out.print(toVisit);
        inOrderHelper(toVisit.right);
    }
}

Reference: @Terry Lee

云计算Project：Twitter大数据分析

2017-02-28T04:37:26.000Z

本文是Twitter Analytics on the Cloud项目的介绍及分析总结。小组作业当时做的匆忙，现在再思考下可以优化的地方很多。感谢队友@shuangshuang 和 @烟酱。

项目介绍

目标：

在云上建立一个高性能又可靠的web服务。
设计，开发和部署并优化服务器以能够处理每秒上万次请求的高负载。
在一个1TB的数据集上完成ETL并载入到Mysql和HBase中。
设计MySQL和HBase并优化配置，提高性能。
探索基于云的web服务存在瓶颈的方法，并提高性能。

基本结构：

前端：

通过HTTP GET请求访问web服务，不同的请求有不同的地址，后面有不同参数。
返回相应的响应时，必须要在持续若干个小时的测试中正常运行。
web服务不能拒绝请求，要能承受高负载。

后端：

保存用来查询的数据文件
比较SQL(MySQL)和NoSQL(HBase)
比较不同数据集不同查询类型的表现，来决定如何实现后端。

数据集：

Twitter数据集，大于1T，JSON格式存储。

项目实战

搭建前端：

在搭建前端之前，需要慎重选择框架。对比主流web框架，参考Techempower,我们最终选择用vertx和undertow进行开发。具体可以参考一些比较好的配置指南：

Vertx:

vertx Document My first Vert.x 3 Application

前端优化：

运用Cache，每次得到请求先check是否有缓存。当缓存满了的时候，就把最不常用的缓存踢出去。

ETL:

根据request设计好数据库的schema以后，要好好设计ETL。因为我们这里用EMR把twitter数据集载入到数据仓库中，每次需要10-20个小时，而EMR特别贵，所以最好不要重复劳动。最初，用小数据及来测试。

这一阶段我们要处理两类请求，从存储系统中获取数据，搭建好的web service 需要能够连接到两个不同的后端存储系统(MySQL 和 HBase)，前端需要通过端口 80 接收 HTTP GET 请求。

操作过程：

这里主要要写一个Map和一个Reduce文件来处理数据。原始数据的格式是JSON，我们需要处理成需要的数据格式：

请求格式 userid+hashtag GET /q2?userid=uid&hashtag=hashtag

响应格式 (如果Tweet存在)

tweet 的 sentiment density
tweet 的发布时间
tweet id
审查修改过的的 tweet 内容，这里有很多可能出问题的地方，比如 emoji 表情、反斜杠、其他语言的字符等等

TEAMID,TEAM_AWS_ACCOUNT_ID\n Sentiment_density1:Tweet_time1:Tweet_id1:Cencored_text1\n Sentiment_density2:Tweet_time2:Tweet_id2:Cencored_text2\n Sentiment_density3:Tweet_time3:Tweet_id3:Cencored_text3\n

响应格式 (如果Tweet不存在) TEAMID,TEAM_AWS_ACCOUNT_ID\n \n

map和reduce程序写完后，到EMR上面跑，要注意：

现用小数据集测试。
注意各种小细节
关于EMR的操作，步骤之后有空总结下之前云计算的EMR project。

Query 文本清理和分析

目标吞吐量： 10000 rps 不允许用现用的缓存设备，可以自己写缓存。会查询某个用户用指定的 hashtag 发的 tweet，主要考察如何设计一个高效的后端来处理大量的请求。

后端数据库

ETL结束以后，我们需要导入数据库。在这个过程中，我们纠结于replication和sharding的选择。 Replication是指将完整的数据库存在每一台机器上，而Sharding是指分成几个部分分别存在每一台机器上。最终，选择了Sharding模式。

数据库设计：

按照我们刚刚说过的请求格式和响应格式，我们对MySQL和HBase进行设计：

MySQL：

设计模式：

（这里参照了Yuki组的赢家设计模式，非常简单粗暴）原来的schema是每一列都很清晰，但是这样row相比后面的设计模式多了很多，导致数据库的读取速度慢了很多。所以新的schema就选择只存取id，读取所有的tweets以后，让前端进行相应的解析。

优化方法：

建立索引Index
mysql有两个存储引擎，MyISAM和InnoDB，MyISAM适用于大量查寻，对写并不是非常友好，updata时会整表锁住。而InnoDB使用的是“行锁"。设置Key_buffer_size以及Query_cache_size到更高的值，可以增加缓冲容量。
设置所有column为not null，这样mysql不用预留空间检查null值。会提高读取速度。

HBase:

鉴于HBase是key-value存储模式，我们在这里只要考虑key里怎么放，剩下的数据全都放到column family里面就可以了。我们采用tweet_id + user_id + hashtag作为rowkey。

优化方法（摘自小土刀博客）：

1.分配合适的内存给 RegionServer 服务: 例如在 HBase 的 conf 目录下的 hbase-env.sh 的最后添加 export HBASE_REGIONSERVER_OPTS=”-Xmx16000m $HBASE_REGIONSERVER_OPTS” 其中 16000m 为分配给 RegionServer 的内存大小。

2.RegionServer 的请求处理 IO 线程数: 较少的 IO 线程适用于处理单次请求内存消耗较高的 Big Put 场景 (大容量单次 Put 或设置了较大 cache 的 Scan，均属于 Big Put) 或 ReigonServer 的内存比较紧张的场景。较多的 IO 线程，适用于单次请求内存消耗低，TPS 要求 (每秒事务处理量 (TransactionPerSecond)) 非常高的场景。设置该值的时候，以监控内存为主要参考。在 hbase-site.xml 配置文件中配置项为 hbase.regionserver.handler.count 200

3.调整 Block Cache: hfile.block.cache.size：RS的block cache的内存大小限制，默认值0.25，在偏向读的业务中，可以适当调大该值，具体配置时需试hbase集群服务的业务特征，结合memstore的内存占比进行综合考虑。

总结：

Team Project过去挺久了，很多细节记不得了，清洗数据的部分有很多细节需要注意，并不像这里写的一两句话就讲清楚了。还有数据库优化是一条不归路，盲目优化会导致反向优化，其实根据后来赢家的报告来看，优化并起不到多少作用，好的schema设计才是提高performance的最根本。云计算这门课的精华，都在这个Project，覆盖了大部分这门课的所实验的知识。从load balance到sharding和replication，再到SQL和NoSQL数据库，再到EMR的应用，就差并行并发那部分的内容了。学习是不难的，有指导来做project也不难，真正到了实际应用中，没有人知道正确答案，靠的都是思考和经验了。

References：

小土刀云计算语料分析&反思课
小Yuki的Report

云计算Project：基于多个后端的社交网络时间线的实现

2017-02-23T23:12:14.000Z

本文是关于CMU15619Cloud Computing项目:Social Networking Timeline with Heterogeneous Backends的介绍以及总结反思。

项目主要目标：

探索AWS的DBaas服务的申请、配置和管理
比较MySQL, HBase和MongoDB在使用Java API载入数据时的异同。
利用多个后端为同一个复杂的web应用提供数据。
比较不同数据库在实际应用中的特点。

背景介绍

DBaas(Database-as-a-Services)：

在AWS中，我们可以用其中的RDS的MySQL服务。

MongoDB：

MongoDB是NoSQL数据库的典型，基于文档存储（Document-oriented），不支持事务和表连接，所以查询的编写、理解和优化比较容易。之后会写一篇关于NoSQL的总结（一个坑）。和HBase的key-value存储模式不同，MongoDB基于文档存储模式的优势在于可以支持复杂的数据类型，并且也支持Index。 MongoDB使用BSON类型存储数据，据说就是把文本直接转成二进制表示，BSON用于以下三种目的：

节省空间：BSON即使在最坏的情况下，也比普通的JSON占用空间少。
移动性
Performance：BSON对内容的编码和解码的速度快于很多编程语言。

数据结构：图

1.邻接矩阵Adjacent Matrix：空间复杂度为O（n^2)

比如这个：

2.邻接表Adjacent List 空间较少：

社交网络应用基础：

如今像Facebook, Twitter和Instagram都需要复杂和涉及良好的后端来处理多种类型的用户数据，提供持续的高性能低延迟的服务。同时还要通过实时数据分析为公司和广告商提供有价值的信息。

不同的数据类型（Video，Text，Link，etc.)需要存在不同的数据库中）
一个简单的展示社交网络页面的HTTP请求会触发后端一系列的请求和数据库动作。可以参见下图：

社交网络中的数据通常包括以下三种：

用户信息：
- 身份验证系统
- 用户信息/简介
- 活动日志
- 社交关系图（在下面会进步介绍）
用户活动：
- 用户产生的多媒体数据
大数据分析系统：
- 搜索系统
- 推荐系统
- 用户行为分析（基于云数据仓库的OLAP，有机会单独更新这个部分）

社交网络的前端已经做好，我们需要把四中不同的数据集存入三种数据库（MySQL, HBase, MongoDB),你完成的后端要能同时响应四中不同的request。

项目操作

通过RDS的MySQL实现基本登录：

在AWS RDS中配置MySQL并导入users.csv, userinfo.csv数据集。连接AWS RDS中MySQL时注意：远程登录需要导入数据时要加入 --local-infile得到授权。 mysql -u username -p password -h hostname --port=portname --local-infile database

数据集格式:

users.csv [UserID, Password]
userinfo.csv [UserID, Name, Profile Image URL]

导入MySQL语句: LOAD DATA LOCAL INFILE 'filename' INTO TABLE tablename CHARACTER SET utf8mb4 FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n';

请求格式: GET /task1?id=[UserID]&pwd=[Password]

响应格式: returnRes({"name":"my_name", "profile":"profile_image_url"})

所以，之后在Java文件中连接数据库，再创建JSON相应的代码即可。测试：

启动前后端服务器，访问http://:3000
输入正确或错误的账号密码登录测试

利用HBase存储社交图谱：

用HBase来保存用户间的follow关系，可以选择用之前在图中介绍的邻接矩阵和邻接表中选择一种，来保存数据。原始数据格式：

请求格式： GET /task2?id=[UserID]

响应格式: {"followers":[{"name":"follower_name_1", "profile":"profile_image_url_1"}, {"name":"follower_name_2", "profile":"profile_image_url_2"}, ...]}

思路：

在HBase中存成followee： follower1， follower2， ...的格式
设计好HBase之后导入数据
启动前后端服务器后访问http://:3000
输入userid进行测试

用MongDB搭建主页：

如之前介绍的那样，对于各种形式的帖子，用MongoDB存储会是一个很好的选择。这里会查询一些特定的field，所以可以建立索引来加速查询。

帖子数据的形式：

{
    "pid":xxx,                                      // PostID
    "uid":xxx,                                      // UserID of poster
    "name":"xxx",                                   // User name of poster
    "profile":"xxx",                                // Poster profile image URL
    "timestamp":"YYYY-MM-DD HH:MM:SS",              // When post is posted
    "image":"xxx",                                  // Post image
    "content":"xxx",                                // Post text content
    "comments":[                                    // comments json array
        {
            "uid":xxx,                              // UserID of commenter
            "name":"xxx",                           // User name of commenter
            "profile":"xxx",                        // Commenter profile image URL
            "timestamp":"YYYY-MM-DD HH:MM:SS",      // When comment is made
            "content":"xxx"                         // Comment text content
        },
        {
            "uid":xxx,
            .......
        },
        ......
    ]
}

关于MongoDB建立索引，可以参考这里

请求格式： GET /task3?id=[UserID]

响应格式： {"posts":[{post1_json}, {post2_json}, ...]}

测试方法：

启动前后端服务器，输入userid

最终整合

之前三个部分分别实现了三个数据库的存储，现在我们希望实现输入一个userid就可以返回用户信息（MySQL),用户粉丝列表（HBase）以及用户关注的人最新三十条帖子（MongoDB）。

排序规则：

对followers进行排序:
- 姓名升序排列
- Profile image URL升序排列
对最新30篇post排序：
- 按照timestamp升序排序
- 按照PostID升序排序

请求格式： GET /task4?id=[UserID]

响应格式: {"name":"my_name", "profile":"my_profile_image_url", "followers":[{"name":"follower_name_1", "profile":"profile_image_url_1"}, {"name":"follower_name_2", "profile":"profile_image_url_2"}, ...], "posts":[{post1_json, post2_json, ...}]}

简单推荐的实现

推荐系统的内容太多了，可以看看shaung的博客（一个广告）这次我们用协同过滤算法实现一个简单的推荐系统，利用“朋友的朋友”来推荐好友。

Graph Distance：

比如：

A follows {B, C, D}
Followee B follows {C, E, A}
followee C follows {F, G}
followee D follows {G, H}

我们可以得到与A的距离关系为： {A:1, C:1, E:1, F:1, G:2, H:1} 其中去掉A本身，去掉A已经关注的C，剩下的就是 {G: 2, E: 1, F: 1, H: 1}

思路：

找到userid的关注的人的集合
将关注的人的集合中的每个人关注的人添加到信集合中，第一次出现则为1，之后的为原来的加1
用优先队列存储，注意第一个关注的人集合中的元素都不应该在此队列中
返回前十个的name和url，并返回

请求格式： http://backend-public-dns:8080/MiniSite/task5?id=

响应格式： returnRes({"recommendation":[{name:, profile:},{name:, profile:},...,{name:, profile:]})

Done!

Reference：
CMU15619课件：Social Networking Timeline with Heterogeneous Backends 小土刀博客：http://wdxtub.com/vault/cc-17.html

Java Interview Questions

2017-02-11T17:51:39.000Z

Always update...

JVM is plantform dependent?

Is the JVM (Java Virtual Machine) platform dependent or platform independent? What is the advantage of using the JVM, and having Java be a translated language?

JVM translates bytecode into machine language Every Java program is first compiled into an intermediate language called Java bytecode. The JVM is used primarily for 2 things: the first is to translate the bytecode into the machine language for a particular computer, and the second thing is to actually execute the corresponding machine-language instructions as well. The JVM and bytecode combined give Java its status as a "portable" language – this is because Java bytecode can be transferred from one machine to another.

Machine language is OS dependent Since the JVM must translate the bytecode into machine language, and since the machine language depends on the operating system being used, it is clear that the JVM is platform (operating system) dependent – in other words, the JVM is not platform independent.

The JVM is not platform independent The key here is that the JVM depends on the operating system – so if you are running Mac OS X you will have a different JVM than if you are running Windows or some other operating system.

Overloading & Overriding

In Java, what’s the difference between method overloading and method overriding?

Overloading: Method overloading in Java occurs when two or more methods in the same class have the exact same name but different parameters (remember that method parameters accept values passed into the method). However, method overloading is a compile-time phenomenon. Can be overloading:

1
2
3

1.) The number of parameters is different for the methods.
2.) The parameter types are different (like
changing a parameter that was a float to an int).

Not overloading

1
2
3

1. Just changing the return type of the method. (Compiler Error)
2. Changing just the name of the method parameters, but
not changing the parameter types.

Overriding: [根本也记不住，其实我的方法是小朋友骑在爸爸肩膀上，他们主体是一样的，不会变的，即方法参数返回值不变，但内容变了。(:зゝ∠)] Overriding means that a method inherited from a parent class will be changed. But, when overriding a method everything remains exactly the same except the method definition – basically what the method does is changed slightly to fit in with the needs of the child class. But, the method name, the number and types of parameters, and the return type will all remain the same. Method overriding is a run-time phenomenon that is the driving force behind polymorphism.

Private Constructor

What’s the point of having a private constructor?

Defining a constructor with the private modifier says that only the native class (as in the class in which the private constructor is defined) is allowed to create an instance of the class, and no other caller is permitted to do so.

There are two possible reasons why one would want to use a private constructor – the first is that you don’t want any objects of your class to be created at all, and the second is that you only want objects to be created internally – as in only created in your class.

A singleton is a design pattern that allows only one instance of your class to be created, and this can be accomplished by using a private constructor.

An object and a class

In Java, what’s the difference between an object and a class?

Shortly: An object is an instance of a class. Objects have a lifespan but classes do not.

Java platform

What is the main difference between Java platform and other platforms?

The Java platform differs from most other platforms in the sense that it's a software-based platform that runs on top of other hardware-based platforms.It has two components:

Runtime Environment
API(Application Programming Interface)

write once and run anywhere

What gives Java its 'write once and run anywhere' nature?

The bytecode. Java is compiled to be a byte code which is the intermediate language between source code and machine code. This byte code is not platform specific and hence can be fed to any platform.

Is Empty .java file name a valid source file name?

Yes, save your java file by .java only, compile it by javac .java and run by java yourclassname Let's take a simple example:

//save by .java only  
class A{  
public static void main(String args[]){  
System.out.println("Hello java");  
}  
}  
//compile by javac .java  
//run by     java A

compile it by javac .java

run it by java A

If I don't provide any arguments on the command line, then the String array of Main method will be empty or null?

It is empty. But not null.

What if I write static public void instead of public static void?

Program compiles and runs properly.

What is the default value of the local variables?

The local variables are not initialized to any default value, neither primitives nor object references.

What is difference between object oriented programming language and object based programming language?

Object based programming languages follow all the features of OOPs except Inheritance. Examples of object based programming languages are JavaScript, VBScript etc.

What will be the initial value of an object reference which is defined as an instance variable?

The object references are all initialized to null in Java.

Constructor in Java

Constructor in java is a special type of method that is used to initialize the object.

Java constructor is invoked at the time of object creation. It constructs the values i.e. provides data for the object that is why it is known as constructor.

Rules for creating java constructor

There are basically two rules defined for the constructor.

Constructor name must be same as its class name Constructor must have no explicit return type

Types of java constructors

There are two types of constructors:

Default constructor (no-arg constructor) Parameterized constructor

References: http://www.programmerinterview.com/ https://www.javatpoint.com/constructor

Leetcode 271. Encode and Decode Strings

2017-02-11T07:52:01.000Z

Question:

Design an algorithm to encode a list of strings to a string. The encoded string is then sent over the network and is decoded back to the original list of strings.

Machine 1 (sender) has the function:

encode(vector strs) {

1
2
3

  // ... your code
  return encoded_string;
}

Machine 2 (receiver) has the function:

decode(string s) {

1
2
3

  //... your code
  return strs;
}

So Machine 1 does:

1	string encoded_string = encode(strs);

and Machine 2 does:

1	vector strs2 = decode(encoded_string);

strs2 in Machine 2 should be the same as strs in Machine 1.

Implement the encode and decode methods.

Note:

The string may contain any possible characters out of 256 valid ascii characters. Your algorithm should be generalized enough to work on any possible characters.
Do not use class member/global/static variables to store states. Your encode and decode algorithms should be stateless.
Do not rely on any library method such as eval or serialize methods. You should implement your own encode/decode algorithm.

Explanation:

乍一看，也不知道说的是什么。其实题意是给一个list的字符串，先要拼成一整个字符串，是encode。然后把这整个字符拆回一个list的字符。所以考点是如何合理地分隔，然后还能识别出来。这哪儿是算法，其实考的是Serilization这个计算机系统中的基本概念。

串行化(Serialization)是计算机科学中的一个概念，它是指将对象存储到介质（如文件、内存缓冲区等）中或是以二进制方式通过网络传输。之后可以通过反串行化从这些连续的字节（byte）数据重新构建一个与原始对象状态相同的对象，因此在特定情况下也可以说是得到一个副本，但并不是所有情况都这样。

Code:

public class Codec {
    // Encodes a list of strings to a single string.
    public String encode(List strs) {
        StringBuilder sb = new StringBuilder();
        for(String s : strs) {
            sb.append(s.length()).append('/').append(s);
        }
        return sb.toString();
    }
    // Decodes a single string to a list of strings.
    public List decode(String s) {
        List ret = new ArrayList();
        int i = 0;
        while(i < s.length()) {
            int slash = s.indexOf('/', i);
            int size = Integer.valueOf(s.substring(i, slash));
            ret.add(s.substring(slash + 1, slash + size + 1));
            i = slash + size + 1;
        }
        return ret;
    }
}

Leetcode 346. Moving Average from Data Stream

2017-02-11T05:17:27.000Z

Question

Given a stream of integers and a window size, calculate the moving average of all integers in the sliding window.

For example,

MovingAverage m = new MovingAverage(3);
m.next(1) = 1
m.next(10) = (1 + 10) / 2
m.next(3) = (1 + 10 + 3) / 3
m.next(5) = (10 + 3 + 5) / 3

Explanation:

非常简单的题目，可以用queue或者arraylist或者array保存next的值。用一个sum存着总和，每次都计算一下平均值。

Code:

public class MovingAverage {
    /** Initialize your data structure here. */
    Queue q = new LinkedList<>();
    int size;
    int count = 0;
    int total = 0;
    public MovingAverage(int size) {
        this.size = size;
    }
    public double next(int val) {
        if (count < size) {
            count ++;
            q.offer(val);
            total += val;
            return total*1.0 / count;
        } else {
            int remove = q.poll();
            q.offer(val);
            total -= remove;
            total += val;
            return total*1.0/ size;
        }
    }
}

Distributed System-Indirect Communication and Naming

2017-02-06T20:49:33.000Z

CMU-95702分布式系统第6、9章总结笔记

Indirect Messaging: Indirect communication is defined as communication between entities in a distributed system through an intermediary with no direct coupling between the sender and the receiver(s).
- Decoupled in space:
  - Sender does not need to know the identify of the receiver(s) and visa-versa
  - Good for handling legacy systems – Decoupled in time:
  - A component need not even be running
  - The messaging system can store messages until they are successfully delivered
  - Reliable delivery is insured
Two messaging modes:
- Point-to-point:
  - Inventory to Factory
  - Inventory to Sales
  - Factory to Accounting
- publish / subscribe :
  - parts to parts inventory and parts order
some example scenarios:
- asynchronous communication: like chat ant twitter type app, report info to on or more interested systems
- event-driven problem
- decoupled/ multiple consumers
- multiply interested parties
indirect messaging protocols:
- two open standars: XMPP & amqp
Java's JMS API:
- An API for performing indirect messaging.
- It is an abstraction API like JNDI and JDBC.
- Interacts with some Message Oriented Middleware (MOM)
- JMS is a client-facing interface, meant to abstract way the particulars of any MOM.
- In theory, you should have portability of systems written with JMS such that they can work with any MOM.
- API is javax.jms
JMS Queues and topics
JMS message types：
- Stream
  - Sequential stream of Java primitive data types.
- Map
  - Set of name-value pairs
- Names are String objects
- Values are Java primitives (including String)
- Text
  - Message is a String object
- Plain-text message
- XML messages
- Object
  - Serialized Java object
- Simple if in Java-only environment
- Bytes
  - Stream of un-interpreted bytes.
  - To encode a message body to match an existing message format
Message driven beans： components that are executedasynchronously by messages coming available in a Queue or Topic.

Line Question on class:

abstract indirect messaging API: JMS
point-to-point messaging: Queue
mom used in lab: Glassfish
public subscribe message: Topics
Indirect Messaging Destination: Queue and Topics
Administratively managed resources (External to your program): Connectionfactory, Queue, Topics

Distributed System-Mobile and ubiquitous computing

2017-02-06T20:28:32.000Z

CMU-95702分布式系统第19章总结笔记

Design issues in distributed mobile applications:
- Association: • Devices – Appear and disappear from the space. – Do so unpredictably – May be totally new to the space. – Or may be returning to the space. • They need to be: – Perhaps added to the network – Brought into Association with resources and applications • Examples of Association – Come on campus and be able to be associated with the printers that are close to you. – Be alerted if someone you know is walking near you. – Be provided with selling prices in your local area for your goods (not prices in far-away areas).
- Application-level Association: often by discorvery, broadcasts

How a new device become part of the local network?
- ARP
- DHCP
Sensing and Context Awareness:
- sensing: camera, time, acceleration, location, speed, temperature
- context awareness: in terms of sensed data, or associated data
Location Sensing:
- GPS
- Database of collected Wifi access points – stores the access point's MAC address and the GPS location at which it was observed
- Cellular – compute using signal strength to multiple cellular tower locations
- RFID tags – tags are associated with a location
Adaptation:
- Presentation to fit the screen
- Use of JavaScript to fit the devices capabilities
- Media quality to fit the screen and device capabilities
- Language to fit the user
- Information to fit the physical context. • Give only movie times in the future, and in nearby theaters
Device awareness / browser detection: Reply differently depending on what device makes request. 3 HTTP headers provide clues of what the device is:

User-Agent • Identifies the mobile browser and almost always the device manufacturer and model. • BlackBerry8330/4.3.0 Profile/MIDP-2.0 Configuration/CLDC-1.1 VendorID/105 • Collection of mobile agent strings: – http://www.zytrax.com/tech/web/mobile_ids.html
X-Wap-Profile • Link to an XML profile of the phone’s capabilities •E.g.http://www.blackberry.net/go/mobile/profiles/uaprof/8310/4.2.2.rdf
Accept • Supported MIME(多用途的网际邮件扩充协议)types • E.g. text/html, application/xhtml+xml, etc.

These 3 headers can provide enough info , but: - header can be missing - have inaccurate values - have invalid urls

Feature detection(more flexible and reliable solution): bottom line: using feature detection, not browser detection
- Two strategies for feature detection:
  - Graceful degradation
    - Design for modern browsers
    - Where features are not available, provide a simpler alternative
      - If not possible, alert the user
    - Don't allow it to invisibly fail
  - Progressive enhancement
  - Design with a baseline of usable functionality
  - Enrich the user experience step-by-step by testing for features before using them.

Responsive web design:
- A strategy of web design for multiple screen sizes
- Uses:
  - Fluid grids expressing sizes in terms of percents, not pixels
  - Modify size of media using relative units
    - Keep them within their bounding elements
    - Images
    - Media
    - Font size
  - Crossing size thresholds switch to completely different designs
    - Accomplished using media queries

Mobile first:
A philosophy of web design
Design for mobile first, and desktop second
Counter to what has been done historically, of mobile 2nd
Benefits of Mobile First: - Focus on the platform on which you will reach the most users - Forces designers to focus on the most important content and functionality - Allows for using technologies on mobile: * touch events * geolocation 地理定位 * accelerometer 加速计
Mobile deployment strategies:
- Native
  - E.g. Android, iOS
  - Requires redeveloping for each architecture
  - 2 code bases
- Native with Development Framework
  - Use a framework that compiles to multiple native applications
  - E.g. Corona (http://www.coronalabs.com)
- Mobile Web
  - Develop in HTML / CSS /JavaScript
  - Accessed in a browser
  - Can install local icon to launch to site
    - Use local storage to store information when off line
    - Use manifest to cache application to use when off line
    - Sync when Internet is again available.
  - Hybrid
    - Develop in HTML / CSS / JavaScript
    - Wrap in a browser wrapper to create native apps
    - Wrapper provides access to phone hardware not accessible from the browser
    - Apache Cordova (https://cordova.apache.org) is an open source native wrapper

Distributed System-ACID

2017-02-06T20:20:32.000Z

CMU-95702分布式系统 ACID概念总结

ACID Transactions：
- Atomic: All or nothing. No intermediate states are visible. No possibility that only part of the transaction ran. If a transaction fails or aborts prior to committing, the TP system will undo the effects of any updates (will recover). We either commit or abort the entire process. Checkpointing and Logging and recoverable objects can be used to ensure a transaction is atomic with respect to failures.
- Consistent: system invariants preserved, e.g., if there were n dollars in a bank before a transfer transaction then there will be n dollars in the bank after the transfer. This is largely in the hands of the application programmer.
- Isolated: Two transactions do not interfere with each other. They appear as serial executions. This is the case even though transactions may run concurrently. Locking is often used to prevent one transaction from interfering with another.
- Durable: The commit causes a permanent change to stable storage. This property may be obtained with log-based recovery algorithms. If there has been a commit but updates have not yet been a commit but updates have not yet been completed due to a crash, the logs will hold the necessary information on recovery.

哪儿

2017-01-24T09:02:55.000Z

今天上Advanced Cloud Computing,三个教授就一个问题争论起来。

那真的是很触动我的一刻，加起来都快150岁的三个顶尖学术大牛，在投影仪前面，对待知识还是和孩子一样专注和热情，可爱极了。

忽然想到，Greg的主页上写着，“我才不是个不度假的教授，我2012年去冲浪了呢！”。

其实，每次选课都很头疼，时间太少，想学的课太多，只恨自己不能多读几年。我也知道上学期很不开心的时候，发过誓这学期不要选很难的课了。可是我的研究生只有一次，在CMU念书的机会也只有一次（当然我不介意以后再来哈哈哈），无法不说服自己再争取一下。

只是，最近有一个问题在渐渐放大。

我到底要去哪里。

其实，很多人也不知道到底要去哪里。但其实每个人，都能看到那么一些些微弱的光，好像是那个地方。只是有些人会选择努力跑着去，有些人怀疑自己是否真的看见了光，有些人，装作看不见。

之前和萱哥聊天，萱哥倒是很爽快，说不管去哪儿不想呆北京吸霾了，除非这次考上了北影。两年了啊，北影的梦还在。

你看你看，坚定的人都很坦然。

就像机器学习拿了满分的Mengyao，总和我说毕业找不到工作，打算去星巴克门口蹲着摆个碗。

所以有的时候，不知道自己要去哪儿其实也不可怕。

毕竟未来未知不可怕，已知才最可怕。

我算是想明白了，关于人生，我一直思考得太累了。这明明是一道无解题，就算我在某个时刻心满意足想明白了，也一定是幻觉。

没有答案的。所以不要思考了。

还不如痛痛快快活一次就好。

Java中的排序问题

2017-01-08T20:56:31.000Z

总结几种排序方法。冒泡排序，选择排序，插入排序，Quick Sort,Merge Sort（continue..）...

Simple sorting

Bubble Sort

慢，但是简单。 Time complexity: O(N^2)

步骤:

1
2
3

1. 每次只比较两个值。
2. 如果左边的值大，就交换两个值。
3. 移动大的值到右边。

举例:

原数组: [4,7, 2, 5, 3]
Round 1:
  -> [4,7,2,5,3]
  -> [4,2,7,5,3]
  -> [4,2,5,7,3]
  -> [4,2,5,3,7]
Round 2:
  -> [2,4,5,3,7]
  -> [2,4,5,3,7]
  -> [2,4,3,5,7]
Round 3:
  -> [2,4,3,5,7]
  -> [2,3,4,5,7]
Round 4:
  -> [2,3,4,5,7]

Code:

int[ ] data = {4, 7, 2, 5, 3}

Swap Method

// a helper method that swaps two values in an int array
private static void swap(int[] data, int one, int two) {
    int temp = data[one];
    data[one] = data[two];
    data[two] = temp;
}

Bubble Sort:

public static void bubbleSort(int[] data) {
       // move backward from the last index
       for (int out = data.length - 1; out >= 1; out--) {
           // move forward from the beginning
           // bubble up the largest value to the right
           for (int in = 0; in < out; in++) {
               if (data[in] > data[in + 1]) {
                   swap(data, in, in + 1);
               }
           }
       }
   }

Selection Sort

比冒泡排序快但是依旧不够快。 Time complexity: O(N^2) 比冒泡排序少了很多swap的过程，所以稍微快一些。

步骤:

1 2	1. 选出最小的值。 2. 移动到最左边。

举例:

原数组: [4,7, 2, 5, 3]
Round1:
  最小的是2：
  [2,7,4,5,3]
Round2:
  最小的是3
  [2,3,4,5,7]
Round3:
  最小的是4
  [2,3,4,5,7]
Round4:
  最小的是5：
  [2,3,4,5,7]

Code:

for (int out = 0; out  min = out;
  for (int in = out+1; in < data.length; in ++) {
    if (data[in] < min) {
      min = in;
    }
  }
  if (out != min) {
    swap(data, out, min);
  }
}

Insertion Sort:

最直观的排序法。 Time complexity: O(N^2)

步骤:

1
2
3

想象数组中有一个分割线。
1. 左手边的是排好序的。
2. 线的右边的第一个元素需要被插入到左边的相应位置中。

举例:

原数组: [4,7, 2, 5, 3]
-> [4,|7,2,5,3]
-> [4,7,|2,5,3]
-> [2,4,7,|5,3]
-> [2,4,5,7,|3]
-> [2,3,4,5,7]

Code:

public static void insertionSort(int[] data) {
   // start from the 1st index till the last index
  for (int out = 1; out    int in = out;
    int temp = data[out];
    /*
    * loop to check the sorted section going backward
    * but not necessarily all the way to the 0th
    * On average, look halfway through the sorted section
    */
    while (in>0 && data[in-1]>=temp) {
      data[in] = data[in-1];
      in --;
    }
    if (out != in) {
      data[in] = temp;
    }
  }
}

##Quick Sort## Time complexity: O(NlogN)

步骤: 是一种对冒泡排序的一种改进。通过一趟把数据分成独立的两部分，其中所有的数据都比另一份数据小。然后再继续排序。递归直到整个数据变成更有序序列。

举例

原数组: [4,7, 2, 5, 3]
-> [4,|7,2,5,3]
-> [4,7,|2,5,3]
-> [2,4,7,|5,3]
-> [2,4,5,7,|3]
-> [2,3,4,5,7]

代码

static void quicksort(int n[], int left, int right) {
    int dp;
    if (left < right) {
        dp = partition(n, left, right);
        quicksort(n, left, dp - 1);
        quicksort(n, dp + 1, right);
    }
}
static int partition(int n[], int left, int right) {
    int pivot = n[left];
    while (left < right) {
        while (left < right && n[right] >= pivot)
            right--;
        if (left < right)
            n[left++] = n[right];
        while (left < right && n[left] <= pivot)
            left++;
        if (left < right)
            n[right--] = n[left];
    }
    n[left] = pivot;
    return left;
}

##Merge Sort## Time complexity: O(NlogN)

Code


public  void sort(int[] A) {
  int[] tmp = new int[A.length];
  mergeSort(A, 0, A.length -1 , tmp);
}
public void mergeSort(int[] A, int start, int end, int[] tmp) {
  if (start >= end) return;
  int left = start, right = end;
  int mid = (start+end)/2;
  mergeSort(A, start, mid, tmp);
  mergeSort(A, mid+1, end,tmp);
  merge(A, start, mid, end, tmp);
}
public void merge(int[] A, int start, int mid, int end, int[] tmp) {
  int left = start;
  int right = mid + 1;
  int index = start;
  while(left <= mid && right <= end) {
    if (A[left] < A[right]) {
      tmp[index++] = A[left++];
    } else {
      tmp[index++] = A[right++];
    }
  }
  while(left <= mid) {
    tmp[index++] = A[left++];
  }
  while(right <= end) {
    tmp[index+++] = A[right++];
  }
  for(index = start; index <= end; index++) {
    A[index] = tmp[index];
  }
}