Linux Kernel SKBuff

I found some brief explaination of sk_buff from here and here.

The socket buffer, or “SKB”, is the most fundamental data structure in the Linux networking code. Every packet sent or received is handled using this data structure.

The most fundamental parts of the SKB structure are as follows:


struct sk_buff {
	/* These two members must be first. */
	struct sk_buff		*next;
	struct sk_buff		*prev;

	struct sk_buff_head	*list;
 ...

The first two members implement list handling. Packets can exist on several kinds of lists and queues. For example, a TCP socket send queue. The third member says which list the packet is on. Learn more about SKB list handling here.


	struct sock		*sk;

This is where we record the socket assosciated with this SKB. When a packet is sent or received for a socket, the memory assosciated with the packet must be charged to the socket for proper memory accounting. Read more about socket packet buffer memory accounting here.


	struct timeval		stamp;

Here we record the timestamp for the packet, either when it arrived or when it was sent. Calculating this is somewhat expensive, so this value is only recorded if necessary. When something happens that requires that we start recording timestamps, net_enable_timestamp() is called. If that need goes away, net_disable_timestamp() is called.

Timestamps are mostly used to packet sniffers. But they are also used to implement certain socket options, and also some netfilter modules make use of this value as well.


	struct net_device	*dev;
	struct net_device	*input_dev;
	struct net_device	*real_dev;

These three members help keep track of the devices assosciated with a packet. The reason we have three different device pointers is that the main ‘skb->dev’ member can change as we encapsulate and decapsulate via a virtual device.

So if we are receiving a packet from a device which is part of a bonding device instance, initially ‘skb->dev’ will be set to point the real underlying bonding slave. When the packet enters the networking (via ‘netif_receive_skb()’) we save ‘skb->dev’ away in ‘skb->real_dev’ and update ‘skb->dev’ to point to the bonding device.

Likewise, the physical device receiving a packet always records itself in ‘skb->input_dev’. In this way, no matter how many layers of virtual devices end up being decapsulated, ‘skb->input_dev’ can always be used to find the top-level device that actually received this packet from the network.


	union {
		struct tcphdr	*th;
		struct udphdr	*uh;
		struct icmphdr	*icmph;
		struct igmphdr	*igmph;
		struct iphdr	*ipiph;
		struct ipv6hdr	*ipv6h;
		unsigned char	*raw;
	} h;

	union {
		struct iphdr	*iph;
		struct ipv6hdr	*ipv6h;
		struct arphdr	*arph;
		unsigned char	*raw;
	} nh;

	union {
		unsigned char	*raw;
	} mac;

Here we store the location of the various protocol layer headers as we build outgoing packets, and parse incoming ones. For example, ‘skb->mac.raw’ is set by ‘eth_type_trans()’, when an eternet packet is received. Later, we can use this to find the location of the MAC header.

These members are potentially redundant, and could be removed. Read a discussion about that here.


	struct  dst_entry	*dst;

This member is the generic route for the packet. It tells us how to get the packet to it’s destination. Note that routes are used for both input and output. DST entries are about as complex as SKBs are, and thus probably deserve their own tutorial.


	struct	sec_path	*sp;

Here we store the security path traversed by the packet, if any. For example, on input IPSEC records each transformation which has been applied to the packet by a decapsulator. The records are an array of ‘struct sec_decap_state’ which each record the security assosciation that matched and got applied. Later, when we are trying to validate the security policy against a packet, we make sure that the transformations applied match the ones allowed by the policy.


	char			cb[40];

This is the SKB control block. It is an opaque storage area usable by protocols, and even some drivers, to store private per-packet information. TCP uses this, for example, to store sequence numbers and retransmission state for the frame.


	unsigned int		len,
				data_len,
				mac_len,
				csum;

The three length members are pretty straight-forward. The total number of bytes in the packet is ‘len’. SKBs are composed of a linear data buffer, and optionally a set of 1 or more page buffers. If there are page buffers, the total number of bytes in the page buffer area is ‘data_len’. Therefore the number of bytes in the linear buffer is ‘skb->len – skb->data_len’. There is a shorthand function for this in ‘skb_headlen()’.


static inline unsigned int skb_headlen(const struct sk_buff *skb)
{
	return skb->len - skb->data_len;
}

The ‘mac_len’ holds the length of the MAC header. Normally, this isn’t really necessary to maintain, except to implement IPSEC decapsulation of IP tunnels properly. This field is initialized once inside of ‘netif_receive_skb()’ to the formula ‘skb->nh.raw – skb->mac.raw’.

Since we only use this for one purpose, with some clever ideas we may be able to eliminate this member in the future. For example, perhaps we can store the value in the ‘struct sec_path’.

Finally, ‘csum’ holds the checksum of the packet. When building send packets, we copy the data in from userspace and calculate the 16-bit two’s complement sum in parallel for performance. This sum is accumulated in ‘skb->csum’. This helps us compute the final checksum stored in the protocol packet header checksum field. This field can end up being ignored if, for example, the device will checksum the packet for us.

On input, the ‘csum’ field can be used to store a checksum calculated by the device. If the device indicates ‘CHECKSUM_HW’ in the SKB ‘ip_summed’ field, this means that ‘csum’ is the two’s complement checksum of the entire packet data area starting at ‘skb->data’. This is generic enough such that both IPV4 and IPV6 checksum offloading can be supported.


	unsigned char		local_df,
				cloned:1,
				nohdr:1,
				pkt_type,
				ip_summed;

The ‘local_df’ field is used by the IPV4 protocol, and when set allows us to locally fragment frames which have already been fragmented. This situation can arise, for example, with IPSEC.

In order to make quick references to SKB data, Linux has the concept of SKB clones. When a clone of an SKB is made, all of the ‘struct sk_buff’ structure members of the clone are private to the clone. The data, however, is shared between the primary SKB and it’s clone. When an SKB is cloned, the ‘cloned’ field will be set in both the primary and clone SKB. Otherwise is will be zero.

The ‘nohdr’ field is used in the support of TCP Segmentation Offload (‘TSO’ for short). Most devices supporting this feature need to make some minor modifications to the TCP and IP headers of an outgoing packet to get it in the right form for the hardware to process. We do not want these modifications to be seen by packet sniffers and the like. So we use this ‘nohdr’ field and a special bit in the data area reference count to keep track of whether the device needs to replace the data area before making the packet header modifications.

The type of the packet (basically, who is it for), is stored in the ‘pkt_type’ field. It takes on one of the ‘PACKET_*’ values defined in the ‘linux/if_packet.h’ header file. For example, when an incoming ethernet frame is to a destination MAC address matching the MAC address of the ethernet device it arrived on, this field will be set to ‘PACKET_HOST’. When a broadcast frame is received, it will be set to ‘PACKET_BROADCAST’. And likewise when a multicast packet is received it will be set to ‘PACKET_MULTICAST’.

The ‘ip_summed’ field describes what kind of checksumming assistence the card has provided for a receive packet. It takes on one of three values: ‘CHECKSUM_NONE’ if the card provided no checksum assistence, ‘CHECKSUM_HW’ if the two’s complement checksum over the entire packet has been provides in ‘skb->csum’, and ‘CHECKSUM_UNNECESSARY’ if it is not necessary to verify the checksum of this packet. The latter usually occurs when the packet is received over the loopback device. ‘CHECKSUM_UNNECESSARY’ can also be used when the device only provides a ‘checksum OK’ indication for receive packet checksum offload.


	__u32			priority;

The ‘priority’ field is used in the implement of QoS. The packet’s value of this field can be determined by, for example, the TOS field setting in the IPV4 header. Then, the packet scheduler and classifier layer can key off of this SKB priority value to schedule or classify the packet, as configured by the administrator.


	unsigned short		protocol,
				security;

The ‘protocol’ field is initialized by routines such as ‘eth_type_trans()’. It takes on one of the ‘ETH_P_*’ values defined in the ‘linux/if_ether.h’ header file. Even non-ethernet devices use these ethernet protocol type values to indicate what protocol should receive the packet. As long as we always have some ethernet protocol value for each and every protocol, this should not be a problem.

The ‘security’ field was meant to be used in the implementation of IP Security, but that never materialized. It can probably be safely removed. Since the next field is a pointer, and thus needs to be aligned properly, eliminating the ‘security’ field would unfortunately not buy us any space savings.


	void			(*destructor)(struct sk_buff *skb);
	...
	unsigned int		truesize;

The SKB ‘destructor’ and ‘truesize’ fields are used for socket buffer accounting. See the SKB socket accounting page for details.


	atomic_t		users;

We reference count SKB objects using the ‘users’ field. Extra references can be obtained by invoking ‘skb_get()’. An implicit single reference is present in the SKB (that is, ‘users’ has a value of ‘1’) when it is first allocated. References are dropped by invoking ‘kfree_skb()’.


	unsigned char		*head,
				*data,
				*tail,
				*end;

These four pointers provide the core management of the linear packet data area of an SKB. SKB data area handling is involved enough to deserve it’s very own tutorial. Check it out here.

sk_buff

All network-related queues and buffers in the kernel use a common data structure, struct sk_buff. This is a large struct containing all the control information required for the packet (datagram, cell, whatever). The sk_buff elements are organized as a doubly linked list, in such a way that it is very efficient to move an sk_buff element from the beginning/end of a list to the beginning/end of another list. A queue is defined by struct sk_buff_head, which includes a head and a tail pointer to sk_buff elements.

All the queuing structures include an sk_buff_head representing the queue. For instance, struct sock includes a receive and send queue. Functions to manage the queues (skb_queue_head(), skb_queue_tail(), skb_dequeue(), skb_dequeue_tail()) operate on an sk_buff_head. In reality, however, the sk_buff_head is included in the doubly linked list of sk_buffs (so it actually forms a ring).

When a sk_buff is allocated, also its data space is allocated from kernel memory. sk_buff allocation is done with alloc_skb() or dev_alloc_skb(); drivers use dev_alloc_skb();. (free by kfree_skb() and dev_kfree_skb(). However, sk_buff provides an additional management layer. The data space is divided into a head area and a data area. This allows kernel functions to reserve space for the header, so that the data doesn’t need to be copied around. Typically, therefore, after allocating an sk_buff, header space is reserved using skb_reserve(). skb_pull(int len) – removes data from the start of a buffer (skipping over an existing header) by advancing data to data+len and by decreasing len.

struct sk_buff has fields to point to the specific network layer headers:

  • transport_header (previously called h) – for layer 4, the transport layer (can include tcp header or udp header or icmp header, and more)
  • network_header – (previously called nh) for layer 3, the network layer (can include ip header or ipv6 header or arp header).
  • mac_header – (previously called mac) for layer 2, the link layer.
  • skb_network_header(skb), skb_transport_header(skb) and skb_mac_header(skb) return pointer to the header.

The struct sk_buff objects themselves are private for every network layer. When a packet is passed from one layer to another, the struct sk_buff is cloned. However, the data itself is not copied in that case. Note that struct sk_buff is quite large, but most of its members are unused in most situations. The copy overhead when cloning is therefore limited.

  • Almost always sk_buff instances appear as “skb” in the kernel code.
  • struct dst_entry *dst – the route for this sk_buff; this route is determined by the routing subsystem.
    • It has 2 important function pointers:
      • int (*input)(struct sk_buff*);
      • int (*output)(struct sk_buff*);
    • input() can be assigned to one of the following : ip_local_deliver, ip_forward, ip_mr_input, ip_error or dst_discard_in.
    • output() can be assigned to one of the following :ip_output, ip_mc_output, ip_rt_bug, or dst_discard_out.
    • we will deal more with dst when talking about routing.
    • In the usual case, there is only one dst_entry for every skb.
    • When using IPsec, there is a linked list of dst_entries and only the last one is for routing; all other dst_entries are for IPSec transformers ; these other dst_entries have the DST_NOHASH flag set. These entries , which has this DST_NOHASH flag set are not kept in the routing cache, but are kept instead on the flow cache.
  • tstamp (of type ktime_t ) : time stamp of receiving the packet.
    • net_enable_timestamp() must be called in order to get values.

Relax

运动可以放松心情,游个泳,或者降档切7000转。

小知识

把一个大表(23GBytes,60million+记录)导出到csv文件,采用500k记录批量处理,导出成一个个单个的csv文件。
那么怎么样把这些文件合并呢?

cat a.001.csv > all.csv
tail -n +2 a.002.csv >> all.csv

etc.

集成的小工具很好用呢 :)

统计学习和数据挖掘参考书

The elements of statistical learning : data mining, inference, and prediction
ISBN 9780387848570
Authors Haste, Trevor
Publisher Springer
Copyright Date c2009
Price $89.95

Applied predictive modeling
ISBN 9781461468486
Authors Kuhn, Max
Publisher Springer
Copyright Date c2013
Price $89.95

Data Mining: Practical Machine Learning Tools and Techniques
ISBN 9780120884070
Authors Witten, I. H., Frank, E. and Hall, M. A.
Publisher Springer
Copyright Date 2011
Price $75.95

感谢哥大同学提供参考。

自Java 7以来的数字证书兼容性问题

在使用JavaApns做IOS推送的时候报如下错误

javapns.communication.exceptions.
    InvalidCertificateChainException

使用Java 6和PHP就没有问题。网上粗略调查过发现是java 7以来的安全机制的问题。尝试使用如下方法解决:

使用JDK 7附带的keytool做两次转换:

.p12 到 jks:

keytool -importkeystore -destkeystore temp.jks -srckeystore src.p12 -srcstoretype PKCS12

jks 到 .p12:

keytool -importkeystore -srckeystore temp.jks -srcstoretype JKS -deststoretype PKCS12 -destkeystore dest.p12

使用新生成的p12密钥文件,一切正常。

Openstack OpenVSwitch GRE 组网下的逻辑架构

虚拟化实例(虚拟机)产生数据包并将其通过其内部的虚拟网络接口( virtual NIC)比如eth0,发送出来,并被传送到计算节点的Test Access Point (TAP)设备上。从/etc/libvirt/qemu/instance-xxxxxxxx.xml文件记录了当前设备的配置信息。

TAP设备的名称由端口ID的前11位组成(10位十六进制码和1位连字符),所以另外一种获取TAP设备名称的方法就是使用neutron命令。neutron port-list命令返回端口列表,第一项是端口ID。利用这里输出的前11个字符,我们可以得到相应TAP设备的名称。



TAP设备连接到整合网桥 br-int。该网桥连接所有实例的TAP设备以及系统内的其他网桥。在本例当中,有 int-br-eth1patch-tun接口。int-br-eth1是连接到br-eth1网桥的桥臂的一端,处理物理接口eth1上承载的VLAN。patch-tun是OpenvSwitch的内部端口,连接到br-tun网桥,用来提供GRE封装隧道网络。

网络节点侧的br-int整合网桥,集成了相应的网络功能,比如DHCP服务(dnsmsq),NS服务(dnsmsq),3层路由服务(实际上就是利用netns和iptables做的隔离),安全组和防火墙(iptables)。还有br-ext可以提供外网的接入支持。

计算节点导网络节点之间的通信,有多种方法实现。可以采用2层VLAN/Vxlan的方式(br-eth1),或者在三层上打GRE隧道(br-tun)。

MySQL 自增ID 和 UUID 做主键的初步性能研究

这几天在纠结数据表主键的设计问题,考虑使用自增ID还是UUID来做主键,数据库后端为MySQL。
首先在互联网上搜索,得到实测 Mysql UUID 性能这篇文章,他的结论是:

当数据表的引擎为MyISAM 时,自增 ID 无疑是效率最高的, UUID 效率略低,但不会低到无法接受。一旦数据引擎为 InnodB 时,效率下降非常严重,已经达到令人发指的地步。由于 InnodB 主键采用 聚集索引 ,会对插入的记录进行物理排序,而 UUID本身基本上是无序的,所以造成了巨大的 I/O 开销。所以如果使用 innodB 千万不要使用 UUID 。

结论经过我后来的测试验证基本正确,但是对这篇文章中间的测试方法不敢苟同。

其测试过程中重大错误:针对自增id的两个表的插入操作没有写入varchar字段,考虑到varchar插入的性能消耗,这一点是绝对不能够忽略的!

建立四张测试用表:

uuidtest_inno(uuid,text),
idtest_inno(id,text),
uuidtest_myisam(uuid,text),
idtest_myisam(id,text)

建立四个存储过程,测试数据量插入100 000行:

DROP PROCEDURE IF EXISTS p_uuid_inno//
CREATE PROCEDURE p_uuid_inno()
BEGIN
DECLARE i INT;
SET i=0;
WHILE i

清空这四个表:

TRUNCATE inttest_inno//
TRUNCATE uuidtest_inno//
TRUNCATE inttest_myisam//
TRUNCATE uuidtest_myisam//

执行存储过程:

call p_int_myisam()//
call p_uuid_myisam()//
call p_int_inno()//
call p_uuid_inno()//

发现执行时间巨长无比。无奈,将测试数据量缩减到1000次插入。
myisam的时间都是0.2s左右,innodb为55s左右。

考虑数据库优化,放弃ACID支持,
设置 innodb_flush_log_at_trx_commit = 2
得到:

mysql> call p_int_myisam();
Query OK, 1 row affected (2.02 sec)

mysql> call p_uuid_myisam();
Query OK, 1 row affected (2.63 sec)

mysql> call p_int_inno();
Query OK, 1 row affected (9.71 sec)

mysql> call p_uuid_inno();
Query OK, 1 row affected (13.88 sec)

再设置 innodb_flush_method = O_DIRECT
得到:

mysql> call p_int_myisam();
Query OK, 1 row affected (2.06 sec)

mysql> call p_uuid_myisam();
Query OK, 1 row affected (2.56 sec)

mysql> call p_int_inno();
Query OK, 1 row affected (7.59 sec)

mysql> call p_uuid_inno();
Query OK, 1 row affected (10.88 sec)

再设置 innodb_log_buffer_size = 8M(之前的设置是3M)
得到:

mysql> call p_int_myisam();
Query OK, 1 row affected (1.96 sec)

mysql> call p_uuid_myisam();
Query OK, 1 row affected (2.63 sec)

mysql> call p_int_inno();
Query OK, 1 row affected (5.28 sec)

mysql> call p_uuid_inno();
Query OK, 1 row affected (9.59 sec)

可以看到对innodb来说插入速度低于myisam,这与选择uuid还是自增ID做主键没有太大的关系,uuid的确要比自增ID慢但是不至于说是数量级上的慢。

利用Java获取网络标准时间

import java.net.*;
import java.io.*;
public class DaytimeClient
{
	/**
	 * @param args
	 */
	public static void main(String[] args)
	{
		// TODO Auto-generated method stub
		String[] hostname = new String[]{"", "time.windows.com","time.nist.gov","time-nw.nist.gov","time-a.nist.gov","time-b.nist.gov"};
		int i = 1;
		boolean RecvFlag = false;
		if(args.length > 0)
		{
			hostname[0] = args[0];
			i = 0;
		}
		while(i <=5 && !RecvFlag)
		{
			try
			{
				Socket timeSocket = new Socket(hostname[i], 13);
				InputStream timeStream = timeSocket.getInputStream();
				StringBuffer time = new StringBuffer();
				int c;
				while((c=timeStream.read()) != -1)
					time.append((char)c);
				String timeString = time.toString().trim();
				System.out.println("It is " + timeString +" at " + hostname[i] +" on port " + timeSocket.getPort());
				RecvFlag = true;
				timeSocket.close();
			}
			catch(UnknownHostException e)
			{
				System.out.println("Failed to get the information from " + hostname[i]);
				System.out.println(e);
				i++;
			}
			catch(Exception e)
			{
				System.out.println("Failed to get the information from " + hostname[i]);
				System.out.println(e);
				i++;
			}
		}
		if(i == 6)
			System.out.println("Failed to get the information please check your network settings.");
		else
			System.out.println("Complete.");
	}
}

JavaApplet程序编译注意事项

首先附上一JavaApplet的Helloworld程序,及在html中的嵌入代码:
源代码路径:JavaAppletsrcHelloworld.java
JavaAppletbinJavaTest.html

import java.awt.*;
import java.applet.*;
public class HelloWorld extends Applet
{
	public void paint(Graphics g)
	{
		g.drawString("Hello World!", 5, 35);
	}
}
<HTML>
<HEAD><TITLE>JavaApplet-Text</TITLE></HEAD>
<BODY>
<applet code = "HelloWorld.class" width = 200 height = 200></applet>
</BODY>
</HTML>

要使上述代码正常运行要注意以下几点:
1、Helloworld.java编译成功后会在…bin中生成Helloworld.class文件,要将html文件同class文件放在同一目录下。若不在同一目录,则要在html中加入”codebase”代码,即

<applet code = "HelloWorld.class" codebase= "location" width = 200 height = 200></applet>

2、确保Java是最新版
3、在控制面板->程序->Java->安全设置中,把安全等级调到最低
4、不要为Helloworld创建包(package),否则要在html中指定包名。P.S.目前DS还不会指定包名=。=!