Linux 内核 Switchdev 代码阅读笔记

Linux 内核 Switchdev 代码阅读笔记

上一篇文章Linux 内核 Switchdev 框架学习笔记学习了switchdev框架的整体结构,主要参考是内核文档目录Documentation/networking/中的文件switchdev.rst(Linux 6.5),细节部分不同内核版本可能会有些出入,翻译部分居多,同时增加了自己的一些理解和思考,请酌情参考。

switchdev这个框架充分利用了Linux内核协议栈的特性,通过swtichdev提供的api来连接网络子系统和芯片驱动,以便厂商根据自己的支持情况卸载相应的特性到芯片上。程序员界有句话叫做“Talk is cheap, show me your code”,那么接下来我们就看看内核中的switchdev具体是如何实现的。

在内核的Documentation下搜索switchdev相关文件,得到如下结果:

./networking/device_drivers/ethernet/mellanox/mlx5/switchdev.rst
./networking/device_drivers/ethernet/ti/am65_nuss_cpsw_switchdev.rst
./networking/device_drivers/ethernet/ti/cpsw_switchdev.rst
./networking/switchdev.rst
./output/.doctrees/networking/device_drivers/ethernet/mellanox/mlx5/switchdev.doctree
./output/.doctrees/networking/device_drivers/ethernet/ti/am65_nuss_cpsw_switchdev.doctree
./output/.doctrees/networking/device_drivers/ethernet/ti/cpsw_switchdev.doctree
./output/.doctrees/networking/switchdev.doctree
./output/networking/device_drivers/ethernet/mellanox/mlx5/switchdev.html
./output/networking/device_drivers/ethernet/ti/am65_nuss_cpsw_switchdev.html
./output/networking/device_drivers/ethernet/ti/cpsw_switchdev.html
./output/networking/switchdev.html
./output/_sources/networking/device_drivers/ethernet/mellanox/mlx5/switchdev.rst.txt
./output/_sources/networking/device_drivers/ethernet/ti/am65_nuss_cpsw_switchdev.rst.txt
./output/_sources/networking/device_drivers/ethernet/ti/cpsw_switchdev.rst.txt
./output/_sources/networking/switchdev.rst.txt

在除了Documentation之外其他代码目录搜索switchdev,得到如下结果:

./drivers/net/dsa/mv88e6xxx/switchdev.c
./drivers/net/dsa/mv88e6xxx/switchdev.h
./drivers/net/ethernet/marvell/prestera/prestera_switchdev.c
./drivers/net/ethernet/marvell/prestera/prestera_switchdev.h
./drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c
./drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.h
./drivers/net/ethernet/microchip/lan966x/lan966x_switchdev.c
./drivers/net/ethernet/microchip/sparx5/sparx5_switchdev.c
./drivers/net/ethernet/ti/am65-cpsw-switchdev.c
./drivers/net/ethernet/ti/am65-cpsw-switchdev.h
./drivers/net/ethernet/ti/cpsw_switchdev.c
./drivers/net/ethernet/ti/cpsw_switchdev.h
./drivers/net/wireless/quantenna/qtnfmac/switchdev.h
./include/net/switchdev.h
./net/bridge/br_mrp_switchdev.c
./net/bridge/br_switchdev.c
./net/switchdev
./net/switchdev/switchdev.c

从上面文档和代码的搜索结果看,switchdev框架主要实现在net/switchdev目录下,bridge和dsa功能涉及switchdev接口,而支持的厂商及芯片主要有TI(cpsw),Microchip(sparx5, lan966x),Marvell(prestera), Mellanox(spectrum),以及专门为了展示switchdev而开发的rocker驱动。接下来我们就从这些方面看看switchdev是如何融入到Linux的网络子系统中的,主要以rocker驱动为主,并简单介绍芯片厂商的实现。

接口动态注册

在rocker_probe(rocker_main.c)函数中,通过rocker_probe_ports来遍历所有端口的初始化工作。代码片段如下:

static int rocker_probe(struct pci_dev *pdev, const struct pci_device_id *id)
{
...
	err = rocker_probe_ports(rocker);
	if (err) {
		dev_err(&pdev->dev, "failed to probe ports\n");
		goto err_probe_ports;
	}
...
}

其中rocker_probe_ports的实现如下:

static int rocker_probe_ports(struct rocker *rocker)
{
	int i;
	size_t alloc_size;
	int err;

	alloc_size = sizeof(struct rocker_port *) * rocker->port_count;
	rocker->ports = kzalloc(alloc_size, GFP_KERNEL);
	if (!rocker->ports)
		return -ENOMEM;
	for (i = 0; i < rocker->port_count; i++) {
		err = rocker_probe_port(rocker, i);
		if (err)
			goto remove_ports;
	}
	return 0;

remove_ports:
	rocker_remove_ports(rocker);
	return err;
}

其中rocker_probe_port完成实际端口的注册工作,具体如下:

static int rocker_probe_port(struct rocker *rocker, unsigned int port_number)
{
	struct pci_dev *pdev = rocker->pdev;
	struct rocker_port *rocker_port;
	struct net_device *dev;
	int err;

	dev = alloc_etherdev(sizeof(struct rocker_port));
	if (!dev)
		return -ENOMEM;
	SET_NETDEV_DEV(dev, &pdev->dev);
	rocker_port = netdev_priv(dev);
	rocker_port->dev = dev;
	rocker_port->rocker = rocker;
	rocker_port->port_number = port_number;
	rocker_port->pport = port_number + 1;

	err = rocker_world_check_init(rocker_port);
	if (err) {
		dev_err(&pdev->dev, "world init failed\n");
		goto err_world_check_init;
	}

	rocker_port_dev_addr_init(rocker_port);
	dev->netdev_ops = &rocker_port_netdev_ops;
	dev->ethtool_ops = &rocker_port_ethtool_ops;
	netif_napi_add_tx(dev, &rocker_port->napi_tx, rocker_port_poll_tx);
	netif_napi_add(dev, &rocker_port->napi_rx, rocker_port_poll_rx);
	rocker_carrier_init(rocker_port);

	dev->features |= NETIF_F_NETNS_LOCAL | NETIF_F_SG;

	/* MTU range: 68 - 9000 */
	dev->min_mtu = ROCKER_PORT_MIN_MTU;
	dev->max_mtu = ROCKER_PORT_MAX_MTU;

	err = rocker_world_port_pre_init(rocker_port);
	if (err) {
		dev_err(&pdev->dev, "port world pre-init failed\n");
		goto err_world_port_pre_init;
	}
	err = register_netdev(dev);
	if (err) {
		dev_err(&pdev->dev, "register_netdev failed\n");
		goto err_register_netdev;
	}
	rocker->ports[port_number] = rocker_port;

	err = rocker_world_port_init(rocker_port);
	if (err) {
		dev_err(&pdev->dev, "port world init failed\n");
		goto err_world_port_init;
	}

	return 0;

err_world_port_init:
	rocker->ports[port_number] = NULL;
	unregister_netdev(dev);
err_register_netdev:
	rocker_world_port_post_fini(rocker_port);
err_world_port_pre_init:
err_world_check_init:
	free_netdev(dev);
	return err;
}

如果你对网卡驱动很熟悉的话,那么对上述代码也不会陌生。是的,rocker驱动会为每个port通过register_netdev注册一个net_device到网络子系统中,这样通过ifconfig/ip命令就可以看到这个接口了。这样芯片支持多少个口就注册了多少了net_device结构,也可以将交换芯片理解为一个很多网口(一般的网卡支持1/2/4个网口)的网卡,后面都将交换芯片看做网卡,很多东西就可以容易理解了,比如支持ethtool工具,而这也是switchdev驱动设计的初衷——用通用Linux操作系统来管理交换机设备。rocker_port是rocker驱动自己管理端口的私有数据结构,或者说是网卡管理网口的私有数据结构。

当然,绝大部分网卡都支持动态插拔,类比网卡的交换驱动也支持卸载,rocker驱动对应的就是rocker_remove,具体实现就是动态删除所有的管理资源,这里就不详细描述了。

中断和DMA

除了接口资源,rocker_probe中还注册了中断处理程序和dma环形缓冲区,这些也是网卡驱动必须实现的功能,部分代码如下:

static int rocker_probe(struct pci_dev *pdev, const struct pci_device_id *id)
{
...
	err = rocker_dma_rings_init(rocker);
	if (err)
		goto err_dma_rings_init;

	err = request_irq(rocker_msix_vector(rocker, ROCKER_MSIX_VEC_CMD),
			  rocker_cmd_irq_handler, 0,
			  rocker_driver_name, rocker);
	if (err) {
		dev_err(&pdev->dev, "cannot assign cmd irq\n");
		goto err_request_cmd_irq;
	}

	err = request_irq(rocker_msix_vector(rocker, ROCKER_MSIX_VEC_EVENT),
			  rocker_event_irq_handler, 0,
			  rocker_driver_name, rocker);
	if (err) {
		dev_err(&pdev->dev, "cannot assign event irq\n");
		goto err_request_event_irq;
	}
...
}

这里有一个dma ring初始化rocker_dma_rings_init,以及两个中断处理函数rocker_cmd_irq_handler和rocker_event_irq_handler,看起来和一般的网卡驱动是不是有些不同了。除了这两个中断外,在上述netdev_ops的回调函数rocker_port_netdev_ops中注册的rocker_port_open中,还分别初始化了一个dma ring,以及注册了两个中断处理函数,具体代码如下:

static int rocker_port_open(struct net_device *dev)
{
	struct rocker_port *rocker_port = netdev_priv(dev);
	int err;

	err = rocker_port_dma_rings_init(rocker_port);
	if (err)
		return err;

	err = request_irq(rocker_msix_tx_vector(rocker_port),
			  rocker_tx_irq_handler, 0,
			  rocker_driver_name, rocker_port);
	if (err) {
		netdev_err(rocker_port->dev, "cannot assign tx irq\n");
		goto err_request_tx_irq;
	}

	err = request_irq(rocker_msix_rx_vector(rocker_port),
			  rocker_rx_irq_handler, 0,
			  rocker_driver_name, rocker_port);
	if (err) {
		netdev_err(rocker_port->dev, "cannot assign rx irq\n");
		goto err_request_rx_irq;
	}

	err = rocker_world_port_open(rocker_port);
	if (err) {
		netdev_err(rocker_port->dev, "cannot open port in world\n");
		goto err_world_port_open;
	}

	napi_enable(&rocker_port->napi_tx);
	napi_enable(&rocker_port->napi_rx);
	if (!dev->proto_down)
		rocker_port_set_enable(rocker_port, true);
	netif_start_queue(dev);
	return 0;

err_world_port_open:
	free_irq(rocker_msix_rx_vector(rocker_port), rocker_port);
err_request_rx_irq:
	free_irq(rocker_msix_tx_vector(rocker_port), rocker_port);
err_request_tx_irq:
	rocker_port_dma_rings_fini(rocker_port);
	return err;
}

等等,这是怎么回事,为什么有这么多的dma ring和irq?一般的网卡没有这么复杂啊!

让我们来仔细分析一下具体代码。

rocker_dma_rings_init中实际初始化了两个dma ring,分别是cmd_ring和event_ring,他们对应下面两个中断处理函数的DMA buffer。

rocker_cmd_irq_handler中断处理函数中,通过处理cmd_ring中的desc_info,来唤醒wait等待队列。

static irqreturn_t rocker_cmd_irq_handler(int irq, void *dev_id)
{
	struct rocker *rocker = dev_id;
	const struct rocker_desc_info *desc_info;
	struct rocker_wait *wait;
	u32 credits = 0;

	spin_lock(&rocker->cmd_ring_lock);
	while ((desc_info = rocker_desc_tail_get(&rocker->cmd_ring))) {
		wait = rocker_desc_cookie_ptr_get(desc_info);
		if (wait->nowait) {
			rocker_desc_gen_clear(desc_info);
		} else {
			rocker_wait_wake_up(wait);
		}
		credits++;
	}
	spin_unlock(&rocker->cmd_ring_lock);
	rocker_dma_ring_credits_set(rocker, &rocker->cmd_ring, credits);

	return IRQ_HANDLED;
}

队列实际的处理在rocker_cmd_exec中,该函数获取cmd_ring中的desc_info,然后设置wait队列状态,调用传入的prepare做预处理后,最终触发cmd中断(通过rocker_desc_head_set设置desc为head),然后通过rocker_wait_event_timeout处理中断消息,最后调用传入process回调函数完成具体的功能。

int rocker_cmd_exec(struct rocker_port *rocker_port, bool nowait,
		    rocker_cmd_prep_cb_t prepare, void *prepare_priv,
		    rocker_cmd_proc_cb_t process, void *process_priv)
{
	struct rocker *rocker = rocker_port->rocker;
	struct rocker_desc_info *desc_info;
	struct rocker_wait *wait;
	unsigned long lock_flags;
	int err;

	spin_lock_irqsave(&rocker->cmd_ring_lock, lock_flags);

	desc_info = rocker_desc_head_get(&rocker->cmd_ring);
	if (!desc_info) {
		spin_unlock_irqrestore(&rocker->cmd_ring_lock, lock_flags);
		return -EAGAIN;
	}

	wait = rocker_desc_cookie_ptr_get(desc_info);
	rocker_wait_init(wait);
	wait->nowait = nowait;

	err = prepare(rocker_port, desc_info, prepare_priv);
	if (err) {
		spin_unlock_irqrestore(&rocker->cmd_ring_lock, lock_flags);
		return err;
	}

	rocker_desc_head_set(rocker, &rocker->cmd_ring, desc_info);

	spin_unlock_irqrestore(&rocker->cmd_ring_lock, lock_flags);

	if (nowait)
		return 0;

	if (!rocker_wait_event_timeout(wait, HZ / 10))
		return -EIO;

	err = rocker_desc_err(desc_info);
	if (err)
		return err;

	if (process)
		err = process(rocker_port, desc_info, process_priv);

	rocker_desc_gen_clear(desc_info);
	return err;
}

上面就是一条命令的执行逻辑,简要说就是通过prepare回调构造desc_info,里面包含get/set命令字,以及port索引等,然后通过rocker_desc_head_set触发Rocker设备执行上述命令,Rocker设备完成后将返回结果放到desc_info中后触发中断,rocker_wait_event_timeout捕获中断后,通过rocker_desc_err判断是否有错误产生,如果没有错误,则调用process处理返回的desc_info,将结果拷贝到用户数据中。

在rocker代码中对应ethtool_ops注册的回调函数。

static int
rocker_cmd_get_port_settings_ethtool(struct rocker_port *rocker_port,
				     struct ethtool_link_ksettings *ecmd)
{
	return rocker_cmd_exec(rocker_port, false,
			       rocker_cmd_get_port_settings_prep, NULL,
			       rocker_cmd_get_port_settings_ethtool_proc,
			       ecmd);
}

static int rocker_cmd_get_port_settings_macaddr(struct rocker_port *rocker_port,
						unsigned char *macaddr)
{
	return rocker_cmd_exec(rocker_port, false,
			       rocker_cmd_get_port_settings_prep, NULL,
			       rocker_cmd_get_port_settings_macaddr_proc,
			       macaddr);
}

static int rocker_cmd_get_port_settings_mode(struct rocker_port *rocker_port,
					     u8 *p_mode)
{
	return rocker_cmd_exec(rocker_port, false,
			       rocker_cmd_get_port_settings_prep, NULL,
			       rocker_cmd_get_port_settings_mode_proc, p_mode);
}

static int
rocker_cmd_set_port_settings_ethtool(struct rocker_port *rocker_port,
				     const struct ethtool_link_ksettings *ecmd)
{
	struct ethtool_link_ksettings copy_ecmd;

	memcpy(&copy_ecmd, ecmd, sizeof(copy_ecmd));

	return rocker_cmd_exec(rocker_port, false,
			       rocker_cmd_set_port_settings_ethtool_prep,
			       &copy_ecmd, NULL, NULL);
}

注意,上面set接口实际是没有注册process回调的,因为prepare中已经把配置信息下发了,中断完成后直接返回即可。

以及通过netdev_ops注册的回调函数,set接口也是没有process处理的:

static int rocker_cmd_set_port_settings_macaddr(struct rocker_port *rocker_port,
						unsigned char *macaddr)
{
	return rocker_cmd_exec(rocker_port, false,
			       rocker_cmd_set_port_settings_macaddr_prep,
			       macaddr, NULL, NULL);
}

static int rocker_cmd_set_port_settings_mtu(struct rocker_port *rocker_port,
					    int mtu)
{
	return rocker_cmd_exec(rocker_port, false,
			       rocker_cmd_set_port_settings_mtu_prep,
			       &mtu, NULL, NULL);
}

int rocker_port_set_learning(struct rocker_port *rocker_port,
			     bool learning)
{
	return rocker_cmd_exec(rocker_port, false,
			       rocker_cmd_set_port_learning_prep,
			       &learning, NULL, NULL);
}

rocker_event_irq_handler中断处理函数中,通过处理event_ring中的desc_info,调用rocker_event_process做进一步的处理。

static irqreturn_t rocker_event_irq_handler(int irq, void *dev_id)
{
	struct rocker *rocker = dev_id;
	const struct pci_dev *pdev = rocker->pdev;
	const struct rocker_desc_info *desc_info;
	u32 credits = 0;
	int err;

	while ((desc_info = rocker_desc_tail_get(&rocker->event_ring))) {
		err = rocker_desc_err(desc_info);
		if (err) {
			dev_err(&pdev->dev, "event desc received with err %d\n",
				err);
		} else {
			err = rocker_event_process(rocker, desc_info);
			if (err)
				dev_err(&pdev->dev, "event processing failed with err %d\n",
					err);
		}
		rocker_desc_gen_clear(desc_info);
		rocker_desc_head_set(rocker, &rocker->event_ring, desc_info);
		credits++;
	}
	rocker_dma_ring_credits_set(rocker, &rocker->event_ring, credits);

	return IRQ_HANDLED;
}

rocker_event_process中对解析desc_info,根据type类型处理link变化事件和mac/vlan事件。

static int rocker_event_process(const struct rocker *rocker,
				const struct rocker_desc_info *desc_info)
{
	const struct rocker_tlv *attrs[ROCKER_TLV_EVENT_MAX + 1];
	const struct rocker_tlv *info;
	u16 type;

	rocker_tlv_parse_desc(attrs, ROCKER_TLV_EVENT_MAX, desc_info);
	if (!attrs[ROCKER_TLV_EVENT_TYPE] ||
	    !attrs[ROCKER_TLV_EVENT_INFO])
		return -EIO;

	type = rocker_tlv_get_u16(attrs[ROCKER_TLV_EVENT_TYPE]);
	info = attrs[ROCKER_TLV_EVENT_INFO];

	switch (type) {
	case ROCKER_TLV_EVENT_TYPE_LINK_CHANGED:
		return rocker_event_link_change(rocker, info);
	case ROCKER_TLV_EVENT_TYPE_MAC_VLAN_SEEN:
		return rocker_event_mac_vlan_seen(rocker, info);
	}

	return -EOPNOTSUPP;
}

从上面我们可以看到,所有的配置和状态获取都是通过cmd_ring,而所有的事件触发都来自event_ring,也就是说rocker通过cmd_ring和event_ring模拟了交换芯片的具体功能,那么是谁向cmd_ring和event_ring中放入数据的呢?继续看,我们后面会揭开这个谜底。

rocker_port_dma_rings_init中实际初始化了两个dma ring,分别是tx_ring和rx_ring,他们对应下面两个中断处理函数的DMA buffer。。

rocker_tx_irq_handler中断处理函数中,通过napi_schedule触发napi_tx注册的回调rocker_port_poll_tx。

static irqreturn_t rocker_tx_irq_handler(int irq, void *dev_id)
{
	struct rocker_port *rocker_port = dev_id;

	napi_schedule(&rocker_port->napi_tx);
	return IRQ_HANDLED;
}

函数rocker_port_poll_tx中处理tx_ring缓冲区的内容,此时报文已经发送完成,释放skb报文,完成budget额度后通过napi_complete通知napi停止轮询调度。

static int rocker_port_poll_tx(struct napi_struct *napi, int budget)
{
	struct rocker_port *rocker_port = rocker_port_napi_tx_get(napi);
	const struct rocker *rocker = rocker_port->rocker;
	const struct rocker_desc_info *desc_info;
	u32 credits = 0;
	int err;

	/* Cleanup tx descriptors */
	while ((desc_info = rocker_desc_tail_get(&rocker_port->tx_ring))) {
		struct sk_buff *skb;

		err = rocker_desc_err(desc_info);
		if (err && net_ratelimit())
			netdev_err(rocker_port->dev, "tx desc received with err %d\n",
				   err);
		rocker_tx_desc_frags_unmap(rocker_port, desc_info);

		skb = rocker_desc_cookie_ptr_get(desc_info);
		if (err == 0) {
			rocker_port->dev->stats.tx_packets++;
			rocker_port->dev->stats.tx_bytes += skb->len;
		} else {
			rocker_port->dev->stats.tx_errors++;
		}

		dev_kfree_skb_any(skb);
		credits++;
	}

	if (credits && netif_queue_stopped(rocker_port->dev))
		netif_wake_queue(rocker_port->dev);

	napi_complete(napi);
	rocker_dma_ring_credits_set(rocker, &rocker_port->tx_ring, credits);

	return 0;
}

上面是发送中断处理,实际的驱动发包在回调函数ndo_start_xmit对应的接口rocker_port_xmit中。此函数找到skb对应的desc_info,填充内容后放到tx_ring中,完成报文发送。

static netdev_tx_t rocker_port_xmit(struct sk_buff *skb, struct net_device *dev)
{
	struct rocker_port *rocker_port = netdev_priv(dev);
	struct rocker *rocker = rocker_port->rocker;
	struct rocker_desc_info *desc_info;
	struct rocker_tlv *frags;
	int i;
	int err;

	desc_info = rocker_desc_head_get(&rocker_port->tx_ring);
	if (unlikely(!desc_info)) {
		if (net_ratelimit())
			netdev_err(dev, "tx ring full when queue awake\n");
		return NETDEV_TX_BUSY;
	}

	rocker_desc_cookie_ptr_set(desc_info, skb);

	frags = rocker_tlv_nest_start(desc_info, ROCKER_TLV_TX_FRAGS);
	if (!frags)
		goto out;
	err = rocker_tx_desc_frag_map_put(rocker_port, desc_info,
					  skb->data, skb_headlen(skb));
	if (err)
		goto nest_cancel;
	if (skb_shinfo(skb)->nr_frags > ROCKER_TX_FRAGS_MAX) {
		err = skb_linearize(skb);
		if (err)
			goto unmap_frags;
	}

	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
		const skb_frag_t *frag = &skb_shinfo(skb)->frags[i];

		err = rocker_tx_desc_frag_map_put(rocker_port, desc_info,
						  skb_frag_address(frag),
						  skb_frag_size(frag));
		if (err)
			goto unmap_frags;
	}
	rocker_tlv_nest_end(desc_info, frags);

	rocker_desc_gen_clear(desc_info);
	rocker_desc_head_set(rocker, &rocker_port->tx_ring, desc_info);

	desc_info = rocker_desc_head_get(&rocker_port->tx_ring);
	if (!desc_info)
		netif_stop_queue(dev);

	return NETDEV_TX_OK;

unmap_frags:
	rocker_tx_desc_frags_unmap(rocker_port, desc_info);
nest_cancel:
	rocker_tlv_nest_cancel(desc_info, frags);
out:
	dev_kfree_skb(skb);
	dev->stats.tx_dropped++;

	return NETDEV_TX_OK;
}

rocker_rx_irq_handler中断处理函数中,通过napi_schedule触发napi_rx注册的回调rocker_port_poll_rx。

static irqreturn_t rocker_rx_irq_handler(int irq, void *dev_id)
{
	struct rocker_port *rocker_port = dev_id;

	napi_schedule(&rocker_port->napi_rx);
	return IRQ_HANDLED;
}

函数rocker_port_poll_rx中处理rx_ring缓冲区的内容,获取desc_info,调用rocker_port_rx_proc处理,完成budget额度后通过napi_complete_done通知napi停止轮询调度。

static int rocker_port_poll_rx(struct napi_struct *napi, int budget)
{
	struct rocker_port *rocker_port = rocker_port_napi_rx_get(napi);
	const struct rocker *rocker = rocker_port->rocker;
	struct rocker_desc_info *desc_info;
	u32 credits = 0;
	int err;

	/* Process rx descriptors */
	while (credits < budget &&
	       (desc_info = rocker_desc_tail_get(&rocker_port->rx_ring))) {
		err = rocker_desc_err(desc_info);
		if (err) {
			if (net_ratelimit())
				netdev_err(rocker_port->dev, "rx desc received with err %d\n",
					   err);
		} else {
			err = rocker_port_rx_proc(rocker, rocker_port,
						  desc_info);
			if (err && net_ratelimit())
				netdev_err(rocker_port->dev, "rx processing failed with err %d\n",
					   err);
		}
		if (err)
			rocker_port->dev->stats.rx_errors++;

		rocker_desc_gen_clear(desc_info);
		rocker_desc_head_set(rocker, &rocker_port->rx_ring, desc_info);
		credits++;
	}

	if (credits < budget)
		napi_complete_done(napi, credits);

	rocker_dma_ring_credits_set(rocker, &rocker_port->rx_ring, credits);

	return credits;
}

rocker_port_rx_proc函数中获取desc_info对应的skb,填充内容后通过netif_receive_skb上交内核网络协议栈处理。

static int rocker_port_rx_proc(const struct rocker *rocker,
			       const struct rocker_port *rocker_port,
			       struct rocker_desc_info *desc_info)
{
	const struct rocker_tlv *attrs[ROCKER_TLV_RX_MAX + 1];
	struct sk_buff *skb = rocker_desc_cookie_ptr_get(desc_info);
	size_t rx_len;
	u16 rx_flags = 0;

	if (!skb)
		return -ENOENT;

	rocker_tlv_parse_desc(attrs, ROCKER_TLV_RX_MAX, desc_info);
	if (!attrs[ROCKER_TLV_RX_FRAG_LEN])
		return -EINVAL;
	if (attrs[ROCKER_TLV_RX_FLAGS])
		rx_flags = rocker_tlv_get_u16(attrs[ROCKER_TLV_RX_FLAGS]);

	rocker_dma_rx_ring_skb_unmap(rocker, attrs);

	rx_len = rocker_tlv_get_u16(attrs[ROCKER_TLV_RX_FRAG_LEN]);
	skb_put(skb, rx_len);
	skb->protocol = eth_type_trans(skb, rocker_port->dev);

	if (rx_flags & ROCKER_RX_FLAGS_FWD_OFFLOAD)
		skb->offload_fwd_mark = 1;

	rocker_port->dev->stats.rx_packets++;
	rocker_port->dev->stats.rx_bytes += skb->len;

	netif_receive_skb(skb);

	return rocker_dma_rx_ring_skb_alloc(rocker_port, desc_info);
}

完成上述分析后,我们应该知道了cmd和event中断是接口配置命令的, tx和rx是处理芯片发送和接收报文的。

让我们揭开上面提到了问题,其实如果你用“rocker driver”在搜索引擎中找一下的话,会发现第一个结果就是介绍rocker的:Rocker: switchdev prototyping vehicle。这个文件里介绍了rocker设备的一些主要实现原理,简单来说Rocker就是QEMU模拟的一个交换芯片设备,通过cmd和event分别实现对设备的配置,以及设备事件的上报,通过tx和rx实现报文的收发功能,如果对Rocker设备有兴趣,可以去看看QEMU的源代码,看文档中的介绍,Rocker设备最终是通过tun机制模拟了交换芯片的端口功能。

其实,上面这么多内容都是介绍Rocker驱动的,我们的主角switchdev还没有出场呢,那么他在哪里呢?在rocker_probe中有如下代码:

	/* Only FIBs pointing to our own netdevs are programmed into
	 * the device, so no need to pass a callback.
	 */
	rocker->fib_nb.notifier_call = rocker_router_fib_event;
	err = register_fib_notifier(&init_net, &rocker->fib_nb, NULL, NULL);
	if (err)
		goto err_register_fib_notifier;

	err = register_switchdev_notifier(&rocker_switchdev_notifier);
	if (err) {
		dev_err(&pdev->dev, "Failed to register switchdev notifier\n");
		goto err_register_switchdev_notifier;
	}

	nb = &rocker_switchdev_blocking_notifier;
	err = register_switchdev_blocking_notifier(nb);
	if (err) {
		dev_err(&pdev->dev, "Failed to register switchdev blocking notifier\n");
		goto err_register_switchdev_blocking_notifier;
	}

其中rocker_switchdev_notifier和rocker_switchdev_blocking_notifier的实现如下:

static struct notifier_block rocker_switchdev_notifier = {
	.notifier_call = rocker_switchdev_event,
};

static struct notifier_block rocker_switchdev_blocking_notifier = {
	.notifier_call = rocker_switchdev_blocking_event,
};

具体说就是rocker_router_fib_event处理fib变化的事件,包括如下事件:

enum fib_event_type {
	FIB_EVENT_ENTRY_REPLACE,
	FIB_EVENT_ENTRY_APPEND,
	FIB_EVENT_ENTRY_ADD,
	FIB_EVENT_ENTRY_DEL,
	FIB_EVENT_RULE_ADD,
	FIB_EVENT_RULE_DEL,
	FIB_EVENT_NH_ADD,
	FIB_EVENT_NH_DEL,
	FIB_EVENT_VIF_ADD,
	FIB_EVENT_VIF_DEL,
};

rocker_switchdev_event处理switchdev通用事件,包括如下事件:

enum switchdev_notifier_type {
	SWITCHDEV_FDB_ADD_TO_BRIDGE = 1,
	SWITCHDEV_FDB_DEL_TO_BRIDGE,
	SWITCHDEV_FDB_ADD_TO_DEVICE,
	SWITCHDEV_FDB_DEL_TO_DEVICE,
	SWITCHDEV_FDB_OFFLOADED,
	SWITCHDEV_FDB_FLUSH_TO_BRIDGE,

	SWITCHDEV_PORT_OBJ_ADD, /* Blocking. */
	SWITCHDEV_PORT_OBJ_DEL, /* Blocking. */
	SWITCHDEV_PORT_ATTR_SET, /* May be blocking . */

	SWITCHDEV_VXLAN_FDB_ADD_TO_BRIDGE,
	SWITCHDEV_VXLAN_FDB_DEL_TO_BRIDGE,
	SWITCHDEV_VXLAN_FDB_ADD_TO_DEVICE,
	SWITCHDEV_VXLAN_FDB_DEL_TO_DEVICE,
	SWITCHDEV_VXLAN_FDB_OFFLOADED,

	SWITCHDEV_BRPORT_OFFLOADED,
	SWITCHDEV_BRPORT_UNOFFLOADED,
};

rocker_switchdev_blocking_event处理port相关事件,包括如下事件:

enum switchdev_notifier_type {
	SWITCHDEV_FDB_ADD_TO_BRIDGE = 1,
	SWITCHDEV_FDB_DEL_TO_BRIDGE,
	SWITCHDEV_FDB_ADD_TO_DEVICE,
	SWITCHDEV_FDB_DEL_TO_DEVICE,
	SWITCHDEV_FDB_OFFLOADED,
	SWITCHDEV_FDB_FLUSH_TO_BRIDGE,

	SWITCHDEV_PORT_OBJ_ADD, /* Blocking. */
	SWITCHDEV_PORT_OBJ_DEL, /* Blocking. */
	SWITCHDEV_PORT_ATTR_SET, /* May be blocking . */

	SWITCHDEV_VXLAN_FDB_ADD_TO_BRIDGE,
	SWITCHDEV_VXLAN_FDB_DEL_TO_BRIDGE,
	SWITCHDEV_VXLAN_FDB_ADD_TO_DEVICE,
	SWITCHDEV_VXLAN_FDB_DEL_TO_DEVICE,
	SWITCHDEV_VXLAN_FDB_OFFLOADED,

	SWITCHDEV_BRPORT_OFFLOADED,
	SWITCHDEV_BRPORT_UNOFFLOADED,
};

上面这三大类事件的通知,需要业务模块,比如brdige, fib等,分别调用call_fib_notifiers,call_switchdev_notifiers和call_switchdev_blocking_notifiers,其中call_fib_notifiers是fib自己实现的通知机制,后面两个都是switchdev实现的通知机制。调用这些接口不限于bridge和fib模块,一些网卡驱动也可能会触发,比如下面这些,数据来自这里

drivers/net/ethernet/marvell/prestera/prestera_switchdev.c
line 1212
line 1803
line 1807
drivers/net/ethernet/mellanox/mlx5/core/esw/bridge.c, line 32
drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
line 10621
line 10669
drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c
line 3016
line 3042
line 3416
line 3429
line 3574
line 3595
drivers/net/ethernet/microchip/lan966x/lan966x_mac.c, line 270
drivers/net/ethernet/microchip/sparx5/sparx5_mactable.c, line 286
drivers/net/ethernet/rocker/rocker_main.c, line 2710
drivers/net/ethernet/rocker/rocker_ofdpa.c, line 1834
drivers/net/ethernet/ti/am65-cpsw-switchdev.c, line 362
drivers/net/ethernet/ti/cpsw_switchdev.c, line 372
drivers/net/vxlan/vxlan_core.c, line 318
drivers/s390/net/qeth_l2_main.c
line 292
line 652
line 658
net/bridge/br_switchdev.c
line 100
line 166
line 170
net/dsa/port.c, line 51
net/dsa/user.c, line 3553
net/switchdev/switchdev.c, line 517

之所以会出现这种情况,是因为switchdev有很多通知类型,设备完成FDB表项卸载后,可以发送SWITCHDEV_FDB_OFFLOADED通知其他模块,其他模块关心这个事件可以自己处理,不关心不处理即可,比如bridge模块的br_switchdev_event就处理了SWITCHDEV_FDB_OFFLOADED事件,因为只有这样bridge才知道该FDB表项被真的offload了。除此之外,当设备上学习了动态MAC地址,也会通过SWITCHDEV_FDB_ADD_TO_BRIDGE事件再通知bridge模块,bridge标记为硬件学习,后续的老化也是需要硬件更新。

switchdev还为厂商实现自己的驱动提供了一些通用接口,比如switchdev_handle_fdb_event_to_device,switchdev_handle_port_obj_add,switchdev_handle_port_attr_set,switchdev_bridge_port_offload等,这些接口内部通过递归调用的方式支持bridge和LAG设备,并调用驱动注册的具体接口实现,或者实现offload完成通知消息的发送功能。

至此我们完成了switchdev驱动框架的主要代码分析工作,其实rocker驱动将表项卸载到了哪里,我们这里都没有提及,如果有兴趣可以看看rocker_ofdpa.c这个文件,看看rocker内部是通过什么实现的。因为和switchdev关系不大,我们这里就不详细介绍了。

本文代码主要交互流程:

绿色箭头是注册流程,蓝色箭头是通知流程,注意蓝色箭头两个方向都支持。

参考文档

  1. Linux Switchdev the Mellanox way 一个使用Mellanox芯片基于switchdev模型的交换机示例
  2. Switchdev:释放开源Linux的网络力量 switchdev框架介绍,DENT操作系统,以及一个应用案例
  3. Linux 4.0的switchdev 绘制了Linux switch软硬件框架流程图,并展示了硬件卸载实现的Linux Box案例
Comments are closed.