Monday 27 May 2013

What is EPOLL? Epoll vs Poll vs Select call ? And How to implement UDP server in Linux using EPOLL?



Today in internet world, as the number of users are increasing day to day and to support these users it needs more efficient HTTP servers.

A common problem in HTTP server scalability is how to ensure that the server handles a large number of connections simultaneously without degrading the performance.

An event-driven approach is often implemented in high-performance network servers to multiplex a large number of concurrent connections over a few server processes. 

In event-driven servers it is important that the server focuses on connections that can be serviced without blocking its main process.

What is EPOLL?
===========
epoll - I/O event notification facility

Select Vs poll Vs Epoll
==================
The Epoll event mechanism  is designed to scale to larger numbers of connections than select and poll.

One of the problems with select and poll is that in a single call they must both inform the kernel of all of the events of interest and obtain new events.
This can result in large overheads, particularly in environments with large numbers of connections and relatively few new events occurring.

However, if your server application is network-intensive (e.g., 1000s of concurrent connections and/or a high connection rate), you should get really serious about performance.
This situation is often called the c10k problem. With select() or poll(), your network server will hardly perform any useful things but wasting precious CPU cycles under such high load.

c10k Problem
===========
Suppose that there are 10,000 concurrent connections. Typically, only a small number of file descriptors among them, say 10, are ready to read.
The rest 9,990 file descriptors are copied and scanned for no reason, for every select()/poll() call.

Another Example as :

The cost of  Epoll is closer to the number of file descriptors that actually have events on them.
If you're monitoring 200 file descriptors, but only 100 of them have events on them, then you're (very roughly) only paying for those 100 active file descriptors.
This is where Epoll tends to offer one of its major advantages over select. If you have a thousand clients that are mostly idle,
then when you use select you're still paying for all one thousands of them. However, with Epoll, it's like you've only got a few - you're only paying for the ones that are active at any given time.

All this means that epoll will lead to less CPU usage for most workloads

Time Complexity
=============

Select  -> O(n)   Epoll -> O(1)

Select calls, which are O(n), epoll is an O(1) algorithm – this means that it scales well as the number of watched file descriptors increase.
select uses a linear search through the list of watched file descriptors, which causes its O(n) behaviour, whereas epoll uses callbacks in the kernel file structure.

Another fundamental difference of epoll is that it can be used in an edge-triggered, as opposed to level-triggered, fashion.
 This means that you receive “hints” when the kernel believes the file descriptor has become ready for I/O, as opposed to being told “I/O can be carried out on this file descriptor”.

No of clients support is a Limitation in Select Call
==============================================
Using Select() call, Max number of clients it handle is 1024 (1k).

In other words, server is able to handle only 1024 client after which connections are failing.
Increased per process max open files (1024) to 100000 and still the connections failed at 1024.

select limitation

select fails after 1024 fds as FD_SETSIZE max to 1024.
As a natural progression poll was tried next to overcome max open fd issue.

poll limitation
poll solves the max fd issue. But as the number of concurrent clients started increasing, performance dropped drastically.
Poll implementation does O(n) operations internally and performance drops as number of fds increases.

epoll
Epoll solved both problems and gave awesome performance.

Triggering modes
=============

  • Edge Triggered Mode
  •  Level Triggered Mode

Epoll provides both edge-triggered and level-triggered modes. 

In edge-triggered mode, a call to epoll_wait will return only when a new event is en queued with the epoll object, while in level-triggered mode, epoll_wait will return as long as the condition holds.

For instance, if a pipe, registered with epoll, has received data, a call to epoll_wait will return, signaling the presence of data to be read.
Suppose the reader only consumed part of data from the buffer. In level-triggered mode, further calls to epoll_wait will return immediately, as long as the pipe's buffer contains data to be read.
In edge-triggered mode, however, epoll_wait will return only once new data is written to the pipe

To Understand Better…..

When an FD becomes read or write ready, you might not want necessarily want to read (or write) all the data immediately.

Level-triggered epoll will keep nagging you as long as the FD remains ready, whereas edge-triggered won't bother you again until the next time you get an EAGAIN
(so it's more complicated to code around, but can be more efficient depending on what you need to do).

Say you're writing from a resource to an FD. If you register your interest for that FD becoming write ready as level-triggered, you'll get constant notification that the FD is still ready for writing.
If the resource isn't yet available, that's a waste of a wake-up, because you can't write any more anyway.

If you were to add it as edge-triggered instead, you'd get notification that the FD was write ready once, then when the other resource becomes ready you write as much as you can.
Then if write(2) returns EAGAIN, you stop writing and wait for the next notification.

The same applies for reading, because you might not want to pull all the data into user-space before you're ready to do whatever you want to do with it
 (thus having to buffer it, etc etc). With edge-triggered epoll you get told when it's ready to read, and then can remember that and do the actual reading "as and when".

EPOLL SYSTEM Calls
==================

The Epoll interface consists of three system calls:

int epoll_create(int size);

Creates an epoll object and returns its file descriptor. size is obsolete since kernel 2.6.8 but must be greater than zero for backwards compatibility.

int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);

Controls (configures) which file descriptors are watched by this object, and for which events. op can be ADD, MODIFY or DELETE.

int epoll_wait(int epfd, struct epoll_event *events, int maxevents, int timeout);

Waits for any of the events registered for with epoll_ctl, until at least one occurs or the timeout elapses. Returns the occurred events in events, up to maxevents at once.


 UDP SERVER IMPLEMENTED USING EPOLL
==========================================


#include <stdio.h>          // for printf() and fprintf()
#include <sys/socket.h>     // for socket(), bind(), and connect()
#include <arpa/inet.h>      // for sockaddr_in and inet_ntoa()
#include <stdlib.h>         // for atoi() and exit()
#include <string.h>         // for memset()
#include <unistd.h>         // for close()
#include <fcntl.h>          // for fcntl()
#include <errno.h>
#include <sys/epoll.h>

#define MAX_EVENTS 100

#define BUFFSIZE 5096

unsigned char buf[BUFFSIZE];

/*
 * Dump Data
 */
void dumpData(unsigned char *data,  unsigned int len)
{
  unsigned int uIndx;

  if(data)
    {
      for(uIndx=0; uIndx<len; ++uIndx)
        {
          if(uIndx%32 == 0)
            {
              printf("\n%4d:", uIndx);
            }
          if(uIndx%4 == 0)
            {
              printf(" ");
            }
          printf("%02x", data[uIndx]);
        }
    }
  printf(" Length of Bytes: %d\n", len);
  printf("\n");
}


/*
 * make_socket_non_blocking :
 *   This Function makes socket as Non blocking
 */
static int make_socket_non_blocking(int sockFd)
{
  int getFlag, setFlag;
 
  getFlag = fcntl(sockFd, F_GETFL, 0);
 
  if(getFlag == -1)
  {
    perror("fnctl");
    return -1;
  }
 
  /* Set the Flag as Non Blocking Socket */
  getFlag |= O_NONBLOCK;
 
  setFlag = fcntl(sockFd, F_GETFL, getFlag);
 
  if(setFlag == -1)
  {
    perror("fnctl");
    return -1;
  }
 
  return 0;
}

/*
 *  Main Routine
 */
int main()
{
  int i, length, receivelen;

  /* Socket Parameters */
  int sockFd;
  int optval = 1;   // Socket Option Always = 1

  /* Server Address */
  struct sockaddr_in serverAddr, receivesocket;

  /* Epoll File Descriptor */
  int epollFd;      

  /* EPOLL Event structures */
  struct epoll_event  ev;                  
  struct epoll_event  events[MAX_EVENTS];               
  int numEvents;    
             
  int ctlFd; 
  // Step 1: First Create UDP Socket 
 
  /* Create UDP socket
   * socket(protocol_family, sock_type, IPPROTO_UDP);
   */
  sockFd = socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP);

  /* Check socket is successful or not */
  if (sockFd == -1)
  {
    perror(" Create SockFd Fail \n");
    return -1;
  }

  // Step 2: Make Socket as Non Blocking Socket.
  //         To handle multiple clients Asychronously, required to
  //         configure socket as Non Blocking socket

  /* Make Socket as Non Blocking Socket */
  make_socket_non_blocking(sockFd);

  // Step 3: Set socket options
  //    One can set different sock Options as RE-USE ADDR, 

  //    BROADCAST etc.
 
  /*  In this Program, the socket is set to RE-USE ADDR
   *  So this gives flexibilty to other sockets to BIND to the 

      same port Num */

  if(setsockopt(sockFd, SOL_SOCKET, SO_REUSEADDR, &optval, sizeof(optval))== -1)
  {
     perror("setsockopt Fail\n");
     return -1;
  }

  // Step 4: Bind to the Recieve socket
  /* Bind to its own port Num  ( Listen on Port Number) */
  
  /* Setup the addresses */
 
   /* my address or Parameters
     ( These are required for Binding the Port and IP Address )
      Bind to my own port and Address */
  memset(&receivesocket, 0, sizeof(receivesocket));
  receivesocket.sin_family = AF_INET;
  receivesocket.sin_addr.s_addr = htonl(INADDR_ANY);
  receivesocket.sin_port = htons(2905);

  receivelen = sizeof(receivesocket);

  /* Bind the my Socket */
  if (bind(sockFd, (struct sockaddr *) &receivesocket, receivelen) < 0)
  {
    perror("bind");
    return -1;
  }

  // EPOLL Implementation Starts
  // Step 5: Create Epoll Instance
             /* paramater is Optional */
 
  epollFd = epoll_create(6);

  if(epollFd == -1)
  {
     perror("epoll_create");
     return -1;
  }

  /* Add the udp Sock File Descriptor to Epoll Instance */
  ev.data.fd = sockFd;
 
  /* Events are Read Only and Edge-Triggered */
  ev.events = EPOLLIN | EPOLLET;

 
  // Step 6: control interface for an epoll descriptor
  /* EPOLL_CTL_ADD
      Register the target file descriptor fd on the epoll instance
      referred to by the file descriptor epfd and
      associate the event event with the internal file linked to fd.
  */


  /* Add the sock Fd to the EPOLL */
  ctlFd  = epoll_ctl (epollFd, EPOLL_CTL_ADD, sockFd, &ev);
 
  if (ctlFd == -1)
  {
    perror ("epoll_ctl");
    return -1;
  }

 // Step 7: Start the Event Loop using epoll_wait() in while Loop.

 /* Event Loop */
 while(1)
 {
     /*  Wait for events.
      *  int epoll_wait(int epfd, struct epoll_event *events, int
      *  maxevents, int timeout);
      *  Specifying a timeout of -1 makes epoll_wait() wait
      *  indefinitely.
      */
    
     /* Epoll Wait Indefently since Time Out is -1 */
     numEvents = epoll_wait(epollFd, events, MAX_EVENTS, -1);

     for (i = 0; i < numEvents; i++)
     {
       if ((events[i].events & EPOLLERR) ||
           (events[i].events & EPOLLHUP) ||
           (!(events[i].events & EPOLLIN)))
        {
           /* An error has occured on this fd, or  the socket is not
            * ready for reading (why were we notified then?)
            */
           fprintf (stderr, "epoll error\n");
           close (events[i].data.fd);
           continue;
        }
       /* We have data on the fd waiting to be read. Read and
        * display it. We must read whatever data is available
        * completely, as we are running in edge-triggered mode
        * and won't get a notification again for the same data.
        */
       else if ( (events[i].events & EPOLLIN) &&
           (sockFd == events[i].data.fd) )
       {
         while (1)
         {

           memset(buf, 0, BUFFSIZE);
           /* Recieve the Data from Other system */
           if ((length = recvfrom(sockFd, buf, BUFFSIZE, 0, NULL, NULL)) < 0)
            {
                perror("recvfrom");
                return -1;
            }

           else if(length == 0)
             {
               printf( " The Return Value is 0\n");
               break;
             }
           else
             {
               /* Print The data */
               printf("Recvd Byte length : %d",  length);
               dumpData(buf, length);
             }
          }
       }
     }
  }

close( sockFd );
close( epollFd );
return 0;
}



==============================================================================
UDP CLIENT -> udpclient.c
==============================================================================

#include <stdio.h>
#include <arpa/inet.h>
#include <string.h>
#include<stdlib.h>
#include <sys/unistd.h>
#include <sys/fcntl.h>


#define BUFFSIZE 5096
#define MAX_LEN 100000

int sendlen, receivelen;
int received = 0, i,count, rcvCnt=0, sentCnt=0;
unsigned char buffer[BUFFSIZE];
struct sockaddr_in receivesocket;
struct sockaddr_in sendsocket;
int sock;
unsigned int ch;
unsigned int noOfTimes;
   
int sendUDPData();  
   
int main(int argc, char *argv[]) {
   
    int ret = 0;
  int optval = 1;

    /* Create the UDP socket */
    if ((sock = socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP)) < 0) {
        perror("socket");
        return -1;
    }

    /* my address */
    memset(&receivesocket, 0, sizeof(receivesocket));
    receivesocket.sin_family = AF_INET;
    receivesocket.sin_addr.s_addr = htonl(INADDR_ANY);
    receivesocket.sin_port = htons(2905);

    receivelen = sizeof(receivesocket);

 if(setsockopt(sock, SOL_SOCKET, SO_REUSEADDR, &optval, sizeof(optval))== -1)
  {
     perror("setsockopt Fail\n");
     return -1;
  }
  if (bind(sock, (struct sockaddr *) &receivesocket, receivelen) < 0) 

    {
        perror("bind");
        return -1;      
    } 
    /* kernel address */
    memset(&sendsocket, 0, sizeof(sendsocket));
    sendsocket.sin_family = AF_INET;
    sendsocket.sin_addr.s_addr = inet_addr("10.12.7.95");
    sendsocket.sin_port = htons(2905);

   do
    {
       printf("\n");
       printf(" Enter your choice:\t");
       printf(" 1. Send UDP Data \n");
       printf(" 2. exit \n");
       scanf("%d", &ch);
       printf("\n");

       switch(ch)
       {

           case 1:
                   printf("Enter the Length of the Payload \n");
                   scanf("%d", &sendlen);
                   printf("Enter How many times you want to send data \n");
                   scanf("%d", &noOfTimes);
                   sendUDPData();
                   break;

           default:
                  printf("Invalid Choice\n");
                  break;
       }
    }while(ch!=2);
return 0;
}
int sendUDPData()
{
    int count=0;
        memset(buffer, 0x31, sendlen);
       
        for(count=0; count< noOfTimes;  count++)
        {
       if (sendto(sock, buffer, sendlen, 0, (struct sockaddr *) &sendsocket,
                        sizeof(sendsocket)) != sendlen)      
       {
        perror("sendto");
        return -1;
       }

    printf("\n");
    }
    return 0;
}

19 comments:

  1. Replies
    1. Hi Sekhar,

      Gasping at your brilliance! Thanks, a tonne for sharing all that content. Can’t stop reading. Honestly!

      I have a data frame with a lot of columns with binary values.

      Is it possible to count the number of rows that satisfy a condition?

      Scope for names introduced in a python program is a region where it could be used, without any qualification. That is, scope is region where the unqualified reference to a name can be looked out in the namespace to find the object.

      ex: the number of rows for Col1 and Col2 that are both 1 is 2

      similarly the number of rows for Col1 and Col2 that are 0 and 1 is 1

      Is there an easy way to do this rather than looping all over the data frame?

      Thank you very much and will look for more postings from you.

      Thanks a heaps,
      Ivan

      Delete
  2. This comment has been removed by the author.

    ReplyDelete
    Replies
    1. Hi Jakub,

      I had added bind() fn intentionally to make client to receive data from server. ( I mean if server is sending data, then client can receive using (recv) function.

      Delete
  3. Thank you for your article.

    ReplyDelete