[lwip-users] lwip lock

236 views
Skip to first unread message

Tazzari Davide

unread,
Mar 22, 2011, 11:41:41 AM3/22/11
to lwip-...@nongnu.org

Hi to all.

I have created an avr32 application based on FreeRtos and LWIP 1.3.2

My application is very huge but I want to concentrate on my problems.

There are 2 tasks that use http connection: a web server and a web client versus an external portal.

The application simply collects some data and, periodically, POST them to an apache based web portal.

The web server is of course alive only when a browser wants to connect otherwise is almost frozen in a listen status.

Here is my problem.

Sometime and somehow all the tcp connections are locked and lost: the web server is no more accessible and the application cannot communicate to the portal.

This seems to happen while I try to access to the web server and, in the same time, the device tries to access to the portal.

 

I started to analyze the lwip and here is what I found.

 

In file mem.c I added the following code

 

 

static u8_t *ram;

/** the last entry, always unused! */

static struct mem *ram_end;

/** pointer to the lowest free block, this is used for faster search */

static struct mem *lfree;

 

u8_t ** ppMemRam;           // DT:2011/03/09

/** the last entry, always unused! */

struct mem ** ppMemRamEnd;  // DT:2011/03/09

/** pointer to the lowest free block, this is used for faster search */

struct mem ** ppMemLFree;   // DT:2011/03/09

 

...

 

void

mem_init(void)

{

...

  ppMemRam = & ram;           // DT:2011/03/09

  ppMemRamEnd = & ram_end;  // DT:2011/03/09

  ppMemLFree = & lfree;   // DT:2011/03/09

}

 

 

This permits to me to see (through a serial debugger) the status of the heap area for the lwip data.

When problems happen, the "lfree" pointer is stacked at an address different to "ram"

I tried to look ad the mem ram area and I found that the chain of the various allocation was ok.

It seems that there was something not freed for some (for me) unknown reason.

Sometimes this is not critical because the access is ok but the wasted area grows up little by little saturating the area and locking the communication.

 

I suppose this is not a cause but an effect so a continue my analysis.

I concentrate on the memp area

I study it being sure I didn't understand so much but, anyway, here is what I discovered.

 

I show only the TCP_SEG area that seems relevant to me.

 

HEX       Offset     Delta    Block  Arg       RefCh    RefMem Free

1E08      2564       20      0      TCP_SEG    0        

1E1C      2584       20      1      TCP_SEG    1E08     

1E30      2604       20      2      TCP_SEG    1E1C     

1E44      2624       20      3      TCP_SEG    1E30     

1E58      2644       20      4      TCP_SEG    1E44     

1E6C      2664       20      5      TCP_SEG    1E58     

1E80      2684       20      6      TCP_SEG    1E6C     

1E94      2704       20     7      TCP_SEG    ?       1EE4  

1EA8      2724       20      8      TCP_SEG    ?       0

1EBC      2744       20      9      TCP_SEG    1E80    -       xxx

1ED0      2764       20      10      TCP_SEG    ?       0

1EE4      2784       20      11      TCP_SEG    ?       0

 

I try to describe...

HEX is the absolute address in memory of the memp block

Offset is the absolute offset in byte from the top of the whole memp structure

Delta is the sizeof the single block

Block is the index of the block

RefCh is the address of the "next" block chained

RefMem is the address of the "next" block found surfing the memory

Free is the first free block

 

What seems is that the block 9 is the first free. The next one is the 6th, then 5th, 4th, 3, 2, 1, 0

Reading the memory I have seen that there is the block 7 chained to block 11. These two blocks are chained but no more reachable.

Again block 10 and 8 seems to be no more reachable and chained to nothing.

 

What I see is that these two phenomena are related: when I loose mem area I lose TCP_SEG blocks as well

If we take a look at the tcp_seg structure

 

struct tcp_seg {

  struct tcp_seg *next;    /* used when putting segements on a queue */

  struct pbuf *p;          /* buffer containing data + TCP header */

...

 

 

we can see that there is a reference to pbuf. The lost tcp_seg blocks do refers to that lost mem area!

 

Anyone has ever seen such a problem?

Any suggestion on how to solve it?

 

I read also the stats of the lwip memp

 

lwip_stats.memp[i].max

lwip_stats.memp[i].avail

lwip_stats.memp[i].used

 

and what I found is, for TCP_SEG, even 12, 12, 12 so all memp block used!

 

I have one idea but I don't know if this maybe can create worst problems. This is not a solution because I don't know the real problem but it is a sort of sanity of the TCP_SEG blocks.

Looking at the example above posted I can chain the two lost blocks (10 and 8 ) to the top of the list and the chained blocks (7 and 11) to the bottom of the list. In this way I can recover at least the lost blocks. The chained blocks (7 and 11), in theory, can be still used and freed or, at least, I don't know if they are really used or lost.

 

So, the result should be

7(chained) -> 11(lost) -> 9 (free) -> 6 -> 5 -> 4 -> 3 -> 2 -> 1 -> 0 -> 8 (lost)-> 10(lost)

This, of course, must be done by hand.

For block 8 and 10 I suppose I have to call also the mem_free function on the block->p area.

 

Is it a good idea?

Again, does anybody know the problem or what the hell I have done to create this problem?

 

Another problem. I don't know if it is related; maybe it is the same problem but with a different effects.

The tcp_thread stalls!

 

static void

tcpip_thread(void *arg)

{

...

  while (1) {                          /* MAIN Loop */

    gusTcpThread ++;  // DT 03/03/2011 Debug

    gucStatusTCPIP = 0; //DT 2011/03/04 TEST

    sys_mbox_fetch(mbox, (void *)&msg);

    gucStatusTCPIP = 1; //DT 2011/03/04 TEST

    switch (msg->type) {

#if LWIP_NETCONN

    case TCPIP_MSG_API:

      LWIP_DEBUGF(TCPIP_DEBUG, ("tcpip_thread: API message %p\n", (void *)msg));

      gucStatusTCPIP = 2; //DT 2011/03/04 TEST

      msg->msg.apimsg->function(&(msg->msg.apimsg->msg));

      gucStatusTCPIP = 3; //DT 2011/03/04 TEST

      break;

#endif /* LWIP_NETCONN */

...

}

 

What I see is that the gusTcpThread counter is stopped. In this case the debug variable gucStatusTCPIP is 2 so that it stalls in the call of the api function. I don't know which one and which mbox is related to.

 

// Posts the "msg" to the mailbox. This function have to block until the "msg"

// is really posted.

void sys_mbox_post(sys_mbox_t mbox, void *msg)

{

  // NOTE: we assume mbox != SYS_MBOX_NULL; iow, we assume the calling function

  // takes care of checking the mbox validity before calling this function.

  while( pdTRUE != xQueueSend( mbox, &msg, SYS_ARCH_BLOCKING_TICKTIMEOUT ) )

  {

      vTaskDelay(10); // DT 08/03/2011 Debug

      gusCntMBoxFull++; // DT 03/03/2011 Debug

  }

  gusCntMBoxFull = 0; // DT 03/03/2011 Debug

}

 

In the normal case the variable gusCntMBoxFull is supposed to be 0. If the tcp thread is locked (the only one that can pop the queue) the queue is continuously filled till its own fullness and that while loop is an infinite loop.

 

Any idea? Do you think these two problems are the same problem with two different effects? Consider that also this problem happens in the same situation: web server and portal both on.

 

Last information. I have the optimization o1. I am going to try the optimization o0 but I have to remove pieces of code so, it is not a simple job.

 

Best regards

Davide

 

 

Kieran Mansley

unread,
Mar 22, 2011, 11:55:57 AM3/22/11
to Mailing list for lwIP users
On Tue, 2011-03-22 at 16:41 +0100, Tazzari Davide wrote:
>
> Anyone has ever seen such a problem?

It sounds like you're corrupting internal stack state by having more
than one thread active in lwIP's core at the same time. This would also
explain tcpip_thread being stalled as it is probably stuck in a loop
iterating a corrupt list.

> Any suggestion on how to solve it?

Make sure that only one thread is active in lwIP at once. This should
in your case be the tcpip_thread. All other threads (including
interrupts) should make sure they're not calling directly into lwIP and
are instead queueing work for the tcpip_thread to perform for them. If
you're using the sockets API then most of this will be done for you but
you still need to be careful; you can't use one socket in two different
thread for example. Make sure your driver is interfacing to lwIP
correctly as that was a common source of porting errors.

Kieran


_______________________________________________
lwip-users mailing list
lwip-...@nongnu.org
http://lists.nongnu.org/mailman/listinfo/lwip-users

Tazzari Davide

unread,
Apr 12, 2011, 11:36:17 AM4/12/11
to Mailing list for lwIP users

Hi Kieran,

I have done this improve in my webserver

 

WebServer task (old version)

...

    for (;;)

    {

        iRestartBinding = 0;

        pxHTTPListener = netconn_new( NETCONN_TCP );

        netconn_bind(pxHTTPListener, NULL, webHTTP_PORT );

        netconn_listen( pxHTTPListener );

        int iTimeout = 1000;

 

        for( ; (iRestartBinding < 10) && (gucRestartWebServer == FALSE); iRestartBinding++)

        {

            xLastFocusTime = xTaskGetTickCount();

            vTaskDelayUntil( &xLastFocusTime, xDelayLength );

            if (iGlobalWtdBomb == FALSE) // TRUE I am waiting for a WDT suicide

            {

                // Wait for a first connection.

                #if LWIP_SO_RCVTIMEO

                pxHTTPListener->recv_timeout = iTimeout;

                #endif

 

                pxNewConnection = netconn_accept(pxHTTPListener);

                if(pxNewConnection != NULL)

                {

                    prvweb_ParseHTMLRequest( pxNewConnection );

                    netconn_close( pxNewConnection );

                    netconn_delete( pxNewConnection );

                    iRestartBinding = 0;

                    iTimeout = 5000;

                }// end if new connection

                else

                {

                    iTimeout = 1000;

                }

            }

        }   // end acquisition loop

        gucRestartWebServer = FALSE;

        netconn_close(pxHTTPListener);

        while(netconn_delete(pxHTTPListener) != 0)

        {

            vTaskDelay(20);

        }

        pxHTTPListener = NULL;

   }

...

 

 

static unsigned char prvweb_ParseHTMLRequest( struct netconn *pxNetCon )

{

struct netbuf *pxRxBuffer;

portCHAR *pcRxString;

unsigned portSHORT usLength;

 

    /* We expect to immediately get data. */

    pxNetCon->recv_timeout = 1000;

    pxRxBuffer = netconn_recv( pxNetCon );

 

    if( pxRxBuffer != NULL )

    {

        /* Where is the data? */

        netbuf_data( pxRxBuffer, ( void * ) &pcRxString, &usLength );

        ...

        netbuf_delete( pxRxBuffer );

        return 0;

    }

    else

    {

        return -1;

    }

}

 

This was my first implementation. Why these two loops? Because, in this case, when the ethernet cable is unplugged and then plugged I recognize it and create again the listener. Anyway I loose the ethernet!!! I don't know if this is THE solution but, at least, this is a solution.

 

After your comments I changed the web server task into a more flexible structure: on each accepted connection, I create a task to serve it in this way

 

portTASK_FUNCTION( WebServerAnswerTask, pvParameters )

{

struct netconn * pxNewConnection = (struct netconn *) pvParameters;

    prvweb_ParseHTMLRequest( pxNewConnection );

    netconn_close( pxNewConnection );

    netconn_delete( pxNewConnection );

    vTaskDelete( NULL );

}

 

And ...

 

if(pxNewConnection != NULL)

{

    if (xTaskCreate(WebServerAnswerTask,

            ( signed portCHAR * ) "WebServerAnswer",

            WEB_SERVER_STACK_SIZE,

            pxNewConnection,

            ethWEBSERVER_PRIORITY,

            ( xTaskHandle * ) NULL ) != pdPASS)

    {

        // Task not correctly created!!!

        netconn_write( pxNewConnection, (char *) webHTTP_HTM_INTERNAL_ERROR, (u16_t) strlen( webHTTP_HTM_INTERNAL_ERROR ), NETCONN_COPY ); // error HTTP 500

        netconn_close( pxNewConnection );

        netconn_delete( pxNewConnection );

    }

    iRestartBinding = 0;

    iTimeout = 5000;

}// end if new connection

else

{

    iTimeout = 1000;

}

 

instead of

 

prvweb_ParseHTMLRequest( pxNewConnection );

netconn_close( pxNewConnection );

netconn_delete( pxNewConnection );

 

directly in the main web server task

 

Results:

Web server faster.

MBOX full has never happened any more

Mem area stuck is reduced but, unfortunately, not to zero. Very few times I have seen that the lfree pointer is different (and stucked) to ram pointer. In that few cases the web server remains not accessible till a reset. I monitore this value and I reset the machine (WDT) if occours. I dislike very much this but... anyway... this doesn't happen so often.

 

Let's consider memp.

Now, TCP_SEG seems correct and it seems that no blocks are lost.

TCP_PCB, instead, goes to full usage almost immediately. I have set the limit to 12 and then to 30 but anytime a connection appears this number increases to reach the limit. My home page contains 8 images, 1 css and 1 js so, in a couple of reload I reach the limit (whatever set)

I have read somewhere that even the connection is closed the pcb remains in a wait state (to wait for connection sinchronisation packet lost in the net) for a couple of minute and the rule is to use the "not used" pcb then the "wait" pcb so at the beginning I didn't take care of it. After 10 minutes waited, the relative lwip_stats.memp[i].used is still equal to the limit or, at least, one less: 12 limit, 11 used; the "used" never goes to zero.

What I see is that, when the use pcb value is well below the limit the web server is almost fantastic, when the pcb value is near the limit the web server is slower and (this is the bad thing) sometimes locks. In that cases the lfree pointer of the mem area is stucked to a value different from the ram pointer.

 

Moreover. For memory code problem I transferred all the code to SDRam. Of course I see a speed reduction but I expected it. The problem is that after few web server connection the web server sometimes locks i.e. connection refused ([RST, ACK] immediately after a [SYN] request) and no possibility to restart till a reset. It seems that this happens when the TCP_PCB limit is reached no matter the value of this limit. But sometimes everything functions no matter these values.

The code is exactly the same, the difference is where this code is fetched from.

 

 

About the lwip interface.

I used in all the code only the netconn api (or at least I this is my intention!). I suppose I make some mistakes or somewhere in the code (FreeRtos? LWIP itself? My fault? ...) there is something that uses a low level lwip access I didn't find.

Here is the lwip connection I have

 

1) WebServer (shown above)

 

2) PortalConnection

... // Send and receive function

    * pps_Connection = netconn_new(NETCONN_TCP);

    error_get_web = netconn_connect(* pps_Connection, &ipaddr, gs_EthernetParameters.siPort);

    if(error_get_web == 0)

    {

        if ((* pps_Connection)->pcb.tcp->state != ESTABLISHED) // if the portal doesn't respond I don't receive any error at all!

        {

            DestroyConnection(pps_Connection);

            return ERR_CONN;

        }

        else

        {

            error_get_web = netconn_write(* pps_Connection, pcBuffer, iSize, NETCONN_COPY );

            if ((* pps_Connection)->state != NETCONN_NONE)

            {

                 // error code but connection not destroyed. I don't know what to do here and if I have to do something!!!

            }

        }

    }

 

    #if LWIP_SO_RCVTIMEO

    ps_Connection->recv_timeout = 10000;    // 10 sec max

    #endif

    unsigned char ucFlagFirstPage = TRUE;

    while( (nb = netconn_recv( ps_Connection ) ) != NULL )

    {

        netbuf_data( nb, (void *) & pcPageData, & usLength );

        ... // transfer data to a temporary file to be analyzed later.

        netbuf_delete(nb);

    }

    #if LWIP_SO_RCVTIMEO

    if (ps_Connection->err == ERR_TIMEOUT)

    {

        DestroyConnection(& ps_Connection);

        return ERR_TIMEOUT;

    }

    #endif

    DestroyConnection(& ps_Connection);

 

And

 

void DestroyConnection(struct netconn ** pps_Connection)

/// \breif Destroy active connection

/// \param pps_Connection pointer to pointer to connection

{

    if (pps_Connection == NULL)

        return;

    netconn_close(* pps_Connection);

    while(netconn_delete(* pps_Connection) != 0)

    {

        vTaskDelay(DELAY_TO_WAIT_DISPOSE_CONNECTION);

    }

    * pps_Connection = NULL;

}

 

3) UDP Debug Client (this task sends data to a remote client).

conn = netconn_new(NETCONN_UDP);

if (conn != NULL)

{

    nb = netbuf_new();

    netconn_connect(conn, &ipaddr, ti_UdpPortDebug.i);

    while (xQueueReceive(xQueueUdpDebug, & s_Block, 0) == pdTRUE)

    {

        sprintf(pcBitmaskCode, "#######   Code: %02x - %02x #######\r\n", s_Block.ucClassCode, s_Block.ucSubClassCode);

        netbuf_ref(nb, pcBitmaskCode, strlen(pcBitmaskCode));

        cError = netconn_send(conn, nb);

        vTaskDelay(10);

        int len = strlen(s_Block.pcTextBloc);

        int i = 0;

        while ((len - i) > 1000)

        {

            netbuf_ref(nb, (char *) & s_Block.pcTextBloc[i], (unsigned short)1000);

            cError = netconn_send(conn, nb);

            i += 1000;

            vTaskDelay(20);

        }

        if (len - i)

        {

            netbuf_ref(nb, (char *) & s_Block.pcTextBloc[i], (unsigned short)(len - i));

            cError = netconn_send(conn, nb);

            vTaskDelay(20);

        }

 

        netbuf_ref(nb, "\r\n\r\n", 4);

        cError = netconn_send(conn, nb);

        vTaskDelay(5);

 

        vPortFree(s_Block.pcTextBloc);

        vTaskDelay(5);

    }

    netconn_disconnect(conn);

    netbuf_free(nb);

    netbuf_delete(nb);

}

 

Forget for the moment the s_Block data; it is a structure to enqueue a debug message text

 

4) UDP Configuration Server

 

portTASK_FUNCTION( vBasicUDPCOMSERVER, pvParameters )

{

struct udp_pcb *connUdp;

err_t myError;

 

    connUdp = udp_new();

    myError = udp_bind(connUdp, IP_ADDR_ANY, UDPCOMNET_PORT);

    udp_recv(connUdp, Server_udp_recv, NULL);

    cUDPTxBuffer[0] = myError;

    // Loop forever

    for( ;; )

    {

        vTaskDelay(1000);

        __asm__ __volatile__("nop");

    }

}

 

void Server_udp_recv(void *_args, struct udp_pcb *upcb, struct pbuf *pBuffUdp, struct ip_addr *Remoteaddr, u16_t Remote_port_udp)

{

int uiUdpLenMessage= 0;

    if(pBuffUdp!= NULL)

    {

        .. // message analyzed

        udp_sendto(...); // send the answer

        pbuf_free(pBuffUdp);

    }

}

 

5) UDP Management Server

Exactly as the UDP Configuration Server

 

Anyway, during the web server lock, udp servers and client were not used

 

In my knowledge, nothing else is using lwip. Where do I have to look for unknown low level lwip access?

 

Any further clever idea about all these problems?

 

Sorry for boring with such huge e-mail

 

Best regards

Davide

Tazzari Davide

unread,
Apr 18, 2011, 4:17:27 AM4/18/11
to Mailing list for lwIP users

New information about the problem.
I have seen that probably my UDP server was not correctly written as Kieran supposed.

Here are my changes

 

portTASK_FUNCTION( vBasicUDPCOMSERVER, pvParameters )

{

static struct netconn *conn;

static struct netbuf * pxRxBuffer;

struct ip_addr *addr;

struct ip_addr destip;

unsigned int uiTxLen;

char pcTxData[BUFFER_LENGTH];

char * pcRxData;

    conn = netconn_new(NETCONN_UDP);

    netconn_bind(conn, NULL, UDPCOMNET_PORT);

 

    for (;;)

    {

        pxRxBuffer = netconn_recv(conn);

        if (pxRxBuffer != NULL)

        {

            addr = netbuf_fromaddr(pxRxBuffer);

            destip = *addr;

            unsigned short usLength = pxRxBuffer->p->tot_len;

            pcRxData = (char *) pvPortMalloc(usLength);

            if (pcRxData != NULL)

            {

                netbuf_copy(pxRxBuffer, pcRxData, usLength);

                InitTxBuffer(pcTxData);

                uiTxLen = VsInterpreter(usLength, (char *) pcRxData, (char *) pcTxData, BUFFER_LENGTH);

                vPortFree(pcRxData);

                if (uiTxLen > 0)

                {

                    struct netbuf * pxTxBuffer;

                    pxTxBuffer = netbuf_new();

                    // Reference the request data into net_buf

                    netbuf_ref( pxTxBuffer , pcTxData , uiTxLen );

                    netconn_sendto(conn, pxTxBuffer, & destip, UDPCOMNET_PORT);

                    netbuf_delete(pxTxBuffer);

                }

            }

            netbuf_delete(pxRxBuffer); // De-allocate packet buffer

        }

    }

}

 

Of course, nothing has changed because my problem happens even with only web server activated. Anyway, both udp servers has been changed

 

I have done other tests.

With explorer I asked for the same page: a very simple page with few data, no images, no js … nothing but that page. First load ok. Second, third, fourth and so on ok but, when I pressed F5 to reload the page faster, after some correct reloads everything stops.

I have used wireshark.

Often, I have seen that my browser sends a [SYN] with no answer from the device

Sometimes (very very  few), after a [SYN] I immediatly receive a [RST, ACK].

 

As you may remember from my last post every access to the web server creates a new task to serve it.

At the beginning I have 28 tasks running; after the crash I have 28 tasks running. So, no tasks has been left frozen or stucked

The TCP_SEG pcb seems completely lost as written in my first post.  

TCP_SEG Stats: Max Used 12, Max 12, Used 12, Error 23

The error is growning up one per each requests. Firefox makes 3 requests per reload and the error has grown from 23 to 26

 

I have seen also that the SYS_TIMEOUT has a lot  of errors: 188!

In this case the stats are Max Used 6, Max 6, Used 1, Error 188

After a request, the error doesn’t increase.

 

Last, but not least, the mem ram pointer address.

As written in the first post the lfree pointer is stucked to an intermediate address

I suppose that TCP_SEG and the lfree stucked pointer are related but I don’t know why.

 

Any further idea?

Does anybody need other tests to investigate?

 

Best regards

Davide

 

 

Kieran Mansley

unread,
Apr 18, 2011, 4:53:49 AM4/18/11
to Mailing list for lwIP users
On Mon, 2011-04-18 at 10:17 +0200, Tazzari Davide wrote:
>
> The TCP_SEG pcb seems completely lost as written in my first post.
>
> TCP_SEG Stats: Max Used 12, Max 12, Used 12, Error 23
>
> The error is growning up one per each requests. Firefox makes 3
> requests per reload and the error has grown from 23 to 26

I'm pretty sure this is the cause of your problems: the stack has no
available packet buffers, and so can't deal with your requests. Why it
has got into this state is the problem you have to solve. I would look
at all the tasks that can call into lwIP: you application, the device
driver, timers, etc. and check how they are doing this. If you can post
how each of these is done, and how they are protected against the
others, that would help. I suspect that for example you have timers
running while you're processing a received packet or something like that
which leads to a list of packets being corrupted, those packets on the
list are leaked, and you're stuck.

>
> I have seen also that the SYS_TIMEOUT has a lot of errors: 188!
>
> In this case the stats are Max Used 6, Max 6, Used 1, Error 188
>
> After a request, the error doesn’t increase.

That is also interesting, but I'm not sure off the top of my head what
those errors indicate.

> Last, but not least, the mem ram pointer address.
>
> As written in the first post the lfree pointer is stucked to an
> intermediate address
>
> I suppose that TCP_SEG and the lfree stucked pointer are related but I
> don’t know why.

I agree. I would first sort out the segments problem and then see if
this persists.

Tazzari Davide

unread,
Apr 18, 2011, 11:18:01 AM4/18/11
to Mailing list for lwIP users

I agree with you Kieran, but the problem is that I don't know where to look for.

I used the lwIP 1.3.2 port for avr32 and I didn't touch almost anything.

In one my long (and boring) previous post I have added the description of all the tasks that uses the lwIP with netconn api. I can reply it if you wish.

About other... I have looked for some timers and I have seen that in the lwip core there are a lot of them that I suppose correct. I said "I suppose" because I don't really know how to investigate.

Can you please suggest where to look for?

Test 1.
I have connected the device to my computer with a cross Ethernet cable so that I haven't any wireless, switch, ... in the middle.

The situation is pretty the same except the fact that the lock is harder to create. After a lot of F5 reload, everything locks while, in the normal situation, I need only 5-10 fast reload.

This could suggest the heavy traffic managed by the lwIP itself could interfere with the normal management. I don't know if it is really a timer; probably something related to the MAC itself but, as you said, at interrupt level. But I don't know where

What I have seen in this test is that the key is really the TCP_SEG: when there is at least an empty block there could be communication even if the lfree ram pointer is not in the top of the area, otherwise there is the lock.

About SYS_TIMEOUT: Everytime I ask a page (or at least a connection) a timeout is created. I have set 6 SYS_TIMEOUT. If I reload the page 5 times and wait, no error occurs. If 6 or more, the error counter is increased. This seems to have no relationship with the TCP_SEG. Anyway, after a lot of error, the lwIP continues to function. So, let's forget it for the moment.

Test 2:

I have put a Relais toggle in the web server task

WebServer task

...
    for (;;)
    {
        iRestartBinding = 0;
        pxHTTPListener = netconn_new( NETCONN_TCP );
        netconn_bind(pxHTTPListener, NULL, webHTTP_PORT );
        netconn_listen( pxHTTPListener );
        int iTimeout = 1000;

        //for( ; (iRestartBinding < 10) && (gucRestartWebServer == FALSE); iRestartBinding++)
        for( ; ; ) // <<-- for this test purpose; In the real case the above line is present
        {
            REL_TGL; // <<-- for this test purpose

            xLastFocusTime = xTaskGetTickCount();
            vTaskDelayUntil( &xLastFocusTime, xDelayLength );
            if (iGlobalWtdBomb == FALSE) // TRUE I am waiting for a WDT suicide
            {
                // Wait for a first connection.
                #if LWIP_SO_RCVTIMEO
                pxHTTPListener->recv_timeout = iTimeout;
                #endif

                pxNewConnection = netconn_accept(pxHTTPListener);

                if (xTaskCreate(WebServerAnswerTask,
                        ( signed portCHAR * ) "WebServerAnswer",
                        WEB_SERVER_STACK_SIZE,
                        pxNewConnection,
                        ethWEBSERVER_PRIORITY,
                        ( xTaskHandle * ) NULL ) != pdPASS)
                {
                   // Task not correctly created!!!
                   netconn_write( pxNewConnection, (char *) webHTTP_HTM_INTERNAL_ERROR, (u16_t) strlen( webHTTP_HTM_INTERNAL_ERROR ), NETCONN_COPY ); // error HTTP 500

                   netconn_close( pxNewConnection );
                   netconn_delete( pxNewConnection );
                }
                iRestartBinding = 0;
                iTimeout = 5000;           
            }

        }   // end acquisition loop
        gucRestartWebServer = FALSE;
        netconn_close(pxHTTPListener);
        while(netconn_delete(pxHTTPListener) != 0)
        {
            vTaskDelay(20);
        }
        pxHTTPListener = NULL;
    }
...

Result...
When I reload the page slowly everything is ok almost forever.
When I reload the page faster I see that both firefox and explorer process the TCP connection, the GET request and immediately after they send [RST, ACK] to close the connection except the last one that waits for the device answer. I suppose that, due to the fact the browser hasn't received any answer and the user requests a reload they would like only the last one to be processed.

Every netconn_accept (time out or not) I can hear the relais toggle. If I press F5 5 times I hear 5 toggle. That's what I expect.

Sometimes one toggle misses (5 press of F5, 4 toggle!). Exactly in this case, I lose a TCP_SEG block and a portion of mem area.

1 toggle lost means also that the netconn_accept doesn't recognize the connection and, from web server task point of view, I cannot see the problem.

Again, this happens if there are lots of requests (connection, GET, [RST,ACK] from browser, close connection) before a (connection, GET, answer, [RST,ACK], close connection).

Sometimes I have seen this transaction in the middle of a reload
(Firefox) Connection [SYN]
(device) Connection [SYN, ACK]
(Firefox) Connection [ACK]
(Firefox) GET request
(Firefox) [TCP Retransmission] of the GET request
(device) [ACK] of the HTTP
(Firefox) [RST, ACK]   without any answer form the device

It seems that this is one case of TCP_SEC lost. It is not easy to say because I don't know exactly when the loss happens and how I can relate it with the wireshark sniffing.

It seems also that the loss often (but not always) happens when a [TCP Retransmission] is present
Anyway, it seems there is something in the inner management of the [RST,ACK], the retransmission or something like that is probably not related to the code I have written.

How can I handle this? Where do I have to look for? I have no idea at the moment.
My milestone is that the lwIP port is correct but at this point I am not so sure. I still hope that I wrote the wrong piece of code but, as I have said, I have no idea where to look at.

I hope my new analysis can help

Best regards
Davide




Kieran Mansley

unread,
Apr 18, 2011, 11:24:15 AM4/18/11
to Mailing list for lwIP users
On Mon, 2011-04-18 at 17:18 +0200, Tazzari Davide wrote:
> About other... I have looked for some timers and I have seen that in
> the lwip core there are a lot of them that I suppose correct. I said
> "I suppose" because I don't really know how to investigate.
> Can you please suggest where to look for?

The port will be responsible for calling TCP timers and passing received
packets to lwIP in the correct way. Ports have, in the past, often got
this wrong with bugs that sound a lot like yours. You need to look at
the code in the port that calls in to lwIP to process the timers (this
might be handled internally in lwIP if you have an OS, and so will
probably be correct) and where the driver is passing received packets to
the stack: what function does it call?

Martin Persich

unread,
Apr 18, 2011, 2:45:13 PM4/18/11
to Mailing list for lwIP users
Hi Davide,
I see very important information in your message today: "AVR32" !
There is no problem in LwIP, but in Atmel's port file and Atmel's MACB driver, I think. (many thans to Kieran for stable version of LwIP ...)
I work with the AVR32 too and there was (is ?) many and many bugs in Atmel's MACB driver and Atmel's port files for LwIP !!
I haven't time to study your problem in this moment, but it is look like my problems one, two years ago.
You can look to my messages in :
...
I had problem with reconnection of Ethernet cable too ...  :-(
I can send you to private address my port files for LwIP 1.4.0 (I advise upgrade to 1.4.0) and my working version of MACB driver
 
Martin Persich
 
 
----- Original Message -----
Sent: Monday, April 18, 2011 5:18 PM
Subject: Re: [lwip-users] lwip lock

Tazzari Davide

unread,
Apr 19, 2011, 2:29:14 AM4/19/11
to Mailing list for lwIP users

Hi Martin,

I am very happy to hear that I lived for long time with a bug!!!!

 

By the way, what you are saying explains me also why if I run my code in FLASH there is still the problem but I have to wait for long time before it happens and if I run my code in SDRAM in a few reloads I crashed everything.

The reason is that the code in SDRAM run slower than in FLASH. The MACB is involved at the same speed no matter where the code runs. So, in percentage, from SDRAM code point of view, MACB runs more time that before. If there is a bug there, it may happen in percentage more often than it happens for FLASH code.

 

I’ll take a look at the messages you wrote.

If you could send my your correct solution I would be very very pleased.

 

About the Ethernet cable reconnection. I have had the same problem. If I unplugged and plugged the cable the web server didn’t recognize the requests any more.

My solution is in my previous post…

WebServer task
...
    for (;;)
    {
        iRestartBinding = 0;
        pxHTTPListener = netconn_new( NETCONN_TCP );
        netconn_bind(pxHTTPListener, NULL, webHTTP_PORT );
        netconn_listen( pxHTTPListener );
        int iTimeout = 1000;

        for( ; (iRestartBinding < 10) && (gucRestartWebServer == FALSE); iRestartBinding++)
        {

a double “for” cycle

The first “for” is the infinite cycle to manage the task

The second “for” lets recreate the connection and the binding/listening if nothing has touched the web server for a certain amount of time (in my solution 10 seconds)

The timeout is also important because this lets the netconn_accept function exits and processes the first “for” using the counter iRestartBinding. The flag gucRestartWebServer is only for testing purpose so forget it.

In this way I could plug and unplug how much time I want. Of course, this doesn’t react immediately but I need to wait at least 10 seconds. I didn’t think this is such a big problem!

 

Davide

 

 

From: lwip-users-bounces+davide.tazzari=power-...@nongnu.org [mailto:lwip-users-bounces+davide.tazzari=power-...@nongnu.org] On Behalf Of Martin Persich
Sent: lunedì 18 aprile 2011 20:45
To: Mailing list for lwIP users
Subject: Re: [lwip-users] lwip lock

 

Hi Davide,

I see very important information in your message today: "AVR32" !

There is no problem in LwIP, but in Atmel's port file and Atmel's MACB driver, I think. (many thans to Kieran for stable version of LwIP ...)

I work with the AVR32 too and there was (is ?) many and many bugs in Atmel's MACB driver and Atmel's port files for LwIP !!

I haven't time to study your problem in this moment, but it is look like my problems one, two years ago.

You can look to my messages in :

...

I had problem with reconnection of Ethernet cable too ...  :-(

I can send you to private address my port files for LwIP 1.4.0 (I advise upgrade to 1.4.0) and my working version of MACB driver

 

Martin Persich

 

 

WebServer task

Reply all
Reply to author
Forward
0 new messages