Very nice to see you made this much progress on this!
I'm replying to the code with comments in the way that you would
get in a formal review on the linux-arm-kernel and mmc mailing
lists, to speed up the process of getting the driver included.
I hope this helps you, feel free to ask me when you have further
questions.
Obviously there are still quite a few things to be done, and I suppose
you are aware of most of them. First of all, the submission
format -- it should be sent to the lists as an inline patch with
a description and Signed-off-by line from you, see
Documentation/SubmittingPatches.
> // Driver module name - required for requesting platform_device resources
> #define DRIVER_NAME "wmt-sdmmc"
The kernel generally does not use C++ style comments, it should be
either
/* one-line comment */
or
/*
* multi-line comment */
*/
The particular comment for the DRIVER_NAME is rather useless though,
you could just as well remove it, or even remove the macro and open-code
the string in the two places where it is used.
> volatile int dma_completed;
Volatile does not mean what you want it to mean. If you want to communicate
atomically, you should use one of spinlock, atomic_t or bitops.
> struct wmt_dma_descriptor {
> u32 reqcount:16;
> u32 interrupt:1;
> u32 reserved0:13;
> u32 format:1;
> u32 end:1;
Bit fields are not particularly portable. It would be better to use a single
u32 (or two u16) here and mask the bits manually/
> u32 DataBufferAddr;
> u32 BranchAddr;
> u32 reserved1;
CaMelCase is against the kernel coding style, the more common way would
be to write "data_buffer_addr", or just "buffer" here.
> struct wmt_mci_priv {
> struct mmc_host *mmc;
> void __iomem *sdmmc_base;
>
> int irq_regular;
> int irq_dma;
>
> // dma buffers
> void *dma_data_buffer;
> u32 dma_data_cpuaddr;
>
> void *dma_desc_buffer;
> u32 dma_desc_cpuaddr;
>
> // sd data
> struct mmc_request *req;
> };
It's rather unusual to have pointers to the request in the same
data structure as the host information. You normally want to
keep the host information global and have pointers to that
from the request where needed.
> static irqreturn_t wmt_mci_dma_isr(int irq_num, void *data)
> {
> struct wmt_mci_priv *priv;
>
> // printk("<1>[MMC/SD] DMA ISR FIRED!!\n");
You should never pass literal "<1>" strings or similar to
printk. Instead, use the KERN_DEBUG etc macros. Even better,
use dev_dbg() or pr_debug() so you can decide at run-time
whether the messages make it to printk.
> priv = (struct wmt_mci_priv *)data;
>
> // disable the SD DMA controller
> writel(0, priv->sdmmc_base + SDDMA_DESPR);
> writel(0, priv->sdmmc_base + SDDMA_GCR);
>
> dma_completed = 1;
>
> return IRQ_HANDLED;
> }
The use of the dma_completed variable is wrong for a number
of reasons here.
One solution would be to instead use complete() in the ISR
and wait_for_completion() in the request function.
> static irqreturn_t wmt_mci_regular_isr(int irq_num, void *data)
> {
> // This function should never be called!
> // printk("<1>[MMC/SD] REGULAR ISR FIRED!!\n");
> return IRQ_HANDLED;
> }
Can you explain why it should not get called?
> // This code is hacky and should be fixed
> // Hardcoded GPIO bases should be passed from board definition
> // This code enables the GPIO pin that power the SD controller - could/should be done with GPIO functions
> // Also configures some general purpose pins for MMC/SD functions
right.
> // wait for command to send
> while ((readb(priv->sdmmc_base + SDMMC_CTLR) & CTLR_CMD_START) != 0);
>
> // wait for command to complete
> while ((readb(priv->sdmmc_base + SDMMC_STS1) & STS1_RSP_TIMEOUT) != STS1_RSP_TIMEOUT) {
> if (( readb(priv->sdmmc_base + SDMMC_STS1) & STS1_CMDRSP_DONE) == STS1_CMDRSP_DONE)
> break;
> }
How long do these take? For really long times, it would be good to sleep
inbetween, for short times, you should have a maximum time to block for.
Use something like
unsigned long timeout = jiffies + (TIMEOUT / HZ);
do {
...
} while (time_before(jiffies, timeout);
> if (priv->req->data->flags & MMC_DATA_READ) {
> // if we recieved data, copy it to the mmc scatterlist
> local_irq_save(flags);
Why the local_irq_save?
> dma_sync_sg_for_cpu(mmc_dev(mmc), priv->req->data->sg, priv->req->data->sg_len, DMA_FROM_DEVICE);
>
> buffer = (u8 *)priv->dma_data_buffer;
> dma_map_sg(mmc_dev(mmc), priv->req->data->sg, priv->req->data->sg_len, DMA_FROM_DEVICE);
> addr = (u8 *)(sg_dma_address(priv->req->data->sg));
> vaddr = kmap_atomic(sg_page(priv->req->data->sg), KM_BIO_SRC_IRQ) + priv->req->data->sg->offset;
>
> for (cnt = 0; cnt < sg_dma_len(priv->req->data->sg); cnt++)
> vaddr[cnt] = buffer[cnt];
>
> priv->req->data->bytes_xfered = priv->req->data->blksz;
>
> kunmap_atomic(vaddr, KM_BIO_SRC_IRQ);
> dma_unmap_sg(mmc_dev(mmc), priv->req->data->sg, priv->req->data->sg_len, DMA_FROM_DEVICE);
>
> local_irq_restore(flags);
There is some confusion in this code. You don't need dma_sync_sg_for_cpu
here, in fact it's invalid to call dma_sync_sg_for_cpu before mapping
the sglist.
The idea of the dma_map_sg is to get a list of entries that you can
pass to the device directly, so you don't need to copy the data
around manually. Note that priv->req->data->sg is an data structue,
not a single value, you need to walk through it using for_each_sg().
You should not need the kmap_atomic here, in fact it is quite harmful
to have it because you are mapping the same page that is still owned
by the device.
> if (!pdev) {
> printk("<1>[MMC/SD] Probe failed!\n");
> return -EINVAL;
> }
Not possible here, the probe function only gets called with a valid pdev.
Arnd
1) Performance is dramatically improved (but still below what I expect).
The previous driver benchmarked around 77Kb/sec (write) on my setup.
This new driver benchmarked around 200Kb/sec (write) on the same setup.
This is mostly due to expanding the code for the PLL clock (now operates up
to 25Mhz).
Added support for card insert detection and write protect detection (haven't
checked the WP yet). Card Insert was required otherwise when the card was
removed the driver would go 'wonky' :)
2) I have fixed the source as per Arnd's notes.
All comments are now the correct style - I have removed some of the more
obvious ones.
Bitfields in the wmt_dma_descriptor struct are replaced with a u32 field and
masked.
CaMelCase has been removed (might still be a bit but I've removed what I've
noticed so far)
Removed the '<1>' literals and replaced with KERN_DEBUG, KERN_ERR where
appropriate
Removed the dma_completed variable and setup a completion struct.
TODO:
Remove the blocking code in the post_command function. Should be movable to
regular_isr() with a bit of work so its non-blocking.
Removing the mmc_request variable from the wmt_mci_priv struct - Not too
sure how I am going to accomplish this (feedback welcomed)
Fix up the data copy from DMA buffers to scatterlists (and vice-versa).
Implement multi-block requests.
It is temporary solution?
> 8) Thanks for the pointers regarding the last code block. To be
> honest, I don't completely understand this code and it was mostly
> copied from other driver source. The dma_map_sg is another left-over
> piece of code that should have been removed. The reason is currently
> doesn't walk the scatterlist is that the list driver is currently
> limited to single-block requests only, and the mmc host driver seems
> to always passes single-entry sg at the moment. kmap_atomic is, again,
> another piece of code that was copied without understanding.
The idea of kmap_atomic is to access a user space page that may or may
not be in highmem. It creates a new page table entry so you can access
a page through a pointer.
> I would appreciate any comments you might have about correcting this
> last block of code.
> It's basic function is to copy the data from the dma buffer to the
> scatterlist. There is an accompanying piece of code which performs the
> same copy in the other direction.
>
> Is it possibly to copy data from the scatterlist without mapping?
> Would it be better to map the scatterlist and pass the bus_addr of the
> mapped sg to the descriptor rather than copying it to my own buffer?
That is indeed the intention of a scatterlist. The kernel uses scatterlists
in order to pass around pages to driver to do DMA in. After you have
done dma_map_sg, you are supposed to use use for_each_sg(), sg_dma_len()
and sg_dma_address() to access the bus addresses that you pass to the
device. Most DMA engines can operate on some kind of scatter-gather
list, so you fill that list with the data from sg_dma_address and
pass the list to the device. There is probably a way to tell the MMC
layer if the hardware cannot deal with these lists and needs to
get contiguous buffers.
Arnd
This may well be because of the single-block access. Most SD cards can only
write blocks of multiples of 16 KB or larger, if you write smaller data,
you can easily get to the worst-case scenario of having the card write
4 MB for every 512 bytes you send it. In fact, many high-speed class 10
cards will have this behaviour, while slower class 2 or 4 cards will deal
a lot better with small accesses.
> 2) I have fixed the source as per Arnd's notes.
> All comments are now the correct style - I have removed some of the more
> obvious ones.
> Bitfields in the wmt_dma_descriptor struct are replaced with a u32 field and
> masked.
> CaMelCase has been removed (might still be a bit but I've removed what I've
> noticed so far)
> Removed the '<1>' literals and replaced with KERN_DEBUG, KERN_ERR where
> appropriate
> Removed the dma_completed variable and setup a completion struct.
very nice.
> TODO:
> Remove the blocking code in the post_command function. Should be movable to
> regular_isr() with a bit of work so its non-blocking.
> Removing the mmc_request variable from the wmt_mci_priv struct - Not too
> sure how I am going to accomplish this (feedback welcomed)
> Fix up the data copy from DMA buffers to scatterlists (and vice-versa).
> Implement multi-block requests.
sounds good.
> /*
> * release the dma buffers
> * TODO: fix these lines - currently crashes with an address invalid
> * dma_free_coherent(&pdev->dev, 2048, &priv->dma_data_cpuaddr, (u32)priv->dma_data_buffer);
> * dma_free_coherent(&pdev->dev, 512, &priv->dma_desc_cpuaddr, (u32)priv->dma_desc_buffer);
> */
You have mixed up the arguments in these, it should be:
dma_free_coherent(&pdev->dev, 2048, priv->dma_data_buffer, priv->dma_data_cpuaddr);
dma_free_coherent(&pdev->dev, 512, priv->dma_desc_buffer, priv->dma_desc_cpuaddr);
Note that what you call the 'buffer' is usually called the 'cpu address', it
is a pointer that the CPU can use. In turn, what you call the 'cpuaddr' is
normally called the 'dma address', it's the address that the device uses to
access the buffer, and should be of type dma_addr_t. When you use the correct
types and names, a lot of the confusion will go away.
Arnd
NOTE: There are no new features or noticeable speed improvements in this
code.
1) Removed the intermediate buffer for DMA transfers - now copies from the
scatterlist to device and vice-versa. This save's on copying the data from
one buffer to another and simplify's the DMA setup code. Removed the
dma_data_buffer variable.
2) Fixed the memory leak with the descriptor DMA buffer not being released -
thanks Arnd. (Also changed the 'u32 dma_desc_cpuaddr' to 'dma_addr_t
dma_desc_device_addr')
TODO:
1) Remove the blocking code in post_command.
2) Implement multi-block transfers. I have done a bit of work on this, but
with no luck. Invalid data/kernel crashes :)
Arnd:
How would you recommend removing the mmc_request variable from the
wmt_mci_priv structure?
It is required in the interrupt routines, and the priv structure is the only
structure passed.
> This may well be because of the single-block access. Most SD cards can only
> write blocks of multiples of 16 KB or larger, if you write smaller data,
> you can easily get to the worst-case scenario of having the card write
> 4 MB for every 512 bytes you send it. In fact, many high-speed class 10
> cards will have this behaviour, while slower class 2 or 4 cards will deal
> a lot better with small accesses.
sorry by the offtopic, but where I can read more about this?
1) Implemented multi-block read/write (currently still limited to 16 blocks
due to a scatterlist mapping problem).
2) Write speed is improved to about 685KB/s (about 330% faster).
WARNING! There is a known fault in the scatterlist mapping which may result
in data loss (either when reading/writing). It is currently set to
max_blocks=16 which hasn't caused any issues for me in about 2 hours use,
but be warned!
TODO:
1) Check the STOP_TRANSMISSION code - I think something is passed in the
req->data->stop that relates to this but at the moment its hardcoded.
2) Fix the scatterlist mapping code. Something is wrong with it which might
cause problems (BE WARNED!) If the max_blocks is set to >16, then the
scatter mapping seems to allocate insufficient descriptors to cover the
requested blocks (eg. 20 blocks requested but only 12 descriptors allocated
- causes a kernel lockup).
3) Still have to remove the blocking code in post_command()
1) Fixed the scatterlist mapping problem. Max_blocks=128 is now the default
(64k/request on FAT filesystem). No reason why it can't be higher except
that it would require more DMA memory. No point going higher than
max_blocks=2048 (fat is 512 bytes/block, 2048 blocks = 1Mb, max_request is
also 1Mb).
2) Write speed is improved again - getting about 780-800Kb/s write on my
card (I need to find another card as this one doesn't even have a class
rating on it :) )
This driver is definitely useable for those that don't want to wait :)
Also, I'd be interested to hear what kind of speeds people get with the
following commands:
'mount /dev/mmcblk0 /mnt/sd -o sync'
'time dd if=/dev/zero of=/mnt/sd/rubbish bs=1M count=5'
Time and Kb/s are of interest (especially if you have a class 4/6/10 card to
test on).
I'm still doing the research, but my findings so far are on
https://wiki.linaro.org/WorkingGroups/KernelConsolidation/Projects/FlashCardSurvey
Arnd
Note that performance values are highly unreliable if you do that,
because if the interaction of ext3 with the layout of the card,
especially if you mount with -o sync.
Better write to the raw card:
time dd if=/dev/zero of=/dev/mmcblk0 bs=1M count=5 conv=direct seek=16
Skipping the first 16 MB will make it not write to the FAT area,
which is often slower than the rest (while allowing random access).
Arnd
do you compare performance with binary kernel from VIA/WMT?
Actually, count and size combined also should be a multiple of 4MB.
Try it multiple times though, some cards need to get into the right
mood before they get fast.
It would be good to compare the same cards doing this using a
USB or sdhci reader on a PC, your driver, and the original VIA
driver.
Arnd
Great work, Tony!
Would you like me to push this code to Gitorious on your behalf?
Best regards,
Alexey
i got much higher values with:
dd if=/dev/mmcblk0 of=/dev/null iflag=direct bs=1M skip=32 (=16 for 32Mb card)
dd if=/dev/zero of=/dev/mmcblk0 oflag=direct bs=1M seek=32 (=16 for 32Mb card)
Transcend SD 150x, 2Gb (SLC):
- your code, reading: 8.0MB/s
- your code, writing: 8.0MB/s
- VIA/WMT kernel, reading: 18.4 MB/s
- VIA/WMT kernel, writing: 12.1 MB/s
- USB cardreader, reading: 21.1 MB/s
- USB cardreader, writing: 14.6 MB/s
Kingston MMC plus, 128Mb:
- your code, reading: 7.6MB/s
- your code, writing: 5.4MB/s
- VIA/WMT kernel, reading: does not work
- VIA/WMT kernel, writing: does not work
- USB cardreader, reading: 11.6 MB/s
- USB cardreader, writing: 10.4 MB/s
old Canon SD, 32Mb:
- your code, reading: 3.4 MB/s
- your code, writing: 1.3 MB/s
- VIA/WMT kernel, reading: 3.6 MB/s
- VIA/WMT kernel, writing: 1.4 MB/s
- USB cardreader, reading: 3.7 MB/s
- USB cardreader, writing: 1.3 MB/s
PS: after some testing I got error on netbook (with your mmc module):
dd: opening `/dev/null': Invalid argument.
I think your code is still buggy and cause memory corrution ;)
BTW this card don't work with u-boot, but work with WinCE.
1) Max_blks=2048 (same as the VIA binary driver).
2) 50Mhz clock now supported for high-speed cards.
3) Changed the coding for the DMA buffer allocation to be 16*max_blks so we
don't have problems with crashes due to wrong calcs :) Just change the
max_blks and the dma buffer will resize according.
Hi Tony,
I just wanted to say thanks for all the work you're putting on this.
I haven't had time to test anything yet, sorry. I was going to make one
simple suggestion, though - have you considered forking Alexey's
repository on gitorious?
Then you can just say "I pushed new source to gitorious" instead of
needing to attach the source file each time, and people who want to test
can just do a 'git pull' and rebuild instead of having to save the
source file each time.
(Even if you don't adopt this suggestion, I do really appreciate the
work that you and the others are putting in on this!)
Cheers,
- Angus
You can probably get a bit better by using double buffering dma,
which Per Forlin has recently implemented, see
http://lkml.org/lkml/2011/1/12/236
Your driver is starting to look really good now, you should probably
post it to the linu...@vger.kernel.org mailing list, following
the rules from the Documentation/SubmittingPatches file in the
kernel, to get more input. The only thing that still sticks out
is the size of the wmt_mci_request() function, it would be more
readable if you split out the read and write code into separate
functions, or possibly one function that takes the dma direction
as an argument.
There are also a few tiny things that scripts/checkpatch.pl should
warn about like whitespace changes, and making all functions 'static',
and of course the things that you already marked as TODO in your
comments.
Arnd
now performance is reasonably good (current version uses logging
agressive, it can also cause slowdown)
> Transcend SD 150x, 2Gb (SLC):
> - your code, reading: 8.0MB/s
> - your code, writing: 8.0MB/s
> - VIA/WMT kernel, reading: 18.4 MB/s
> - VIA/WMT kernel, writing: 12.1 MB/s
> - USB cardreader, reading: 21.1 MB/s
> - USB cardreader, writing: 14.6 MB/s
new result: 16.0 MB/s read, 10.4 MB/s write.
what you think about this:
> PS: after some testing I got error on netbook (with your mmc module):
> dd: opening `/dev/null': Invalid argument.
>
> I think your code is still buggy and cause memory corrution ;)
I get similar errors twice with previous driver version.
In both cases error appear until reboot.
One more idea: When reading through the sdhci driver, I noticed that
you don't actually have to wait in the request function. Just
move everything you do after wmt_mci_start_command() into the
interrupt handler and call mmc_request_done() when the interrupt
arrives. This will save you a full context switch for every
interrupt.
Arnd
I cannot reproduce error after this changes:
But now I find another trouble: if I run dd and remove card - dd hang.
while dd if running.
No more polling/blocking code - fully interrupt driven.
Doesn't seem to have made much improvement to the raw speed (although it
could be that my card is maxed out @ 15MB/s (read) 10MB/s (write) ) but it
will reduce the CPU time.
It is in this tree:
http://gitorious.org/linux-on-via-vt8500/tprisk-wm8505-sdmmc-dev
(you must use origin v2.6.35-vt8500).
PS: I can publish diff's with velocity support for this tree.
I got another error:
CC arch/arm/mach-vt8500/wm8505-clocks.o
arch/arm/mach-vt8500/wm8505-clocks.c:461: error: stray '\' in program
arch/arm/mach-vt8500/wm8505-clocks.c:462: error: stray '\' in program
arch/arm/mach-vt8500/wm8505-clocks.c:463: error: stray '\' in program
arch/arm/mach-vt8500/wm8505-clocks.c:464: error: stray '\' in program
arch/arm/mach-vt8500/wm8505-clocks.c:465: error: stray '\' in program
arch/arm/mach-vt8500/wm8505-clocks.c:467: error: stray '\' in program
arch/arm/mach-vt8500/wm8505-clocks.c:468: error: stray '\' in program
arch/arm/mach-vt8500/wm8505-clocks.c:469: error: stray '\' in program
arch/arm/mach-vt8500/wm8505-clocks.c:470: error: stray '\' in program
arch/arm/mach-vt8500/wm8505-clocks.c:471: error: stray '\' in program
arch/arm/mach-vt8500/wm8505-clocks.c:472: error: stray '\' in program
arch/arm/mach-vt8500/wm8505-clocks.c:473: error: stray '\' in program
arch/arm/mach-vt8500/wm8505-clocks.c:474: error: stray '\' in program
arch/arm/mach-vt8500/wm8505-clocks.c:475: error: stray '\' in program
arch/arm/mach-vt8500/wm8505-clocks.c:476: error: stray '\' in program
make[1]: *** [arch/arm/mach-vt8500/wm8505-clocks.o] Error 1
Could you guys try to 'git bisect' the issue? Last time I checked, it worked fine, so there must have been some change in kernel.
On Feb 28, 2011 3:03 PM, "Josiah Walker" <mer...@gmail.com> wrote:
Yes, until then I guess I'm waiting for a fix to test all this on my machines...
On Sun, Feb 27, 2011 at 6:01 PM, Tony Prisk <sent...@digitalfantasy.co.nz> wrote:
>
> Even worse,...
Try a bit earlier, in December or so. I haven't really had any opportunity to test stuff since then.
On Feb 28, 2011 5:05 PM, "Josiah Walker" <mer...@gmail.com> wrote:
Maybe I've messed up in how to use git, but doing a git bisect going back a month, and running make clean && make uImage each time I bisect... I get the same error every time.
On Mon, Feb 28, 2011 at 8:21 PM, Alexey Charkov <alc...@gmail.com> wrote:
>
> Could you guys try ...
Well, Tony's tree is a descendant of my one, so it should contain all the commits. Just grep the changelog for something authored by me, most likely the culprit will be there:-)
As for a fresh copy of my tree, that may take long. As I mentioned in some other discussion, I've changed my job in January, so now I'm somewhat constrained on time and far away from the hardware.
Best,
Alexey
On Feb 28, 2011 5:33 PM, "Josiah Walker" <mer...@gmail.com> wrote:
I'm not sure how to do that from tony's tree, so it'll have to wait until I can grab a fresh copy of yours I guess.
On Mon, Feb 28, 2011 at 10:10 PM, Alexey Charkov <alc...@gmail.com> wrote:
>
> Try a bit earlier,...
firefox is very slow on this device ;)
Guys, I have just pushed a slight change to the IRQ code that is
supposed to fix the irq_desc problem. Could you please test this on
your hardware, as currently I have no access to mine?
Best,
Alexey