Vocabulary Files C1 Key

1 view

Skip to first unread message

Ardelle Abdullah

unread,

Aug 5, 2024, 10:45:14 AM8/5/24

to rupgirawa

Youcan explore any of these files by highlighting one and opening it. Explore the page and button organization by choosing the buttons. Buttons with arrows in the corners will move to new pages. As you explore the page layouts, consider if one of these might work as a starting point for the intended device user.

This process transfers just the vocabulary file. If you want to transfer vocabulary and the settings from one NovaChat to another, see Backup/Restore all user Vocabulary and Settings using a FlashDrive or Google Drive instructions.

You will need a flash drive with an appropriate USB connector for your NovaChat Device. The USB flash drive provided by Saltillo has the appropriate USB connector on one end that plugs into your device and a USB 3.0 connector on the other end that plugs into a computer.

Note: if your two NovaChat Devices do not have the same USB connector, you will need to plug the USB drive that contains the vocabulary file into a computer, copy the vocabulary file to the computer, plug the other USB drive into the computer and copy the vocabulary file to it.

"Word is not a typesetting/layout program and they cannot have the InDesign files supplied as a Word document so they can make changes themselves. If I try converting the InDesign document to Word then all the margins, running heads/footnotes, images/layout are not kept exactly as requested."*

"If you want a Word file, I can give you one, but it will not be a designed layout. It will be a text file. Word is unsuitable for professional printing and can not maintain design in many instances. I'm happy to send a Word file. However, you should be aware, at this stage if you were to make changes to a Word file and send it back, it may require me to restart the design processes from the beginning."

Beyond just spitting out a text file from InDesign you can use Acrobat and a PDF to save as a Word file, some type sizing, color, etc. can be retained that way. It still won't be a "great" Word file, but it may be slightly more aesthetically altered than simple plain text. (Generate the PDF from INDD, Open PDF in Acrobat, Save as Word)

Either way, you'll need to educate the client. Like many, they probably just assume everyone uses Word since that's what they use. Most often they are accustomed to editing Word files so that's why they are requesting a Word file. If you take the time to explain that they can mark up a PDF with edits/corrections they may be fine with that.

Lastly. some clients may request a Word file because to them, that means they have a copy of your work and they can then use it in the future rather than paying you or someone else to rework things. They may be ignorant to the fact that Word is never used for commercial reproduction. (Well nearly never - realize there's always an exception to every rule). So...

Send them the raw, untouched, spit out, Word file from a PDF. The more elaborate your InDesign layout, the more "wonky" the Word file is going to be. In almost every single case page breaks will be horrible, object positions will be shifted... etc. If they complain, well, there's the opportunity to explain that Word isn't used as tool by professional designers and it fails to support object positioning, CMYK, etc....

If no part of your agreement stated that the client is o receive final deliverables as native files.. then don't provide such files. Any contract/agreement should clearly state what the final deliverables are. In my case, it's always simply a PDF and explicitly never "native" or "working" files.. (relevant question)

You should always mention deliverables in writing when taking on such a job, some clients and especially many inexperienced clients who go around on sites like Upwork, will just have no idea of how things work and will assume the wrong things. Before you take a job like this, they need to be explained in writing "i am going to deliver this in PDF" and they need to agree to that, otherwise you end up where you are now.

InDesign comes with a basic text exporting plugin which can dump the raw text content into TXT format, with all design elements removed. Then, there's a few paid plugins out there (google Rorohiko Text Exporter) that can somehow keep some of the design and export to RTF format, but generally speaking, everybody in the field knows there's no perfect INDD to DOCX conversion.

Use custom vocabularies to improve transcription accuracy for one or more specific words. These are generally domain-specific terms, such as brand names and acronyms, proper nouns, and words that Amazon Transcribe isn't rendering correctly.

You are responsible for the integrity of your own data when you use Amazon Transcribe. Do not enter confidential information, personal information (PII), or protected health information (PHI) into a custom vocabulary.

You can test your custom vocabulary using the AWS Management Console. Once your custom vocabulary is ready to use, log in to the AWS Management Console, select Real-time transcription, scroll to Customizations, toggle on Custom vocabulary, and select your custom vocabulary from the dropdown list. Then select start streaming. Speak some of the words in your custom vocabulary into your microphone to see if they render correctly.

The AWS Management Console, AWS CLI, and AWS SDKs all use custom vocabulary tables in the same way; lists are used differently for each method and thus may require additional formatting for successful use between methods.

For text inputs, vocabulary files should be provided in the data configuration. A vocabulary file is a simple text file with one token per line. It should start with these 3 special tokens:

If your training data is already tokenized, you can build a vocabulary with the most frequent tokens. For example, the command below extracts the 50,000 most frequent tokens from the files train.txt.tok and other.txt.tok and saves them to vocab.txt:

The Flashlight decoder, deployed by default in Riva, is a lexicon-based decoder and only emits words that are present in the provided lexicon file. That means, uncommon and new words, such as domain specific terminologies, that are not present in the lexicon file, will have no chance of being generated.

On the other hand, the greedy decoder (available as an option during the riva-build process with the flag --decoder_type=greedy) is not lexicon-based and hence can virtually produce any word or character sequence.

Lexicon file: The lexicon file is a flat text file that contains the mapping of each vocabulary word to its tokenized form, e.g, sentencepiece tokens, separated by a tab. Below is an example:

Riva ServiceMaker automatically tokenizes the words in the vocabulary file to generate the lexicon file. It uses the correct tokenizer model that is packaged together with the acoustic model in the .riva file. By default, Riva generates 1 tokenized form for each word in the vocabulary file.

Or in a local Riva deployment: The actual physical location of Riva assets depends on the value of the riva_model_loc variable in the config.sh file under the Riva quickstart folder. The vocabulary file is bundled with the Flashlight decoder.

By default, riva_model_loc is set to riva-model-repo, which is a docker volume. You can inspect this docker volume and copy the vocabulary file from within the docker volume to the host file system with commands such as:

If you modify riva_model_loc to an absolute path pointing to a folder, then the specified folder in the local file system will be used to store Riva assets instead. Assuming is the directory where Riva assets are stored, then the vocabulary file can similarly be found under, for example, /models/citrinet-1024-en-US-asr-offline-ctc-decoder-cpu-offline/1/dict_vocab.txt.

The lexicon file that is used by the Flashlight decoder can be found in the Riva assets directory, as specified by the value of the riva_model_loc variable in the config.sh file under the Riva quickstart folder (see above).

Finally, once this is done, regenerate the model repository using that new decoding lexicon tokenization by passing --decoding_lexicon=modified_lexicon.txt to riva-build instead of --decoding_vocab=decoding_vocab.txt.

This ETL process will be improved and extended with new vocabulary sources over time. It is somewhat complex, it has very specific pre-requisites, it requires a good knowledge of the CDM schema & source vocabulary data sets and it includes some manual steps.

Notes.Oracle XE cannot be used because it has an 11 GB database size limitation. In order to minimize network latency for database loads and data transformation SQL scripts, it is recommended to host the Oracle DBMS and the data set files to be loaded on the same server.

* Note. As a general principle to be followed when running the create_source_tables.sql scripts in this ETL process: You will get the best performance by first running the create table statements, then loading the tables and then running the create index statements. The reason is that it is faster to add an index to a populated table than to load data into a table that already has an existing index.

This ETL process is an update process which merges new data into an existing populated vocabulary database. If you want to refresh V4 vocabulary data you will need both V4 and V5 vocabularies (V4 is refreshed from V5). If you only want to refresh V5 vocabulary data then V4 is not required and you can skip the V4 update step in the ETL process.

Create a directory e.g. called /vocabETL on your server (on a Windows server use the name C:\vocabETL - make similar windows file path substitutions in subsequent instructions in this document)Use a web browser or the command-line program wget to download the following zip file into the vocabETL directory: -v5.0/archive/master.zipunzip the master.zip file using 7-Zip on Windows or unzip on Linux.