I hope this message finds you well. I am currently studying your paper and have come across a couple of points that I would greatly appreciate some clarification on. Your insights would be invaluable to my understanding of the work. Here are my questions:
-
In the paper, it is mentioned that the model observes four distinct patterns in the loss trends for each token during training: H->H, L->L, H->L, and L->H. It is noted that, except for the L->L category, the other patterns exhibit higher average loss. To address this, the paper introduces the SLM, which uses a reference model to filter out tokens with higher loss. However, I am curious about how this approach specifically targets and removes the noisy tokens depicted in Figure 2. Intuitively, one might expect noisy tokens to have a higher loss and thus be more likely to be selected by the reference model. Could you please elaborate on the mechanism by which SLM effectively eliminates these noisy tokens?
-
For continue pre-training, I am wondering if it is feasible to use the pretrained model itself as the reference model. What are the potential implications or limitations of such an approach?
I hope this message finds you well. I am currently studying your paper and have come across a couple of points that I would greatly appreciate some clarification on. Your insights would be invaluable to my understanding of the work. Here are my questions:
In the paper, it is mentioned that the model observes four distinct patterns in the loss trends for each token during training: H->H, L->L, H->L, and L->H. It is noted that, except for the L->L category, the other patterns exhibit higher average loss. To address this, the paper introduces the SLM, which uses a reference model to filter out tokens with higher loss. However, I am curious about how this approach specifically targets and removes the noisy tokens depicted in Figure 2. Intuitively, one might expect noisy tokens to have a higher loss and thus be more likely to be selected by the reference model. Could you please elaborate on the mechanism by which SLM effectively eliminates these noisy tokens?
For continue pre-training, I am wondering if it is feasible to use the pretrained model itself as the reference model. What are the potential implications or limitations of such an approach?